# Ray > You can post questions or issues or feedback through the following channels: --- You can post questions or issues or feedback through the following channels: 1. `Discussion Board`_: For **questions about Ray usage** or **feature requests**. 2. `GitHub Issues`_: For **bug reports**. 3. `Ray Slack`_: For **getting in touch** with Ray maintainers. 4. `StackOverflow`_: Use the [ray] tag for **questions about Ray**. .. _`Discussion Board`: https://discuss.ray.io/ .. _`GitHub Issues`: https://github.com/ray-project/ray/issues .. _`Ray Slack`: https://www.ray.io/join-slack .. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray --- .. admonition:: Check your version! Things can change quickly, and so does this contributor guide. To make sure you've got the most cutting edge version of this guide, go check out the `latest version `__. --- .. .. note:: Ray 2.40 uses RLlib's new API stack by default. The Ray team has mostly completed transitioning algorithms, example scripts, and documentation to the new code base. If you're still using the old API stack, see :doc:`New API stack migration guide ` for details on how to migrate. --- .. TODO: we comment out the hiring message, as it's too much with the RL conf announcement. uncomment again after the summit on March 29th. .. .. admonition:: We're hiring! The RLlib team at `Anyscale Inc. `__, the company behind Ray, is hiring interns and full-time **reinforcement learning engineers** to help advance and maintain RLlib. If you have a background in ML/RL and are interested in making RLlib **the** industry-leading open-source RL library, `apply here today `__. We'd be thrilled to welcome you on the team! --- .. _ray-cluster-cli: Cluster Management CLI ====================== This section contains commands for managing Ray clusters. .. _ray-start-doc: .. click:: ray.scripts.scripts:start :prog: ray start :show-nested: .. _ray-stop-doc: .. click:: ray.scripts.scripts:stop :prog: ray stop :show-nested: .. _ray-up-doc: .. click:: ray.scripts.scripts:up :prog: ray up :show-nested: .. _ray-down-doc: .. click:: ray.scripts.scripts:down :prog: ray down :show-nested: .. _ray-exec-doc: .. click:: ray.scripts.scripts:exec :prog: ray exec :show-nested: .. _ray-submit-doc: .. click:: ray.scripts.scripts:submit :prog: ray submit :show-nested: .. _ray-attach-doc: .. click:: ray.scripts.scripts:attach :prog: ray attach :show-nested: .. _ray-get_head_ip-doc: .. click:: ray.scripts.scripts:get_head_ip :prog: ray get_head_ip :show-nested: .. _ray-monitor-doc: .. click:: ray.scripts.scripts:monitor :prog: ray monitor :show-nested: --- .. _cluster-FAQ: === FAQ === These are some Frequently Asked Questions for Ray clusters. If you still have questions after reading this FAQ, reach out on the `Ray Discourse forum `__. Do Ray clusters support multi-tenancy? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Yes, you can run multiple :ref:`jobs ` from different users simultaneously in a Ray cluster but it's not recommended in production. Some Ray features are still missing for multi-tenancy in production: * Ray doesn't provide strong resource isolation: Ray :ref:`resources ` are logical and they don't limit the physical resources a task or actor can use while running. This means simultaneous jobs can interfere with each other and makes them less reliable to run in production. * Ray doesn't support priorities: All jobs, tasks and actors have the same priority so there is no way to prioritize important jobs under load. * Ray doesn't support access control: Jobs have full access to a Ray cluster and all of the resources within it. On the other hand, you can run the same job multiple times using the same cluster to save the cluster startup time. .. note:: A Ray :ref:`namespace ` is just a logical grouping of jobs and named actors. Unlike a Kubernetes namespace, it doesn't provide any other multi-tenancy functions like resource quotas. I have multiple Ray users. What's the right way to deploy Ray for them? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Start a Ray cluster for each user to isolate their workloads. What's the difference between ``--node-ip-address`` and ``--address``? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When starting a head node on a machine with more than one network address, you may need to specify the externally available address so worker nodes can connect. Use this command: .. code:: bash ray start --head --node-ip-address xx.xx.xx.xx --port nnnn Then when starting the worker node, use this command to connect to the head node: .. code:: bash ray start --address xx.xx.xx.xx:nnnn What does a worker node failure to connect look like? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the worker node can't connect to the head node, you should see this error: Unable to connect to GCS at xx.xx.xx.xx:nnnn. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access. The most likely cause is that the worker node can't access the IP address given. You can use ``ip route get xx.xx.xx.xx`` on the worker node to start debugging routing issues. You may also see failures in the log like: This node has an IP address of xx.xx.xx.xx, while we cannot find the matched Raylet address. This may come from when you connect the Ray cluster with a different IP address or connect a container. The cause of this error may be the head node overloading with too many simultaneous connections. The solution for this problem is to start the worker nodes more slowly. Problems getting a SLURM cluster to work ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A class of issues exist with starting Ray on SLURM clusters. While the exact causes aren't understood, (as of June 2023), some Ray improvements mitigate some of the resource contention. Some of the issues reported are as follows: * Using a machine with a large number of CPUs, and starting one worker per CPU together with OpenBLAS (as used in NumPy) may allocate too many threads. This issue is a `known OpenBLAS limitation`_. You can mitigate it by limiting OpenBLAS to one thread per process as explained in the link. * Resource allocation isn't as expected: usually the configuration has too many CPUs allocated per node. The best practice is to verify the SLURM configuration without starting Ray to verify that the allocations are as expected. For more detailed information see :ref:`ray-slurm-deploy`. .. _`known OpenBLAS limitation`: http://www.openmathlib.org/OpenBLAS/docs/faq/#how-can-i-use-openblas-in-multi-threaded-applications Where does my Ray Job entrypoint script run? On the head node or worker nodes? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, jobs submitted using the :ref:`Ray Job API ` run their `entrypoint` script on the head node. You can change this by specifying any of the options `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources` or `--entrypoint-memory` to `ray job submit`, or the corresponding arguments if using the Python SDK. If these are specified, the job entrypoint will be scheduled on a node that has the requested resources available. --- .. _cluster-index: Ray Clusters Overview ===================== .. toctree:: :hidden: Key Concepts Deploying on Kubernetes Deploying on VMs metrics configure-manage-dashboard Applications Guide faq package-overview usage-stats Ray enables seamless scaling of workloads from a laptop to a large cluster. While Ray works out of the box on single machines with just a call to ``ray.init``, to run Ray applications on multiple nodes you must first *deploy a Ray cluster*. A Ray cluster is a set of worker nodes connected to a common :ref:`Ray head node `. Ray clusters can be fixed-size, or they may :ref:`autoscale up and down ` according to the resources requested by applications running on the cluster. Where can I deploy Ray clusters? -------------------------------- Ray provides native cluster deployment support on the following technology stacks: * On :ref:`AWS, GCP, and Azure `. Community-supported Aliyun and vSphere integrations also exist. * On :ref:`Kubernetes `, via the officially supported KubeRay project. * On `Anyscale `_, a fully managed Ray platform by the creators of Ray. You can either bring an existing AWS, GCP, Azure and Kubernetes clusters, or use the Anyscale hosted compute layer. Advanced users may want to :ref:`deploy Ray manually ` or onto :ref:`platforms not listed here `. .. note:: Multi-node Ray clusters are only supported on Linux. At your own risk, you may deploy Windows and OSX clusters by setting the environment variable ``RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1`` during deployment. What's next? ------------ .. grid:: 1 2 2 2 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: **I want to learn key Ray cluster concepts** ^^^ Understand the key concepts and main ways of interacting with a Ray cluster. +++ .. button-ref:: cluster-key-concepts :color: primary :outline: :expand: Learn Key Concepts .. grid-item-card:: **I want to run Ray on Kubernetes** ^^^ Deploy a Ray application to a Kubernetes cluster. You can run the tutorial on a Kubernetes cluster or on your laptop via Kind. +++ .. button-ref:: kuberay-quickstart :color: primary :outline: :expand: Get Started with Ray on Kubernetes .. grid-item-card:: **I want to run Ray on a cloud provider** ^^^ Take a sample application designed to run on a laptop and scale it up in the cloud. Access to an AWS or GCP account is required. +++ .. button-ref:: vm-cluster-quick-start :color: primary :outline: :expand: Get Started with Ray on VMs .. grid-item-card:: **I want to run my application on an existing Ray cluster** ^^^ Guide to submitting applications as Jobs to existing Ray clusters. +++ .. button-ref:: jobs-quickstart :color: primary :outline: :expand: Job Submission --- Key Concepts ============ .. _cluster-key-concepts: This page introduces key concepts for Ray clusters: .. contents:: :local: Ray Cluster ----------- A Ray cluster consists of a single :ref:`head node ` and any number of connected :ref:`worker nodes `: .. figure:: images/ray-cluster.svg :align: center :width: 600px *A Ray cluster with two worker nodes. Each node runs Ray helper processes to facilitate distributed scheduling and memory management. The head node runs additional control processes (highlighted in blue).* The number of worker nodes may be *autoscaled* with application demand as specified by your Ray cluster configuration. The head node runs the :ref:`autoscaler `. .. note:: Ray nodes are implemented as pods when :ref:`running on Kubernetes `. Users can submit jobs for execution on the Ray cluster, or can interactively use the cluster by connecting to the head node and running `ray.init`. See :ref:`Ray Jobs ` for more information. .. _cluster-head-node: Head Node --------- Every Ray cluster has one node which is designated as the *head node* of the cluster. The head node is identical to other worker nodes, except that it also runs singleton processes responsible for cluster management such as the :ref:`autoscaler `, :term:`GCS ` and the Ray driver processes :ref:`which run Ray jobs `. Ray may schedule tasks and actors on the head node just like any other worker node, which is not desired in large-scale clusters. See :ref:`vms-large-cluster-configure-head-node` for the best practice in large-scale clusters. .. _cluster-worker-nodes: Worker Node ------------ *Worker nodes* do not run any head node management processes, and serve only to run user code in Ray tasks and actors. They participate in distributed scheduling, as well as the storage and distribution of Ray objects in :ref:`cluster memory `. .. _cluster-autoscaler: Autoscaling ----------- The *Ray autoscaler* is a process that runs on the :ref:`head node ` (or as a sidecar container in the head pod if :ref:`using Kubernetes `). When the resource demands of the Ray workload exceed the current capacity of the cluster, the autoscaler will try to increase the number of worker nodes. When worker nodes sit idle, the autoscaler will remove worker nodes from the cluster. It is important to understand that the autoscaler only reacts to task and actor resource requests, and not application metrics or physical resource utilization. To learn more about autoscaling, refer to the user guides for Ray clusters on :ref:`VMs ` and :ref:`Kubernetes `. .. note:: Version 2.10.0 introduces the alpha release of Autoscaling V2 on KubeRay. Discover the enhancements and configuration details :ref:`here `. .. _cluster-clients-and-jobs: Ray Jobs -------- A Ray job is a single application: it is the collection of Ray tasks, objects, and actors that originate from the same script. The worker that runs the Python script is known as the *driver* of the job. There are two ways to run a Ray job on a Ray cluster: 1. (Recommended) Submit the job using the :ref:`Ray Jobs API `. 2. Run the driver script directly on the Ray cluster, for interactive development. For details on these workflows, refer to the :ref:`Ray Jobs API guide `. .. figure:: images/ray-job-diagram.svg :align: center :width: 650px *Two ways of running a job on a Ray cluster.* --- .. _kuberay-gpu: Using GPUs ========== This document provides tips on GPU usage with KubeRay. To use GPUs on Kubernetes, configure both your Kubernetes setup and add additional values to your Ray cluster configuration. To learn about GPU usage on different clouds, see instructions for `GKE`_, for `EKS`_, and for `AKS`_. Quickstart: Serve a GPU-based StableDiffusion model ___________________________________________________ You can find several GPU workload examples in the :ref:`examples ` section of the docs. The :ref:`StableDiffusion example ` is a good place to start. Dependencies for GPU-based machine learning ___________________________________________ The `Ray Docker Hub `_ hosts CUDA-based container images packaged with Ray and certain machine learning libraries. For example, the image ``rayproject/ray-ml:2.6.3-gpu`` is ideal for running GPU-based ML workloads with Ray 2.6.3. The Ray ML images are packaged with dependencies (such as TensorFlow and PyTorch) needed for the Ray Libraries that are used in these docs. To add custom dependencies, use one, or both, of the following methods: * Building a docker image using one of the official :ref:`Ray docker images ` as base. * Using :ref:`Ray Runtime environments `. Configuring Ray pods for GPU usage __________________________________ Using NVIDIA GPUs requires specifying `nvidia.com/gpu` resource `limits` and `requests` in the container fields of your `RayCluster`'s `headGroupSpec` and/or `workerGroupSpecs`. Here is a config snippet for a RayCluster workerGroup of up to 5 GPU workers. .. code-block:: yaml groupName: gpu-group replicas: 0 minReplicas: 0 maxReplicas: 5 ... template: spec: ... containers: - name: ray-node image: rayproject/ray-ml:2.6.3-gpu ... resources: nvidia.com/gpu: 1 # Optional, included just for documentation. cpu: 3 memory: 50Gi limits: nvidia.com/gpu: 1 # Required to use GPU. cpu: 3 memory: 50Gi ... Each of the Ray pods in the group can be scheduled on an AWS `p2.xlarge` instance (1 GPU, 4vCPU, 61Gi RAM). .. tip:: GPU instances are expensive -- consider setting up autoscaling for your GPU Ray workers, as demonstrated with the `minReplicas:0` and `maxReplicas:5` settings above. To enable autoscaling, remember also to set `enableInTreeAutoscaling:True` in your RayCluster's `spec` Finally, make sure you configured the group or pool of GPU Kubernetes nodes, to autoscale. Refer to your :ref:`cloud provider's documentation ` for details on autoscaling node pools. GPU multi-tenancy _________________ If a Pod doesn't include `nvidia.com/gpu` in its resource configurations, users typically expect the Pod to be unaware of any GPU devices, even if it's scheduled on a GPU node. However, when `nvidia.com/gpu` isn't specified, the default value for `NVIDIA_VISIBLE_DEVICES` becomes `all`, giving the Pod awareness of all GPU devices on the node. This behavior isn't unique to KubeRay, but is a known issue for NVIDIA. A workaround is to set the `NVIDIA_VISIBLE_DEVICES` environment variable to `void` in the Pods which don't require GPU devices. Some useful links: - `NVIDIA/k8s-device-plugin#61`_ - `NVIDIA/k8s-device-plugin#87`_ - `[NVIDIA] Preventing unprivileged access to GPUs in Kubernetes`_ - `ray-project/ray#29753`_ GPUs and Ray ____________ This section discuss GPU usage for Ray applications running on Kubernetes. For general guidance on GPU usage with Ray, see also :ref:`gpu-support`. The KubeRay operator advertises container GPU resource limits to the Ray scheduler and the Ray autoscaler. In particular, the Ray container's `ray start` entrypoint will be automatically configured with the appropriate `--num-gpus` option. GPU workload scheduling ~~~~~~~~~~~~~~~~~~~~~~~ After a Ray pod with access to GPU is deployed, it will be able to execute tasks and actors annotated with gpu requests. For example, the decorator `@ray.remote(num_gpus=1)` annotates a task or actor requiring 1 GPU. GPU autoscaling ~~~~~~~~~~~~~~~ The Ray autoscaler is aware of each Ray worker group's GPU capacity. Say we have a RayCluster configured as in the config snippet above: - There is a worker group of Ray pods with 1 unit of GPU capacity each. - The Ray cluster does not currently have any workers from that group. - `maxReplicas` for the group is at least 2. Then the following Ray program will trigger upscaling of 2 GPU workers. .. code-block:: python import ray ray.init() @ray.remote(num_gpus=1) class GPUActor: def say_hello(self): print("I live in a pod with GPU access.") # Request actor placement. gpu_actors = [GPUActor.remote() for _ in range(2)] # The following command will block until two Ray pods with GPU access are scaled # up and the actors are placed. ray.get([actor.say_hello.remote() for actor in gpu_actors]) After the program exits, the actors will be garbage collected. The GPU worker pods will be scaled down after the idle timeout (60 seconds by default). If the GPU worker pods were running on an autoscaling pool of Kubernetes nodes, the Kubernetes nodes will be scaled down as well. Requesting GPUs ~~~~~~~~~~~~~~~ You can also make a :ref:`direct request to the autoscaler ` to scale up GPU resources. .. code-block:: python import ray ray.init() ray.autoscaler.sdk.request_resources(bundles=[{"GPU": 1}] * 2) After the nodes are scaled up, they will persist until the request is explicitly overridden. The following program will remove the resource request. .. code-block:: python import ray ray.init() ray.autoscaler.sdk.request_resources(bundles=[]) The GPU workers can then scale down. .. _kuberay-gpu-override: Overriding Ray GPU capacity (advanced) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For specialized use-cases, it is possible to override the Ray pod GPU capacities advertised to Ray. To do so, set a value for the `num-gpus` key of the head or worker group's `rayStartParams`. For example, .. code-block:: yaml rayStartParams: # Note that all rayStartParam values must be supplied as strings. num-gpus: "2" The Ray scheduler and autoscaler will then account 2 units of GPU capacity for each Ray pod in the group, even if the container limits do not indicate the presence of GPU. GPU pod scheduling (advanced) _____________________________ GPU taints and tolerations ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: Managed Kubernetes services typically take care of GPU-related taints and tolerations for you. If you are using a managed Kubernetes service, you might not need to worry about this section. The `NVIDIA gpu plugin`_ for Kubernetes applies `taints`_ to GPU nodes; these taints prevent non-GPU pods from being scheduled on GPU nodes. Managed Kubernetes services like GKE, EKS, and AKS automatically apply matching `tolerations`_ to pods requesting GPU resources. Tolerations are applied by means of Kubernetes's `ExtendedResourceToleration`_ `admission controller`_. If this admission controller is not enabled for your Kubernetes cluster, you may need to manually add a GPU toleration to each of your GPU pod configurations. For example, .. code-block:: yaml apiVersion: v1 kind: Pod metadata: generateName: example-cluster-ray-worker spec: ... tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists ... containers: - name: ray-node image: rayproject/ray:nightly-gpu ... Node selectors and node labels ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To ensure Ray pods are bound to Kubernetes nodes satisfying specific conditions (such as the presence of GPU hardware), you may wish to use the `nodeSelector` field of your `workerGroup`'s pod template `spec`. See the `Kubernetes docs`_ for more about Pod-to-Node assignment. Further reference and discussion -------------------------------- Read about Kubernetes device plugins `here `__, about Kubernetes GPU plugins `here `__, and about NVIDIA's GPU plugin for Kubernetes `here `__. .. _`GKE`: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus .. _`EKS`: https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html .. _`AKS`: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster .. _`NVIDIA/k8s-device-plugin#61`: https://github.com/NVIDIA/k8s-device-plugin/issues/61 .. _`NVIDIA/k8s-device-plugin#87`: https://github.com/NVIDIA/k8s-device-plugin/issues/87 .. _`[NVIDIA] Preventing unprivileged access to GPUs in Kubernetes`: https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit?usp=sharing .. _`ray-project/ray#29753`: https://github.com/ray-project/ray/issues/29753 .. _`tolerations`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ .. _`taints`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ .. _`NVIDIA gpu plugin`: https://github.com/NVIDIA/k8s-device-plugin .. _`admission controller`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/ .. _`ExtendedResourceToleration`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration .. _`Kubernetes docs`: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ .. _`bug`: https://github.com/ray-project/kuberay/pull/497/ --- .. _cluster-api-ref: Ray Cluster Management API ========================== This section contains a reference for the cluster management API. If there is anything missing, please open an issue on `GitHub`_. .. _`GitHub`: https://github.com/ray-project/ray/issues .. toctree:: :maxdepth: 2 cli.rst running-applications/job-submission/jobs-package-ref.rst running-applications/job-submission/cli.rst running-applications/autoscaling/reference.rst --- .. _ref-autoscaler-sdk: Programmatic Cluster Scaling ============================ .. _ref-autoscaler-sdk-request-resources: ray.autoscaler.sdk.request_resources ------------------------------------ Within a Ray program, you can command the autoscaler to scale the cluster up to a desired size with ``request_resources()`` call. The cluster will immediately attempt to scale to accommodate the requested resources, bypassing normal upscaling speed constraints. .. autofunction:: ray.autoscaler.sdk.request_resources :noindex: --- .. _ray-job-submission-cli-ref: Ray Jobs CLI API Reference ========================== This section contains commands for :ref:`Ray Job Submission `. .. _ray-job-submit-doc: .. click:: ray.dashboard.modules.job.cli:submit :prog: ray job submit .. warning:: When using the CLI, do not wrap the entrypoint command in quotes. For example, use ``ray job submit --working-dir="." -- python script.py`` instead of ``ray job submit --working-dir="." -- "python script.py"``. Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``. .. warning:: You must provide the entrypoint command, ``python script.py``, last (after the ``--``), and any other arguments to `ray job submit` (e.g., ``--working-dir="."``) must be provided before the two hyphens (``--``). For example, use ``ray job submit --working-dir="." -- python script.py`` instead of ``ray job submit -- python script.py --working-dir="."``. This syntax supports the use of ``--`` to separate arguments to `ray job submit` from arguments to the entrypoint command. .. _ray-job-status-doc: .. click:: ray.dashboard.modules.job.cli:status :prog: ray job status :show-nested: .. _ray-job-stop-doc: .. click:: ray.dashboard.modules.job.cli:stop :prog: ray job stop :show-nested: .. _ray-job-logs-doc: .. click:: ray.dashboard.modules.job.cli:logs :prog: ray job logs :show-nested: .. _ray-job-list-doc: .. click:: ray.dashboard.modules.job.cli:list :prog: ray job list :show-nested: .. _ray-job-delete-doc: .. click:: ray.dashboard.modules.job.cli:delete :prog: ray job delete :show-nested: --- .. _ray-job-submission-sdk-ref: Python SDK API Reference ======================== .. currentmodule:: ray.job_submission For an overview with examples see :ref:`Ray Jobs `. For the CLI reference see :ref:`Ray Job Submission CLI Reference `. .. _job-submission-client-ref: JobSubmissionClient ------------------- .. autosummary:: :nosignatures: :toctree: doc/ JobSubmissionClient .. autosummary:: :nosignatures: :toctree: doc/ JobSubmissionClient.submit_job JobSubmissionClient.stop_job JobSubmissionClient.get_job_status JobSubmissionClient.get_job_info JobSubmissionClient.list_jobs JobSubmissionClient.get_job_logs JobSubmissionClient.tail_job_logs JobSubmissionClient.delete_job .. _job-status-ref: JobStatus --------- .. autosummary:: :nosignatures: :toctree: doc/ JobStatus .. _job-info-ref: JobInfo ------- .. autosummary:: :nosignatures: :toctree: doc/ JobInfo .. _job-details-ref: JobDetails ---------- .. autosummary:: :nosignatures: :toctree: doc/ JobDetails .. _job-type-ref: JobType ------- .. autosummary:: :nosignatures: :toctree: doc/ JobType .. _driver-info-ref: DriverInfo ---------- .. autosummary:: :nosignatures: :toctree: doc/ DriverInfo --- .. _jobs-quickstart: ================================= Quickstart using the Ray Jobs CLI ================================= This guide walks through the Ray Jobs CLI commands available for submitting and interacting with a Ray Job. To use the Jobs API programmatically with a Python SDK instead of a CLI, see :ref:`ray-job-sdk`. Setup ----- Ray Jobs is available in versions 1.9+ and requires a full installation of Ray. You can install Ray by running: .. code-block:: shell pip install "ray[default]" See the :ref:`installation guide ` for more details on installing Ray. To submit a job, you need to send HTTP requests to a Ray Cluster. This guide assumes that you are using a local Ray Cluster, which you can start by running: .. code-block:: shell ray start --head # ... # 2022-08-10 09:54:57,664 INFO services.py:1476 -- View the Ray dashboard at http://127.0.0.1:8265 # ... This command creates a Ray head node on a local machine that you can use for development purposes. Note the Ray Dashboard URL that appears on stdout when starting or connecting to a Ray Cluster. Use this URL later to submit a job. For more details on production deployment scenarios, see the guides for deploying Ray on :ref:`VMs ` and :ref:`Kubernetes `. Submitting a job ---------------- Start with a sample script that you can run locally. The following script uses Ray APIs to submit a task and print its return value: .. code-block:: python # script.py import ray @ray.remote def hello_world(): return "hello world" # Automatically connect to the running Ray cluster. ray.init() print(ray.get(hello_world.remote())) Create an empty working directory with the preceding Python script inside a file named ``script.py``. .. code-block:: bash | your_working_directory | ├── script.py Next, find the HTTP address of the Ray Cluster to which you can submit a job request. Submit jobs to the same address that the **Ray Dashboard** uses. By default, this job uses port 8265. If you are using a local Ray Cluster (``ray start --head``), connect directly at ``http://127.0.0.1:8265``. If you are using a Ray Cluster started on VMs or Kubernetes, follow the instructions there for setting up network access from a client. See :ref:`Using a Remote Cluster ` for tips. To tell the Ray Jobs CLI how to find your Ray Cluster, pass the Ray Dashboard address. Set the ``RAY_API_SERVER_ADDRESS`` environment variable: .. code-block:: bash $ export RAY_API_SERVER_ADDRESS="http://127.0.0.1:8265" Alternatively, you can also pass the ``--address=http://127.0.0.1:8265`` flag explicitly to each Ray Jobs CLI command, or prepend each command with ``RAY_API_SERVER_ADDRESS=http://127.0.0.1:8265``. Additionally, if you wish to pass headers per HTTP request to the Cluster, use the `RAY_JOB_HEADERS` environment variable. This environment variable must be in JSON form. .. code-block:: bash $ export RAY_JOB_HEADERS='{"KEY": "VALUE"}' To submit the job, use ``ray job submit``. Make sure to specify the path to the working directory in the ``--working-dir`` argument. For local clusters this argument isn't strictly necessary, but for remote clusters this argument is required in order to upload the working directory to the cluster. .. code-block:: bash $ ray job submit --working-dir your_working_directory -- python script.py # Job submission server address: http://127.0.0.1:8265 # ------------------------------------------------------- # Job 'raysubmit_inB2ViQuE29aZRJ5' submitted successfully # ------------------------------------------------------- # Next steps # Query the logs of the job: # ray job logs raysubmit_inB2ViQuE29aZRJ5 # Query the status of the job: # ray job status raysubmit_inB2ViQuE29aZRJ5 # Request the job to be stopped: # ray job stop raysubmit_inB2ViQuE29aZRJ5 # Tailing logs until the job exits (disable with --no-wait): # hello world # ------------------------------------------ # Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded # ------------------------------------------ This command runs the entrypoint script on the Ray Cluster's head node and waits until the job finishes. Note that it also streams the `stdout` and `stderr` of the entrypoint script back to the client (``hello world`` in this case). Ray also makes the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. .. note:: The double dash (`--`) separates the arguments for the entrypoint command (e.g., `python script.py --arg1=val1`) from the arguments to `ray job submit`. .. note:: By default the entrypoint script runs on the head node. To override this behavior, specify one of the `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources`, or `--entrypoint-memory` arguments to the `ray job submit` command. See :ref:`Specifying CPU and GPU resources ` for more details. Interacting with Long-running Jobs ---------------------------------- For long-running applications, you probably don't want to require the client to wait for the job to finish. To do this, pass the ``--no-wait`` flag to ``ray job submit`` and use the other CLI commands to check on the job's status. Try this modified script that submits a task every second in an infinite loop: .. code-block:: python # script.py import ray import time @ray.remote def hello_world(): return "hello world" ray.init() while True: print(ray.get(hello_world.remote())) time.sleep(1) Now submit the job: .. code-block:: shell $ ray job submit --no-wait --working-dir your_working_directory -- python script.py # Job submission server address: http://127.0.0.1:8265 # ------------------------------------------------------- # Job 'raysubmit_tUAuCKubPAEXh6CW' submitted successfully # ------------------------------------------------------- # Next steps # Query the logs of the job: # ray job logs raysubmit_tUAuCKubPAEXh6CW # Query the status of the job: # ray job status raysubmit_tUAuCKubPAEXh6CW # Request the job to be stopped: # ray job stop raysubmit_tUAuCKubPAEXh6CW We can later get the stdout using the provided ``ray job logs`` command: .. code-block:: shell $ ray job logs raysubmit_tUAuCKubPAEXh6CW # Job submission server address: http://127.0.0.1:8265 # hello world # hello world # hello world # hello world # hello world Get the current status of the job using ``ray job status``: .. code-block:: shell $ ray job status raysubmit_tUAuCKubPAEXh6CW # Job submission server address: http://127.0.0.1:8265 # Status for job 'raysubmit_tUAuCKubPAEXh6CW': RUNNING # Status message: Job is currently running. Finally, to cancel the job, use ``ray job stop``: .. code-block:: shell $ ray job stop raysubmit_tUAuCKubPAEXh6CW # Job submission server address: http://127.0.0.1:8265 # Attempting to stop job raysubmit_tUAuCKubPAEXh6CW # Waiting for job 'raysubmit_tUAuCKubPAEXh6CW' to exit (disable with --no-wait): # Job 'raysubmit_tUAuCKubPAEXh6CW' was stopped $ ray job status raysubmit_tUAuCKubPAEXh6CW # Job submission server address: http://127.0.0.1:8265 # Job 'raysubmit_tUAuCKubPAEXh6CW' was stopped .. _jobs-remote-cluster: Using a remote cluster ---------------------- The preceding example is for a local Ray cluster. When connecting to a `remote` cluster, you need to access the dashboard port of the cluster over HTTP. One way to access the port is to port forward ``127.0.0.1:8265`` on your local machine to ``127.0.0.1:8265`` on the head node. If you started your remote cluster with the :ref:`Ray Cluster Launcher `, then you can set up automatic port forwarding using the ``ray dashboard`` command. See :ref:`monitor-cluster` for details. Run the following command on your local machine, where ``cluster.yaml`` is the configuration file you used to launch your cluster: .. code-block:: bash ray dashboard cluster.yaml Once this command is running, verify that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``. Also, verify that you set the environment variable ``RAY_API_SERVER_ADDRESS`` to ``"http://127.0.0.1:8265"``. After this setup, you can use the Jobs CLI on the local machine as in the preceding example to interact with the remote Ray cluster. Using the CLI on Kubernetes ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The preceding instructions still apply, but you can achieve the dashboard port forwarding using ``kubectl port-forward``: https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/ Alternatively, you can set up Ingress to the dashboard port of the cluster over HTTP: https://kubernetes.io/docs/concepts/services-networking/ingress/ Dependency management --------------------- To run a distributed application, ensure that all workers run in the same environment. This configuration can be challenging if multiple applications in the same Ray Cluster have different and conflicting dependencies. To avoid dependency conflicts, Ray provides a mechanism called :ref:`runtime environments `. Runtime environments allow an application to override the default environment on the Ray Cluster and run in an isolated environment, similar to virtual environments in single-node Python. Dependencies can include both files and Python packages. The Ray Jobs API provides an option to specify the runtime environment when submitting a job. On the Ray Cluster, Ray installs the runtime environment across the workers and ensures that tasks in that job run in the same environment. To demonstrate this feature, this Python script prints the current version of the ``requests`` module in a Ray task. .. code-block:: python import ray import requests @ray.remote def get_requests_version(): return requests.__version__ # Note: No need to specify the runtime_env in ray.init() in the driver script. ray.init() print("requests version:", ray.get(get_requests_version.remote())) Submit this job using the default environment. This environment is the environment you started the Ray Cluster in. .. code-block:: bash $ ray job submit -- python script.py # Job submission server address: http://127.0.0.1:8265 # # ------------------------------------------------------- # Job 'raysubmit_seQk3L4nYWcUBwXD' submitted successfully # ------------------------------------------------------- # # Next steps # Query the logs of the job: # ray job logs raysubmit_seQk3L4nYWcUBwXD # Query the status of the job: # ray job status raysubmit_seQk3L4nYWcUBwXD # Request the job to be stopped: # ray job stop raysubmit_seQk3L4nYWcUBwXD # # Tailing logs until the job exits (disable with --no-wait): # requests version: 2.28.1 # # ------------------------------------------ # Job 'raysubmit_seQk3L4nYWcUBwXD' succeeded # ------------------------------------------ Now submit the job with a runtime environment that pins the version of the ``requests`` module: .. code-block:: bash $ ray job submit --runtime-env-json='{"pip": ["requests==2.26.0"]}' -- python script.py # Job submission server address: http://127.0.0.1:8265 # ------------------------------------------------------- # Job 'raysubmit_vGGV4MiP9rYkYUnb' submitted successfully # ------------------------------------------------------- # Next steps # Query the logs of the job: # ray job logs raysubmit_vGGV4MiP9rYkYUnb # Query the status of the job: # ray job status raysubmit_vGGV4MiP9rYkYUnb # Request the job to be stopped: # ray job stop raysubmit_vGGV4MiP9rYkYUnb # Tailing logs until the job exits (disable with --no-wait): # requests version: 2.26.0 # ------------------------------------------ # Job 'raysubmit_vGGV4MiP9rYkYUnb' succeeded # ------------------------------------------ .. note:: If both the Driver and Job specify a runtime environment, Ray tries to merge them and raises an exception if they conflict. See :ref:`runtime environments ` for more details. - See :ref:`Ray Jobs CLI ` for a full API reference of the CLI. - See :ref:`Ray Jobs SDK ` for a full API reference of the SDK. - For more information, see :ref:`Programmatic job submission ` and :ref:`Job submission using REST `. --- .. _ray-client-ref: Ray Client ========== .. warning:: Ray Client requires pip package `ray[client]`. If you installed the minimal Ray (e.g. `pip install ray`), please reinstall by executing `pip install ray[client]`. **What is the Ray Client?** The Ray Client is an API that connects a Python script to a **remote** Ray cluster. Effectively, it allows you to leverage a remote Ray cluster just like you would with Ray running on your local machine. By changing ``ray.init()`` to ``ray.init("ray://:")``, you can connect from your laptop (or anywhere) directly to a remote cluster and scale-out your Ray code, while maintaining the ability to develop interactively in a Python shell. **This will only work with Ray 1.5+.** .. code-block:: python # You can run this code outside of the Ray cluster! import ray # Starting the Ray client. This connects to a remote Ray cluster. ray.init("ray://:10001") # Normal Ray code follows @ray.remote def do_work(x): return x ** x do_work.remote(2) #.... When to use Ray Client ---------------------- .. note:: Ray Client has architectural limitations and may not work as expected when using Ray for ML workloads (like Ray Tune or Ray Train). Use :ref:`Ray Jobs API` for interactive development on ML projects. Ray Client can be used when you want to connect an interactive Python shell to a **remote** cluster. * Use ``ray.init("ray://:10001")`` (Ray Client) if you've set up a remote cluster at ```` and you want to do interactive work. This will connect your shell to the cluster. See the section on :ref:`using Ray Client` for more details on setting up your cluster. * Use ``ray.init()`` (non-client connection, no address specified) if you're developing locally and want to connect to an existing cluster (i.e. ``ray start --head`` has already been run), or automatically create a local cluster and attach directly to it. This can also be used for :ref:`Ray Job ` submission. Ray Client is useful for developing interactively in a local Python shell. However, it requires a stable connection to the remote cluster and will terminate the workload if the connection is lost for :ref:`more than 30 seconds `. If you have a long running workload that you want to run on your cluster, we recommend using :ref:`Ray Jobs ` instead. Client arguments ---------------- Ray Client is used when the address passed into ``ray.init`` is prefixed with ``ray://``. Besides the address, Client mode currently accepts two other arguments: - ``namespace`` (optional): Sets the namespace for the session. - ``runtime_env`` (optional): Sets the :ref:`runtime environment ` for the session, allowing you to dynamically specify environment variables, packages, local files, and more. .. code-block:: python # Connects to an existing cluster at 1.2.3.4 listening on port 10001, using # the namespace "my_namespace". The Ray workers will run inside a cluster-side # copy of the local directory "files/my_project", in a Python environment with # `toolz` and `requests` installed. ray.init( "ray://1.2.3.4:10001", namespace="my_namespace", runtime_env={"working_dir": "files/my_project", "pip": ["toolz", "requests"]}, ) #.... .. _how-do-you-use-the-ray-client: How do you use the Ray Client? ------------------------------ Step 1: Set up your Ray cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you have a running Ray cluster (version >= 1.5), Ray Client server is likely already running on port ``10001`` of the head node by default. Otherwise, you'll want to create a Ray cluster. To start a Ray cluster locally, you can run .. code-block:: bash ray start --head To start a Ray cluster remotely, you can follow the directions in :ref:`vm-cluster-quick-start`. If necessary, you can modify the Ray Client server port to be other than ``10001``, by specifying ``--ray-client-server-port=...`` to the ``ray start`` :ref:`command `. Step 2: Configure Access ~~~~~~~~~~~~~~~~~~~~~~~~ Ensure that your local machine can access the Ray Client port on the head node. The easiest way to accomplish this is to use SSH port forwarding or `K8s port-forwarding `_. This allows you to connect to the Ray Client server on the head node via ``localhost``. First, open up an SSH connection with your Ray cluster and forward the listening port (``10001``). For Clusters launched with the Ray Cluster launcher this looks like: .. code-block:: bash $ ray up cluster.yaml $ ray attach cluster.yaml -p 10001 Then connect to the Ray cluster **from another terminal** using ``localhost`` as the ``head_node_host``. .. code-block:: python import ray # This will connect to the cluster via the open SSH session. ray.init("ray://localhost:10001") # Normal Ray code follows @ray.remote def do_work(x): return x ** x do_work.remote(2) #.... Step 3: Run Ray code ~~~~~~~~~~~~~~~~~~~~ Now, connect to the Ray Cluster with the following and then use Ray like you normally would: .. .. code-block:: python import ray # replace with the appropriate host and port ray.init("ray://:10001") # Normal Ray code follows @ray.remote def do_work(x): return x ** x do_work.remote(2) #.... Alternative Connection Approach: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of port-forwarding, you can directly connect to the Ray Client server on the head node if your computer has network access to the head node. This is an option if your computer is on the same network as the Cluster or if your computer can connect to the Cluster with a VPN. If your computer does not have direct access, you can modify the network configuration to grant access. On `EC2 `_, this can be done by modifying the security group to allow inbound access from your local IP address to the Ray Client server port (``10001`` by default). .. tab-set:: .. tab-item:: AWS With the Ray cluster launcher, you can configure the security group to allow inbound access by defining :ref:`cluster-configuration-security-group` in your `cluster.yaml`. .. code-block:: yaml # An unique identifier for the head node and workers of this cluster. cluster_name: minimal_security_group # Cloud-provider specific configuration. provider: type: aws region: us-west-2 security_group: GroupName: ray_client_security_group IpPermissions: - FromPort: 10001 ToPort: 10001 IpProtocol: TCP IpRanges: # Allow traffic only from your local IP address. - CidrIp: /32 .. warning:: Anyone with Ray Client access can execute arbitrary code on the Ray Cluster. **Do not expose this to `0.0.0.0/0`.** Connect to multiple Ray clusters (Experimental) ----------------------------------------------- Ray Client allows connecting to multiple Ray clusters in one Python process. To do this, just pass ``allow_multiple=True`` to ``ray.init``: .. code-block:: python import ray # Create a default client. ray.init("ray://:10001") # Connect to other clusters. cli1 = ray.init("ray://:10001", allow_multiple=True) cli2 = ray.init("ray://:10001", allow_multiple=True) # Data is put into the default cluster. obj = ray.put("obj") with cli1: obj1 = ray.put("obj1") with cli2: obj2 = ray.put("obj2") with cli1: assert ray.get(obj1) == "obj1" try: ray.get(obj2) # Cross-cluster ops not allowed. except: print("Failed to get object which doesn't belong to this cluster") with cli2: assert ray.get(obj2) == "obj2" try: ray.get(obj1) # Cross-cluster ops not allowed. except: print("Failed to get object which doesn't belong to this cluster") assert "obj" == ray.get(obj) cli1.disconnect() cli2.disconnect() When using Ray multi-client, there are some different behaviors to pay attention to: * The client won't be disconnected automatically. Call ``disconnect`` explicitly to close the connection. * Object references can only be used by the client from which it was obtained. * ``ray.init`` without ``allow_multiple`` will create a default global Ray client. Things to know -------------- .. _client-disconnections: Client disconnections ~~~~~~~~~~~~~~~~~~~~~ When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster. If the client disconnects unexpectedly, i.e. due to a network failure, the client will attempt to reconnect to the server for 30 seconds before all of the references are dropped. You can increase this time by setting the environment variable ``RAY_CLIENT_RECONNECT_GRACE_PERIOD=N``, where ``N`` is the number of seconds that the client should spend trying to reconnect before giving up. Versioning requirements ~~~~~~~~~~~~~~~~~~~~~~~ Generally, the client Ray version must match the server Ray version. An error will be raised if an incompatible version is used. Similarly, the minor Python (e.g., 3.6 vs 3.7) must match between the client and server. An error will be raised if this is not the case. Starting a connection on older Ray versions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you encounter ``socket.gaierror: [Errno -2] Name or service not known`` when using ``ray.init("ray://...")`` then you may be on a version of Ray prior to 1.5 that does not support starting client connections through ``ray.init``. Connection through the Ingress ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you encounter the following error message when connecting to the ``Ray Cluster`` using an ``Ingress``, it may be caused by the Ingress's configuration. .. .. code-block:: python grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "" debug_error_string = "{"created":"@1628668820.164591000","description":"Error received from peer ipv4:10.233.120.107:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"","grpc_status":3}" > Got Error from logger channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "" debug_error_string = "{"created":"@1628668820.164713000","description":"Error received from peer ipv4:10.233.120.107:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"","grpc_status":3}" > If you are using the ``nginx-ingress-controller``, you may be able to resolve the issue by adding the following Ingress configuration. .. code-block:: yaml metadata: annotations: nginx.ingress.kubernetes.io/server-snippet: | underscores_in_headers on; ignore_invalid_headers on; Ray client logs ~~~~~~~~~~~~~~~ Ray client logs can be found at ``/tmp/ray/session_latest/logs`` on the head node. Uploads ~~~~~~~ If a ``working_dir`` is specified in the runtime env, when running ``ray.init()`` the Ray client will upload the ``working_dir`` on the laptop to ``/tmp/ray/session_latest/runtime_resources/_ray_pkg_``. Ray workers are started in the ``/tmp/ray/session_latest/runtime_resources/_ray_pkg_`` directory on the cluster. This means that relative paths in the remote tasks and actors in the code will work on the laptop and on the cluster without any code changes. For example, if the ``working_dir`` on the laptop contains ``data.txt`` and ``run.py``, inside the remote task definitions in ``run.py`` one can just use the relative path ``"data.txt"``. Then ``python run.py`` will work on my laptop, and also on the cluster. As a side note, since relative paths can be used in the code, the absolute path is only useful for debugging purposes. Troubleshooting --------------- Error: Attempted to reconnect a session that has already been cleaned up ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This error happens when Ray Client reconnects to a head node that does not recognize the client. This can happen if the head node restarts unexpectedly and loses state. On Kubernetes, this can happen if the head pod restarts after being evicted or crashing. --- .. _ray-job-rest-api: Ray Jobs REST API ^^^^^^^^^^^^^^^^^ Under the hood, both the Python SDK and the CLI make HTTP calls to the job server running on the Ray head node. You can also directly send requests to the corresponding endpoints via HTTP if needed: Continue on for examples, or jump to the :ref:`OpenAPI specification `. **Submit Job** .. code-block:: python import requests import json import time resp = requests.post( "http://127.0.0.1:8265/api/jobs/", # Don't forget the trailing slash! json={ "entrypoint": "echo hello", "runtime_env": {}, "job_id": None, "metadata": {"job_submission_id": "123"} } ) rst = json.loads(resp.text) job_id = rst["job_id"] print(job_id) **Query and poll for Job status** .. code-block:: python start = time.time() while time.time() - start <= 10: resp = requests.get( f"http://127.0.0.1:8265/api/jobs/{job_id}" ) rst = json.loads(resp.text) status = rst["status"] print(f"status: {status}") if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}: break time.sleep(1) **Query for logs** .. code-block:: python resp = requests.get( f"http://127.0.0.1:8265/api/jobs/{job_id}/logs" ) rst = json.loads(resp.text) logs = rst["logs"] print(logs) **List all jobs** .. code-block:: python resp = requests.get( "http://127.0.0.1:8265/api/jobs/" ) print(resp.json()) # {"job_id": {"metadata": ..., "status": ..., "message": ...}, ...} **Stop a Job** .. code-block:: python import json import time import requests resp = requests.post( "http://127.0.0.1:8265/api/jobs/{job_or_submission_id}/stop", ) rst = json.loads(resp.text) json = rst.json() stopped = json["stopped"] print(stopped) .. _ray-job-rest-api-spec: OpenAPI Documentation (Beta) ---------------------------- We provide an OpenAPI specification for the Ray Job API. You can use this to generate client libraries for other languages. View the `Ray Jobs REST API OpenAPI documentation `_. --- .. _ray-job-sdk: Python SDK Overview ^^^^^^^^^^^^^^^^^^^ The Ray Jobs Python SDK is the recommended way to submit jobs programmatically. Jump to the :ref:`API Reference `, or continue reading for a quick overview. Setup ----- Ray Jobs is available in versions 1.9+ and requires a full installation of Ray. You can do this by running: .. code-block:: shell pip install "ray[default]" See the :ref:`installation guide ` for more details on installing Ray. To run a Ray Job, we also need to be able to send HTTP requests to a Ray Cluster. For convenience, this guide will assume that you are using a local Ray Cluster, which we can start by running: .. code-block:: shell ray start --head # ... # 2022-08-10 09:54:57,664 INFO services.py:1476 -- View the Ray dashboard at http://127.0.0.1:8265 # ... This will create a Ray head node on our local machine that we can use for development purposes. Note the Ray Dashboard URL that is printed when starting or connecting to a Ray Cluster; we will use this URL later to submit a Ray Job. See :ref:`Using a Remote Cluster ` for tips on port-forwarding if using a remote cluster. For more details on production deployment scenarios, check out the guides for deploying Ray on :ref:`VMs ` and :ref:`Kubernetes `. Submitting a Ray Job -------------------- Let's start with a sample script that can be run locally. The following script uses Ray APIs to submit a task and print its return value: .. code-block:: python # script.py import ray @ray.remote def hello_world(): return "hello world" ray.init() print(ray.get(hello_world.remote())) SDK calls are made via a ``JobSubmissionClient`` object. To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses; see :ref:`Using a Remote Cluster ` for details on setting up port forwarding. .. code-block:: python from ray.job_submission import JobSubmissionClient # If using a remote cluster, replace 127.0.0.1 with the head node's IP address or set up port forwarding. client = JobSubmissionClient("http://127.0.0.1:8265") job_id = client.submit_job( # Entrypoint shell command to execute entrypoint="python script.py", # Path to the local directory that contains the script.py file runtime_env={"working_dir": "./"} ) print(job_id) .. tip:: By default, the Ray job server will generate a new ``job_id`` and return it, but you can alternatively choose a unique ``job_id`` string first and pass it into :code:`submit_job`. In this case, the Job will be executed with your given id, and will throw an error if the same ``job_id`` is submitted more than once for the same Ray cluster. Because job submission is asynchronous, the above call will return immediately with output like the following: .. code-block:: bash raysubmit_g8tDzJ6GqrCy7pd6 Now we can write a simple polling loop that checks the job status until it reaches a terminal state (namely, ``JobStatus.SUCCEEDED``, ``JobStatus.STOPPED``, or ``JobStatus.FAILED``). We can also get the output of the job by calling ``client.get_job_logs``. .. code-block:: python from ray.job_submission import JobSubmissionClient, JobStatus import time # If using a remote cluster, replace 127.0.0.1 with the head node's IP address. client = JobSubmissionClient("http://127.0.0.1:8265") job_id = client.submit_job( # Entrypoint shell command to execute entrypoint="python script.py", # Path to the local directory that contains the script.py file runtime_env={"working_dir": "./"} ) print(job_id) def wait_until_status(job_id, status_to_wait_for, timeout_seconds=5): start = time.time() while time.time() - start <= timeout_seconds: status = client.get_job_status(job_id) print(f"status: {status}") if status in status_to_wait_for: break time.sleep(1) wait_until_status(job_id, {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}) logs = client.get_job_logs(job_id) print(logs) The output should look something like this: .. code-block:: bash raysubmit_pBwfn5jqRE1E7Wmc status: PENDING status: PENDING status: RUNNING status: RUNNING status: RUNNING 2022-08-22 15:05:55,652 INFO worker.py:1203 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS 2022-08-22 15:05:55,652 INFO worker.py:1312 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379... 2022-08-22 15:05:55,660 INFO worker.py:1487 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265. hello world Interacting with Long-running Jobs ---------------------------------- In addition to getting the current status and output of a job, a submitted job can also be stopped by the user before it finishes executing. .. code-block:: python job_id = client.submit_job( # Entrypoint shell command to execute entrypoint="python -c 'import time; print(\"Sleeping...\"); time.sleep(60)'" ) wait_until_status(job_id, {JobStatus.RUNNING}) print(f'Stopping job {job_id}') client.stop_job(job_id) wait_until_status(job_id, {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}) logs = client.get_job_logs(job_id) print(logs) The output should look something like the following: .. code-block:: bash status: PENDING status: PENDING status: RUNNING Stopping job raysubmit_VYCZZ2BQb4tfeCjq status: STOPPED Sleeping... To get information about all jobs, call ``client.list_jobs()``. This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information. Job information (status and associated metadata) is stored on the cluster indefinitely. To delete this information, you may call ``client.delete_job(job_id)`` for any job that is already in a terminal state. See the :ref:`SDK API Reference ` for more details. Dependency Management --------------------- Similar to the :ref:`Jobs CLI `, we can also package our application's dependencies by using a Ray :ref:`runtime environment `. Using the Python SDK, the syntax looks something like this: .. code-block:: python job_id = client.submit_job( # Entrypoint shell command to execute entrypoint="python script.py", # Runtime environment for the job, specifying a working directory and pip package runtime_env={ "working_dir": "./", "pip": ["requests==2.26.0"] } ) .. tip:: Instead of a local directory (``"./"`` in this example), you can also specify remote URIs for your job's working directory, such as S3 buckets or Git repositories. See :ref:`remote-uris` for details. For full details, see the :ref:`API Reference `. .. _ray-job-cpu-gpu-resources: Specifying CPU and GPU resources -------------------------------- By default, the job entrypoint script always runs on the head node. We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script. No extra configuration is needed to do this. However, if you need to do computation directly in the entrypoint script and would like to reserve CPU and GPU resources for the entrypoint script, you may specify the ``entrypoint_num_cpus``, ``entrypoint_num_gpus``, ``entrypoint_memory`` and ``entrypoint_resources`` arguments to ``submit_job``. These arguments function identically to the ``num_cpus``, ``num_gpus``, ``resources``, and ``_memory`` arguments to ``@ray.remote()`` decorator for tasks and actors as described in :ref:`resource-requirements`. If any of these arguments are specified, the entrypoint script will be scheduled on a node with at least the specified resources, instead of the head node, which is the default. For example, the following code will schedule the entrypoint script on a node with at least 1 GPU: .. code-block:: python job_id = client.submit_job( entrypoint="python script.py", runtime_env={ "working_dir": "./", } # Reserve 1 GPU for the entrypoint script entrypoint_num_gpus=1 ) The same arguments are also available as options ``--entrypoint-num-cpus``, ``--entrypoint-num-gpus``, ``--entrypoint-memory``, and ``--entrypoint-resources`` to ``ray job submit`` in the Jobs CLI; see :ref:`Ray Job Submission CLI Reference `. If ``num_gpus`` is not specified, GPUs will still be available to the entrypoint script, but Ray will not provide isolation in terms of visible devices. To be precise, the environment variable ``CUDA_VISIBLE_DEVICES`` will not be set in the entrypoint script; it will only be set inside tasks and actors that have `num_gpus` specified in their ``@ray.remote()`` decorator. .. note:: Resources specified by ``entrypoint_num_cpus``, ``entrypoint_num_gpus``, ``entrypoint-memory``, and ``entrypoint_resources`` are separate from any resources specified for tasks and actors within the job. For example, if you specify ``entrypoint_num_gpus=1``, then the entrypoint script will be scheduled on a node with at least 1 GPU, but if your script also contains a Ray task defined with ``@ray.remote(num_gpus=1)``, then the task will be scheduled to use a different GPU (on the same node if the node has at least 2 GPUs, or on a different node otherwise). .. note:: As with the ``num_cpus``, ``num_gpus``, ``resources``, and ``_memory`` arguments to ``@ray.remote()`` described in :ref:`resource-requirements`, these arguments only refer to logical resources used for scheduling purposes. The actual CPU and GPU utilization is not controlled or limited by Ray. .. note:: By default, 0 CPUs and 0 GPUs are reserved for the entrypoint script. Client Configuration -------------------------------- Additional client connection options, such as custom HTTP headers and cookies, can be passed to the ``JobSubmissionClient`` class. A full list of options can be found in the :ref:`API Reference `. TLS Verification ~~~~~~~~~~~~~~~~~ By default, any HTTPS client connections will be verified using system certificates found by the underlying ``requests`` and ``aiohttp`` libraries. The ``verify`` parameter can be set to override this behavior. For example: .. code-block:: python client = JobSubmissionClient("https://", verify="/path/to/cert.pem") will use the certificate found at ``/path/to/cert.pem`` to verify the job server's certificate. Certificate verification can be disabled by setting the ``verify`` parameter to ``False``. --- .. _ref-usage-stats: Usage Stats Collection ====================== Starting in Ray 1.13, Ray collects usage stats data by default (guarded by an opt-out prompt). This data will be used by the open-source Ray engineering team to better understand how to improve our libraries and core APIs, and how to prioritize bug fixes and enhancements. Here are the guiding principles of our collection policy: - **No surprises** — you will be notified before we begin collecting data. You will be notified of any changes to the data being collected or how it is used. - **Easy opt-out:** You will be able to easily opt-out of data collection - **Transparency** — you will be able to review all data that is sent to us - **Control** — you will have control over your data, and we will honor requests to delete your data. - We will **not** collect any personally identifiable data or proprietary code/data - We will **not** sell data or buy data about you. You will always be able to :ref:`disable the usage stats collection `. For more context, please refer to this `RFC `_. What data is collected? ----------------------- We collect non-sensitive data that helps us understand how Ray is used (e.g., which Ray libraries are used). **Personally identifiable data will never be collected.** Please check the UsageStatsToReport class to see the data we collect. .. _usage-disable: How to disable it ----------------- There are multiple ways to disable usage stats collection before starting a cluster: #. Add ``--disable-usage-stats`` option to the command that starts the Ray cluster (e.g., ``ray start --head --disable-usage-stats`` :ref:`command `). #. Run :ref:`ray disable-usage-stats ` to disable collection for all future clusters. This won't affect currently running clusters. Under the hood, this command writes ``{"usage_stats": true}`` to the global config file ``~/.ray/config.json``. #. Set the environment variable ``RAY_USAGE_STATS_ENABLED`` to 0 (e.g., ``RAY_USAGE_STATS_ENABLED=0 ray start --head`` :ref:`command `). #. If you're using `KubeRay `_, you can add ``disable-usage-stats: 'true'`` to ``.spec.[headGroupSpec|workerGroupSpecs].rayStartParams.``. Currently there is no way to enable or disable collection for a running cluster; you have to stop and restart the cluster. How does it work? ----------------- When a Ray cluster is started via :ref:`ray start --head `, :ref:`ray up `, :ref:`ray submit --start ` or :ref:`ray exec --start `, Ray will decide whether usage stats collection should be enabled or not by considering the following factors in order: #. It checks whether the environment variable ``RAY_USAGE_STATS_ENABLED`` is set: 1 means enabled and 0 means disabled. #. If the environment variable is not set, it reads the value of key ``usage_stats`` in the global config file ``~/.ray/config.json``: true means enabled and false means disabled. #. If neither is set and the console is interactive, then the user will be prompted to enable or disable the collection. If the console is non-interactive, usage stats collection will be enabled by default. The decision will be saved to ``~/.ray/config.json``, so the prompt is only shown once. Note: usage stats collection is not enabled when using local dev clusters started via ``ray.init()`` unless it's a nightly wheel. This means that Ray will never collect data from third-party library users not using Ray directly. If usage stats collection is enabled, a background process on the head node will collect the usage stats and report to ``https://usage-stats.ray.io/`` every hour. The reported usage stats will also be saved to ``/tmp/ray/session_xxx/usage_stats.json`` on the head node for inspection. You can check the existence of this file to see if collection is enabled. Usage stats collection is very lightweight and should have no impact on your workload in any way. Requesting removal of collected data ------------------------------------ To request removal of collected data, please email us at ``usage_stats@ray.io`` with the ``session_id`` that you can find in ``/tmp/ray/session_xxx/usage_stats.json``. Frequently Asked Questions (FAQ) -------------------------------- **Does the session_id map to personal data?** No, the uuid will be a Ray session/job-specific random ID that cannot be used to identify a specific person nor machine. It will not live beyond the lifetime of your Ray session; and is primarily captured to enable us to honor deletion requests. The session_id is logged so that deletion requests can be honored. **Could an enterprise easily configure an additional endpoint or substitute a different endpoint?** We definitely see this use case and would love to chat with you to make this work -- email ``usage_stats@ray.io``. Contact us ---------- If you have any feedback regarding usage stats collection, please email us at ``usage_stats@ray.io``. --- .. _vm-cluster-quick-start: Getting Started =============== This quick start demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands. For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup ` guide. About the demo -------------- This demo will walk through an end-to-end flow: 1. Create a (basic) Python application. 2. Launch a cluster on a cloud provider. 3. Run the application in the cloud. Requirements ~~~~~~~~~~~~ To run this demo, you will need: * Python installed on your development machine (typically your laptop), and * an account at your preferred cloud provider (AWS, GCP, Azure, Aliyun, or vSphere). Setup ~~~~~ Before we start, you will need to install some Python dependencies as follows: .. tab-set:: .. tab-item:: Ray Team Supported :sync: Ray Team Supported .. tab-set:: .. tab-item:: AWS :sync: AWS .. code-block:: shell $ pip install -U "ray[default]" boto3 .. tab-item:: Azure :sync: Azure .. code-block:: shell $ pip install -U "ray[default]" azure-cli azure-core .. tab-item:: GCP :sync: GCP .. code-block:: shell $ pip install -U "ray[default]" google-api-python-client .. tab-item:: Community Supported :sync: Community Supported .. tab-set:: .. tab-item:: Aliyun :sync: Aliyun .. code-block:: shell $ pip install -U "ray[default]" aliyun-python-sdk-core aliyun-python-sdk-ecs Aliyun Cluster Launcher Maintainers (GitHub handles): @zhuangzhuang131419, @chenk008 .. tab-item:: vSphere :sync: vSphere .. code-block:: shell $ pip install -U "ray[default]" vSphere Cluster Launcher Maintainers (GitHub handles): @roshankathawate, @ankitasonawane30, @VamshikShetty Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials: .. tab-set:: .. tab-item:: Ray Team Supported :sync: Ray Team Supported .. tab-set:: .. tab-item:: AWS :sync: AWS Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs `_. .. tab-item:: Azure :sync: Azure Log in using ``az login``, then configure your credentials with ``az account set -s ``. .. tab-item:: GCP :sync: GCP Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs `_. .. tab-item:: Community Supported :sync: Community Supported .. tab-set:: .. tab-item:: Aliyun :sync: Aliyun Obtain and set the AccessKey pair of the Aliyun account as described in `the docs `__. Make sure to grant the necessary permissions to the RAM user and set the AccessKey pair in your cluster config file. Refer to the provided `aliyun/example-full.yaml `__ for a sample cluster config. .. tab-item:: vSphere :sync: vSphere Make sure Ray supervisor service is up and running as per `the Ray-on-VCF docs ` Create a (basic) Python application ----------------------------------- We will write a simple Python application that tracks the IP addresses of the machines that its tasks are executed on: .. code-block:: python from collections import Counter import socket import time def f(): time.sleep(0.001) # Return IP address. return socket.gethostbyname("localhost") ip_addresses = [f() for _ in range(10000)] print(Counter(ip_addresses)) Save this application as ``script.py`` and execute it by running the command ``python script.py``. The application should take 10 seconds to run and output something similar to ``Counter({'127.0.0.1': 10000})``. With some small changes, we can make this application run on Ray (for more information on how to do this, refer to :ref:`the Ray Core Walkthrough `): .. code-block:: python from collections import Counter import socket import time import ray ray.init() @ray.remote def f(): time.sleep(0.001) # Return IP address. return socket.gethostbyname("localhost") object_ids = [f.remote() for _ in range(10000)] ip_addresses = ray.get(object_ids) print(Counter(ip_addresses)) Finally, let's add some code to make the output more interesting: .. code-block:: python from collections import Counter import socket import time import ray ray.init() print('''This cluster consists of {} nodes in total {} CPU resources in total '''.format(len(ray.nodes()), ray.cluster_resources()['CPU'])) @ray.remote def f(): time.sleep(0.001) # Return IP address. return socket.gethostbyname("localhost") object_ids = [f.remote() for _ in range(10000)] ip_addresses = ray.get(object_ids) print('Tasks executed') for ip_address, num_tasks in Counter(ip_addresses).items(): print(' {} tasks on {}'.format(num_tasks, ip_address)) Running ``python script.py`` should now output something like: .. parsed-literal:: This cluster consists of 1 nodes in total 4.0 CPU resources in total Tasks executed 10000 tasks on 127.0.0.1 Launch a cluster on a cloud provider ------------------------------------ To start a Ray Cluster, first we need to define the cluster configuration. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. A minimal sample cluster configuration file looks as follows: .. tab-set:: .. tab-item:: Ray Team Supported :sync: Ray Team Supported .. tab-set:: .. tab-item:: AWS :sync: AWS .. literalinclude:: ../../../../python/ray/autoscaler/aws/example-minimal.yaml :language: yaml .. tab-item:: Azure :sync: Azure .. code-block:: yaml # An unique identifier for the head node and workers of this cluster. cluster_name: minimal # Cloud-provider specific configuration. provider: type: azure location: westus2 resource_group: ray-cluster # How Ray will authenticate with newly launched nodes. auth: ssh_user: ubuntu # you must specify paths to matching private and public key pair files # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair ssh_private_key: ~/.ssh/id_rsa # changes to this should match what is specified in file_mounts ssh_public_key: ~/.ssh/id_rsa.pub .. tab-item:: GCP :sync: GCP .. code-block:: yaml # A unique identifier for the head node and workers of this cluster. cluster_name: minimal # Cloud-provider specific configuration. provider: type: gcp region: us-west1 .. tab-item:: Community Supported :sync: Community Supported .. tab-set:: .. tab-item:: Aliyun :sync: Aliyun Please refer to `example-full.yaml `__. Make sure your account balance is not less than 100 RMB, otherwise you will receive the error `InvalidAccountStatus.NotEnoughBalance`. .. tab-item:: vSphere :sync: vSphere .. literalinclude:: ../../../../python/ray/autoscaler/vsphere/example-minimal.yaml :language: yaml Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference `. After defining our configuration, we will use the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI `. Run the following command: .. code-block:: shell $ ray up -y config.yaml Running applications on a Ray Cluster ------------------------------------- We are now ready to execute an application on our Ray Cluster. ``ray.init()`` will now automatically connect to the newly created cluster. As a quick example, we execute a Python command on the Ray Cluster that connects to Ray and exits: .. code-block:: shell $ ray exec config.yaml 'python -c "import ray; ray.init()"' 2022-08-10 11:23:17,093 INFO worker.py:1312 -- Connecting to existing Ray cluster at address: :6379... 2022-08-10 11:23:17,097 INFO worker.py:1490 -- Connected to Ray cluster. You can also optionally get a remote shell using ``ray attach`` and run commands directly on the cluster. This command will create an SSH connection to the head node of the Ray Cluster. .. code-block:: shell # From a remote client: $ ray attach config.yaml # Now on the head node... $ python -c "import ray; ray.init()" For a full reference on the Ray Cluster CLI tools, please refer to :ref:`the cluster commands reference `. While these tools are useful for ad-hoc execution on the Ray Cluster, the recommended way to execute an application on a Ray Cluster is to use :ref:`Ray Jobs `. Check out the :ref:`quickstart guide ` to get started! Deleting a Ray Cluster ---------------------- To shut down your cluster, run the following command: .. code-block:: shell $ ray down -y config.yaml --- .. _cluster-commands: Cluster Launcher Commands ========================= This document overviews common commands for using the Ray cluster launcher. See the :ref:`Cluster Configuration ` docs on how to customize the configuration file. Launching a cluster (``ray up``) -------------------------------- This will start up the machines in the cloud, install your dependencies and run any setup commands that you have, configure the Ray cluster automatically, and prepare you to scale your distributed system. See :ref:`the documentation ` for ``ray up``. The example config files can be accessed `here `_. .. tip:: The worker nodes will start only after the head node has finished starting. To monitor the progress of the cluster setup, you can run `ray monitor `. .. code-block:: shell # Replace '' with one of: 'aws', 'gcp', 'kubernetes', or 'local'. $ BACKEND= # Create or update the cluster. $ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml # Tear down the cluster. $ ray down ray/python/ray/autoscaler/$BACKEND/example-full.yaml Updating an existing cluster (``ray up``) ----------------------------------------- If you want to update your cluster configuration (add more files, change dependencies), run ``ray up`` again on the existing cluster. This command checks if the local configuration differs from the applied configuration of the cluster. This includes any changes to synced files specified in the ``file_mounts`` section of the config. If so, the new files and config will be uploaded to the cluster. Following that, Ray services/processes will be restarted. .. tip:: Don't do this for the cloud provider specifications (e.g., change from AWS to GCP on a running cluster) or change the cluster name (as this will just start a new cluster and orphan the original one). You can also run ``ray up`` to restart a cluster if it seems to be in a bad state (this will restart all Ray services even if there are no config changes). Running ``ray up`` on an existing cluster will do all the following: * If the head node matches the cluster specification, the filemounts will be reapplied and the ``setup_commands`` and ``ray start`` commands will be run. There may be some caching behavior here to skip setup/file mounts. * If the head node is out of date from the specified YAML (e.g., ``head_node_type`` has changed on the YAML), then the out-of-date node will be terminated and a new node will be provisioned to replace it. Setup/file mounts/``ray start`` will be applied. * After the head node reaches a consistent state (after ``ray start`` commands are finished), the same above procedure will be applied to all the worker nodes. The ``ray start`` commands tend to run a ``ray stop`` + ``ray start``, so this will kill currently working jobs. If you don't want the update to restart services (e.g. because the changes don't require a restart), pass ``--no-restart`` to the update call. If you want to force re-generation of the config to pick up possible changes in the cloud environment, pass ``--no-config-cache`` to the update call. If you want to skip the setup commands and only run ``ray stop``/``ray start`` on all nodes, pass ``--restart-only`` to the update call. See :ref:`the documentation ` for ``ray up``. .. code-block:: shell # Reconfigure autoscaling behavior without interrupting running jobs. $ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml \ --max-workers=N --no-restart Running shell commands on the cluster (``ray exec``) ---------------------------------------------------- You can use ``ray exec`` to conveniently run commands on clusters. See :ref:`the documentation ` for ``ray exec``. .. code-block:: shell # Run a command on the cluster $ ray exec cluster.yaml 'echo "hello world"' # Run a command on the cluster, starting it if needed $ ray exec cluster.yaml 'echo "hello world"' --start # Run a command on the cluster, stopping the cluster after it finishes $ ray exec cluster.yaml 'echo "hello world"' --stop # Run a command on a new cluster called 'experiment-1', stopping it after $ ray exec cluster.yaml 'echo "hello world"' \ --start --stop --cluster-name experiment-1 # Run a command in a detached tmux session $ ray exec cluster.yaml 'echo "hello world"' --tmux # Run a command in a screen (experimental) $ ray exec cluster.yaml 'echo "hello world"' --screen If you want to run applications on the cluster that are accessible from a web browser (e.g., Jupyter notebook), you can use the ``--port-forward``. The local port opened is the same as the remote port. .. code-block:: shell $ ray exec cluster.yaml --port-forward=8899 'source ~/anaconda3/bin/activate tensorflow_p36 && jupyter notebook --port=8899' .. note:: For Kubernetes clusters, the ``port-forward`` option cannot be used while executing a command. To port forward and run a command you need to call ``ray exec`` twice separately. Running Ray scripts on the cluster (``ray submit``) --------------------------------------------------- You can also use ``ray submit`` to execute Python scripts on clusters. This will ``rsync`` the designated file onto the head node cluster and execute it with the given arguments. See :ref:`the documentation ` for ``ray submit``. .. code-block:: shell # Run a Python script in a detached tmux session $ ray submit cluster.yaml --tmux --start --stop tune_experiment.py # Run a Python script with arguments. # This executes script.py on the head node of the cluster, using # the command: python ~/script.py --arg1 --arg2 --arg3 $ ray submit cluster.yaml script.py -- --arg1 --arg2 --arg3 Attaching to a running cluster (``ray attach``) ----------------------------------------------- You can use ``ray attach`` to attach to an interactive screen session on the cluster. See :ref:`the documentation ` for ``ray attach`` or run ``ray attach --help``. .. code-block:: shell # Open a screen on the cluster $ ray attach cluster.yaml # Open a screen on a new cluster called 'session-1' $ ray attach cluster.yaml --start --cluster-name=session-1 # Attach to tmux session on cluster (creates a new one if none available) $ ray attach cluster.yaml --tmux .. _ray-rsync: Synchronizing files from the cluster (``ray rsync-up/down``) ------------------------------------------------------------ To download or upload files to the cluster head node, use ``ray rsync_down`` or ``ray rsync_up``: .. code-block:: shell $ ray rsync_down cluster.yaml '/path/on/cluster' '/local/path' $ ray rsync_up cluster.yaml '/local/path' '/path/on/cluster' .. _monitor-cluster: Monitoring cluster status (``ray dashboard/status``) ----------------------------------------------------- The Ray also comes with an online dashboard. The dashboard is accessible via HTTP on the head node (by default it listens on ``localhost:8265``). You can also use the built-in ``ray dashboard`` to set up port forwarding automatically, making the remote dashboard viewable in your local browser at ``localhost:8265``. .. code-block:: shell $ ray dashboard cluster.yaml You can monitor cluster usage and auto-scaling status by running (on the head node): .. code-block:: shell $ ray status To see live updates to the status: .. code-block:: shell $ watch -n 1 ray status The Ray autoscaler also reports per-node status in the form of instance tags. In your cloud provider console, you can click on a Node, go to the "Tags" pane, and add the ``ray-node-status`` tag as a column. This lets you see per-node statuses at a glance: .. image:: /images/autoscaler-status.png Common Workflow: Syncing git branches ------------------------------------- A common use case is syncing a particular local git branch to all workers of the cluster. However, if you just put a `git checkout ` in the setup commands, the autoscaler won't know when to rerun the command to pull in updates. There is a nice workaround for this by including the git SHA in the input (the hash of the file will change if the branch is updated): .. code-block:: yaml file_mounts: { "/tmp/current_branch_sha": "/path/to/local/repo/.git/refs/heads/", } setup_commands: - test -e || git clone https://github.com//.git - cd && git fetch && git checkout `cat /tmp/current_branch_sha` This tells ``ray up`` to sync the current git branch SHA from your personal computer to a temporary file on the cluster (assuming you've pushed the branch head already). Then, the setup commands read that file to figure out which SHA they should checkout on the nodes. Note that each command runs in its own session. The final workflow to update the cluster then becomes just this: 1. Make local changes to a git branch 2. Commit the changes with ``git commit`` and ``git push`` 3. Update files on your Ray cluster with ``ray up`` --- .. _cluster-config: Cluster YAML Configuration Options ================================== The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. Once the cluster configuration is defined, you will need to use the :ref:`Ray CLI ` to perform any operations such as starting and stopping the cluster. Syntax ------ .. parsed-literal:: :ref:`cluster_name `: str :ref:`max_workers `: int :ref:`upscaling_speed `: float :ref:`idle_timeout_minutes `: int :ref:`docker `: :ref:`docker ` :ref:`provider `: :ref:`provider ` :ref:`auth `: :ref:`auth ` :ref:`available_node_types `: :ref:`node_types ` :ref:`head_node_type `: str :ref:`file_mounts `: :ref:`file_mounts ` :ref:`cluster_synced_files `: - str :ref:`rsync_exclude `: - str :ref:`rsync_filter `: - str :ref:`initialization_commands `: - str :ref:`setup_commands `: - str :ref:`head_setup_commands `: - str :ref:`worker_setup_commands `: - str :ref:`head_start_ray_commands `: - str :ref:`worker_start_ray_commands `: - str Custom types ------------ .. _cluster-configuration-docker-type: Docker ~~~~~~ .. parsed-literal:: :ref:`image `: str :ref:`head_image `: str :ref:`worker_image `: str :ref:`container_name `: str :ref:`pull_before_run `: bool :ref:`run_options `: - str :ref:`head_run_options `: - str :ref:`worker_run_options `: - str :ref:`disable_automatic_runtime_detection `: bool :ref:`disable_shm_size_detection `: bool .. _cluster-configuration-auth-type: Auth ~~~~ .. tab-set:: .. tab-item:: AWS .. parsed-literal:: :ref:`ssh_user `: str :ref:`ssh_private_key `: str .. tab-item:: Azure .. parsed-literal:: :ref:`ssh_user `: str :ref:`ssh_private_key `: str :ref:`ssh_public_key `: str .. tab-item:: GCP .. parsed-literal:: :ref:`ssh_user `: str :ref:`ssh_private_key `: str .. tab-item:: vSphere .. parsed-literal:: :ref:`ssh_user `: str .. _cluster-configuration-provider-type: Provider ~~~~~~~~ .. tab-set:: .. tab-item:: AWS .. parsed-literal:: :ref:`type `: str :ref:`region `: str :ref:`availability_zone `: str :ref:`cache_stopped_nodes `: bool :ref:`security_group `: :ref:`Security Group ` :ref:`use_internal_ips `: bool .. tab-item:: Azure .. parsed-literal:: :ref:`type `: str :ref:`location `: str :ref:`availability_zone `: str :ref:`resource_group `: str :ref:`subscription_id `: str :ref:`msi_name `: str :ref:`msi_resource_group `: str :ref:`cache_stopped_nodes `: bool :ref:`use_internal_ips `: bool :ref:`use_external_head_ip `: bool .. tab-item:: GCP .. parsed-literal:: :ref:`type `: str :ref:`region `: str :ref:`availability_zone `: str :ref:`project_id `: str :ref:`cache_stopped_nodes `: bool :ref:`use_internal_ips `: bool .. tab-item:: vSphere .. parsed-literal:: :ref:`type `: str :ref:`vsphere_config `: :ref:`vSphere Config ` .. _cluster-configuration-security-group-type: Security Group ~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS .. parsed-literal:: :ref:`GroupName `: str :ref:`IpPermissions `: - `IpPermission `_ .. _cluster-configuration-vsphere-config-type: vSphere Config ~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: vSphere .. parsed-literal:: :ref:`credentials `: :ref:`vSphere Credentials ` :ref:`frozen_vm `: :ref:`vSphere Frozen VM Configs ` :ref:`gpu_config `: :ref:`vSphere GPU Configs ` .. _cluster-configuration-vsphere-credentials-type: vSphere Credentials ~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: vSphere .. parsed-literal:: :ref:`user `: str :ref:`password `: str :ref:`server `: str .. _cluster-configuration-vsphere-frozen-vm-configs: vSphere Frozen VM Configs ~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: vSphere .. parsed-literal:: :ref:`name `: str :ref:`library_item `: str :ref:`resource_pool `: str :ref:`cluster `: str :ref:`datastore `: str .. _cluster-configuration-vsphere-gpu-configs: vSphere GPU Configs ~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: vSphere .. parsed-literal:: :ref:`dynamic_pci_passthrough `: bool .. _cluster-configuration-node-types-type: Node types ~~~~~~~~~~ The ``available_nodes_types`` object's keys represent the names of the different node types. Deleting a node type from ``available_node_types`` and updating with :ref:`ray up ` will cause the autoscaler to scale down all nodes of that type. In particular, changing the key of a node type object will result in removal of nodes corresponding to the old key; nodes with the new key name will then be created according to cluster configuration and Ray resource demands. .. parsed-literal:: : :ref:`node_config `: :ref:`Node config ` :ref:`resources `: :ref:`Resources ` :ref:`min_workers `: int :ref:`max_workers `: int :ref:`worker_setup_commands `: - str :ref:`docker `: :ref:`Node Docker ` : ... ... .. _cluster-configuration-node-config-type: Node config ~~~~~~~~~~~ Cloud-specific configuration for nodes of a given node type. Modifying the ``node_config`` and updating with :ref:`ray up ` will cause the autoscaler to scale down all existing nodes of the node type; nodes with the newly applied ``node_config`` will then be created according to cluster configuration and Ray resource demands. .. tab-set:: .. tab-item:: AWS A YAML object which conforms to the EC2 ``create_instances`` API in `the AWS docs `_. .. tab-item:: Azure A YAML object as defined in `the deployment template `_ whose resources are defined in `the Azure docs `_. .. tab-item:: GCP A YAML object as defined in `the GCP docs `_. .. tab-item:: vSphere .. parsed-literal:: # The resource pool where the head node should live, if unset, will be # the frozen VM's resource pool. resource_pool: str # The datastore to store the vmdk of the head node vm, if unset, will be # the frozen VM's datastore. datastore: str .. _cluster-configuration-node-docker-type: Node Docker ~~~~~~~~~~~ .. parsed-literal:: :ref:`worker_image `: str :ref:`pull_before_run `: bool :ref:`worker_run_options `: - str :ref:`disable_automatic_runtime_detection `: bool :ref:`disable_shm_size_detection `: bool .. _cluster-configuration-resources-type: Resources ~~~~~~~~~ .. parsed-literal:: :ref:`CPU `: int :ref:`GPU `: int :ref:`object_store_memory `: int :ref:`memory `: int : int : int ... .. _cluster-configuration-file-mounts-type: File mounts ~~~~~~~~~~~ .. parsed-literal:: : str # Path 1 on local machine : str # Path 2 on local machine ... Properties and Definitions -------------------------- .. _cluster-configuration-cluster-name: ``cluster_name`` ~~~~~~~~~~~~~~~~ The name of the cluster. This is the namespace of the cluster. * **Required:** Yes * **Importance:** High * **Type:** String * **Default:** "default" * **Pattern:** ``[a-zA-Z0-9_]+`` .. _cluster-configuration-max-workers: ``max_workers`` ~~~~~~~~~~~~~~~ The maximum number of workers the cluster will have at any given time. * **Required:** No * **Importance:** High * **Type:** Integer * **Default:** ``2`` * **Minimum:** ``0`` * **Maximum:** Unbounded .. _cluster-configuration-upscaling-speed: ``upscaling_speed`` ~~~~~~~~~~~~~~~~~~~ The number of nodes allowed to be pending as a multiple of the current number of nodes. For example, if set to 1.0, the cluster can grow in size by at most 100% at any time, so if the cluster currently has 20 nodes, at most 20 pending launches are allowed. Note that although the autoscaler will scale down to `min_workers` (which could be 0), it will always scale up to 5 nodes at a minimum when scaling up. * **Required:** No * **Importance:** Medium * **Type:** Float * **Default:** ``1.0`` * **Minimum:** ``0.0`` * **Maximum:** Unbounded .. _cluster-configuration-idle-timeout-minutes: ``idle_timeout_minutes`` ~~~~~~~~~~~~~~~~~~~~~~~~ The number of minutes that need to pass before an idle worker node is removed by the Autoscaler. * **Required:** No * **Importance:** Medium * **Type:** Integer * **Default:** ``5`` * **Minimum:** ``0`` * **Maximum:** Unbounded .. _cluster-configuration-docker: ``docker`` ~~~~~~~~~~ Configure Ray to run in Docker containers. * **Required:** No * **Importance:** High * **Type:** :ref:`Docker ` * **Default:** ``{}`` In rare cases when Docker is not available on the system by default (e.g., bad AMI), add the following commands to :ref:`initialization_commands ` to install it. .. code-block:: yaml initialization_commands: - curl -fsSL https://get.docker.com -o get-docker.sh - sudo sh get-docker.sh - sudo usermod -aG docker $USER - sudo systemctl restart docker -f .. _cluster-configuration-provider: ``provider`` ~~~~~~~~~~~~ The cloud provider-specific configuration properties. * **Required:** Yes * **Importance:** High * **Type:** :ref:`Provider ` .. _cluster-configuration-auth: ``auth`` ~~~~~~~~ Authentication credentials that Ray will use to launch nodes. * **Required:** Yes * **Importance:** High * **Type:** :ref:`Auth ` .. _cluster-configuration-available-node-types: ``available_node_types`` ~~~~~~~~~~~~~~~~~~~~~~~~ Tells the autoscaler the allowed node types and the resources they provide. Each node type is identified by a user-specified key. * **Required:** No * **Importance:** High * **Type:** :ref:`Node types ` * **Default:** .. tab-set:: .. tab-item:: AWS .. code-block:: yaml available_node_types: ray.head.default: node_config: InstanceType: m5.large BlockDeviceMappings: - DeviceName: /dev/sda1 Ebs: VolumeSize: 140 resources: {"CPU": 2} ray.worker.default: node_config: InstanceType: m5.large InstanceMarketOptions: MarketType: spot resources: {"CPU": 2} min_workers: 0 .. _cluster-configuration-head-node-type: ``head_node_type`` ~~~~~~~~~~~~~~~~~~ The key for one of the node types in :ref:`available_node_types `. This node type will be used to launch the head node. If the field ``head_node_type`` is changed and an update is executed with :ref:`ray up `, the currently running head node will be considered outdated. The user will receive a prompt asking to confirm scale-down of the outdated head node, and the cluster will restart with a new head node. Changing the :ref:`node_config` of the :ref:`node_type` with key ``head_node_type`` will also result in cluster restart after a user prompt. * **Required:** Yes * **Importance:** High * **Type:** String * **Pattern:** ``[a-zA-Z0-9_]+`` .. _cluster-configuration-file-mounts: ``file_mounts`` ~~~~~~~~~~~~~~~ The files or directories to copy to the head and worker nodes. * **Required:** No * **Importance:** High * **Type:** :ref:`File mounts ` * **Default:** ``[]`` .. _cluster-configuration-cluster-synced-files: ``cluster_synced_files`` ~~~~~~~~~~~~~~~~~~~~~~~~ A list of paths to the files or directories to copy from the head node to the worker nodes. The same path on the head node will be copied to the worker node. This behavior is a subset of the file_mounts behavior, so in the vast majority of cases one should just use :ref:`file_mounts `. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-rsync-exclude: ``rsync_exclude`` ~~~~~~~~~~~~~~~~~ A list of patterns for files to exclude when running ``rsync up`` or ``rsync down``. The filter is applied on the source directory only. Example for a pattern in the list: ``**/.git/**``. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-rsync-filter: ``rsync_filter`` ~~~~~~~~~~~~~~~~ A list of patterns for files to exclude when running ``rsync up`` or ``rsync down``. The filter is applied on the source directory and recursively through all subdirectories. Example for a pattern in the list: ``.gitignore``. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-initialization-commands: ``initialization_commands`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ A list of commands that will be run before the :ref:`setup commands `. If Docker is enabled, these commands will run outside the container and before Docker is setup. * **Required:** No * **Importance:** Medium * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-setup-commands: ``setup_commands`` ~~~~~~~~~~~~~~~~~~ A list of commands to run to set up nodes. These commands will always run on the head and worker nodes and will be merged with :ref:`head setup commands ` for head and with :ref:`worker setup commands ` for workers. * **Required:** No * **Importance:** Medium * **Type:** List of String * **Default:** .. tab-set:: .. tab-item:: AWS .. code-block:: yaml # Default setup_commands: setup_commands: - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl - Setup commands should ideally be *idempotent* (i.e., can be run multiple times without changing the result); this allows Ray to safely update nodes after they have been created. You can usually make commands idempotent with small modifications, e.g. ``git clone foo`` can be rewritten as ``test -e foo || git clone foo`` which checks if the repo is already cloned first. - Setup commands are run sequentially but separately. For example, if you are using anaconda, you need to run ``conda activate env && pip install -U ray`` because splitting the command into two setup commands will not work. - Ideally, you should avoid using setup_commands by creating a docker image with all the dependencies preinstalled to minimize startup time. - **Tip**: if you also want to run apt-get commands during setup add the following list of commands: .. code-block:: yaml setup_commands: - sudo pkill -9 apt-get || true - sudo pkill -9 dpkg || true - sudo dpkg --configure -a .. _cluster-configuration-head-setup-commands: ``head_setup_commands`` ~~~~~~~~~~~~~~~~~~~~~~~ A list of commands to run to set up the head node. These commands will be merged with the general :ref:`setup commands `. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-worker-setup-commands: ``worker_setup_commands`` ~~~~~~~~~~~~~~~~~~~~~~~~~ A list of commands to run to set up the worker nodes. These commands will be merged with the general :ref:`setup commands `. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-head-start-ray-commands: ``head_start_ray_commands`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Commands to start ray on the head node. You don't need to change this. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** .. tab-set:: .. tab-item:: AWS .. code-block:: yaml head_start_ray_commands: - ray stop - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml .. _cluster-configuration-worker-start-ray-commands: ``worker_start_ray_commands`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Command to start ray on worker nodes. You don't need to change this. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** .. tab-set:: .. tab-item:: AWS .. code-block:: yaml worker_start_ray_commands: - ray stop - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 .. _cluster-configuration-image: ``docker.image`` ~~~~~~~~~~~~~~~~ The default Docker image to pull in the head and worker nodes. This can be overridden by the :ref:`head_image ` and :ref:`worker_image ` fields. If neither `image` nor (:ref:`head_image ` and :ref:`worker_image `) are specified, Ray will not use Docker. * **Required:** Yes (If Docker is in use.) * **Importance:** High * **Type:** String The Ray project provides Docker images on `DockerHub `_. The repository includes following images: * ``rayproject/ray-ml:latest-gpu``: CUDA support, includes ML dependencies. * ``rayproject/ray:latest-gpu``: CUDA support, no ML dependencies. * ``rayproject/ray-ml:latest``: No CUDA support, includes ML dependencies. * ``rayproject/ray:latest``: No CUDA support, no ML dependencies. .. _cluster-configuration-head-image: ``docker.head_image`` ~~~~~~~~~~~~~~~~~~~~~ Docker image for the head node to override the default :ref:`docker image `. * **Required:** No * **Importance:** Low * **Type:** String .. _cluster-configuration-worker-image: ``docker.worker_image`` ~~~~~~~~~~~~~~~~~~~~~~~ Docker image for the worker nodes to override the default :ref:`docker image `. * **Required:** No * **Importance:** Low * **Type:** String .. _cluster-configuration-container-name: ``docker.container_name`` ~~~~~~~~~~~~~~~~~~~~~~~~~ The name to use when starting the Docker container. * **Required:** Yes (If Docker is in use.) * **Importance:** Low * **Type:** String * **Default:** ray_container .. _cluster-configuration-pull-before-run: ``docker.pull_before_run`` ~~~~~~~~~~~~~~~~~~~~~~~~~~ If enabled, the latest version of image will be pulled when starting Docker. If disabled, ``docker run`` will only pull the image if no cached version is present. * **Required:** No * **Importance:** Medium * **Type:** Boolean * **Default:** ``True`` .. _cluster-configuration-run-options: ``docker.run_options`` ~~~~~~~~~~~~~~~~~~~~~~ The extra options to pass to ``docker run``. * **Required:** No * **Importance:** Medium * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-head-run-options: ``docker.head_run_options`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The extra options to pass to ``docker run`` for head node only. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-worker-run-options: ``docker.worker_run_options`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The extra options to pass to ``docker run`` for worker nodes only. * **Required:** No * **Importance:** Low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-disable-automatic-runtime-detection: ``docker.disable_automatic_runtime_detection`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If enabled, Ray will not try to use the NVIDIA Container Runtime if GPUs are present. * **Required:** No * **Importance:** Low * **Type:** Boolean * **Default:** ``False`` .. _cluster-configuration-disable-shm-size-detection: ``docker.disable_shm_size_detection`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If enabled, Ray will not automatically specify the size ``/dev/shm`` for the started container and the runtime's default value (64MiB for Docker) will be used. If ``--shm-size=<>`` is manually added to ``run_options``, this is *automatically* set to ``True``, meaning that Ray will defer to the user-provided value. * **Required:** No * **Importance:** Low * **Type:** Boolean * **Default:** ``False`` .. _cluster-configuration-ssh-user: ``auth.ssh_user`` ~~~~~~~~~~~~~~~~~ The user that Ray will authenticate with when launching new nodes. * **Required:** Yes * **Importance:** High * **Type:** String .. _cluster-configuration-ssh-private-key: ``auth.ssh_private_key`` ~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and ``KeyName`` has to be defined in the :ref:`node configuration `. * **Required:** No * **Importance:** Low * **Type:** String .. tab-item:: Azure The path to an existing private key for Ray to use. * **Required:** Yes * **Importance:** High * **Type:** String You may use ``ssh-keygen -t rsa -b 4096`` to generate a new ssh keypair. .. tab-item:: GCP The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and ``KeyName`` has to be defined in the :ref:`node configuration `. * **Required:** No * **Importance:** Low * **Type:** String .. tab-item:: vSphere Not available. The vSphere provider expects the key to be located at a fixed path ``~/ray-bootstrap-key.pem``. .. _cluster-configuration-ssh-public-key: ``auth.ssh_public_key`` ~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure The path to an existing public key for Ray to use. * **Required:** Yes * **Importance:** High * **Type:** String .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-type: ``provider.type`` ~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS The cloud service provider. For AWS, this must be set to ``aws``. * **Required:** Yes * **Importance:** High * **Type:** String .. tab-item:: Azure The cloud service provider. For Azure, this must be set to ``azure``. * **Required:** Yes * **Importance:** High * **Type:** String .. tab-item:: GCP The cloud service provider. For GCP, this must be set to ``gcp``. * **Required:** Yes * **Importance:** High * **Type:** String .. tab-item:: vSphere The cloud service provider. For vSphere and VCF, this must be set to ``vsphere``. * **Required:** Yes * **Importance:** High * **Type:** String .. _cluster-configuration-region: ``provider.region`` ~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS The region to use for deployment of the Ray cluster. * **Required:** Yes * **Importance:** High * **Type:** String * **Default:** us-west-2 .. tab-item:: Azure Not available. .. tab-item:: GCP The region to use for deployment of the Ray cluster. * **Required:** Yes * **Importance:** High * **Type:** String * **Default:** us-west1 .. tab-item:: vSphere Not available. .. _cluster-configuration-availability-zone: ``provider.availability_zone`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. Nodes will be launched in the first listed availability zone and will be tried in the following availability zones if launching fails. * **Required:** No * **Importance:** Low * **Type:** String * **Default:** us-west-2a,us-west-2b .. tab-item:: Azure A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. This can be specified at the provider level to set defaults for all node types, or at the node level to override the provider setting for specific node types. For Azure, availability zone availability depends on each specific VM size / location combination. Node-level configuration in ``available_node_types..node_config.azure_arm_parameters.availability_zone`` takes precedence over provider-level configuration. * **Required:** No * **Importance:** Low * **Type:** String * **Default:** "auto" (let Azure automatically pick zones) * **Example values:** * ``"1,2,3"`` - Use zones 1, 2, and 3 * ``"1"`` - Use only zone 1 * ``"none"`` - Explicitly disable zones * ``"auto"`` or omit - Let Azure automatically pick zones See the following example Azure cluster config for more details: .. literalinclude:: ../../../../../python/ray/autoscaler/azure/example-availability-zones.yaml :language: yaml .. tab-item:: GCP A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. * **Required:** No * **Importance:** Low * **Type:** String * **Default:** us-west1-a .. tab-item:: vSphere Not available. .. _cluster-configuration-location: ``provider.location`` ~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure The location to use for deployment of the Ray cluster. * **Required:** Yes * **Importance:** High * **Type:** String * **Default:** westus2 .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-resource-group: ``provider.resource_group`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure The resource group to use for deployment of the Ray cluster. * **Required:** Yes * **Importance:** High * **Type:** String * **Default:** ray-cluster .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-subscription-id: ``provider.subscription_id`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure The subscription ID to use for deployment of the Ray cluster. If not specified, Ray will use the default from the Azure CLI. * **Required:** No * **Importance:** High * **Type:** String * **Default:** ``""`` .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-msi-name: ``provider.msi_name`` ~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure The name of the managed identity to use for deployment of the Ray cluster. If not specified, Ray will create a default user-assigned managed identity. * **Required:** No * **Importance:** Low * **Type:** String * **Default:** ray-default-msi .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-msi-resource-group: ``provider.msi_resource_group`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure The name of the managed identity's resource group to use for deployment of the Ray cluster, used in conjunction with msi_name. If not specified, Ray will create a default user-assigned managed identity in resource group specified in the provider config. * **Required:** No * **Importance:** Low * **Type:** String * **Default:** ray-cluster .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-project-id: ``provider.project_id`` ~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure Not available. .. tab-item:: GCP The globally unique project ID to use for deployment of the Ray cluster. * **Required:** Yes * **Importance:** Low * **Type:** String * **Default:** ``null`` .. tab-item:: vSphere Not available. .. _cluster-configuration-cache-stopped-nodes: ``provider.cache_stopped_nodes`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If enabled, nodes will be *stopped* when the cluster scales down. If disabled, nodes will be *terminated* instead. Stopped nodes launch faster than terminated nodes. * **Required:** No * **Importance:** Low * **Type:** Boolean * **Default:** ``True`` .. _cluster-configuration-use-internal-ips: ``provider.use_internal_ips`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If enabled, Ray will use private IP addresses for communication between nodes. This should be omitted if your network interfaces use public IP addresses. If enabled, Ray CLI commands (e.g. ``ray up``) will have to be run from a machine that is part of the same VPC as the cluster. This option does not affect the existence of public IP addresses for the nodes, it only affects which IP addresses are used by Ray. The existence of public IP addresses is controlled by your cloud provider's configuration. * **Required:** No * **Importance:** Low * **Type:** Boolean * **Default:** ``False`` .. _cluster-configuration-use-external-head-ip: ``provider.use_external_head_ip`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure If enabled, Ray will provision and use a public IP address for communication with the head node, regardless of the value of ``use_internal_ips``. This option can be used in combination with ``use_internal_ips`` to avoid provisioning excess public IPs for worker nodes (i.e., communicate among nodes using private IPs, but provision a public IP for head node communication only). If ``use_internal_ips`` is ``False``, then this option has no effect. * **Required:** No * **Importance:** Low * **Type:** Boolean * **Default:** ``False`` .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-security-group: ``provider.security_group`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS A security group that can be used to specify custom inbound rules. * **Required:** No * **Importance:** Medium * **Type:** :ref:`Security Group ` .. tab-item:: Azure Not available. .. tab-item:: GCP Not available. .. tab-item:: vSphere Not available. .. _cluster-configuration-vsphere-config: ``provider.vsphere_config`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS Not available. .. tab-item:: Azure Not available. .. tab-item:: GCP Not available. .. tab-item:: vSphere vSphere configurations used to connect vCenter Server. If not configured, the VSPHERE_* environment variables will be used. * **Required:** No * **Importance:** Low * **Type:** :ref:`vSphere Config ` .. _cluster-configuration-group-name: ``security_group.GroupName`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The name of the security group. This name must be unique within the VPC. * **Required:** No * **Importance:** Low * **Type:** String * **Default:** ``"ray-autoscaler-{cluster-name}"`` .. _cluster-configuration-ip-permissions: ``security_group.IpPermissions`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The inbound rules associated with the security group. * **Required:** No * **Importance:** Medium * **Type:** `IpPermission `_ .. _cluster-configuration-vsphere-credentials: ``vsphere_config.credentials`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The credential to connect to the vSphere vCenter Server. * **Required:** No * **Importance:** Low * **Type:** :ref:`vSphere Credentials ` .. _cluster-configuration-vsphere-user: ``vsphere_config.credentials.user`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Username to connect to vCenter Server. * **Required:** No * **Importance:** Low * **Type:** String .. _cluster-configuration-vsphere-password: ``vsphere_config.credentials.password`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Password of the user to connect to vCenter Server. * **Required:** No * **Importance:** Low * **Type:** String .. _cluster-configuration-vsphere-server: ``vsphere_config.credentials.server`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The vSphere vCenter Server address. * **Required:** No * **Importance:** Low * **Type:** String .. _cluster-configuration-vsphere-frozen-vm: ``vsphere_config.frozen_vm`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The frozen VM related configurations. If the frozen VM(s) is/are existing, then ``library_item`` should be unset. Either an existing frozen VM should be specified by ``name``, or a resource pool name of frozen VMs on every ESXi (https://docs.vmware.com/en/VMware-vSphere/index.html) host should be specified by ``resource_pool``. If the frozen VM(s) is/are to be deployed from OVF template, then `library_item` must be set to point to an OVF template (https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-AFEDC48B-C96F-4088-9C1F-4F0A30E965DE.html) in the content library. In such a case, ``name`` must be set to indicate the name or the name prefix of the frozen VM(s). Then, either ``resource_pool`` should be set to indicate that a set of frozen VMs will be created on each ESXi host of the resource pool, or ``cluster`` should be set to indicate that creating a single frozen VM in the vSphere cluster. The config ``datastore`` (https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-D5AB2BAD-C69A-4B8D-B468-25D86B8D39CE.html) is mandatory in this case. Valid examples: 1. ``ray up`` on a frozen VM to be deployed from an OVF template: .. code-block:: yaml frozen_vm: name: single-frozen-vm library_item: frozen-vm-template cluster: vsanCluster datastore: vsanDatastore 2. ``ray up`` on an existing frozen VM: .. code-block:: yaml frozen_vm: name: existing-single-frozen-vm 3. ``ray up`` on a resource pool of frozen VMs to be deployed from an OVF template: .. code-block:: yaml frozen_vm: name: frozen-vm-prefix library_item: frozen-vm-template resource_pool: frozen-vm-resource-pool datastore: vsanDatastore 4. ``ray up`` on an existing resource pool of frozen VMs: .. code-block:: yaml frozen_vm: resource_pool: frozen-vm-resource-pool Other cases not in above examples are invalid. * **Required:** Yes * **Importance:** High * **Type:** :ref:`vSphere Frozen VM Configs ` .. _cluster-configuration-vsphere-frozen-vm-name: ``vsphere_config.frozen_vm.name`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The name or the name prefix of the frozen VM. Can only be unset when ``resource_pool`` is set and pointing to an existing resource pool of frozen VMs. * **Required:** No * **Importance:** Medium * **Type:** String .. _cluster-configuration-vsphere-frozen-vm-library-item: ``vsphere_config.frozen_vm.library_item`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The library item (https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-D3DD122F-16A5-4F36-8467-97994A854B16.html#GUID-D3DD122F-16A5-4F36-8467-97994A854B16) of the OVF template of the frozen VM. If set, the frozen VM or a set of frozen VMs will be deployed from an OVF template specified by ``library_item``. Otherwise, frozen VM(s) should be existing. Visit the VM Packer for Ray project (https://github.com/vmware-ai-labs/vm-packer-for-ray) to know how to create an OVF template for frozen VMs. * **Required:** No * **Importance:** Low * **Type:** String .. _cluster-configuration-vsphere-frozen-vm-resource-pool: ``vsphere_config.frozen_vm.resource_pool`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The resource pool name of the frozen VMs, can point to an existing resource pool of frozen VMs. Otherwise, ``library_item`` must be specified and a set of frozen VMs will be deployed on each ESXi host. The frozen VMs will be named as "{frozen_vm.name}-{the vm's ip address}" * **Required:** No * **Importance:** Medium * **Type:** String .. _cluster-configuration-vsphere-frozen-vm-cluster: ``vsphere_config.frozen_vm.cluster`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The vSphere cluster name, only takes effect when ``library_item`` is set and ``resource_pool`` is unset. Indicates to deploy a single frozen VM on the vSphere cluster from OVF template. * **Required:** No * **Importance:** Medium * **Type:** String .. _cluster-configuration-vsphere-frozen-vm-datastore: ``vsphere_config.frozen_vm.datastore`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The target vSphere datastore name for storing the virtual machine files of the frozen VM to be deployed from OVF template. Will take effect only when ``library_item`` is set. If ``resource_pool`` is also set, this datastore must be a shared datastore among the ESXi hosts. * **Required:** No * **Importance:** Low * **Type:** String .. _cluster-configuration-vsphere-gpu-config: ``vsphere_config.gpu_config`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _cluster-configuration-vsphere-gpu-config-pci-passthrough: ``vsphere_config.gpu_config.dynamic_pci_passthrough`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The switch controlling the way for binding the GPU from ESXi host to the Ray node VM. The default value is False, which indicates regular PCI Passthrough. If set to True, the Dynamic PCI passthrough (https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-esxi-host-client/GUID-2B6D43A6-9598-47C4-A2E7-5924E3367BB6.html) will be enabled for the GPU. The VM with Dynamic PCI passthrough GPU can still support vSphere DRS (https://www.vmware.com/products/vsphere/drs-dpm.html). * **Required:** No * **Importance:** Low * **Type:** Boolean .. _cluster-configuration-node-config: ``available_node_types..node_type.node_config`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The configuration to be used to launch the nodes on the cloud service provider. Among other things, this will specify the instance type to be launched. * **Required:** Yes * **Importance:** High * **Type:** :ref:`Node config ` .. _cluster-configuration-resources: ``available_node_types..node_type.resources`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The resources that a node type provides, which enables the autoscaler to automatically select the right type of nodes to launch given the resource demands of the application. The resources specified will be automatically passed to the ``ray start`` command for the node via an environment variable. If not provided, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. For more information, see also the `resource demand scheduler `_ * **Required:** Yes (except for AWS/K8s) * **Importance:** High * **Type:** :ref:`Resources ` * **Default:** ``{}`` In some cases, adding special nodes without any resources may be desirable. Such nodes can be used as a driver which connects to the cluster to launch jobs. In order to manually add a node to an autoscaled cluster, the *ray-cluster-name* tag should be set and *ray-node-type* tag should be set to unmanaged. Unmanaged nodes can be created by setting the resources to ``{}`` and the :ref:`maximum workers ` to 0. The Autoscaler will not attempt to start, stop, or update unmanaged nodes. The user is responsible for properly setting up and cleaning up unmanaged nodes. .. _cluster-configuration-node-min-workers: ``available_node_types..node_type.min_workers`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The minimum number of workers to maintain for this node type regardless of utilization. * **Required:** No * **Importance:** High * **Type:** Integer * **Default:** ``0`` * **Minimum:** ``0`` * **Maximum:** Unbounded .. _cluster-configuration-node-max-workers: ``available_node_types..node_type.max_workers`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The maximum number of workers to have in the cluster for this node type regardless of utilization. This takes precedence over :ref:`minimum workers `. By default, the number of workers of a node type is unbounded, constrained only by the cluster-wide :ref:`max_workers `. (Prior to Ray 1.3.0, the default value for this field was 0.) Note, for the nodes of type ``head_node_type`` the default number of max workers is 0. * **Required:** No * **Importance:** High * **Type:** Integer * **Default:** cluster-wide :ref:`max_workers ` * **Minimum:** ``0`` * **Maximum:** cluster-wide :ref:`max_workers ` .. _cluster-configuration-node-type-worker-setup-commands: ``available_node_types..node_type.worker_setup_commands`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A list of commands to run to set up worker nodes of this type. These commands will replace the general :ref:`worker setup commands ` for the node. * **Required:** No * **Importance:** low * **Type:** List of String * **Default:** ``[]`` .. _cluster-configuration-cpu: ``available_node_types..node_type.resources.CPU`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS The number of CPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. * **Required:** Yes (except for AWS/K8s) * **Importance:** High * **Type:** Integer .. tab-item:: Azure The number of CPUs made available by this node. * **Required:** Yes * **Importance:** High * **Type:** Integer .. tab-item:: GCP The number of CPUs made available by this node. * **Required:** No * **Importance:** High * **Type:** Integer .. tab-item:: vSphere The number of CPUs made available by this node. If not configured, the nodes will use the same settings as the frozen VM. * **Required:** No * **Importance:** High * **Type:** Integer .. _cluster-configuration-gpu: ``available_node_types..node_type.resources.GPU`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS The number of GPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. * **Required:** No * **Importance:** Low * **Type:** Integer .. tab-item:: Azure The number of GPUs made available by this node. * **Required:** No * **Importance:** High * **Type:** Integer .. tab-item:: GCP The number of GPUs made available by this node. * **Required:** No * **Importance:** High * **Type:** Integer .. tab-item:: vSphere The number of GPUs made available by this node. * **Required:** No * **Importance:** High * **Type:** Integer .. _cluster-configuration-memory: ``available_node_types..node_type.resources.memory`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS The memory in bytes allocated for python worker heap memory on the node. If not configured, Autoscaler will automatically detect the amount of RAM on the node for AWS/Kubernetes and allocate 70% of it for the heap. * **Required:** No * **Importance:** Low * **Type:** Integer .. tab-item:: Azure The memory in bytes allocated for python worker heap memory on the node. * **Required:** No * **Importance:** High * **Type:** Integer .. tab-item:: GCP The memory in bytes allocated for python worker heap memory on the node. * **Required:** No * **Importance:** High * **Type:** Integer .. tab-item:: vSphere The memory in megabytes allocated for python worker heap memory on the node. If not configured, the node will use the same memory settings as the frozen VM. * **Required:** No * **Importance:** High * **Type:** Integer .. _cluster-configuration-object-store-memory: ``available_node_types..node_type.resources.object-store-memory`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS The memory in bytes allocated for the object store on the node. If not configured, Autoscaler will automatically detect the amount of RAM on the node for AWS/Kubernetes and allocate 30% of it for the object store. * **Required:** No * **Importance:** Low * **Type:** Integer .. tab-item:: Azure The memory in bytes allocated for the object store on the node. * **Required:** No * **Importance:** High * **Type:** Integer .. tab-item:: GCP The memory in bytes allocated for the object store on the node. * **Required:** No * **Importance:** High * **Type:** Integer .. tab-item:: vSphere The memory in bytes allocated for the object store on the node. * **Required:** No * **Importance:** High * **Type:** Integer .. _cluster-configuration-node-docker: ``available_node_types..docker`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A set of overrides to the top-level :ref:`Docker ` configuration. * **Required:** No * **Importance:** Low * **Type:** :ref:`docker ` * **Default:** ``{}`` Examples -------- Minimal configuration ~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS .. literalinclude:: ../../../../../python/ray/autoscaler/aws/example-minimal.yaml :language: yaml .. tab-item:: Azure .. literalinclude:: ../../../../../python/ray/autoscaler/azure/example-minimal.yaml :language: yaml .. tab-item:: GCP .. literalinclude:: ../../../../../python/ray/autoscaler/gcp/example-minimal.yaml :language: yaml .. tab-item:: vSphere .. literalinclude:: ../../../../../python/ray/autoscaler/vsphere/example-minimal.yaml :language: yaml Full configuration ~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: AWS .. literalinclude:: ../../../../../python/ray/autoscaler/aws/example-full.yaml :language: yaml .. tab-item:: Azure .. literalinclude:: ../../../../../python/ray/autoscaler/azure/example-full.yaml :language: yaml .. tab-item:: GCP .. literalinclude:: ../../../../../python/ray/autoscaler/gcp/example-full.yaml :language: yaml .. tab-item:: vSphere .. literalinclude:: ../../../../../python/ray/autoscaler/vsphere/example-full.yaml :language: yaml TPU Configuration ~~~~~~~~~~~~~~~~~ It is possible to use `TPU VMs `_ on GCP. Currently, `TPU pods `_ (TPUs other than v2-8, v3-8 and v4-8) are not supported. Before using a config with TPUs, ensure that the `TPU API is enabled for your GCP project `_. .. tab-set:: .. tab-item:: GCP .. literalinclude:: ../../../../../python/ray/autoscaler/gcp/tpu.yaml :language: yaml --- .. _ref-cluster-setup: Community Supported Cluster Managers ==================================== .. toctree:: :hidden: yarn slurm lsf .. note:: If you're using AWS, Azure, GCP or vSphere you can use the :ref:`Ray cluster launcher ` to simplify the cluster setup process. The following is a list of community supported cluster managers. .. toctree:: :maxdepth: 2 yarn.rst slurm.rst lsf.rst spark.rst .. _ref-additional-cloud-providers: Using a custom cloud or cluster manager ======================================= The Ray cluster launcher currently supports AWS, Azure, GCP, Aliyun, vSphere and KubeRay out of the box. To use the Ray cluster launcher and Autoscaler on other cloud providers or cluster managers, you can implement the `node_provider.py `_ interface (100 LOC). Once the node provider is implemented, you can register it in the `provider section `_ of the cluster launcher config. .. code-block:: yaml provider: type: "external" module: "my.module.MyCustomNodeProvider" You can refer to `AWSNodeProvider `_, `KubeRayNodeProvider `_ and `LocalNodeProvider `_ for more examples. --- .. _ray-LSF-deploy: Deploying on LSF ================ This document describes a couple high-level steps to run Ray clusters on LSF. 1) Obtain desired nodes from LSF scheduler using bsub directives. 2) Obtain free ports on the desired nodes to start ray services like dashboard, GCS etc. 3) Start ray head node on one of the available nodes. 4) Connect all the worker nodes to the head node. 5) Perform port forwarding to access ray dashboard. Steps 1-4 have been automated and can be easily run as a script, please refer to below github repo to access script and run sample workloads: - `ray_LSF`_ Ray with LSF. Users can start up a Ray cluster on LSF, and run DL workloads through that either in a batch or interactive mode. .. _`ray_LSF`: https://github.com/IBMSpectrumComputing/ray-integration --- :orphan: .. _slurm-basic: slurm-basic.sh ~~~~~~~~~~~~~~ .. literalinclude:: /cluster/doc_code/slurm-basic.sh :language: bash --- :orphan: .. _slurm-launch: slurm-launch.py ~~~~~~~~~~~~~~~ .. literalinclude:: /cluster/doc_code/slurm-launch.py --- :orphan: .. _slurm-template: slurm-template.sh ~~~~~~~~~~~~~~~~~ .. literalinclude:: /cluster/doc_code/slurm-template.sh :language: bash --- .. _ray-slurm-deploy: Deploying on Slurm ================== Slurm usage with Ray can be a little bit unintuitive. * SLURM requires multiple copies of the same program are submitted multiple times to the same cluster to do cluster programming. This is particularly well-suited for MPI-based workloads. * Ray, on the other hand, expects a head-worker architecture with a single point of entry. That is, you'll need to start a Ray head node, multiple Ray worker nodes, and run your Ray script on the head node. To bridge this gap, Ray 2.49 and above introduces ``ray symmetric-run`` command, which will start a Ray cluster on all nodes with given CPU and GPU resources and run your entrypoint script ONLY the head node. Below, we provide a walkthrough using ``ray symmetric-run`` to run Ray on SLURM. .. contents:: :local: Walkthrough using Ray with SLURM -------------------------------- Many SLURM deployments require you to interact with slurm via ``sbatch``, which executes a batch script on SLURM. To run a Ray job with ``sbatch``, you will want to start a Ray cluster in the sbatch job with multiple ``srun`` commands (tasks), and then execute your python script that uses Ray. Each task will run on a separate node and start/connect to a Ray runtime. The below walkthrough will do the following: 1. Set the proper headers for the ``sbatch`` script. 2. Load the proper environment/modules. 3. Fetch a list of available computing nodes and their IP addresses. 4. Launch a head ray process in one of the node (called the head node). 5. Launch Ray processes in (n-1) worker nodes and connects them to the head node by providing the head node address. 6. After the underlying ray cluster is ready, submit the user specified task. See :ref:`slurm-basic.sh ` for an end-to-end example. .. _ray-slurm-headers: sbatch directives ~~~~~~~~~~~~~~~~~ In your sbatch script, you'll want to add `directives to provide context `__ for your job to SLURM. .. code-block:: bash #!/bin/bash #SBATCH --job-name=my-workload You'll need to tell SLURM to allocate nodes specifically for Ray. Ray will then find and manage all resources on each node. .. code-block:: bash ### Modify this according to your Ray workload. #SBATCH --nodes=4 #SBATCH --exclusive Important: To ensure that each Ray worker runtime will run on a separate node, set ``tasks-per-node``. .. code-block:: bash #SBATCH --tasks-per-node=1 Since we've set `tasks-per-node = 1`, this will be used to guarantee that each Ray worker runtime will obtain the proper resources. In this example, we ask for at least 5 CPUs and 5 GB of memory per node. .. code-block:: bash ### Modify this according to your Ray workload. #SBATCH --cpus-per-task=5 #SBATCH --mem-per-cpu=1GB ### Similarly, you can also specify the number of GPUs per node. ### Modify this according to your Ray workload. Sometimes this ### should be 'gres' instead. #SBATCH --gpus-per-task=1 You can also add other optional flags to your sbatch directives. Loading your environment ~~~~~~~~~~~~~~~~~~~~~~~~ First, you'll often want to Load modules or your own conda environment at the beginning of the script. Note that this is an optional step, but it is often required for enabling the right set of dependencies. .. code-block:: bash # Example: module load pytorch/v1.4.0-gpu # Example: conda activate my-env conda activate my-env Obtain the head IP address ~~~~~~~~~~~~~~~~~~~~~~~~~~ Next, we'll want to obtain a hostname and a node IP address for the head node. This way, when we start worker nodes, we'll be able to properly connect to the right head node. .. literalinclude:: /cluster/doc_code/slurm-basic.sh :language: bash :start-after: __doc_head_address_start__ :end-before: __doc_head_address_end__ .. note:: In Ray 2.49 and above, you can use IPv6 addresses/hostnames. Starting Ray and executing your script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: `ray symmetric-run` is available in Ray 2.49 and above. Check older versions of the documentation if you are using an older version of Ray. Now, we'll use `ray symmetric-run` to start Ray on all nodes with given CPU and GPU resources and run your entrypoint script ONLY the head node. Below, you'll see that we explicitly specify the number of CPUs (``num-cpus``) and number of GPUs (``num-gpus``) to Ray, as this will prevent Ray from using more resources than allocated. We also need to explicitly indicate the ``address`` parameter for the head node to identify itself and other nodes to connect to: .. literalinclude:: /cluster/doc_code/slurm-basic.sh :language: bash :start-after: __doc_symmetric_run_start__ :end-before: __doc_symmetric_run_end__ After the training job is completed, the Ray cluster will be stopped automatically. .. note:: The -u argument tells python to print to stdout unbuffered, which is important with how slurm deals with rerouting output. If this argument is not included, you may get strange printing behavior such as printed statements not being logged by slurm until the program has terminated. .. _slurm-network-ray: SLURM networking caveats ~~~~~~~~~~~~~~~~~~~~~~~~ There are two important networking aspects to keep in mind when working with SLURM and Ray: 1. Ports binding. 2. IP binding. One common use of a SLURM cluster is to have multiple users running concurrent jobs on the same infrastructure. This can easily conflict with Ray due to the way the head node communicates with its workers. Considering 2 users, if they both schedule a SLURM job using Ray at the same time, they are both creating a head node. In the backend, Ray will assign some internal ports to a few services. The issue is that as soon as the first head node is created, it will bind some ports and prevent them to be used by another head node. To prevent any conflicts, users have to manually specify non overlapping ranges of ports. The following ports are to be adjusted. For an explanation on ports, see :ref:`here `:: # used for all ports --node-manager-port --object-manager-port --min-worker-port --max-worker-port # used for the head node --port --ray-client-server-port --redis-shard-ports For instance, again with 2 users, they would run the following commands. Note that we don't use symmetric-run here because it does not currently work in multi-tenant environments: .. code-block:: bash # user 1 ... srun --nodes=1 --ntasks=1 -w "$head_node" \ ray start --head --node-ip-address="$head_node_ip" \ --port=6379 \ --node-manager-port=6700 \ --object-manager-port=6701 \ --ray-client-server-port=10001 \ --redis-shard-ports=6702 \ --min-worker-port=10002 \ --max-worker-port=19999 \ --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block & python -u your_script.py # user 2 ... srun --nodes=1 --ntasks=1 -w "$head_node" \ ray start --head --node-ip-address="$head_node_ip" \ --port=6380 \ --node-manager-port=6800 \ --object-manager-port=6801 \ --ray-client-server-port=20001 \ --redis-shard-ports=6802 \ --min-worker-port=20002 \ --max-worker-port=29999 \ --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block & python -u your_script.py As for the IP binding, on some cluster architecture the network interfaces do not allow to use external IPs between nodes. Instead, there are internal network interfaces (`eth0`, `eth1`, etc.). Currently, it's difficult to set an internal IP (see the open `issue `_). Python-interface SLURM scripts ------------------------------ [Contributed by @pengzhenghao] Below, we provide a helper utility (:ref:`slurm-launch.py `) to auto-generate SLURM scripts and launch. ``slurm-launch.py`` uses an underlying template (:ref:`slurm-template.sh `) and fills out placeholders given user input. You can feel free to copy both files into your cluster for use. Feel free to also open any PRs for contributions to improve this script! Usage example ~~~~~~~~~~~~~ If you want to utilize a multi-node cluster in slurm: .. code-block:: bash python slurm-launch.py --exp-name test --command "python your_file.py" --num-nodes 3 If you want to specify the computing node(s), just use the same node name(s) in the same format of the output of ``sinfo`` command: .. code-block:: bash python slurm-launch.py --exp-name test --command "python your_file.py" --num-nodes 3 --node NODE_NAMES There are other options you can use when calling ``python slurm-launch.py``: * ``--exp-name``: The experiment name. Will generate ``{exp-name}_{date}-{time}.sh`` and ``{exp-name}_{date}-{time}.log``. * ``--command``: The command you wish to run. For example: ``rllib train XXX`` or ``python XXX.py``. * ``--num-gpus``: The number of GPUs you wish to use in each computing node. Default: 0. * ``--node`` (``-w``): The specific nodes you wish to use, in the same form as the output of ``sinfo``. Nodes are automatically assigned if not specified. * ``--num-nodes`` (``-n``): The number of nodes you wish to use. Default: 1. * ``--partition`` (``-p``): The partition you wish to use. Default: "", will use user's default partition. * ``--load-env``: The command to setup your environment. For example: ``module load cuda/10.1``. Default: "". Note that the :ref:`slurm-template.sh ` is compatible with both IPV4 and IPV6 ip address of the computing nodes. Implementation ~~~~~~~~~~~~~~ Concretely, the (:ref:`slurm-launch.py `) does the following things: 1. It automatically writes your requirements, e.g. number of CPUs, GPUs per node, the number of nodes and so on, to a sbatch script name ``{exp-name}_{date}-{time}.sh``. Your command (``--command``) to launch your own job is also written into the sbatch script. 2. Then it will submit the sbatch script to slurm manager via a new process. 3. Finally, the python process will terminate itself and leaves a log file named ``{exp-name}_{date}-{time}.log`` to record the progress of your submitted command. At the mean time, the ray cluster and your job is running in the slurm cluster. Examples and templates ---------------------- Here are some community-contributed templates for using SLURM with Ray: - `Ray sbatch submission scripts`_ used at `NERSC `_, a US national lab. - `YASPI`_ (yet another slurm python interface) by @albanie. The goal of yaspi is to provide an interface to submitting slurm jobs, thereby obviating the joys of sbatch files. It does so through recipes - these are collections of templates and rules for generating sbatch scripts. Supports job submissions for Ray. - `Convenient python interface`_ to launch ray cluster and submit task by @pengzhenghao .. _`Ray sbatch submission scripts`: https://github.com/NERSC/slurm-ray-cluster .. _`YASPI`: https://github.com/albanie/yaspi .. _`Convenient python interface`: https://github.com/pengzhenghao/use-ray-with-slurm --- .. _ray-Spark-deploy: Deploying on Spark Standalone cluster ===================================== This document describes a couple high-level steps to run Ray clusters on `Spark Standalone cluster `_. Running a basic example ----------------------- This is a spark application example code that starts Ray cluster on spark, and then execute ray application code, then shut down initiated ray cluster. 1) Create a python file that contains a spark application code, Assuming the python file name is 'ray-on-spark-example1.py'. .. code-block:: python from pyspark.sql import SparkSession from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster, MAX_NUM_WORKER_NODES if __name__ == "__main__": spark = SparkSession \ .builder \ .appName("Ray on spark example 1") \ .config("spark.task.cpus", "4") \ .getOrCreate() # Set up a ray cluster on this spark application, it creates a background # spark job that each spark task launches one ray worker node. # ray head node is launched in spark application driver side. # Resources (CPU / GPU / memory) allocated to each ray worker node is equal # to resources allocated to the corresponding spark task. setup_ray_cluster(max_worker_nodes=MAX_NUM_WORKER_NODES) # You can any ray application code here, the ray application will be executed # on the ray cluster setup above. # You don't need to set address for `ray.init`, # it will connect to the cluster created above automatically. ray.init() ... # Terminate ray cluster explicitly. # If you don't call it, when spark application is terminated, the ray cluster # will also be terminated. shutdown_ray_cluster() 2) Submit the spark application above to spark standalone cluster. .. code-block:: bash #!/bin/bash spark-submit \ --master spark://{spark_master_IP}:{spark_master_port} \ path/to/ray-on-spark-example1.py Creating a long running ray cluster on spark cluster ---------------------------------------------------- This is a spark application example code that starts a long running Ray cluster on spark. The created ray cluster can be accessed by remote python processes. 1) Create a python file that contains a spark application code, Assuming the python file name is 'long-running-ray-cluster-on-spark.py'. .. code-block:: python from pyspark.sql import SparkSession import time from ray.util.spark import setup_ray_cluster, MAX_NUM_WORKER_NODES if __name__ == "__main__": spark = SparkSession \ .builder \ .appName("long running ray cluster on spark") \ .config("spark.task.cpus", "4") \ .getOrCreate() cluster_address = setup_ray_cluster( max_worker_nodes=MAX_NUM_WORKER_NODES ) print("Ray cluster is set up, you can connect to this ray cluster " f"via address ray://{cluster_address}") # Sleep forever until the spark application being terminated, # at that time, the ray cluster will also be terminated. while True: time.sleep(10) 2) Submit the spark application above to spark standalone cluster. .. code-block:: bash #!/bin/bash spark-submit \ --master spark://{spark_master_IP}:{spark_master_port} \ path/to/long-running-ray-cluster-on-spark.py Ray on Spark APIs ----------------- .. autofunction:: ray.util.spark.setup_ray_cluster .. autofunction:: ray.util.spark.shutdown_ray_cluster .. autofunction:: ray.util.spark.setup_global_ray_cluster --- .. _ray-yarn-deploy: Deploying on YARN ================= .. warning:: Running Ray on YARN is still a work in progress. If you have a suggestion for how to improve this documentation or want to request a missing feature, please feel free to create a pull request or get in touch using one of the channels in the `Questions or Issues?`_ section below. This document assumes that you have access to a YARN cluster and will walk you through using `Skein`_ to deploy a YARN job that starts a Ray cluster and runs an example script on it. Skein uses a declarative specification (either written as a yaml file or using the Python API) and allows users to launch jobs and scale applications without the need to write Java code. You will first need to install Skein: ``pip install skein``. The Skein ``yaml`` file and example Ray program used here are provided in the `Ray repository`_ to get you started. Refer to the provided ``yaml`` files to be sure that you maintain important configuration options for Ray to function properly. .. _`Ray repository`: https://github.com/ray-project/ray/tree/master/doc/yarn Skein Configuration ------------------- A Ray job is configured to run as two `Skein services`: 1. The ``ray-head`` service that starts the Ray head node and then runs the application. 2. The ``ray-worker`` service that starts worker nodes that join the Ray cluster. You can change the number of instances in this configuration or at runtime using ``skein container scale`` to scale the cluster up/down. The specification for each service consists of necessary files and commands that will be run to start the service. .. code-block:: yaml services: ray-head: # There should only be one instance of the head node per cluster. instances: 1 resources: # The resources for the worker node. vcores: 1 memory: 2048 files: ... script: ... ray-worker: # Number of ray worker nodes to start initially. # This can be scaled using 'skein container scale'. instances: 3 resources: # The resources for the worker node. vcores: 1 memory: 2048 files: ... script: ... Packaging Dependencies ---------------------- Use the ``files`` option to specify files that will be copied into the YARN container for the application to use. See `the Skein file distribution page `_ for more information. .. code-block:: yaml services: ray-head: # There should only be one instance of the head node per cluster. instances: 1 resources: # The resources for the head node. vcores: 1 memory: 2048 files: # ray/doc/yarn/example.py example.py: example.py # ray/doc/yarn/dashboard.py dashboard.py: dashboard.py # # A packaged python environment using `conda-pack`. Note that Skein # # doesn't require any specific way of distributing files, but this # # is a good one for python projects. This is optional. # # See https://jcrist.github.io/skein/distributing-files.html # environment: environment.tar.gz Ray Setup in YARN ----------------- Below is a walkthrough of the bash commands used to start the ``ray-head`` and ``ray-worker`` services. Note that this configuration will launch a new Ray cluster for each application, not reuse the same cluster. Head node commands ~~~~~~~~~~~~~~~~~~ Start by activating a pre-existing environment for dependency management. .. code-block:: bash source environment/bin/activate Register the Ray head address needed by the workers in the Skein key-value store. .. code-block:: bash skein kv put --key=RAY_HEAD_ADDRESS --value=$(hostname -i) current Start all the processes needed on the ray head node. By default, we set object store memory and heap memory to roughly 200 MB. This is conservative and should be set according to application needs. .. code-block:: bash ray start --head --port=6379 --object-store-memory=200000000 --memory 200000000 --num-cpus=1 Register the ray dashboard to Skein. This exposes the dashboard link on the Skein application page. .. code-block:: bash python dashboard.py "http://$(hostname -i):8265" Execute the user script containing the Ray program. .. code-block:: bash python example.py Clean up all started processes even if the application fails or is killed. .. code-block:: bash ray stop skein application shutdown current Putting things together, we have: .. literalinclude:: /cluster/doc_code/yarn/ray-skein.yaml :language: yaml :start-after: # Head service :end-before: # Worker service Worker node commands ~~~~~~~~~~~~~~~~~~~~ Fetch the address of the head node from the Skein key-value store. .. code-block:: bash RAY_HEAD_ADDRESS=$(skein kv get current --key=RAY_HEAD_ADDRESS) Start all of the processes needed on a ray worker node, blocking until killed by Skein/YARN via SIGTERM. After receiving SIGTERM, all started processes should also die (ray stop). .. code-block:: bash ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:6379 --block; ray stop Putting things together, we have: .. literalinclude:: /cluster/doc_code/yarn/ray-skein.yaml :language: yaml :start-after: # Worker service Running a Job ------------- Within your Ray script, use the following to connect to the started Ray cluster: .. literalinclude:: /cluster/doc_code/yarn/example.py :language: python :start-after: if __name__ == "__main__" You can use the following command to launch the application as specified by the Skein YAML file. .. code-block:: bash skein application submit [TEST.YAML] Once it has been submitted, you can see the job running on the YARN dashboard. .. image:: /cluster/images/yarn-job.png If you have registered the Ray dashboard address in the Skein as shown above, you can retrieve it on Skein's application page: .. image:: /cluster/images/yarn-job-dashboard.png Cleaning Up ----------- To clean up a running job, use the following (using the application ID): .. code-block:: bash skein application shutdown $appid Questions or Issues? -------------------- .. include:: /_includes/_help.rst .. _`Skein`: https://jcrist.github.io/skein/ --- .. _vms-autoscaling: Configuring Autoscaling ======================= This guide explains how to configure the Ray autoscaler using the Ray cluster launcher. The Ray autoscaler is a Ray cluster process that automatically scales a cluster up and down based on resource demand. The autoscaler does this by adjusting the number of nodes in the cluster based on the resources required by tasks, actors or placement groups. Note that the autoscaler only considers logical resource requests for scaling (i.e., those specified in ``@ray.remote`` and displayed in `ray status`), not physical machine utilization. If a user tries to launch an actor, task, or placement group but there are insufficient resources, the request will be queued. The autoscaler adds nodes to satisfy resource demands in this queue. The autoscaler also removes nodes after they become idle for some time. A node is considered idle if it has no active tasks, actors, or objects. .. tip:: **When to use Autoscaling?** Autoscaling can reduce workload costs, but adds node launch overheads and can be tricky to configure. We recommend starting with non-autoscaling clusters if you're new to Ray. Cluster Config Parameters ------------------------- The following options are available in your cluster config file. It is recommended that you set these before launching your cluster, but you can also modify them at run-time by updating the cluster config. `max_workers[default_value=2, min_value=0]`: The max number of cluster worker nodes to launch. Note that this does not include the head node. `min_workers[default_value=0, min_value=0]`: The min number of cluster worker nodes to launch, regardless of utilization. Note that this does not include the head node. This number must be less than the ``max_workers``. .. note:: If `max_workers` is modified at runtime, the autoscaler will immediately remove nodes until this constraint is satisfied. This may disrupt running workloads. If you are using more than one node type, you can also set min and max workers for each individual type: `available_node_types..max_workers[default_value=cluster max_workers, min_value=0]`: The maximum number of worker nodes of a given type to launch. This number must be less than or equal to the `max_workers` for the cluster. `available_node_types..min_workers[default_value=0, min_value=0]`: The minimum number of worker nodes of a given type to launch, regardless of utilization. The sum of `min_workers` across all node types must be less than or equal to the `max_workers` for the cluster. Upscaling and downscaling speed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If needed, you can also control the rate at which nodes should be added to or removed from the cluster. For applications with many short-lived tasks, you may wish to adjust the upscaling and downscaling speed to be more conservative. `upscaling_speed[default_value=1.0, min_value=1.0]`: The number of nodes allowed to be pending as a multiple of the current number of nodes. The higher the value, the more aggressive upscaling will be. For example, if this is set to 1.0, the cluster can grow in size by at most 100% at any time, so if the cluster currently has 20 nodes, at most 20 pending launches are allowed. The minimum number of pending launches is 5 regardless of this setting. `idle_timeout_minutes[default_value=5, min_value=0]`: The number of minutes that need to pass before an idle worker node is removed by the autoscaler. The smaller the value, the more aggressive downscaling will be. Worker nodes are considered idle when they hold no active tasks, actors, or referenced objects (either in-memory or spilled to disk). This parameter does not affect the head node. Programmatic Scaling -------------------- For more information on programmatic access to the autoscaler, see the :ref:`Programmatic Cluster Scaling Guide `. --- .. _vms-large-cluster: Best practices for deploying large clusters ------------------------------------------- This section aims to document best practices for deploying Ray clusters at large scale. Networking configuration ^^^^^^^^^^^^^^^^^^^^^^^^ End users should only need to directly interact with the head node of the cluster. In particular, there are 2 services which should be exposed to users: 1. The dashboard 2. The Ray client server .. note:: While users only need 2 ports to connect to a cluster, the nodes within a cluster require a much wider range of ports to communicate. See :ref:`Ray port configuration ` for a comprehensive list. Applications (such as :ref:`Ray Serve `) may also require additional ports to work properly. System configuration ^^^^^^^^^^^^^^^^^^^^ There are a few system level configurations that should be set when using Ray at a large scale. * Make sure ``ulimit -n`` is set to at least 65535. Ray opens many direct connections between worker processes to avoid bottlenecks, so it can quickly use a large number of file descriptors. * Make sure ``/dev/shm`` is sufficiently large. Most ML/RL applications rely heavily on the plasma store. By default, Ray will try to use ``/dev/shm`` for the object store, but if it is not large enough (i.e. ``--object-store-memory`` > size of ``/dev/shm``), Ray will write the plasma store to disk instead, which may cause significant performance problems. * Use NVMe SSDs (or other high performance storage) if possible. If :ref:`object spilling ` is enabled Ray will spill objects to disk if necessary. This is most commonly needed for data processing workloads. .. _vms-large-cluster-configure-head-node: Configuring the head node ^^^^^^^^^^^^^^^^^^^^^^^^^ In addition to the above changes, when deploying a large cluster, Ray's architecture means that the head node has extra stress due to additional system processes running on it like GCS. * A good starting hardware specification for the head node is 8 CPUs and 32 GB memory. The actual hardware specification depends on the workload and the size of the cluster. Metrics that are useful for deciding the hardware specification are CPU usage, memory usage, and network bandwidth usage. * Make sure the head node has sufficient bandwidth. The most heavily stressed resource on the head node is outbound bandwidth. For large clusters (see the scalability envelope), we recommend using machines networking characteristics at least as good as an r5dn.16xlarge on AWS EC2. * Set ``resources: {"CPU": 0}`` on the head node. (For Ray clusters deployed using KubeRay, set ``rayStartParams: {"num-cpus": "0"}``. See the :ref:`configuration guide for KubeRay clusters `.) Due to the heavy networking load (and the GCS and dashboard processes), we recommend setting the quantity of logical CPU resources to 0 on the head node to avoid scheduling additional tasks on it. Configuring the autoscaler ^^^^^^^^^^^^^^^^^^^^^^^^^^ For large, long running clusters, there are a few parameters that can be tuned. * Ensure your quotas for node types are set correctly. * For long running clusters, set the ``AUTOSCALER_MAX_NUM_FAILURES`` environment variable to a large number (or ``inf``) to avoid unexpected autoscaler crashes. The variable can be set by prepending \ ``export AUTOSCALER_MAX_NUM_FAILURES=inf;`` to the head node's Ray start command. (Note: you may want a separate mechanism to detect if the autoscaler errors too often). * For large clusters, consider tuning ``upscaling_speed`` for faster autoscaling. Picking nodes ^^^^^^^^^^^^^ Here are some tips for how to set your ``available_node_types`` for a cluster, using AWS instance types as a concrete example. General recommendations with AWS instance types: **When to use GPUs** * If you’re using some RL/ML framework * You’re doing something with tensorflow/pytorch/jax (some framework that can leverage GPUs well) **What type of GPU?** * The latest gen GPU is almost always the best bang for your buck (p3 > p2, g4 > g3), for most well designed applications the performance outweighs the price. (The instance price may be higher, but you use the instance for less time.) * You may want to consider using older instances if you’re doing dev work and won’t actually fully utilize the GPUs though. * If you’re doing training (ML or RL), you should use a P instance. If you’re doing inference, you should use a G instance. The difference is processing:VRAM ratio (training requires more memory). **What type of CPU?** * Again stick to the latest generation, they’re typically cheaper and faster. * When in doubt use M instances, they have typically have the highest availability. * If you know your application is memory intensive (memory utilization is full, but cpu is not), go with an R instance * If you know your application is CPU intensive go with a C instance * If you have a big cluster, make the head node an instance with an n (r5dn or c5n) **How many CPUs/GPUs?** * Focus on your CPU:GPU ratio first and look at the utilization (Ray dashboard should help with this). If your CPU utilization is low add GPUs, or vice versa. * The exact ratio will be very dependent on your workload. * Once you find a good ratio, you should be able to scale up and keep the same ratio. * You can’t infinitely scale forever. Eventually, as you add more machines your performance improvements will become sub-linear/not worth it. There may not be a good one-size fits all strategy at this point. .. note:: If you're using RLlib, check out :ref:`the RLlib scaling guide ` for RLlib specific recommendations. --- .. _launching-vm-clusters: Launching Ray Clusters on AWS, GCP, Azure, vSphere, On-Prem =========================================================== In this section, you can find guides for launching Ray clusters in various clouds or on-premises. Table of Contents ----------------- .. toctree:: :maxdepth: 2 aws.md gcp.md azure.md vsphere.md on-premises.md --- .. _aggregations: Aggregating Data ================ Ray Data provides a flexible and performant API for performing aggregations on :class:`~ray.data.dataset.Dataset`. Basic Aggregations ------------------ Ray Data provides several built-in aggregation functions like :class:`~ray.data.Dataset.max`, :class:`~ray.data.Dataset.min`, :class:`~ray.data.Dataset.sum`. These can be used directly on a Dataset or a GroupedData object, as shown below: .. testcode:: import ray # Create a sample dataset ds = ray.data.range(100) ds = ds.add_column("group_key", lambda x: x % 3) # Schema: {'id': int64, 'group_key': int64} # Find the max result = ds.max("id") # result: 99 # Find the minimum value per group result = ds.groupby("group_key").min("id") # result: [{'group_key': 0, 'min(id)': 0}, {'group_key': 1, 'min(id)': 1}, {'group_key': 2, 'min(id)': 2}] The full list of built-in aggregation functions is available in the :ref:`Dataset API reference `. Each of the preceding methods also has a corresponding :ref:`AggregateFnV2 ` object. These objects can be used in :meth:`~ray.data.Dataset.aggregate()` or :meth:`Dataset.groupby().aggregate() `. Aggregation objects can be used directly with a Dataset like shown below: .. testcode:: import ray from ray.data.aggregate import Count, Mean, Quantile # Create a sample dataset ds = ray.data.range(100) ds = ds.add_column("group_key", lambda x: x % 3) # Count all rows result = ds.aggregate(Count()) # result: {'count()': 100} # Calculate mean per group result = ds.groupby("group_key").aggregate(Mean(on="id")).take_all() # result: [{'group_key': 0, 'mean(id)': ...}, # {'group_key': 1, 'mean(id)': ...}, # {'group_key': 2, 'mean(id)': ...}] # Calculate 75th percentile result = ds.aggregate(Quantile(on="id", q=0.75)) # result: {'quantile(id)': 75.0} Multiple aggregations can also be computed at once: .. testcode:: import ray from ray.data.aggregate import Count, Mean, Min, Max, Std ds = ray.data.range(100) ds = ds.add_column("group_key", lambda x: x % 3) # Compute multiple aggregations at once result = ds.groupby("group_key").aggregate( Count(on="id"), Mean(on="id"), Min(on="id"), Max(on="id"), Std(on="id") ).take_all() # result: [{'group_key': 0, 'count(id)': 34, 'mean(id)': ..., 'min(id)': ..., 'max(id)': ..., 'std(id)': ...}, # {'group_key': 1, 'count(id)': 33, 'mean(id)': ..., 'min(id)': ..., 'max(id)': ..., 'std(id)': ...}, # {'group_key': 2, 'count(id)': 33, 'mean(id)': ..., 'min(id)': ..., 'max(id)': ..., 'std(id)': ...}] Custom Aggregations -------------------- You can create custom aggregations by implementing the :class:`~ray.data.aggregate.AggregateFnV2` interface. The AggregateFnV2 interface has three key methods to implement: 1. `aggregate_block`: Processes a single block of data and returns a partial aggregation result 2. `combine`: Merges two partial aggregation results into a single result 3. `finalize`: Transforms the final accumulated result into the desired output format The aggregation process follows these steps: 1. **Initialization**: For each group (if grouping) or for the entire dataset, an initial accumulator is created using `zero_factory` 2. **Block Aggregation**: The `aggregate_block` method is applied to each block independently 3. **Combination**: The `combine` method merges partial results into a single accumulator 4. **Finalization**: The `finalize` method transforms the final accumulator into the desired output Example: Creating a Custom Mean Aggregator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here's an example of creating a custom aggregator that calculates the Mean of values in a column: .. testcode:: import numpy as np from ray.data.aggregate import AggregateFnV2 from ray.data._internal.util import is_null from ray.data.block import Block, BlockAccessor, AggType, U import pyarrow.compute as pc from typing import List, Optional class Mean(AggregateFnV2): """Defines mean aggregation.""" def __init__( self, on: Optional[str] = None, ignore_nulls: bool = True, alias_name: Optional[str] = None, ): super().__init__( alias_name if alias_name else f"mean({str(on)})", on=on, ignore_nulls=ignore_nulls, # NOTE: We've to copy returned list here, as some # aggregations might be modifying elements in-place zero_factory=lambda: list([0, 0]), # noqa: C410 ) def aggregate_block(self, block: Block) -> AggType: block_acc = BlockAccessor.for_block(block) count = block_acc.count(self._target_col_name, self._ignore_nulls) if count == 0 or count is None: # Empty or all null. return None sum_ = block_acc.sum(self._target_col_name, self._ignore_nulls) if is_null(sum_): # In case of ignore_nulls=False and column containing 'null' # return as is (to prevent unnecessary type conversions, when, for ex, # using Pandas and returning None) return sum_ return [sum_, count] def combine(self, current_accumulator: AggType, new: AggType) -> AggType: return [current_accumulator[0] + new[0], current_accumulator[1] + new[1]] def finalize(self, accumulator: AggType) -> Optional[U]: if accumulator[1] == 0: return np.nan return accumulator[0] / accumulator[1] .. note:: Internally, aggregations support both the :ref:`hash-shuffle backend ` and the :ref:`range based backend `. Hash-shuffling can provide better performance for aggregations in certain cases. For more information see `comparison between hash based shuffling and Range Based shuffling approach `_ . To use the hash-shuffle algorithm for aggregations, you need to set the shuffle strategy explicitly: ``ray.data.DataContext.get_current().shuffle_strategy = ShuffleStrategy.HASH_SHUFFLE`` before creating a ``Dataset`` --- :orphan: .. # This file is only used to auto-generate API docs. .. # It should not be included in the toctree. .. # .. # For any classes that you want to include in the .. # API docs, add them to the list of autosummary .. # below, then include the generated ray.data..rst .. # file in your top level rst file. .. currentmodule:: ray.data .. autosummary:: :nosignatures: :template: autosummary/class_v2.rst :toctree: DataIterator Dataset Schema stats.DatasetSummary grouped_data.GroupedData aggregate.AggregateFn aggregate.AggregateFnV2 --- .. _aggregations_api_ref: Aggregation API =============== Pass :class:`AggregateFnV2 ` objects to :meth:`Dataset.aggregate() ` or :meth:`Dataset.groupby().aggregate() ` to compute aggregations. .. currentmodule:: ray.data.aggregate .. autosummary:: :nosignatures: :toctree: doc/ AggregateFnV2 AggregateFn Count Sum Min Max Mean Std AbsMax Quantile Unique ValueCounter MissingValuePercentage ZeroPercentage ApproximateQuantile ApproximateTopK --- .. _data-api: Ray Data API ================ .. toctree:: :maxdepth: 2 input_output.rst dataset.rst data_iterator.rst execution_options.rst aggregate.rst grouped_data.rst expressions.rst datatype.rst data_context.rst preprocessor.rst llm.rst from_other_data_libs.rst --- .. _data-context-api: Global configuration ==================== .. currentmodule:: ray.data.context .. autoclass:: DataContext .. autosummary:: :nosignatures: :toctree: doc/ DataContext.get_current .. autoclass:: AutoscalingConfig --- .. _dataset-iterator-api: DataIterator API ================ .. include:: ray.data.DataIterator.rst --- .. _dataset-api: Dataset API ============== .. include:: ray.data.Dataset.rst Compute Strategy API -------------------- .. currentmodule:: ray.data .. autosummary:: :nosignatures: :toctree: doc/ ActorPoolStrategy TaskPoolStrategy Schema ------ .. currentmodule:: ray.data .. autoclass:: Schema :members: DatasetSummary -------------- .. currentmodule:: ray.data.stats .. autoclass:: DatasetSummary :members: Developer API ------------- .. currentmodule:: ray.data .. autosummary:: :nosignatures: :toctree: doc/ Dataset.to_pandas_refs Dataset.to_numpy_refs Dataset.to_arrow_refs Dataset.iter_internal_ref_bundles block.Block block.BlockExecStats block.BlockMetadata block.BlockAccessor Deprecated API -------------- .. currentmodule:: ray.data .. autosummary:: :nosignatures: :toctree: doc/ Dataset.iter_tf_batches --- .. _datatype-api: Data types ========== .. currentmodule:: ray.data.datatype Class ----- .. autoclass:: DataType :members: Enumeration ----------- .. autoclass:: TypeCategory :members: --- .. _execution-options-api: ExecutionOptions API ==================== .. currentmodule:: ray.data Constructor ----------- .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_autosummary.rst ExecutionOptions Resource Options ---------------- .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_autosummary.rst ExecutionResources --- .. _expressions-api: Expressions API ================ .. currentmodule:: ray.data.expressions Expressions provide a way to specify column-based operations on datasets. Use :func:`col` to reference columns and :func:`lit` to create literal values. You can combine these with operators to create complex expressions for filtering, transformations, and computations. Public API ---------- .. autosummary:: :nosignatures: :toctree: doc/ star col lit udf pyarrow_udf download Expression Classes ------------------ These classes represent the structure of expressions. You typically don't need to instantiate them directly, but you may encounter them when working with expressions. .. autosummary:: :nosignatures: :toctree: doc/ Expr ColumnExpr LiteralExpr BinaryExpr UnaryExpr UDFExpr StarExpr Expression namespaces ------------------------------------ These namespace classes provide specialized operations for list, string, and struct columns. You access them through properties on expressions: ``.list``, ``.str``, and ``.struct``. The following example shows how to use the string namespace to transform text columns: .. testcode:: import ray from ray.data.expressions import col # Create a dataset with a text column ds = ray.data.from_items([ {"name": "alice"}, {"name": "bob"}, {"name": "charlie"} ]) # Use the string namespace to uppercase the names ds = ds.with_column("upper_name", col("name").str.upper()) ds.show() .. testoutput:: {'name': 'alice', 'upper_name': 'ALICE'} {'name': 'bob', 'upper_name': 'BOB'} {'name': 'charlie', 'upper_name': 'CHARLIE'} The following example demonstrates using the list namespace to work with array columns: .. testcode:: import ray from ray.data.expressions import col # Create a dataset with list columns ds = ray.data.from_items([ {"scores": [85, 90, 78]}, {"scores": [92, 88]}, {"scores": [76, 82, 88, 91]} ]) # Use the list namespace to get the length of each list ds = ds.with_column("num_scores", col("scores").list.len()) ds.show() .. testoutput:: {'scores': [85, 90, 78], 'num_scores': 3} {'scores': [92, 88], 'num_scores': 2} {'scores': [76, 82, 88, 91], 'num_scores': 4} The following example shows how to use the struct namespace to access nested fields: .. testcode:: import ray from ray.data.expressions import col # Create a dataset with struct columns ds = ray.data.from_items([ {"user": {"name": "alice", "age": 25}}, {"user": {"name": "bob", "age": 30}}, {"user": {"name": "charlie", "age": 35}} ]) # Use the struct namespace to extract a specific field ds = ds.with_column("user_name", col("user").struct.field("name")) ds.show() .. testoutput:: {'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'} {'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'} {'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'} .. autoclass:: _ListNamespace :members: :exclude-members: _expr .. autoclass:: _StringNamespace :members: :exclude-members: _expr .. autoclass:: _StructNamespace :members: :exclude-members: _expr --- .. _api-guide-for-users-from-other-data-libs: API Guide for Users from Other Data Libraries ============================================= Ray Data is a data loading and preprocessing library for ML. It shares certain similarities with other ETL data processing libraries, but also has its own focus. This guide provides API mappings for users who come from those data libraries, so you can quickly map what you may already know to Ray Data APIs. .. note:: - This is meant to map APIs that perform comparable but not necessarily identical operations. Select the API reference for exact semantics and usage. - This list may not be exhaustive: It focuses on common APIs or APIs that are less obvious to see a connection. .. _api-guide-for-pandas-users: For Pandas Users ---------------- .. list-table:: Pandas DataFrame vs. Ray Data APIs :header-rows: 1 * - Pandas DataFrame API - Ray Data API * - df.head() - :meth:`ds.show() `, :meth:`ds.take() `, or :meth:`ds.take_batch() ` * - df.dtypes - :meth:`ds.schema() ` * - len(df) or df.shape[0] - :meth:`ds.count() ` * - df.truncate() - :meth:`ds.limit() ` * - df.iterrows() - :meth:`ds.iter_rows() ` * - df.drop() - :meth:`ds.drop_columns() ` * - df.transform() - :meth:`ds.map_batches() ` or :meth:`ds.map() ` * - df.groupby() - :meth:`ds.groupby() ` * - df.groupby().apply() - :meth:`ds.groupby().map_groups() ` * - df.sample() - :meth:`ds.random_sample() ` * - df.sort_values() - :meth:`ds.sort() ` * - df.append() - :meth:`ds.union() ` * - df.aggregate() - :meth:`ds.aggregate() ` * - df.min() - :meth:`ds.min() ` * - df.max() - :meth:`ds.max() ` * - df.sum() - :meth:`ds.sum() ` * - df.mean() - :meth:`ds.mean() ` * - df.std() - :meth:`ds.std() ` .. _api-guide-for-pyarrow-users: For PyArrow Users ----------------- .. list-table:: PyArrow Table vs. Ray Data APIs :header-rows: 1 * - PyArrow Table API - Ray Data API * - ``pa.Table.schema`` - :meth:`ds.schema() ` * - ``pa.Table.num_rows`` - :meth:`ds.count() ` * - ``pa.Table.filter()`` - :meth:`ds.filter() ` * - ``pa.Table.drop()`` - :meth:`ds.drop_columns() ` * - ``pa.Table.add_column()`` - :meth:`ds.with_column() ` * - ``pa.Table.groupby()`` - :meth:`ds.groupby() ` * - ``pa.Table.sort_by()`` - :meth:`ds.sort() ` For PyTorch Dataset & DataLoader Users -------------------------------------- For more details, see the :ref:`Migrating from PyTorch to Ray Data `. --- .. _grouped-dataset-api: GroupedData API =============== .. currentmodule:: ray.data The groupby call returns GroupedData objects: :meth:`Dataset.groupby() `. .. include:: ray.data.grouped_data.GroupedData.rst --- .. _input-output: Input/Output ============ .. currentmodule:: ray.data Synthetic Data -------------- .. autosummary:: :nosignatures: :toctree: doc/ range range_tensor Python Objects -------------- .. autosummary:: :nosignatures: :toctree: doc/ from_items Parquet ------- .. autosummary:: :nosignatures: :toctree: doc/ read_parquet Dataset.write_parquet CSV --- .. autosummary:: :nosignatures: :toctree: doc/ read_csv Dataset.write_csv JSON ---- .. autosummary:: :nosignatures: :toctree: doc/ read_json Dataset.write_json Text ---- .. autosummary:: :nosignatures: :toctree: doc/ read_text Audio ----- .. autosummary:: :nosignatures: :toctree: doc/ read_audio Avro ---- .. autosummary:: :nosignatures: :toctree: doc/ read_avro Images ------ .. autosummary:: :nosignatures: :toctree: doc/ read_images Dataset.write_images Binary ------ .. autosummary:: :nosignatures: :toctree: doc/ read_binary_files TFRecords --------- .. autosummary:: :nosignatures: :toctree: doc/ read_tfrecords Dataset.write_tfrecords TFXReadOptions Pandas ------ .. autosummary:: :nosignatures: :toctree: doc/ from_pandas from_pandas_refs Dataset.to_pandas Dataset.to_pandas_refs NumPy ----- .. autosummary:: :nosignatures: :toctree: doc/ read_numpy from_numpy from_numpy_refs Dataset.write_numpy Dataset.to_numpy_refs Arrow ----- .. autosummary:: :nosignatures: :toctree: doc/ from_arrow from_arrow_refs Dataset.to_arrow_refs MongoDB ------- .. autosummary:: :nosignatures: :toctree: doc/ read_mongo Dataset.write_mongo BigQuery -------- .. autosummary:: :toctree: doc/ read_bigquery Dataset.write_bigquery SQL Databases ------------- .. autosummary:: :nosignatures: :toctree: doc/ read_sql Dataset.write_sql Databricks ---------- .. autosummary:: :nosignatures: :toctree: doc/ read_databricks_tables Snowflake --------- .. autosummary:: :nosignatures: :toctree: doc/ read_snowflake Dataset.write_snowflake Unity Catalog ------------- .. autosummary:: :nosignatures: :toctree: doc/ read_unity_catalog Delta Sharing ------------- .. autosummary:: :nosignatures: :toctree: doc/ read_delta_sharing_tables Hudi ---- .. autosummary:: :nosignatures: :toctree: doc/ read_hudi Iceberg ------- .. autosummary:: :nosignatures: :toctree: doc/ read_iceberg Dataset.write_iceberg Delta Lake ---------- .. autosummary:: :nosignatures: :toctree: doc/ read_delta Lance ----- .. autosummary:: :nosignatures: :toctree: doc/ read_lance Dataset.write_lance MCAP (Message Capture) ---------------------- .. autosummary:: :nosignatures: :toctree: doc/ read_mcap ClickHouse ---------- .. autosummary:: :nosignatures: :toctree: doc/ read_clickhouse Dataset.write_clickhouse Daft ---- .. autosummary:: :nosignatures: :toctree: doc/ from_daft Dataset.to_daft Dask ---- .. autosummary:: :nosignatures: :toctree: doc/ from_dask Dataset.to_dask Spark ----- .. autosummary:: :nosignatures: :toctree: doc/ from_spark Dataset.to_spark Modin ----- .. autosummary:: :nosignatures: :toctree: doc/ from_modin Dataset.to_modin Mars ---- .. autosummary:: :nosignatures: :toctree: doc/ from_mars Dataset.to_mars Torch ----- .. autosummary:: :nosignatures: :toctree: doc/ from_torch Hugging Face ------------ .. autosummary:: :nosignatures: :toctree: doc/ from_huggingface TensorFlow ---------- .. autosummary:: :nosignatures: :toctree: doc/ from_tf Video ----- .. autosummary:: :nosignatures: :toctree: doc/ read_videos WebDataset ---------- .. autosummary:: :nosignatures: :toctree: doc/ read_webdataset .. _data_source_api: Kafka ----- .. autosummary:: :nosignatures: :toctree: doc/ read_kafka Datasource API -------------- .. autosummary:: :nosignatures: :toctree: doc/ read_datasource Datasource ReadTask datasource.FilenameProvider Datasink API ------------ .. autosummary:: :nosignatures: :toctree: doc/ Dataset.write_datasink Datasink datasource.RowBasedFileDatasink datasource.BlockBasedFileDatasink datasource.FileBasedDatasource datasource.WriteResult datasource.WriteReturnType Partitioning API ---------------- .. autosummary:: :nosignatures: :toctree: doc/ datasource.Partitioning datasource.PartitionStyle datasource.PathPartitionParser datasource.PathPartitionFilter .. _metadata_provider: MetadataProvider API -------------------- .. autosummary:: :nosignatures: :toctree: doc/ datasource.FileMetadataProvider datasource.BaseFileMetadataProvider datasource.DefaultFileMetadataProvider Shuffling API ------------- .. autosummary:: :nosignatures: :toctree: doc/ FileShuffleConfig --- .. _llm-ref: Large Language Model (LLM) API ============================== .. currentmodule:: ray.data.llm LLM processor builder --------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~build_processor Processor --------- .. autosummary:: :nosignatures: :toctree: doc/ ~Processor Processor configs ----------------- .. autosummary:: :nosignatures: :template: autosummary/class_without_autosummary_noinheritance.rst :toctree: doc/ ~ProcessorConfig ~HttpRequestProcessorConfig ~vLLMEngineProcessorConfig ~SGLangEngineProcessorConfig --- .. _preprocessor-ref: Preprocessor ============ Preprocessor Interface ------------------------ .. currentmodule:: ray.data Constructor ~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessor.Preprocessor Fit/Transform APIs ~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessor.Preprocessor.fit ~preprocessor.Preprocessor.fit_transform ~preprocessor.Preprocessor.transform ~preprocessor.Preprocessor.transform_batch ~preprocessor.PreprocessorNotFittedException Generic Preprocessors --------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessors.Concatenator ~preprocessors.SimpleImputer ~preprocessors.Chain Categorical Encoders -------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessors.Categorizer ~preprocessors.LabelEncoder ~preprocessors.MultiHotEncoder ~preprocessors.OneHotEncoder ~preprocessors.OrdinalEncoder Feature Scalers --------------- .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessors.MaxAbsScaler ~preprocessors.MinMaxScaler ~preprocessors.Normalizer ~preprocessors.PowerTransformer ~preprocessors.RobustScaler ~preprocessors.StandardScaler K-Bins Discretizers ------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessors.CustomKBinsDiscretizer ~preprocessors.UniformKBinsDiscretizer Feature Hashers and Vectorizers ------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessors.FeatureHasher ~preprocessors.CountVectorizer ~preprocessors.HashingVectorizer Specialized Preprocessors ------------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~preprocessors.TorchVisionPreprocessor --- .. _batch_inference_home: End-to-end: Offline Batch Inference =================================== Offline batch inference is a process for generating model predictions on a fixed set of input data. Ray Data offers an efficient and scalable solution for batch inference, providing faster execution and cost-effectiveness for deep learning applications. .. https://docs.google.com/presentation/d/1l03C1-4jsujvEFZUM4JVNy8Ju8jnY5Lc_3q7MBWi2PQ/edit#slide=id.g230eb261ad2_0_0 .. image:: images/stream-example.png :width: 650px :align: center .. note:: This guide is primarily focused on batch inference with deep learning frameworks. For more information on batch inference with LLMs, see :ref:`Working with LLMs `. .. _batch_inference_quickstart: Quickstart ---------- To start, install Ray Data: .. code-block:: bash pip install -U "ray[data]" Using Ray Data for offline inference involves four basic steps: - **Step 1:** Load your data into a Ray Dataset. Ray Data supports many different datasources and formats. For more details, see :ref:`Loading Data `. - **Step 2:** Define a Python class to load the pre-trained model. - **Step 3:** Transform your dataset using the pre-trained model by calling :meth:`ds.map_batches() `. For more details, see :ref:`Transforming Data `. - **Step 4:** Get the final predictions by either iterating through the output or saving the results. For more details, see the :ref:`Iterating over data ` and :ref:`Saving data ` user guides. For more in-depth examples for your use case, see :doc:`the batch inference examples`. For how to configure batch inference, see :ref:`the configuration guide`. .. tab-set:: .. tab-item:: HuggingFace :sync: HuggingFace .. testcode:: from typing import Dict import numpy as np import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"])) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and reuse it for inference across multiple batches. class HuggingFacePredictor: def __init__(self): from transformers import pipeline # Initialize a pre-trained GPT2 Huggingface pipeline. self.model = pipeline("text-generation", model="gpt2") # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: # Get the predictions from the input batch. predictions = self.model(list(batch["data"]), max_length=20, num_return_sequences=1) # `predictions` is a list of length-one lists. For example: # [[{'generated_text': 'output_1'}], ..., [{'generated_text': 'output_2'}]] # Modify the output to get it into the following format instead: # ['output_1', 'output_2'] batch["output"] = [sequences[0]["generated_text"] for sequences in predictions] return batch # Step 2: Map the Predictor over the Dataset to get predictions. # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. predictions = ds.map_batches(HuggingFacePredictor, compute=ray.data.ActorPoolStrategy(size=2)) # Step 3: Show one prediction output. predictions.show(limit=1) .. testoutput:: :options: +MOCK {'data': 'Complete this', 'output': 'Complete this information or purchase any item from this site.\n\nAll purchases are final and non-'} .. tab-item:: PyTorch :sync: PyTorch .. testcode:: from typing import Dict import numpy as np import torch import torch.nn as nn import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.ones((1, 100))) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and reuse it for inference across multiple batches. class TorchPredictor: def __init__(self): # Load a dummy neural network. # Set `self.model` to your pre-trained PyTorch model. self.model = nn.Sequential( nn.Linear(in_features=100, out_features=1), nn.Sigmoid(), ) self.model.eval() # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: tensor = torch.as_tensor(batch["data"], dtype=torch.float32) with torch.inference_mode(): # Get the predictions from the input batch. return {"output": self.model(tensor).numpy()} # Step 2: Map the Predictor over the Dataset to get predictions. # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. predictions = ds.map_batches(TorchPredictor, compute=ray.data.ActorPoolStrategy(size=2)) # Step 3: Show one prediction output. predictions.show(limit=1) .. testoutput:: :options: +MOCK {'output': array([0.5590901], dtype=float32)} .. tab-item:: TensorFlow :sync: TensorFlow .. testcode:: from typing import Dict import numpy as np import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.ones((1, 100))) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and reuse it for inference across multiple batches. class TFPredictor: def __init__(self): from tensorflow import keras # Load a dummy neural network. # Set `self.model` to your pre-trained Keras model. input_layer = keras.Input(shape=(100,)) output_layer = keras.layers.Dense(1, activation="sigmoid") self.model = keras.Sequential([input_layer, output_layer]) # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: # Get the predictions from the input batch. return {"output": self.model(batch["data"]).numpy()} # Step 2: Map the Predictor over the Dataset to get predictions. # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. predictions = ds.map_batches(TFPredictor, compute=ray.data.ActorPoolStrategy(size=2)) # Step 3: Show one prediction output. predictions.show(limit=1) .. testoutput:: :options: +MOCK {'output': array([0.625576], dtype=float32)} .. tab-item:: LLM Inference :sync: vLLM Ray Data offers native integration with vLLM, a high-performance inference engine for large language models (LLMs). .. testcode:: :skipif: True import ray from ray.data.llm import vLLMEngineProcessorConfig, build_processor import numpy as np config = vLLMEngineProcessorConfig( model="unsloth/Llama-3.1-8B-Instruct", engine_kwargs={ "enable_chunked_prefill": True, "max_num_batched_tokens": 4096, "max_model_len": 16384, }, concurrency=1, batch_size=64, ) processor = build_processor( config, preprocess=lambda row: dict( messages=[ {"role": "system", "content": "You are a bot that responds with haikus."}, {"role": "user", "content": row["item"]} ], sampling_params=dict( temperature=0.3, max_tokens=250, ) ), postprocess=lambda row: dict( answer=row["generated_text"] ), ) ds = ray.data.from_items(["Start of the haiku is: Complete this for me..."]) ds = processor(ds) ds.show(limit=1) .. testoutput:: :options: +MOCK {'answer': 'Snowflakes gently fall\nBlanketing the winter scene\nFrozen peaceful hush'} .. _batch_inference_configuration: Configuration and troubleshooting --------------------------------- .. _batch_inference_gpu: Using GPUs for inference ~~~~~~~~~~~~~~~~~~~~~~~~ To use GPUs for inference, make the following changes to your code: 1. Update the class implementation to move the model and data to and from GPU. 2. Specify ``num_gpus=1`` in the :meth:`ds.map_batches() ` call to indicate that each actor should use 1 GPU. 3. Specify a ``batch_size`` for inference. For more details on how to configure the batch size, see :ref:`Configuring Batch Size `. The remaining is the same as the :ref:`Quickstart `. .. tab-set:: .. tab-item:: HuggingFace :sync: HuggingFace .. testcode:: from typing import Dict import numpy as np import ray ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"])) class HuggingFacePredictor: def __init__(self): from transformers import pipeline # Set "cuda:0" as the device so the Huggingface pipeline uses GPU. self.model = pipeline("text-generation", model="gpt2", device="cuda:0") def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: predictions = self.model(list(batch["data"]), max_length=20, num_return_sequences=1) batch["output"] = [sequences[0]["generated_text"] for sequences in predictions] return batch # Use 2 actors, each actor using 1 GPU. 2 GPUs total. predictions = ds.map_batches( HuggingFacePredictor, num_gpus=1, # Specify the batch size for inference. # Increase this for larger datasets. batch_size=1, # Set the concurrency to the number of GPUs in your cluster. compute=ray.data.ActorPoolStrategy(size=2), ) predictions.show(limit=1) .. testoutput:: :options: +MOCK {'data': 'Complete this', 'output': 'Complete this poll. Which one do you think holds the most promise for you?\n\nThank you'} .. tab-item:: PyTorch :sync: PyTorch .. testcode:: from typing import Dict import numpy as np import torch import torch.nn as nn import ray ds = ray.data.from_numpy(np.ones((1, 100))) class TorchPredictor: def __init__(self): # Move the neural network to GPU device by specifying "cuda". self.model = nn.Sequential( nn.Linear(in_features=100, out_features=1), nn.Sigmoid(), ).cuda() self.model.eval() def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: # Move the input batch to GPU device by specifying "cuda". tensor = torch.as_tensor(batch["data"], dtype=torch.float32, device="cuda") with torch.inference_mode(): # Move the prediction output back to CPU before returning. return {"output": self.model(tensor).cpu().numpy()} # Use 2 actors, each actor using 1 GPU. 2 GPUs total. predictions = ds.map_batches( TorchPredictor, num_gpus=1, # Specify the batch size for inference. # Increase this for larger datasets. batch_size=1, # Set the concurrency to the number of GPUs in your cluster. compute=ray.data.ActorPoolStrategy(size=2), ) predictions.show(limit=1) .. testoutput:: :options: +MOCK {'output': array([0.5590901], dtype=float32)} .. tab-item:: TensorFlow :sync: TensorFlow .. testcode:: from typing import Dict import numpy as np import ray ds = ray.data.from_numpy(np.ones((1, 100))) class TFPredictor: def __init__(self): import tensorflow as tf from tensorflow import keras # Move the neural network to GPU by specifying the GPU device. with tf.device("GPU:0"): input_layer = keras.Input(shape=(100,)) output_layer = keras.layers.Dense(1, activation="sigmoid") self.model = keras.Sequential([input_layer, output_layer]) def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: import tensorflow as tf # Move the input batch to GPU by specifying GPU device. with tf.device("GPU:0"): return {"output": self.model(batch["data"]).numpy()} # Use 2 actors, each actor using 1 GPU. 2 GPUs total. predictions = ds.map_batches( TFPredictor, num_gpus=1, # Specify the batch size for inference. # Increase this for larger datasets. batch_size=1, # Set the concurrency to the number of GPUs in your cluster. compute=ray.data.ActorPoolStrategy(size=2), ) predictions.show(limit=1) .. testoutput:: :options: +MOCK {'output': array([0.625576], dtype=float32)} .. _batch_inference_batch_size: Configuring Batch Size ~~~~~~~~~~~~~~~~~~~~~~ Configure the size of the input batch that's passed to ``__call__`` by setting the ``batch_size`` argument for :meth:`ds.map_batches() ` Increasing batch size results in faster execution because inference is a vectorized operation. For GPU inference, increasing batch size increases GPU utilization. Set the batch size to as large possible without running out of memory. If you encounter out-of-memory errors, decreasing ``batch_size`` may help. .. testcode:: import numpy as np import ray ds = ray.data.from_numpy(np.ones((10, 100))) def assert_batch(batch: Dict[str, np.ndarray]): assert len(batch) == 2 return batch # Specify that each input batch should be of size 2. ds.map_batches(assert_batch, batch_size=2) .. caution:: The default ``batch_size`` of ``4096`` may be too large for datasets with large rows (for example, tables with many columns or a collection of large images). Handling GPU out-of-memory failures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you run into CUDA out-of-memory issues, your batch size is likely too large. Decrease the batch size by following :ref:`these steps `. If your batch size is already set to 1, then use either a smaller model or GPU devices with more memory. For advanced users working with large models, you can use model parallelism to shard the model across multiple GPUs. Optimizing expensive CPU preprocessing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your workload involves expensive CPU preprocessing in addition to model inference, you can optimize throughput by separating the preprocessing and inference logic into separate operations. This separation allows inference on batch :math:`N` to execute concurrently with preprocessing on batch :math:`N+1`. For an example where preprocessing is done in a separate `map` call, see :doc:`Image Classification Batch Inference with PyTorch ResNet18 `. Handling CPU out-of-memory failures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you run out of CPU RAM, you likely have too many model replicas that are running concurrently on the same node. For example, if a model uses 5 GB of RAM when created / run, and a machine has 16 GB of RAM total, then no more than three of these models can be run at the same time. The default resource assignments of one CPU per task/actor might lead to `OutOfMemoryError` from Ray in this situation. Suppose your cluster has 4 nodes, each with 16 CPUs. To limit to at most 3 of these actors per node, you can override the CPU or memory: .. testcode:: :skipif: True from typing import Dict import numpy as np import ray ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"])) class HuggingFacePredictor: def __init__(self): from transformers import pipeline self.model = pipeline("text-generation", model="gpt2") def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: predictions = self.model(list(batch["data"]), max_length=20, num_return_sequences=1) batch["output"] = [sequences[0]["generated_text"] for sequences in predictions] return batch predictions = ds.map_batches( HuggingFacePredictor, # Require 5 CPUs per actor (so at most 3 can fit per 16 CPU node). num_cpus=5, # 3 actors per node, with 4 nodes in the cluster means concurrency of 12. compute=ray.data.ActorPoolStrategy(size=12), ) predictions.show(limit=1) --- Comparing Ray Data to other systems =================================== How does Ray Data compare to other solutions for offline inference? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. dropdown:: Batch Services: AWS Batch, GCP Batch Cloud providers such as AWS, GCP, and Azure provide batch services to manage compute infrastructure for you. Each service uses the same process: you provide the code, and the service runs your code on each node in a cluster. However, while infrastructure management is necessary, it is often not enough. These services have limitations, such as a lack of software libraries to address optimized parallelization, efficient data transfer, and easy debugging. These solutions are suitable only for experienced users who can write their own optimized batch inference code. Ray Data abstracts away not only the infrastructure management, but also the sharding of your dataset, the parallelization of the inference over these shards, and the transfer of data from storage to CPU to GPU. .. dropdown:: Online inference solutions: Bento ML, Sagemaker Batch Transform Solutions like `Bento ML `_, `Sagemaker Batch Transform `_, or :ref:`Ray Serve ` provide APIs to make it easy to write performant inference code and can abstract away infrastructure complexities. But they are designed for online inference rather than offline batch inference, which are two different problems with different sets of requirements. These solutions introduce additional complexity like HTTP, and cannot effectively handle large datasets leading inference service providers like `Bento ML to integrating with Apache Spark `_ for offline inference. Ray Data is built for offline batch jobs, without all the extra complexities of starting servers or sending HTTP requests. For a more detailed performance comparison between Ray Data and Sagemaker Batch Transform, see `Offline Batch Inference: Comparing Ray, Apache Spark, and SageMaker `_. .. dropdown:: Distributed Data Processing Frameworks: Apache Spark and Daft Ray Data handles many of the same batch processing workloads as `Apache Spark `_ and `Daft `_, but with a streaming paradigm that is better suited for GPU workloads for deep learning inference. However, Ray Data doesn't have a SQL interface unlike Spark and Daft. For a more detailed performance comparison between Ray Data and Apache Spark, see `Offline Batch Inference: Comparing Ray, Apache Spark, and SageMaker `_. How does Ray Data compare to other solutions for ML training ingest? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. dropdown:: PyTorch Dataset and DataLoader * **Framework-agnostic:** Datasets is framework-agnostic and portable between different distributed training frameworks, while `Torch datasets `__ are specific to Torch. * **No built-in IO layer:** Torch datasets do not have an I/O layer for common file formats or in-memory exchange with other frameworks; users need to bring in other libraries and roll this integration themselves. * **Generic distributed data processing:** Datasets is more general: it can handle generic distributed operations, including global per-epoch shuffling, which would otherwise have to be implemented by stitching together two separate systems. Torch datasets would require such stitching for anything more involved than batch-based preprocessing, and does not natively support shuffling across worker shards. See our `blog post `__ on why this shared infrastructure is important for 3rd generation ML architectures. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines of Torch datasets. .. dropdown:: TensorFlow Dataset * **Framework-agnostic:** Datasets is framework-agnostic and portable between different distributed training frameworks, while `TensorFlow datasets `__ is specific to TensorFlow. * **Unified single-node and distributed:** Datasets unifies single and multi-node training under the same abstraction. TensorFlow datasets presents `separate concepts `__ for distributed data loading and prevents code from being seamlessly scaled to larger clusters. * **Generic distributed data processing:** Datasets is more general: it can handle generic distributed operations, including global per-epoch shuffling, which would otherwise have to be implemented by stitching together two separate systems. TensorFlow datasets would require such stitching for anything more involved than basic preprocessing, and does not natively support full-shuffling across worker shards; only file interleaving is supported. See our `blog post `__ on why this shared infrastructure is important for 3rd generation ML architectures. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines of TensorFlow datasets. .. dropdown:: Petastorm * **Supported data types:** `Petastorm `__ only supports Parquet data, while Ray Data supports many file formats. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines used by Petastorm. * **No data processing:** Petastorm does not expose any data processing APIs. .. dropdown:: NVTabular * **Supported data types:** `NVTabular `__ only supports tabular (Parquet, CSV, Avro) data, while Ray Data supports many other file formats. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines used by NVTabular. * **Heterogeneous compute:** NVTabular doesn't support mixing heterogeneous resources in dataset transforms (e.g. both CPU and GPU transformations), while Ray Data supports this. --- ======================== Contributing to Ray Data ======================== .. toctree:: :maxdepth: 2 contributing-guide how-to-write-tests --- .. _custom_datasource: Advanced: Read and Write Custom File Types ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. vale off .. Ignoring Vale because of future tense. This guide shows you how to extend Ray Data to read and write file types that aren't natively supported. This is an advanced guide, and you'll use unstable internal APIs. .. vale on Images are already supported with the :func:`~ray.data.read_images` and :meth:`~ray.data.Dataset.write_images` APIs, but this example shows you how to implement them for illustrative purposes. Read data from files -------------------- .. tip:: If you're not contributing to Ray Data, you don't need to create a :class:`~ray.data.Datasource`. Instead, you can call :func:`~ray.data.read_binary_files` and decode files with :meth:`~ray.data.Dataset.map`. The core abstraction for reading files is :class:`~ray.data.datasource.FileBasedDatasource`. It provides file-specific functionality on top of the :class:`~ray.data.Datasource` interface. To subclass :class:`~ray.data.datasource.FileBasedDatasource`, implement the constructor and ``_read_stream``. Implement the constructor ========================= Call the superclass constructor and specify the files you want to read. Optionally, specify valid file extensions. Ray Data ignores files with other extensions. .. literalinclude:: doc_code/custom_datasource_example.py :language: python :start-after: __datasource_constructor_start__ :end-before: __datasource_constructor_end__ Implement ``_read_stream`` ========================== ``_read_stream`` is a generator that yields one or more blocks of data from a file. `Blocks `_ are a Data-internal abstraction for a collection of rows. They can be PyArrow tables, pandas DataFrames, or dictionaries of NumPy arrays. Don't create a block directly. Instead, add rows of data to a `DelegatingBlockBuilder `_. .. literalinclude:: doc_code/custom_datasource_example.py :language: python :start-after: __read_stream_start__ :end-before: __read_stream_end__ Read your data ============== Once you've implemented ``ImageDatasource``, call :func:`~ray.data.read_datasource` to read images into a :class:`~ray.data.Dataset`. Ray Data reads your files in parallel. .. literalinclude:: doc_code/custom_datasource_example.py :language: python :start-after: __read_datasource_start__ :end-before: __read_datasource_end__ Write data to files ------------------- .. note:: The write interface is under active development and might change in the future. If you have feature requests, `open a GitHub Issue `_. The core abstractions for writing data to files are :class:`~ray.data.datasource.RowBasedFileDatasink` and :class:`~ray.data.datasource.BlockBasedFileDatasink`. They provide file-specific functionality on top of the :class:`~ray.data.Datasink` interface. If you want to write one row per file, subclass :class:`~ray.data.datasource.RowBasedFileDatasink`. Otherwise, subclass :class:`~ray.data.datasource.BlockBasedFileDatasink`. .. vale off .. Ignoring Vale because of future tense. In this example, you'll write one image per file, so you'll subclass :class:`~ray.data.datasource.RowBasedFileDatasink`. To subclass :class:`~ray.data.datasource.RowBasedFileDatasink`, implement the constructor and :meth:`~ray.data.datasource.RowBasedFileDatasink.write_row_to_file`. .. vale on Implement the constructor ========================= Call the superclass constructor and specify the folder to write to. Optionally, specify a string representing the file format (for example, ``"png"``). Ray Data uses the file format as the file extension. .. literalinclude:: doc_code/custom_datasource_example.py :language: python :start-after: __datasink_constructor_start__ :end-before: __datasink_constructor_end__ Implement ``write_row_to_file`` =============================== ``write_row_to_file`` writes a row of data to a file. Each row is a dictionary that maps column names to values. .. literalinclude:: doc_code/custom_datasource_example.py :language: python :start-after: __write_row_to_file_start__ :end-before: __write_row_to_file_end__ Write your data =============== Once you've implemented ``ImageDatasink``, call :meth:`~ray.data.Dataset.write_datasink` to write images to files. Ray Data writes to multiple files in parallel. .. literalinclude:: doc_code/custom_datasource_example.py :language: python :start-after: __write_datasink_start__ :end-before: __write_datasink_end__ --- .. _datasets_scheduling: ================== Ray Data Internals ================== This guide describes the implementation of Ray Data. The intended audience is advanced users and Ray Data developers. For a gentler introduction to Ray Data, see :ref:`Quickstart `. .. _dataset_concept: Key concepts ============ Datasets and blocks ------------------- Datasets ~~~~~~~~ :class:`Dataset ` is the main user-facing Python API. It represents a distributed data collection, and defines data loading and processing operations. You typically use the API in this way: 1. Create a Ray Dataset from external storage or in-memory data. 2. Apply transformations to the data. 3. Write the outputs to external storage or feed the outputs to training workers. Blocks ~~~~~~ A *block* is the basic unit of data bulk that Ray Data stores in the object store and transfers over the network. Each block contains a disjoint subset of rows, and Ray Data loads and transforms these blocks in parallel. The following figure visualizes a dataset with three blocks, each holding 1000 rows. Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution (which is usually the driver) and stores the blocks as objects in Ray's shared-memory :ref:`object store `. .. image:: images/dataset-arch.svg .. https://docs.google.com/drawings/d/1PmbDvHRfVthme9XD7EYM-LIHPXtHdOfjCbc1SCsM64k/edit Block formats ~~~~~~~~~~~~~ Blocks are Arrow tables or `pandas` DataFrames. Generally, blocks are Arrow tables unless Arrow can’t represent your data. The block format doesn’t affect the type of data returned by APIs like :meth:`~ray.data.Dataset.iter_batches`. Block size limiting ~~~~~~~~~~~~~~~~~~~ Ray Data bounds block sizes to avoid excessive communication overhead and prevent out-of-memory errors. Small blocks are good for latency and more streamed execution, while large blocks reduce scheduler and communication overhead. The default range attempts to make a good tradeoff for most jobs. Ray Data attempts to bound block sizes between 1 MiB and 128 MiB. To change the block size range, configure the ``target_min_block_size`` and ``target_max_block_size`` attributes of :class:`~ray.data.context.DataContext`. .. testcode:: import ray ctx = ray.data.DataContext.get_current() ctx.target_min_block_size = 1 * 1024 * 1024 ctx.target_max_block_size = 128 * 1024 * 1024 Dynamic block splitting ~~~~~~~~~~~~~~~~~~~~~~~ If a block is larger than 192 MiB (50% more than the target max size), Ray Data dynamically splits the block into smaller blocks. To change the size at which Ray Data splits blocks, configure ``MAX_SAFE_BLOCK_SIZE_FACTOR``. The default value is 1.5. .. testcode:: import ray ray.data.context.MAX_SAFE_BLOCK_SIZE_FACTOR = 1.5 Ray Data can’t split rows. So, if your dataset contains large rows (for example, large images), then Ray Data can’t bound the block size. Shuffle Algorithms ------------------ In data processing, shuffling refers to the process of redistributing individual dataset's partitions (that in Ray Data are called :ref:`blocks `). Ray Data implements two main shuffle algorithms: .. _hash-shuffle: Hash-shuffling ~~~~~~~~~~~~~~ .. note:: Hash-shuffling is available in Ray 2.46 Hash-shuffling is a classical hash-partitioning based shuffling where: 1. **Partition phase:** rows in every block are hash-partitioned based on values in the *key columns* into a specified number of partitions, following a simple residual formula of ``hash(key-values) % N`` (used in hash-tables and pretty much everywhere). 2. **Push phase:** partition's shards from individual blocks are then pushed into corresponding aggregating actors (called ``HashShuffleAggregator``) handling respective partitions. 3. **Reduce phase:** aggregators combine received individual partition's shards back into blocks optionally applying additional transformations before producing the resulting blocks. Hash-shuffling is particularly useful for operations that require deterministic partitioning based on keys, such as joins, group-by operations, and key-based repartitioning, by ensuring that rows with the same key-values are being placed into the same partition. .. note:: To use hash-shuffling in your aggregations and repartitioning operations, you need to currently specify ``ray.data.DataContext.get_current().shuffle_strategy = ShuffleStrategy.HASH_SHUFFLE`` before creating a ``Dataset``. .. _range-partitioning-shuffle: Range-partitioning shuffle ~~~~~~~~~~~~~~~~~~~~~~~~~~ Range-partitioning based shuffle also is a classical algorithm, based on the dataset being split into target number of ranges as determined by boundaries approximating the real ranges of the totally ordered (sorted) dataset. 1. **Sampling phase:** every input block is randomly sampled for (10) rows. Samples are combined into a single dataset, which is then sorted and split into target number of partitions defining approximate *range boundaries*. 2. **Partition phase:** every block is sorted and split into partitions based on the *range boundaries* derived in the previous step. 3. **Reduce phase:** individual partitions within the same range are then recombined to produce the resulting block. .. note:: Range-partitioning shuffle is a default shuffling strategy. To set it explicitly specify ``ray.data.DataContext.get_current().shuffle_strategy = ShuffleStrategy.SORT_SHUFFLE_PULL_BASED`` before creating a ``Dataset``. Operators, plans, and planning ------------------------------ Operators ~~~~~~~~~ There are two types of operators: *logical operators* and *physical operators*. Logical operators are stateless objects that describe “what” to do. Physical operators are stateful objects that describe “how” to do it. An example of a logical operator is ``ReadOp``, and an example of a physical operator is ``TaskPoolMapOperator``. Plans ~~~~~ A *logical plan* is a series of logical operators, and a *physical plan* is a series of physical operators. When you call APIs like :func:`ray.data.read_images` and :meth:`ray.data.Dataset.map_batches`, Ray Data produces a logical plan. When execution starts, the planner generates a corresponding physical plan. The planner ~~~~~~~~~~~ The Ray Data planner translates logical operators to one or more physical operators. For example, the planner translates the ``ReadOp`` logical operator into two physical operators: an ``InputDataBuffer`` and ``TaskPoolMapOperator``. Whereas the ``ReadOp`` logical operator only describes the input data, the ``TaskPoolMapOperator`` physical operator actually launches tasks to read the data. Plan optimization ~~~~~~~~~~~~~~~~~ Ray Data applies optimizations to both logical and physical plans. For example, the ``OperatorFusionRule`` combines a chain of physical map operators into a single map operator. This prevents unnecessary serialization between map operators. To add custom optimization rules, implement a class that extends ``Rule`` and configure ``DEFAULT_LOGICAL_RULES`` or ``DEFAULT_PHYSICAL_RULES``. .. testcode:: import ray from ray.data._internal.logical.interfaces import Rule from ray.data._internal.logical.optimizers import get_logical_ruleset class CustomRule(Rule): def apply(self, plan): ... logical_ruleset = get_logical_ruleset() logical_ruleset.add(CustomRule) .. testcode:: :hide: logical_ruleset.remove(CustomRule) Types of physical operators ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Physical operators take in a stream of block references and output another stream of block references. Some physical operators launch Ray Tasks and Actors to transform the blocks, and others only manipulate the references. ``MapOperator`` is the most common operator. All read, transform, and write operations are implemented with it. To process data, ``MapOperator`` implementations use either Ray Tasks or Ray Actors. Non-map operators include ``OutputSplitter`` and ``LimitOperator``. These two operators manipulate references to data, but don’t launch tasks or modify the underlying data. Execution --------- The executor ~~~~~~~~~~~~ The *executor* schedules tasks and moves data between physical operators. The executor and operators are located on the process where dataset execution starts. For batch inference jobs, this process is usually the driver. For training jobs, the executor runs on a special actor called ``SplitCoordinator`` which handles :meth:`~ray.data.Dataset.streaming_split`. Tasks and actors launched by operators are scheduled across the cluster, and outputs are stored in Ray’s distributed object store. The executor manipulates references to objects, and doesn’t fetch the underlying data itself to the executor. Out queues ~~~~~~~~~~ Each physical operator has an associated *out queue*. When a physical operator produces outputs, the executor moves the outputs to the operator’s out queue. .. _streaming_execution: Streaming execution ~~~~~~~~~~~~~~~~~~~ In contrast to bulk synchronous execution, Ray Data’s streaming execution doesn’t wait for one operator to complete to start the next. Each operator takes in and outputs a stream of blocks. This approach allows you to process datasets that are too large to fit in your cluster’s memory. The scheduling loop ~~~~~~~~~~~~~~~~~~~ The executor runs a loop. Each step works like this: 1. Wait until running tasks and actors have new outputs. 2. Move new outputs into the appropriate operator out queues. 3. Choose some operators and assign new inputs to them. These operator process the new inputs either by launching new tasks or manipulating metadata. Choosing the best operator to assign inputs is one of the most important decisions in Ray Data. This decision is critical to the performance, stability, and scalability of a Ray Data job. The executor can schedule an operator if the operator satisfies the following conditions: * The operator has inputs. * There are adequate resources available. * The operator isn’t backpressured. If there are multiple viable operators, the executor chooses the operator with the smallest out queue. Scheduling ========== Ray Data uses Ray Core for execution. Below is a summary of the :ref:`scheduling strategy ` for Ray Data: * The ``SPREAD`` scheduling strategy ensures that data blocks and map tasks are evenly balanced across the cluster. * Dataset tasks ignore placement groups by default, see :ref:`Ray Data and Placement Groups `. * Map operations use the ``SPREAD`` scheduling strategy if the total argument size is less than 50 MB; otherwise, they use the ``DEFAULT`` scheduling strategy. * Read operations use the ``SPREAD`` scheduling strategy. * All other operations, such as split, sort, and shuffle, use the ``DEFAULT`` scheduling strategy. .. _datasets_pg: Ray Data and placement groups ----------------------------- By default, Ray Data configures its tasks and actors to use the cluster-default scheduling strategy (``"DEFAULT"``). You can inspect this configuration variable here: :class:`ray.data.DataContext.get_current().scheduling_strategy `. This scheduling strategy schedules these Tasks and Actors outside any present placement group. To use current placement group resources specifically for Ray Data, set ``ray.data.DataContext.get_current().scheduling_strategy = None``. Consider this override only for advanced use cases to improve performance predictability. The general recommendation is to let Ray Data run outside placement groups. .. _datasets_tune: Ray Data and Tune ----------------- When using Ray Data in conjunction with :ref:`Ray Tune `, it's important to ensure there are enough free CPUs for Ray Data to run on. By default, Tune tries to fully utilize cluster CPUs. This can prevent Ray Data from scheduling tasks, reducing performance or causing workloads to hang. To ensure CPU resources are always available for Ray Data execution, limit the number of concurrent Tune trials with the ``max_concurrent_trials`` Tune option. .. literalinclude:: ./doc_code/key_concepts.py :language: python :start-after: __resource_allocation_1_begin__ :end-before: __resource_allocation_1_end__ Memory Management ================= This section describes how Ray Data manages execution and object store memory. Execution Memory ---------------- During execution, a task can read multiple input blocks, and write multiple output blocks. Input and output blocks consume both worker heap memory and shared memory through Ray's object store. Ray caps object store memory usage by spilling to disk, but excessive worker heap memory usage can cause out-of-memory errors. For more information on tuning memory usage and preventing out-of-memory errors, see the :ref:`performance guide `. Object Store Memory ------------------- Ray Data uses the Ray object store to store data blocks, which means it inherits the memory management features of the Ray object store. This section discusses the relevant features: * Object Spilling: Since Ray Data uses the Ray object store to store data blocks, any blocks that can't fit into object store memory are automatically spilled to disk. The objects are automatically reloaded when needed by downstream compute tasks: * Locality Scheduling: Ray preferentially schedules compute tasks on nodes that already have a local copy of the object, reducing the need to transfer objects between nodes in the cluster. * Reference Counting: Dataset blocks are kept alive by object store reference counting as long as there is any Dataset that references them. To free memory, delete any Python references to the Dataset object. --- .. _data: =================================================== Ray Data: Scalable Data Processing for AI Workloads =================================================== .. toctree:: :hidden: quickstart key-concepts user-guide examples api/api contributing/contributing comparisons benchmark data-internals Ray Data is a scalable data processing library for AI workloads built on Ray. Ray Data provides flexible and performant APIs for common operations such as :ref:`batch inference `, data preprocessing, and data loading for ML training. Unlike other distributed data systems, Ray Data features a :ref:`streaming execution engine ` to efficiently process large datasets and maintain high utilization across both CPU and GPU workloads. Quick start ----------- First, install Ray Data. To learn more about installing Ray and its libraries, see :ref:`Installing Ray `: .. code-block:: console $ pip install -U 'ray[data]' Here is an example of how to do perform a simple batch text classification task with Ray Data: .. testcode:: import ray import pandas as pd class ClassificationModel: def __init__(self): from transformers import pipeline self.pipe = pipeline("text-classification") def __call__(self, batch: pd.DataFrame): results = self.pipe(list(batch["text"])) result_df = pd.DataFrame(results) return pd.concat([batch, result_df], axis=1) ds = ray.data.read_text("s3://anonymous@ray-example-data/sms_spam_collection_subset.txt") ds = ds.map_batches( ClassificationModel, compute=ray.data.ActorPoolStrategy(size=2), batch_size=64, batch_format="pandas" # num_gpus=1 # this will set 1 GPU per worker ) ds.show(limit=1) .. testoutput:: :options: +MOCK {'text': 'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'label': 'NEGATIVE', 'score': 0.9935141801834106} Why choose Ray Data? -------------------- Modern AI workloads revolve around the usage of deep learning models, which are computationally intensive and often require specialized hardware such as GPUs. Unlike CPUs, GPUs often come with less memory, have different semantics for scheduling, and are much more expensive to run. Systems built to support traditional data processing pipelines often don't utilize such resources well. Ray Data supports AI workloads as a first-class citizen and offers several key advantages: - **Faster and cheaper for deep learning**: Ray Data streams data between CPU preprocessing and GPU inference/training tasks, maximizing resource utilization and reducing costs by keeping GPUs active. - **Framework friendly**: Ray Data provides performant, first-class integration with common AI frameworks (vLLM, PyTorch, HuggingFace, TensorFlow) and common cloud providers (AWS, GCP, Azure) - **Support for multi-modal data**: Ray Data leverages Apache Arrow and Pandas and provides support for many data formats used in ML workloads such as Parquet, Lance, images, JSON, CSV, audio, video, and more. - **Scalable by default**: Built on Ray for automatic scaling across heterogeneous clusters with different CPU and GPU machines. Code runs unchanged from one machine to hundreds of nodes processing hundreds of TB of data. .. https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit Learn more ---------- .. grid:: 1 2 2 2 :gutter: 1 :class-container: container pb-5 .. grid-item-card:: **Quickstart** ^^^ Get started with Ray Data with a simple example. +++ .. button-ref:: data_quickstart :color: primary :outline: :expand: Quickstart .. grid-item-card:: **Key Concepts** ^^^ Learn the key concepts behind Ray Data. Learn what Datasets are and how they're used. +++ .. button-ref:: data_key_concepts :color: primary :outline: :expand: Key Concepts .. grid-item-card:: **User Guides** ^^^ Learn how to use Ray Data, from basic usage to end-to-end guides. +++ .. button-ref:: data_user_guide :color: primary :outline: :expand: Learn how to use Ray Data .. grid-item-card:: **Examples** ^^^ Find both simple and scaling-out examples of using Ray Data. +++ .. button-ref:: examples :color: primary :outline: :expand: Ray Data Examples .. grid-item-card:: **API** ^^^ Get more in-depth information about the Ray Data API. +++ .. button-ref:: data-api :color: primary :outline: :expand: Read the API Reference Case studies for Ray Data ------------------------- **Training ingest using Ray Data** - `Pinterest uses Ray Data to do last mile data processing for model training `_ - `DoorDash elevates model training with Ray Data `_ - `Instacart builds distributed machine learning model training on Ray Data `_ - `Predibase speeds up image augmentation for model training using Ray Data `_ **Batch inference using Ray Data** - `ByteDance scales offline inference with multi-modal LLMs to 200 TB on Ray Data `_ - `Spotify's new ML platform built on Ray Data for batch inference `_ - `Sewer AI speeds up object detection on videos 3x using Ray Data `_ --- .. _execution_configurations: ======================== Execution Configurations ======================== Ray Data provides a number of configuration options that control various aspects of execution of Ray Data's :class:`~ray.data.Dataset` on top of configuration of the Ray Core cluster itself. Ray Data's configuration is primarily controlled through either of :class:`~ray.data.ExecutionOptions` or :class:`~ray.data.DataContext`. This guide describes the most important of these configurations and when to use them. Configuring :class:`~ray.data.ExecutionOptions` =============================================== The :class:`~ray.data.ExecutionOptions` class is used to configure options during Ray Dataset execution. To use it, modify the attributes in the current :class:`~ray.data.DataContext` object's `execution_options`. For example: .. testcode:: :hide: import ray .. testcode:: ctx = ray.data.DataContext.get_current() ctx.execution_options.verbose_progress = True * `resource_limits`: Set a soft limit on the resource usage during execution. For example, if there are other parts of the code which require some minimum amount of resources, you may want to limit the amount of resources that Ray Data uses. Auto-detected by default. * `exclude_resources`: Amount of resources to exclude from Ray Data. Set this if you have other workloads running on the same cluster. Note: * If you're using Ray Data with Ray Train, training resources are automatically excluded. Otherwise, off by default. * For each resource type, you can't set both ``resource_limits`` and ``exclude_resources``. * `locality_with_output`: Set this to prefer running tasks on the same node as the output node (node driving the execution). It can also be set to a list of node ids to spread the outputs across those nodes. This parameter applies to both :meth:`~ray.data.Dataset.map` and :meth:`~ray.data.Dataset.streaming_split` operations. This setting is useful if you know you are consuming the output data directly on the consumer node (such as for ML training ingest). However, other use cases can incur a performance penalty with this setting. Off by default. * `preserve_order`: Set this to preserve the ordering between blocks processed by operators under the streaming executor. Off by default. * `actor_locality_enabled`: Whether to enable locality-aware task dispatch to actors. This parameter applies to stateful :meth:`~ray.data.Dataset.map` operations. This setting is useful if you know you are consuming the output data directly on the consumer node (such as for ML batch inference). However, other use cases can incur a performance penalty with this setting. Off by default. * `verbose_progress`: Whether to report progress individually per operator. By default, only AllToAll operators and global progress is reported. This option is useful for performance debugging. On by default. For more details on each of the preceding options, see :class:`~ray.data.ExecutionOptions`. Configuring :class:`~ray.data.DataContext` ========================================== The :class:`~ray.data.DataContext` class is used to configure more general options for Ray Data usage, such as observability/logging options, error handling/retry behavior, and internal data formats. To use it, modify the attributes in the current :class:`~ray.data.DataContext` object. For example: .. testcode:: :hide: import ray .. testcode:: ctx = ray.data.DataContext.get_current() ctx.verbose_stats_logs = True Many of the options in :class:`~ray.data.DataContext` are intended for advanced use cases or debugging, and most users shouldn't need to modify them. However, some of the most important options are: * `max_errored_blocks`: Max number of blocks that are allowed to have errors, unlimited if negative. This option allows application-level exceptions in block processing tasks. These exceptions may be caused by UDFs (for example, due to corrupted data samples) or IO errors. Data in the failed blocks are dropped. This option can be useful to prevent a long-running job from failing due to a small number of bad blocks. By default, no retries are allowed. * `write_file_retry_on_errors`: A list of sub-strings of error messages that should trigger a retry when writing files. This is useful for handling transient errors when writing to remote storage systems. By default, retries on common transient AWS S3 errors. * `verbose_stats_logs`: Whether stats logs should be verbose. This includes fields such as ``extra_metrics`` in the stats output, which are excluded by default. Off by default. * `log_internal_stack_trace_to_stdout`: Whether to include internal Ray Data/Ray Core code stack frames when logging to ``stdout``. The full stack trace is always written to the Ray Data log file. Off by default. * `raise_original_map_exception`: Whether to raise the original exception encountered in map UDF instead of wrapping it in a `UserCodeException`. For more details on each of the preceding options, see :class:`~ray.data.DataContext`. --- .. _inspecting-data: =============== Inspecting Data =============== Inspect :class:`Datasets ` to better understand your data. This guide shows you how to: * `Describe datasets <#describing-datasets>`_ * `Inspect rows <#inspecting-rows>`_ * `Inspect batches <#inspecting-batches>`_ * `Inspect execution statistics <#inspecting-execution-statistics>`_ .. _describing-datasets: Describing datasets =================== :class:`Datasets ` are tabular. To view a dataset's column names and types, call :meth:`Dataset.schema() `. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal length (cm) double sepal width (cm) double petal length (cm) double petal width (cm) double target int64 For more information like the number of rows, print the Dataset. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") print(ds) .. testoutput:: Dataset(num_rows=..., schema=...) .. _inspecting-rows: Inspecting rows =============== To get a list of rows, call :meth:`Dataset.take() ` or :meth:`Dataset.take_all() `. Ray Data represents each row as a dictionary. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") rows = ds.take(1) print(rows) .. testoutput:: [{'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0}] For more information on working with rows, see :ref:`Transforming rows ` and :ref:`Iterating over rows `. .. _inspecting-batches: Inspecting batches ================== A batch contains data from multiple rows. To inspect batches, call `Dataset.take_batch() `. By default, Ray Data represents batches as dicts of NumPy ndarrays. To change the type of the returned batch, set ``batch_format``. The batch format is independent from how Ray Data stores the underlying blocks, so you can use any batch format regardless of the internal block representation. .. tab-set:: .. tab-item:: NumPy .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") batch = ds.take_batch(batch_size=2, batch_format="numpy") print("Batch:", batch) print("Image shape", batch["image"].shape) .. testoutput:: :options: +MOCK Batch: {'image': array([[[[...]]]], dtype=uint8)} Image shape: (2, 32, 32, 3) .. tab-item:: pandas .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") batch = ds.take_batch(batch_size=2, batch_format="pandas") print(batch) .. testoutput:: :options: +MOCK sepal length (cm) sepal width (cm) ... petal width (cm) target 0 5.1 3.5 ... 0.2 0 1 4.9 3.0 ... 0.2 0 .. tab-item:: pyarrow .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") batch = ds.take_batch(batch_size=2, batch_format="pyarrow") print(batch) .. testoutput:: pyarrow.Table sepal length (cm): double sepal width (cm): double petal length (cm): double petal width (cm): double target: int64 ---- sepal length (cm): [[5.1,4.9]] sepal width (cm): [[3.5,3]] petal length (cm): [[1.4,1.4]] petal width (cm): [[0.2,0.2]] target: [[0,0]] For more information on working with batches, see :ref:`Transforming batches ` and :ref:`Iterating over batches `. Inspecting execution statistics =============================== Ray Data calculates statistics during execution for each operator, such as wall clock time and memory usage. To view stats about your :class:`Datasets `, call :meth:`Dataset.stats() ` on an executed dataset. The stats are also persisted under `/tmp/ray/session_*/logs/ray-data/ray-data.log`. For more on how to read this output, see :ref:`Monitoring Your Workload with the Ray Data Dashboard `. .. This snippet below is skipped because of https://github.com/ray-project/ray/issues/54101. .. testcode:: :skipif: True import ray from huggingface_hub import HfFileSystem def f(batch): return batch def g(row): return True path = "hf://datasets/ylecun/mnist/mnist/" fs = HfFileSystem() train_files = [f["name"] for f in fs.ls(path) if "train" in f["name"] and f["name"].endswith(".parquet")] ds = ( ray.data.read_parquet(train_files, filesystem=fs) .map_batches(f) .filter(g) .materialize() ) print(ds.stats()) .. testoutput:: :options: +MOCK Operator 1 ReadParquet->SplitBlocks(32): 1 tasks executed, 32 blocks produced in 2.92s * Remote wall time: 103.38us min, 1.34s max, 42.14ms mean, 1.35s total * Remote cpu time: 102.0us min, 164.66ms max, 5.37ms mean, 171.72ms total * UDF time: 0us min, 0us max, 0.0us mean, 0us total * Peak heap memory usage (MiB): 266375.0 min, 281875.0 max, 274491 mean * Output num rows per block: 1875 min, 1875 max, 1875 mean, 60000 total * Output size bytes per block: 537986 min, 555360 max, 545963 mean, 17470820 total * Output rows per task: 60000 min, 60000 max, 60000 mean, 1 tasks used * Tasks per node: 1 min, 1 max, 1 mean; 1 nodes used * Operator throughput: * Ray Data throughput: 20579.80984833993 rows/s * Estimated single node throughput: 44492.67361278733 rows/s Operator 2 MapBatches(f)->Filter(g): 32 tasks executed, 32 blocks produced in 3.63s * Remote wall time: 675.48ms min, 1.0s max, 797.07ms mean, 25.51s total * Remote cpu time: 673.41ms min, 897.32ms max, 768.09ms mean, 24.58s total * UDF time: 661.65ms min, 978.04ms max, 778.13ms mean, 24.9s total * Peak heap memory usage (MiB): 152281.25 min, 286796.88 max, 164231 mean * Output num rows per block: 1875 min, 1875 max, 1875 mean, 60000 total * Output size bytes per block: 530251 min, 547625 max, 538228 mean, 17223300 total * Output rows per task: 1875 min, 1875 max, 1875 mean, 32 tasks used * Tasks per node: 32 min, 32 max, 32 mean; 1 nodes used * Operator throughput: * Ray Data throughput: 16512.364546087643 rows/s * Estimated single node throughput: 2352.3683708977856 rows/s Dataset throughput: * Ray Data throughput: 11463.372316361854 rows/s * Estimated single node throughput: 25580.963670075285 rows/s --- .. _iterating-over-data: =================== Iterating over Data =================== Ray Data lets you iterate over rows or batches of data. This guide shows you how to: * `Iterate over rows <#iterating-over-rows>`_ * `Iterate over batches <#iterating-over-batches>`_ * `Iterate over batches with shuffling <#iterating-over-batches-with-shuffling>`_ * `Split datasets for distributed parallel training <#splitting-datasets-for-distributed-parallel-training>`_ .. _iterating-over-rows: Iterating over rows =================== To iterate over the rows of your dataset, call :meth:`Dataset.iter_rows() `. Ray Data represents each row as a dictionary. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") for row in ds.iter_rows(): print(row) .. testoutput:: {'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0} {'sepal length (cm)': 4.9, 'sepal width (cm)': 3.0, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0} ... {'sepal length (cm)': 5.9, 'sepal width (cm)': 3.0, 'petal length (cm)': 5.1, 'petal width (cm)': 1.8, 'target': 2} For more information on working with rows, see :ref:`Transforming rows ` and :ref:`Inspecting rows `. .. _iterating-over-batches: Iterating over batches ====================== A batch contains data from multiple rows. Iterate over batches of dataset in different formats by calling one of the following methods: * `Dataset.iter_batches() ` * `Dataset.iter_torch_batches() ` * `Dataset.to_tf() ` .. tab-set:: .. tab-item:: NumPy :sync: NumPy .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_batches(batch_size=2, batch_format="numpy"): print(batch) .. testoutput:: :options: +MOCK {'image': array([[[[...]]]], dtype=uint8)} ... {'image': array([[[[...]]]], dtype=uint8)} .. tab-item:: pandas :sync: pandas .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") for batch in ds.iter_batches(batch_size=2, batch_format="pandas"): print(batch) .. testoutput:: :options: +MOCK sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 ... sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 6.2 3.4 5.4 2.3 2 1 5.9 3.0 5.1 1.8 2 .. tab-item:: Torch :sync: Torch .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_torch_batches(batch_size=2): print(batch) .. testoutput:: :options: +MOCK {'image': tensor([[[[...]]]], dtype=torch.uint8)} ... {'image': tensor([[[[...]]]], dtype=torch.uint8)} .. tab-item:: TensorFlow :sync: TensorFlow .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") tf_dataset = ds.to_tf( feature_columns="sepal length (cm)", label_columns="target", batch_size=2 ) for features, labels in tf_dataset: print(features, labels) .. testoutput:: tf.Tensor([5.1 4.9], shape=(2,), dtype=float64) tf.Tensor([0 0], shape=(2,), dtype=int64) ... tf.Tensor([6.2 5.9], shape=(2,), dtype=float64) tf.Tensor([2 2], shape=(2,), dtype=int64) For more information on working with batches, see :ref:`Transforming batches ` and :ref:`Inspecting batches `. .. _iterating-over-batches-with-shuffling: Iterating over batches with shuffling ===================================== :class:`Dataset.random_shuffle ` is slow because it shuffles all rows. If a full global shuffle isn't required, you can shuffle a subset of rows up to a provided buffer size during iteration by specifying ``local_shuffle_buffer_size``. While this isn't a true global shuffle like ``random_shuffle``, it's more performant because it doesn't require excessive data movement. For more details about these options, see :doc:`Shuffling Data `. .. tip:: To configure ``local_shuffle_buffer_size``, choose the smallest value that achieves sufficient randomness. Higher values result in more randomness at the cost of slower iteration. See :ref:`Local shuffle when iterating over batches ` on how to diagnose slowdowns. .. tab-set:: .. tab-item:: NumPy :sync: NumPy .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_batches( batch_size=2, batch_format="numpy", local_shuffle_buffer_size=250, ): print(batch) .. testoutput:: :options: +MOCK {'image': array([[[[...]]]], dtype=uint8)} ... {'image': array([[[[...]]]], dtype=uint8)} .. tab-item:: pandas :sync: pandas .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") for batch in ds.iter_batches( batch_size=2, batch_format="pandas", local_shuffle_buffer_size=250, ): print(batch) .. testoutput:: :options: +MOCK sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 6.3 2.9 5.6 1.8 2 1 5.7 4.4 1.5 0.4 0 ... sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 5.6 2.7 4.2 1.3 1 1 4.8 3.0 1.4 0.1 0 .. tab-item:: Torch :sync: Torch .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_torch_batches( batch_size=2, local_shuffle_buffer_size=250, ): print(batch) .. testoutput:: :options: +MOCK {'image': tensor([[[[...]]]], dtype=torch.uint8)} ... {'image': tensor([[[[...]]]], dtype=torch.uint8)} .. tab-item:: TensorFlow :sync: TensorFlow .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") tf_dataset = ds.to_tf( feature_columns="sepal length (cm)", label_columns="target", batch_size=2, local_shuffle_buffer_size=250, ) for features, labels in tf_dataset: print(features, labels) .. testoutput:: :options: +MOCK tf.Tensor([5.2 6.3], shape=(2,), dtype=float64) tf.Tensor([1 2], shape=(2,), dtype=int64) ... tf.Tensor([5. 5.8], shape=(2,), dtype=float64) tf.Tensor([0 0], shape=(2,), dtype=int64) Splitting datasets for distributed parallel training ==================================================== If you're performing distributed data parallel training, call :meth:`Dataset.streaming_split ` to split your dataset into disjoint shards. .. note:: If you're using :ref:`Ray Train `, you don't need to split the dataset. Ray Train automatically splits your dataset for you. To learn more, see :ref:`Data Loading for ML Training guide `. .. testcode:: import ray @ray.remote class Worker: def train(self, data_iterator): for batch in data_iterator.iter_batches(batch_size=8): pass ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") workers = [Worker.remote() for _ in range(4)] shards = ds.streaming_split(n=4, equal=True) ray.get([w.train.remote(s) for w, s in zip(workers, shards)]) --- .. _joining-data: ============ Joining Data ============ .. note:: This is a new feature released in Ray 2.46. Note that this is an experimental feature and some things might not work as expected. Ray Data allows multiple :class:`~ray.data.dataset.Dataset` instances to be joined using different join types based on the provided key columns as follows: .. testcode:: import ray doubles_ds = ray.data.range(4).map( lambda row: {"id": row["id"], "double": int(row["id"]) * 2} ) squares_ds = ray.data.range(4).map( lambda row: {"id": row["id"], "square": int(row["id"]) ** 2} ) doubles_and_squares_ds = doubles_ds.join( squares_ds, join_type="inner", num_partitions=2, on=("id",), ) Ray Data supports the following join types (check out `Dataset.join` docs for up-to-date list): **Inner/Outer Joins:** - Inner, Left Outer, Right Outer, Full Outer **Semi Joins:** - Left Semi, Right Semi (returns all rows that have at least one matching row in the other table, only returning columns from the requested side) **Anti Joins:** - Left Anti, Right Anti (return rows that have no matching rows in the other table, only returning columns from the requested side) Internally joins are currently powered by the :ref:`hash-shuffle backend `. Configuring Joins ---------------------------------- Joins are generally memory-intensive operations that require accurate memory accounting and projection and hence are sensitive to skews and imbalances in the dataset. Ray Data provides the following levers to allow tuning the performance of joins for your workload: - `num_partitions`: (required) specifies number of partitions both incoming datasets will be hash-partitioned into. Check out :ref:`configuring number of partitions ` section for guidance on how to tune this up. - `partition_size_hint`: (optional) Hint to joining operator about the estimated avg expected size of the individual partition (in bytes). If not specified, defaults to DataContext.target_max_block_size (128Mb by default). - Note that, `num_partitions * partition_size_hint` should ideally be approximating actual dataset size, ie `partition_size_hint` could be estimated as dataset size divided by `num_partitions` (assuming relatively evenly sized partitions) - However, in cases when dataset partitioning is expected to be heavily skewed `partition_size_hint` should approximate largest partition to prevent Out-of-Memory (OOM) errors .. note:: Be mindful that by default Ray reserves only 30% of the memory for its Object Store. This is recommended to be set at least to ***50%*** for all Ray Data workloads, but especially so for ones utilizing joins. To configure Object Store to be 50%, add to your image: .. testcode:: RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION=0.5 .. _joins_configuring_num_partitions: Configuring number of partitions -------------------------------------------- Number of partitions (also referred to as blocks) provide an important trade-off between the size of individual batch of rows handled by individual tasks against memory requirements of the operation performed on them **Rule of thumb**: *keep partitions large, but not too large to cause Out-of-Memory (OOM) errors* 1. It’s important to not “oversize” partitions for joins as that could lead to OOM errors (if joined partitions might be too large to fit in memory) 2. It’s also important to not create too many small partitions as this creates an overhead of passing large amount of smaller objects Configuring number of Aggregators ---------------------------------------------- “Aggregators” are worker actors that perform actual joins/aggregations/shuffling, they receive individual partition chunks from the incoming blocks and subsequently "aggregate" them in the way that's required to perform given operation. Following are important considerations for successfully configuring number of aggregators in your pool: - Defaults to 64 or `num_partitions` (in cases when there are less than 64 partitions) - Individual Aggregators might be assigned to handle more than one partition (partitions are evenly split in round-robin fashion among the aggregators) - Aggregators are stateful components that hold the state (partitions) during shuffling **in memory** .. note:: The rule of thumb is to avoid setting `num_partitions` >> number of aggregators as it might create bottlenecks 1. Setting `DataContext.max_hash_shuffle_aggregators` caps the number of aggregators 2. Setting it to large enough value has an effect of allocating 1 partition to 1 aggregator (when `max_hash_shuffle_aggregators >= num_partitions`) --- .. _data_key_concepts: Key Concepts ============ Datasets and blocks ------------------- There are two main concepts in Ray Data: Datasets and Blocks. A :class:`Dataset ` represents a distributed data collection and defines data loading and processing operations and is the primary user-facing API for Ray Data. Users typically use the API by creating a :class:`Dataset ` from external storage or in-memory data, applying transformations to the data, and writing the outputs to external storage or feeding the outputs to training workers. The Dataset API is lazy, meaning that operations aren't executed until you materialize or consume the dataset, with methods like :meth:`~ray.data.Dataset.show`. This allows Ray Data to optimize the execution plan and execute operations in a pipelined, streaming fashion. A *block* is a set of rows representing single partition of the dataset. Blocks, as a collection of rows represented by columnar formats (like Arrow) are the basic unit of data processing in Ray Data: 1. Every dataset is partitioned into a number of blocks, then 2. Processing of the whole dataset is distributed and parallelized at the block level (blocks are processed in parallel and for the most part independently) The following figure visualizes a dataset with three blocks, each holding 1000 rows. Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution (which is usually the entrypoint of the program, referred to as the :term:`driver`) and stores the blocks as objects in Ray's shared-memory :ref:`object store `. Internally, Ray Data can natively handle blocks either as Pandas ``DataFrame`` or PyArrow ``Table``. .. image:: images/dataset-arch-with-blocks.svg .. https://docs.google.com/drawings/d/1kOYQqHdMrBp2XorDIn0u0G_MvFj-uSA4qm6xf9tsFLM/edit Operators and Plans ------------------- Ray Data uses a two-phase planning process to execute operations efficiently. When you write a program using the Dataset API, Ray Data first builds a *logical plan* - a high-level description of what operations to perform. When execution begins, it converts this into a *physical plan* that specifies exactly how to execute those operations. This diagram illustrates the complete planning process: .. https://docs.google.com/drawings/d/1WrVAg3LwjPo44vjLsn17WLgc3ta2LeQGgRfE8UHrDA0/edit .. image:: images/get_execution_plan.svg :width: 600 :align: center The building blocks of these plans are operators: * Logical plans consist of *logical operators* that describe *what* operation to perform. For example, when you write ``dataset = ray.data.read_parquet(...)``, Ray Data creates a ``ReadOp`` logical operator to specify what data to read. * Physical plans consist of *physical operators* that describe *how* to execute the operation. For example, Ray Data converts the ``ReadOp`` logical operator into a ``TaskPoolMapOperator`` physical operator that launches Ray tasks to read the data. Here is a simple example of how Ray Data builds a logical plan. As you chain operations together, Ray Data constructs the logical plan behind the scenes: .. testcode:: import ray dataset = ray.data.range(100) dataset = dataset.add_column("test", lambda x: x["id"] + 1) dataset = dataset.select_columns("test") You can inspect the resulting logical plan by printing the dataset: .. code-block:: Project +- MapBatches(add_column) +- Dataset(schema={...}) When execution begins, Ray Data optimizes the logical plan, then translates it into a physical plan - a series of operators that implement the actual data transformations. During this translation: 1. A single logical operator may become multiple physical operators. For example, ``ReadOp`` becomes both ``InputDataBuffer`` and ``TaskPoolMapOperator``. 2. Both logical and physical plans go through optimization passes. For example, ``OperatorFusionRule`` combines map operators to reduce serialization overhead. Physical operators work by: * Taking in a stream of block references * Performing their operation (either transforming data with Ray Tasks/Actors or manipulating references) * Outputting another stream of block references For more details on Ray Tasks and Actors, see :ref:`Ray Core Concepts `. .. note:: A dataset's execution plan only runs when you materialize or consume the dataset through operations like :meth:`~ray.data.Dataset.show`. .. _streaming-execution: Streaming execution model ------------------------- Ray Data can stream data through a pipeline of operators to efficiently process large datasets. This means that different operators in an execution can be scaled independently while running concurrently, allowing for more flexible and fine-grained resource allocation. For example, if two map operators require different amounts or types of resources, the streaming execution model can allow them to run concurrently and independently while still maintaining high performance. Note that this is primarily useful for non-shuffle operations. Shuffle operations like :meth:`ds.sort() ` and :meth:`ds.groupby() ` require materializing data, which stops streaming until the shuffle is complete. Here is an example of how the streaming execution works in Ray Data. .. code-block:: python import ray # Create a dataset with 1K rows ds = ray.data.read_parquet(...) # Define a pipeline of operations ds = ds.map(cpu_function, num_cpus=2) ds = ds.map(GPUClass, num_gpus=1) ds = ds.map(cpu_function2, num_cpus=4) ds = ds.filter(filter_func) # Data starts flowing when you call a method like show() ds.show(5) This creates a logical plan like the following: .. code-block:: Filter(filter_func) +- Map(cpu_function2) +- Map(GPUClass) +- Map(cpu_function) +- Dataset(schema={...}) The streaming topology looks like the following: .. https://docs.google.com/drawings/d/10myFIVtpI_ZNdvTSxsaHlOhA_gHRdUde_aHRC9zlfOw/edit .. image:: images/streaming-topology.svg :width: 1000 :align: center In the streaming execution model, operators are connected in a pipeline, with each operator's output queue feeding directly into the input queue of the next downstream operator. This creates an efficient flow of data through the execution plan. This enables multiple stages to execute concurrently, improving overall performance and resource utilization. For example, if the map operator requires GPU resources, the streaming execution model can execute the map operator concurrently with the filter operator (which may run on CPUs), effectively utilizing the GPU through the entire duration of the pipeline. You can read more about the streaming execution model in this `blog post `__. --- .. _loading_data: ============ Loading Data ============ Ray Data loads data from various sources. This guide shows you how to: * `Read files <#reading-files>`_ like images * `Load in-memory data <#loading-data-from-other-libraries>`_ like pandas DataFrames * `Read databases <#reading-databases>`_ like MySQL .. _reading-files: Reading files ============= Ray Data reads files from local disk or cloud storage in a variety of file formats. To view the full list of supported file formats, see the :ref:`Input/Output reference `. .. tab-set:: .. tab-item:: Parquet To read Parquet files, call :func:`~ray.data.read_parquet`. .. testcode:: import ray ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string .. tip:: When reading parquet files, you can take advantage of column pruning to efficiently filter columns at the file scan level. See :ref:`Parquet column pruning ` for more details on the projection pushdown feature. .. tab-item:: Images To read raw images, call :func:`~ray.data.read_images`. Ray Data represents images as NumPy ndarrays. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages/") print(ds.schema()) .. testoutput:: Column Type ------ ---- image ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8) .. tab-item:: Text To read lines of text, call :func:`~ray.data.read_text`. .. testcode:: import ray ds = ray.data.read_text("s3://anonymous@ray-example-data/this.txt") print(ds.schema()) .. testoutput:: Column Type ------ ---- text string .. tab-item:: CSV To read CSV files, call :func:`~ray.data.read_csv`. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal length (cm) double sepal width (cm) double petal length (cm) double petal width (cm) double target int64 .. tab-item:: Binary To read raw binary files, call :func:`~ray.data.read_binary_files`. .. testcode:: import ray ds = ray.data.read_binary_files("s3://anonymous@ray-example-data/documents") print(ds.schema()) .. testoutput:: Column Type ------ ---- bytes binary .. tab-item:: TFRecords To read TFRecords files, call :func:`~ray.data.read_tfrecords`. .. testcode:: import ray ds = ray.data.read_tfrecords("s3://anonymous@ray-example-data/iris.tfrecords") print(ds.schema()) .. testoutput:: :options: +MOCK Column Type ------ ---- label binary petal.length float sepal.width float petal.width float sepal.length float Reading files from local disk ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To read files from local disk, call a function like :func:`~ray.data.read_parquet` and specify paths with the ``local://`` schema. Paths can point to files or directories. To read formats other than Parquet, see the :ref:`Input/Output reference `. .. tip:: If your files are accessible on every node, exclude ``local://`` to parallelize the read tasks across the cluster. .. testcode:: :skipif: True import ray ds = ray.data.read_parquet("local:///tmp/iris.parquet") print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Reading files from cloud storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To read files in cloud storage, authenticate all nodes with your cloud service provider. Then, call a method like :func:`~ray.data.read_parquet` and specify URIs with the appropriate schema. URIs can point to buckets, folders, or objects. To read formats other than Parquet, see the :ref:`Input/Output reference `. .. tab-set:: .. tab-item:: S3 To read files from Amazon S3, specify URIs with the ``s3://`` scheme. .. testcode:: import ray ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Ray Data relies on PyArrow for authentication with Amazon S3. For more on how to configure your credentials to be compatible with PyArrow, see their `S3 Filesystem docs `_. .. tab-item:: GCS To read files from Google Cloud Storage, install the `Filesystem interface to Google Cloud Storage `_ .. code-block:: console pip install gcsfs Then, create a ``GCSFileSystem`` and specify URIs with the ``gs://`` scheme. .. testcode:: :skipif: True import ray filesystem = gcsfs.GCSFileSystem(project="my-google-project") ds = ray.data.read_parquet( "gs://...", filesystem=filesystem ) print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Ray Data relies on PyArrow for authentication with Google Cloud Storage. For more on how to configure your credentials to be compatible with PyArrow, see their `GCS Filesystem docs `_. .. tab-item:: ABS To read files from Azure Blob Storage, install the `Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage `_ .. code-block:: console pip install adlfs Then, create a ``AzureBlobFileSystem`` and specify URIs with the `az://` scheme. .. testcode:: :skipif: True import adlfs import ray ds = ray.data.read_parquet( "az://ray-example-data/iris.parquet", adlfs.AzureBlobFileSystem(account_name="azureopendatastorage") ) print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Ray Data relies on PyArrow for authentication with Azure Blob Storage. For more on how to configure your credentials to be compatible with PyArrow, see their `fsspec-compatible filesystems docs `_. Reading files from NFS ~~~~~~~~~~~~~~~~~~~~~~ To read files from NFS filesystems, call a function like :func:`~ray.data.read_parquet` and specify files on the mounted filesystem. Paths can point to files or directories. To read formats other than Parquet, see the :ref:`Input/Output reference `. .. testcode:: :skipif: True import ray ds = ray.data.read_parquet("/mnt/cluster_storage/iris.parquet") print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Handling compressed files ~~~~~~~~~~~~~~~~~~~~~~~~~ To read a compressed file, specify ``compression`` in ``arrow_open_stream_args``. You can use any `codec supported by Arrow `__. .. testcode:: import ray ds = ray.data.read_csv( "s3://anonymous@ray-example-data/iris.csv.gz", arrow_open_stream_args={"compression": "gzip"}, ) Downloading files from URIs ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes you may have a metadata table with a column of URIs and you want to download the files referenced by the URIs. You can download data in bulk by leveraging the :func:`~ray.data.Dataset.with_column` method together with the :func:`~ray.data.expressions.download` expression. This approach lets the system handle the parallel downloading of files referenced by URLs in your dataset, without needing to manage async code within your own transformations. The following example shows how to download a batch of images from URLs listed in a Parquet file: .. testcode:: import ray from ray.data.expressions import download # Read a Parquet file containing a column of image URLs ds = ray.data.read_parquet("s3://anonymous@ray-example-data/imagenet/metadata_file.parquet") # Use `with_column` and `download` to download the images in parallel. # This creates a new column 'bytes' with the downloaded file contents. ds = ds.with_column( "bytes", download("image_url"), ) ds.take(1) Loading data from other libraries ================================= Loading data from single-node data libraries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Data interoperates with libraries like pandas, NumPy, and Arrow. .. tab-set:: .. tab-item:: Python objects To create a :class:`~ray.data.dataset.Dataset` from Python objects, call :func:`~ray.data.from_items` and pass in a list of ``Dict``. Ray Data treats each ``Dict`` as a row. .. testcode:: import ray ds = ray.data.from_items([ {"food": "spam", "price": 9.34}, {"food": "ham", "price": 5.37}, {"food": "eggs", "price": 0.94} ]) print(ds) .. testoutput:: MaterializedDataset( num_blocks=3, num_rows=3, schema={food: string, price: double} ) You can also create a :class:`~ray.data.dataset.Dataset` from a list of regular Python objects. In the schema, the column name defaults to "item". .. testcode:: import ray ds = ray.data.from_items([1, 2, 3, 4, 5]) print(ds) .. testoutput:: MaterializedDataset(num_blocks=5, num_rows=5, schema={item: int64}) .. tab-item:: NumPy To create a :class:`~ray.data.dataset.Dataset` from a NumPy array, call :func:`~ray.data.from_numpy`. Ray Data treats the outer axis as the row dimension. .. testcode:: import numpy as np import ray array = np.ones((3, 2, 2)) ds = ray.data.from_numpy(array) print(ds) .. testoutput:: MaterializedDataset( num_blocks=1, num_rows=3, schema={data: ArrowTensorTypeV2(shape=(2, 2), dtype=double)} ) .. tab-item:: pandas To create a :class:`~ray.data.dataset.Dataset` from a pandas DataFrame, call :func:`~ray.data.from_pandas`. .. testcode:: import pandas as pd import ray df = pd.DataFrame({ "food": ["spam", "ham", "eggs"], "price": [9.34, 5.37, 0.94] }) ds = ray.data.from_pandas(df) print(ds) .. testoutput:: MaterializedDataset( num_blocks=1, num_rows=3, schema={food: object, price: float64} ) .. tab-item:: PyArrow To create a :class:`~ray.data.dataset.Dataset` from an Arrow table, call :func:`~ray.data.from_arrow`. .. testcode:: import pyarrow as pa table = pa.table({ "food": ["spam", "ham", "eggs"], "price": [9.34, 5.37, 0.94] }) ds = ray.data.from_arrow(table) print(ds) .. testoutput:: MaterializedDataset( num_blocks=1, num_rows=3, schema={food: string, price: double} ) .. _loading_datasets_from_distributed_df: Loading data from distributed DataFrame libraries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Data interoperates with distributed data processing frameworks like `Daft `_, :ref:`Dask `, :ref:`Spark `, :ref:`Modin `, and :ref:`Mars `. .. note:: The Ray Community provides these operations but may not actively maintain them. If you run into issues, create a GitHub issue `here `__. .. tab-set:: .. tab-item:: Daft To create a :class:`~ray.data.dataset.Dataset` from a `Daft DataFrame `_, call :func:`~ray.data.from_daft`. This function executes the Daft dataframe and constructs a ``Dataset`` backed by the resultant arrow data produced by your Daft query. .. warning:: :func:`~ray.data.from_daft` doesn't work with PyArrow 14 and later. For more information, see `this issue `__. .. testcode:: :skipif: True import daft import ray df = daft.from_pydict({"int_col": [i for i in range(10000)], "str_col": [str(i) for i in range(10000)]}) ds = ray.data.from_daft(df) ds.show(3) .. testoutput:: {'int_col': 0, 'str_col': '0'} {'int_col': 1, 'str_col': '1'} {'int_col': 2, 'str_col': '2'} .. tab-item:: Dask To create a :class:`~ray.data.dataset.Dataset` from a `Dask DataFrame `__, call :func:`~ray.data.from_dask`. This function constructs a ``Dataset`` backed by the distributed Pandas DataFrame partitions that underly the Dask DataFrame. .. We skip the code snippet below because `from_dask` doesn't work with PyArrow 14 and later. For more information, see https://github.com/ray-project/ray/issues/54837 .. testcode:: :skipif: True import dask.dataframe as dd import pandas as pd import ray df = pd.DataFrame({"col1": list(range(10000)), "col2": list(map(str, range(10000)))}) ddf = dd.from_pandas(df, npartitions=4) # Create a Dataset from a Dask DataFrame. ds = ray.data.from_dask(ddf) ds.show(3) .. testoutput:: {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} .. tab-item:: Spark To create a :class:`~ray.data.dataset.Dataset` from a `Spark DataFrame `__, call :func:`~ray.data.from_spark`. This function creates a ``Dataset`` backed by the distributed Spark DataFrame partitions that underly the Spark DataFrame. .. TODO: This code snippet might not work correctly. We should test it. .. testcode:: :skipif: True import ray import raydp spark = raydp.init_spark(app_name="Spark -> Datasets Example", num_executors=2, executor_cores=2, executor_memory="500MB") df = spark.createDataFrame([(i, str(i)) for i in range(10000)], ["col1", "col2"]) ds = ray.data.from_spark(df) ds.show(3) .. testoutput:: {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} .. tab-item:: Iceberg To create a :class:`~ray.data.dataset.Dataset` from an `Iceberg Table `__, call :func:`~ray.data.read_iceberg`. This function creates a ``Dataset`` backed by the distributed files that underlie the Iceberg table. .. testcode:: :skipif: True import ray from pyiceberg.expressions import EqualTo ds = ray.data.read_iceberg( table_identifier="db_name.table_name", row_filter=EqualTo("column_name", "literal_value"), catalog_kwargs={"name": "default", "type": "glue"} ) ds.show(3) .. testoutput:: :options: +MOCK {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} .. tab-item:: Modin To create a :class:`~ray.data.dataset.Dataset` from a Modin DataFrame, call :func:`~ray.data.from_modin`. This function constructs a ``Dataset`` backed by the distributed Pandas DataFrame partitions that underly the Modin DataFrame. .. testcode:: import modin.pandas as md import pandas as pd import ray df = pd.DataFrame({"col1": list(range(10000)), "col2": list(map(str, range(10000)))}) mdf = md.DataFrame(df) # Create a Dataset from a Modin DataFrame. ds = ray.data.from_modin(mdf) ds.show(3) .. testoutput:: {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} .. tab-item:: Mars To create a :class:`~ray.data.dataset.Dataset` from a Mars DataFrame, call :func:`~ray.data.from_mars`. This function constructs a ``Dataset`` backed by the distributed Pandas DataFrame partitions that underly the Mars DataFrame. .. testcode:: :skipif: True import mars import mars.dataframe as md import pandas as pd import ray cluster = mars.new_cluster_in_ray(worker_num=2, worker_cpu=1) df = pd.DataFrame({"col1": list(range(10000)), "col2": list(map(str, range(10000)))}) mdf = md.DataFrame(df, num_partitions=8) # Create a tabular Dataset from a Mars DataFrame. ds = ray.data.from_mars(mdf) ds.show(3) .. testoutput:: {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} .. _loading_huggingface_datasets: Loading Hugging Face datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To read datasets from the Hugging Face Hub, use :func:`~ray.data.read_parquet` (or other read functions) with the ``HfFileSystem`` filesystem. This approach provides better performance and scalability than loading datasets into memory first. First, install the required dependencies .. code-block:: console pip install huggingface_hub Set your Hugging Face token to authenticate. While public datasets can be read without a token, Hugging Face rate limits are more aggressive without a token. To read Hugging Face datasets without a token, simply set the filesystem argument to ``HfFileSystem()``. .. code-block:: console export HF_TOKEN= For most Hugging Face datasets, the data is stored in Parquet files. You can directly read from the dataset path: .. testcode:: :skipif: True import os import ray from huggingface_hub import HfFileSystem ds = ray.data.read_parquet( "hf://datasets/wikimedia/wikipedia", file_extensions=["parquet"], filesystem=HfFileSystem(token=os.environ["HF_TOKEN"]), ) print(f"Dataset count: {ds.count()}") print(ds.schema()) .. testoutput:: Dataset count: 61614907 Column Type ------ ---- id string url string title string text string .. tip:: If you encounter serialization errors when reading from Hugging Face filesystems, try upgrading ``huggingface_hub`` to version 1.1.6 or later. For more details, see this issue: https://github.com/ray-project/ray/issues/59029 .. _loading_datasets_from_ml_libraries: Loading data from ML libraries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Data interoperates with PyTorch and TensorFlow datasets. .. tab-set:: .. tab-item:: HuggingFace To load a HuggingFace Dataset into Ray Data, use the HuggingFace Hub ``HfFileSystem`` with :func:`~ray.data.read_parquet`, :func:`~ray.data.read_csv`, or :func:`~ray.data.read_json`. Since HuggingFace datasets are often backed by these file formats, this approach enables efficient distributed reads directly from the Hub. .. testcode:: :skipif: True import ray.data from huggingface_hub import HfFileSystem path = "hf://datasets/Salesforce/wikitext/wikitext-2-raw-v1/" fs = HfFileSystem() ds = ray.data.read_parquet(path, filesystem=fs) print(ds.take(5)) .. testoutput:: :options: +MOCK [{'text': '...'}, {'text': '...'}] .. tab-item:: PyTorch To convert a PyTorch dataset to a Ray Dataset, call :func:`~ray.data.from_torch`. .. testcode:: import ray from torch.utils.data import Dataset from torchvision import datasets from torchvision.transforms import ToTensor tds = datasets.CIFAR10(root="data", train=True, download=True, transform=ToTensor()) ds = ray.data.from_torch(tds) print(ds) .. testoutput:: :options: +MOCK Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz 100%|███████████████████████| 170498071/170498071 [00:07<00:00, 23494838.54it/s] Extracting data/cifar-10-python.tar.gz to data Dataset(num_rows=50000, schema={item: object}) .. tab-item:: TensorFlow To convert a TensorFlow dataset to a Ray Dataset, call :func:`~ray.data.from_tf`. .. warning:: :class:`~ray.data.from_tf` doesn't support parallel reads. Only use this function with small datasets like MNIST or CIFAR. .. testcode:: import ray import tensorflow_datasets as tfds tf_ds, _ = tfds.load("cifar10", split=["train", "test"]) ds = ray.data.from_tf(tf_ds) print(ds) .. The following `testoutput` is mocked to avoid illustrating download logs like "Downloading and preparing dataset 162.17 MiB". .. testoutput:: :options: +MOCK MaterializedDataset( num_blocks=..., num_rows=50000, schema={ id: binary, image: ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8), label: int64 } ) Reading databases ================= Ray Data reads from databases like MySQL, PostgreSQL, MongoDB, and BigQuery. .. _reading_sql: Reading SQL databases ~~~~~~~~~~~~~~~~~~~~~ Call :func:`~ray.data.read_sql` to read data from a database that provides a `Python DB API2-compliant `_ connector. .. tab-set:: .. tab-item:: MySQL To read from MySQL, install `MySQL Connector/Python `_. It's the first-party MySQL database connector. .. code-block:: console pip install mysql-connector-python Then, define your connection logic and query the database. .. testcode:: :skipif: True import mysql.connector import ray def create_connection(): return mysql.connector.connect( user="admin", password=..., host="example-mysql-database.c2c2k1yfll7o.us-west-2.rds.amazonaws.com", connection_timeout=30, database="example", ) # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) .. tab-item:: PostgreSQL To read from PostgreSQL, install `Psycopg 2 `_. It's the most popular PostgreSQL database connector. .. code-block:: console pip install psycopg2-binary Then, define your connection logic and query the database. .. testcode:: :skipif: True import psycopg2 import ray def create_connection(): return psycopg2.connect( user="postgres", password=..., host="example-postgres-database.c2c2k1yfll7o.us-west-2.rds.amazonaws.com", dbname="example", ) # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) .. tab-item:: Snowflake To read from Snowflake, install the `Snowflake Connector for Python `_. .. code-block:: console pip install snowflake-connector-python Then, define your connection logic and query the database. .. testcode:: :skipif: True import snowflake.connector import ray def create_connection(): return snowflake.connector.connect( user=..., password=... account="ZZKXUVH-IPB52023", database="example", ) # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) .. tab-item:: Databricks To read from Databricks, set the ``DATABRICKS_TOKEN`` environment variable to your Databricks warehouse access token. .. code-block:: console export DATABRICKS_TOKEN=... If you're not running your program on the Databricks runtime, also set the ``DATABRICKS_HOST`` environment variable. .. code-block:: console export DATABRICKS_HOST=adb-..azuredatabricks.net Then, call :func:`ray.data.read_databricks_tables` to read from the Databricks SQL warehouse. .. testcode:: :skipif: True import ray dataset = ray.data.read_databricks_tables( warehouse_id='...', # Databricks SQL warehouse ID catalog='catalog_1', # Unity catalog name schema='db_1', # Schema name query="SELECT title, score FROM movie WHERE year >= 1980", ) .. tab-item:: BigQuery To read from BigQuery, install the `Python Client for Google BigQuery `_ and the `Python Client for Google BigQueryStorage `_. .. code-block:: console pip install google-cloud-bigquery pip install google-cloud-bigquery-storage To read data from BigQuery, call :func:`~ray.data.read_bigquery` and specify the project id, dataset, and query (if applicable). .. testcode:: :skipif: True import ray # Read the entire dataset. Do not specify query. ds = ray.data.read_bigquery( project_id="my_gcloud_project_id", dataset="bigquery-public-data.ml_datasets.iris", ) # Read from a SQL query of the dataset. Do not specify dataset. ds = ray.data.read_bigquery( project_id="my_gcloud_project_id", query = "SELECT * FROM `bigquery-public-data.ml_datasets.iris` LIMIT 50", ) # Write back to BigQuery ds.write_bigquery( project_id="my_gcloud_project_id", dataset="destination_dataset.destination_table", overwrite_table=True, ) .. _reading_mongodb: Reading MongoDB ~~~~~~~~~~~~~~~ To read data from MongoDB, call :func:`~ray.data.read_mongo` and specify the source URI, database, and collection. You also need to specify a pipeline to run against the collection. .. testcode:: :skipif: True import ray # Read a local MongoDB. ds = ray.data.read_mongo( uri="mongodb://localhost:27017", database="my_db", collection="my_collection", pipeline=[{"$match": {"col": {"$gte": 0, "$lt": 10}}}, {"$sort": "sort_col"}], ) # Reading a remote MongoDB is the same. ds = ray.data.read_mongo( uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", database="my_db", collection="my_collection", pipeline=[{"$match": {"col": {"$gte": 0, "$lt": 10}}}, {"$sort": "sort_col"}], ) # Write back to MongoDB. ds.write_mongo( MongoDatasource(), uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", database="my_db", collection="my_collection", ) Reading from Kafka ====================== Ray Data reads from message queues like Kafka. .. _reading_kafka: To read data from Kafka topics, call :func:`~ray.data.read_kafka` and specify the topic names and broker addresses. Ray Data performs bounded reads between a start and end offset. First, install the required dependencies: .. code-block:: console pip install kafka-python Then, specify your Kafka configuration and read from topics. .. testcode:: :skipif: True import ray # Read from a single topic with offset range ds = ray.data.read_kafka( topics="my-topic", bootstrap_servers="localhost:9092", start_offset=0, end_offset=1000, ) # Read from multiple topics ds = ray.data.read_kafka( topics=["topic1", "topic2"], bootstrap_servers="localhost:9092", start_offset="earliest", end_offset="latest", ) # Read with authentication from ray.data import KafkaAuthConfig auth_config = KafkaAuthConfig( security_protocol="SASL_SSL", sasl_mechanism="PLAIN", sasl_plain_username="your-username", sasl_plain_password="your-password", ) ds = ray.data.read_kafka( topics="secure-topic", bootstrap_servers="localhost:9092", kafka_auth_config=auth_config, ) print(ds.schema()) .. testoutput:: Column Type ------ ---- offset int64 key binary value binary topic string partition int32 timestamp int64 timestamp_type int32 headers map Creating synthetic data ======================= Synthetic datasets can be useful for testing and benchmarking. .. tab-set:: .. tab-item:: Int Range To create a synthetic :class:`~ray.data.Dataset` from a range of integers, call :func:`~ray.data.range`. Ray Data stores the integer range in a single column called "id". .. testcode:: import ray ds = ray.data.range(10000) print(ds.schema()) .. testoutput:: Column Type ------ ---- id int64 .. tab-item:: Tensor Range To create a synthetic :class:`~ray.data.Dataset` containing arrays, call :func:`~ray.data.range_tensor`. Ray Data packs an integer range into ndarrays of the provided shape. In the schema, the column name defaults to "data". .. testcode:: import ray ds = ray.data.range_tensor(10, shape=(64, 64)) print(ds.schema()) .. testoutput:: Column Type ------ ---- data ArrowTensorTypeV2(shape=(64, 64), dtype=int64) Loading other datasources ========================== If Ray Data can't load your data, subclass :class:`~ray.data.Datasource`. Then, construct an instance of your custom datasource and pass it to :func:`~ray.data.read_datasource`. To write results, you might also need to subclass :class:`ray.data.Datasink`. Then, create an instance of your custom datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see :ref:`Advanced: Read and Write Custom File Types `. .. testcode:: :skipif: True # Read from a custom datasource. ds = ray.data.read_datasource(YourCustomDatasource(), **read_args) # Write to a custom datasink. ds.write_datasink(YourCustomDatasink()) Performance considerations ========================== By default, the number of output blocks from all read tasks is dynamically decided based on input data size and available resources. It should work well in most cases. However, you can also override the default value by setting the ``override_num_blocks`` argument. Ray Data decides internally how many read tasks to run concurrently to best utilize the cluster, ranging from ``1...override_num_blocks`` tasks. In other words, the higher the ``override_num_blocks``, the smaller the data blocks in the Dataset and hence more opportunities for parallel execution. For more information on how to tune the number of output blocks and other suggestions for optimizing read performance, see `Optimizing reads `__. --- .. _monitoring-your-workload: Monitoring Your Workload ======================== This section helps you debug and monitor the execution of your :class:`~ray.data.Dataset` by viewing the: * :ref:`Ray Data progress bars ` * :ref:`Ray Data dashboard ` * :ref:`Ray Data logs ` * :ref:`Ray Data stats ` .. _ray-data-progress-bars: Ray Data progress bars ---------------------- When you execute a :class:`~ray.data.Dataset`, Ray Data displays a set of progress bars in the console. These progress bars show various execution and progress-related metrics, including the number of rows completed/remaining, resource usage, and task/actor status. See the annotated image for a breakdown of how to interpret the progress bar outputs: .. image:: images/dataset-progress-bar.png :align: center Some additional notes on progress bars: * The progress bars are updated every second; resource usage, metrics, and task/actor status may take up to 5 seconds to update. * When the tasks section contains the label `[backpressure]`, it indicates that the operator is *backpressured*, meaning that the operator won't submit more tasks until the downstream operator is ready to accept more data. * The global resource usage is the sum of resources used by all operators, active and requested (includes pending scheduling and pending node assignment). Configuring the progress bar ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Depending on your use case, you may not be interested in the full progress bar output, or wish to turn them off altogether. Ray Data provides several ways to accomplish this: * Disabling operator-level progress bars: Set `DataContext.get_current().enable_operator_progress_bars = False`. This only shows the global progress bar, and omits operator-level progress bars. * Disabling all progress bars: Set `DataContext.get_current().enable_progress_bars = False`. This disables all progress bars from Ray Data related to dataset execution. * Disabling `ray_tqdm`: Set `DataContext.get_current().use_ray_tqdm = False`. This configures Ray Data to use the base `tqdm` library instead of the custom distributed `tqdm` implementation, which could be useful when debugging logging issues in a distributed setting. For operator names longer than a threshold of 100 characters, Ray Data truncates the names by default, to prevent the case when the operator names are long and the progress bar is too wide to fit on the screen. * To turn off this behavior and show the full operator name, set `DataContext.get_current().enable_progress_bar_name_truncation = False`. * To change the threshold of truncating the name, update the constant `ray.data._internal.progress_bar.ProgressBar.MAX_NAME_LENGTH = 42`. .. tip:: There is a new experimental console UI to show progress bars. Set `DataContext.get_current().enable_rich_progress_bars = True` or set the `RAY_DATA_ENABLE_RICH_PROGRESS_BARS=1` environment variable to enable. .. _ray-data-dashboard: Ray Data dashboard ------------------ Ray Data emits Prometheus metrics in real-time while a Dataset is executing. These metrics are tagged by both dataset and operator, and are displayed in multiple views across the Ray dashboard. .. note:: Most metrics are only available for physical operators that use the map operation. For example, physical operators created by :meth:`~ray.data.Dataset.map_batches`, :meth:`~ray.data.Dataset.map`, and :meth:`~ray.data.Dataset.flat_map`. Jobs: Ray Data overview ~~~~~~~~~~~~~~~~~~~~~~~ For an overview of all datasets that have been running on your cluster, see the Ray Data Overview in the :ref:`jobs view `. This table appears once the first dataset starts executing on the cluster, and shows dataset details such as: * execution progress (measured in blocks) * execution state (running, failed, or finished) * dataset start/end time * dataset-level metrics (for example, sum of rows processed over all operators) .. image:: images/data-overview-table.png :align: center For a more fine-grained overview, each dataset row in the table can also be expanded to display the same details for individual operators. .. image:: images/data-overview-table-expanded.png :align: center .. tip:: To evaluate a dataset-level metric where it's not appropriate to sum the values of all the individual operators, it may be more useful to look at the operator-level metrics of the last operator. For example, to calculate a dataset's throughput, use the "Rows Outputted" of the dataset's last operator, because the dataset-level metric contains the sum of rows outputted over all operators. Ray dashboard metrics ~~~~~~~~~~~~~~~~~~~~~ For a time-series view of these metrics, see the Ray Data section in the :ref:`Metrics view `. This section contains time-series graphs of all metrics emitted by Ray Data. Execution metrics are grouped by dataset and operator, and iteration metrics are grouped by dataset. The metrics recorded include: * Bytes spilled by objects from object store to disk * Bytes of objects allocated in object store * Bytes of objects freed in object store * Current total bytes of objects in object store * Logical CPUs allocated to dataset operators * Logical GPUs allocated to dataset operators * Bytes outputted by dataset operators * Rows outputted by dataset operators * Input blocks received by data operators * Input blocks/bytes processed in tasks by data operators * Input bytes submitted to tasks by data operators * Output blocks/bytes/rows generated in tasks by data operators * Output blocks/bytes taken by downstream operators * Output blocks/bytes from finished tasks * Submitted tasks * Running tasks * Tasks with at least one output block * Finished tasks * Failed tasks * Operator internal inqueue size (in blocks/bytes) * Operator internal outqueue size (in blocks/bytes) * Size of blocks used in pending tasks * Freed memory in object store * Spilled memory in object store * Time spent generating blocks * Time spent in task submission backpressure * Time spent to initialize iteration. * Time user code is blocked during iteration. * Time spent in user code during iteration. .. image:: images/data-dashboard.png :align: center To learn more about the Ray dashboard, including detailed setup instructions, see :ref:`Ray Dashboard `. Prometheus metrics ~~~~~~~~~~~~~~~~~~ Ray Data emits Prometheus metrics that you can use to monitor dataset execution. The metrics are tagged with `dataset` and `operator` labels to help you identify which dataset and operator the metrics are coming from. To access these metrics, you can query the Prometheus server running on the Ray head node. The default Prometheus server URL is `http://:8080`. The following tables list all available Ray Data metrics grouped by category. Overview metrics ^^^^^^^^^^^^^^^^ These metrics provide high-level information about dataset execution and resource usage. .. list-table:: :header-rows: 1 :widths: 30 70 * - Metric name - Description * - `data_spilled_bytes` - Bytes spilled by dataset operators. Set `DataContext.enable_get_object_locations_for_metrics` to `True` to report this metric. * - `data_freed_bytes` - Bytes freed by dataset operators * - `data_current_bytes` - Bytes of object store memory used by dataset operators * - `data_cpu_usage_cores` - CPUs allocated to dataset operators * - `data_gpu_usage_cores` - GPUs allocated to dataset operators * - `data_output_bytes` - Bytes outputted by dataset operators * - `data_output_rows` - Rows outputted by dataset operators Input metrics ^^^^^^^^^^^^^ These metrics track input data flowing into operators. .. list-table:: :header-rows: 1 :widths: 30 70 * - Metric name - Description * - `num_inputs_received` - Number of input blocks received by operator * - `num_row_inputs_received` - Number of input rows received by operator * - `bytes_inputs_received` - Byte size of input blocks received by operator * - `num_task_inputs_processed` - Number of input blocks that the operator's tasks finished processing * - `bytes_task_inputs_processed` - Byte size of input blocks that the operator's tasks finished processing * - `bytes_inputs_of_submitted_tasks` - Byte size of input blocks passed to submitted tasks * - `rows_inputs_of_submitted_tasks` - Number of rows in the input blocks passed to submitted tasks * - `average_num_inputs_per_task` - Average number of input blocks per task, or `None` if no task finished * - `average_bytes_inputs_per_task` - Average size in bytes of ref bundles passed to tasks, or `None` if no tasks submitted * - `average_rows_inputs_per_task` - Average number of rows in input blocks per task, or `None` if no task submitted Output metrics ^^^^^^^^^^^^^^ These metrics track output data generated by operators. .. list-table:: :header-rows: 1 :widths: 30 70 * - Metric name - Description * - `num_task_outputs_generated` - Number of output blocks generated by tasks * - `bytes_task_outputs_generated` - Byte size of output blocks generated by tasks * - `rows_task_outputs_generated` - Number of output rows generated by tasks * - `row_outputs_taken` - Number of rows that are already taken by downstream operators * - `block_outputs_taken` - Number of blocks that are already taken by downstream operators * - `num_outputs_taken` - Number of output blocks that are already taken by downstream operators * - `bytes_outputs_taken` - Byte size of output blocks that are already taken by downstream operators * - `num_outputs_of_finished_tasks` - Number of generated output blocks that are from finished tasks * - `bytes_outputs_of_finished_tasks` - Total byte size of generated output blocks produced by finished tasks * - `rows_outputs_of_finished_tasks` - Number of rows generated by finished tasks * - `num_external_inqueue_blocks` - Number of blocks in the external inqueue * - `num_external_inqueue_bytes` - Byte size of blocks in the external inqueue * - `num_external_outqueue_blocks` - Number of blocks in the external outqueue * - `num_external_outqueue_bytes` - Byte size of blocks in the external outqueue * - `average_num_outputs_per_task` - Average number of output blocks per task, or `None` if no task finished * - `average_bytes_per_output` - Average size in bytes of output blocks * - `average_bytes_outputs_per_task` - Average total output size of task in bytes, or `None` if no task finished * - `average_rows_outputs_per_task` - Average number of rows produced per task, or `None` if no task finished * - `num_output_blocks_per_task_s` - Average number of output blocks per task per second Task metrics ^^^^^^^^^^^^ These metrics track task execution and scheduling. .. list-table:: :header-rows: 1 :widths: 30 70 * - Metric name - Description * - `num_tasks_submitted` - Number of submitted tasks * - `num_tasks_running` - Number of running tasks * - `num_tasks_have_outputs` - Number of tasks with at least one output * - `num_tasks_finished` - Number of finished tasks * - `num_tasks_failed` - Number of failed tasks * - `block_generation_time` - Time spent generating blocks in tasks * - `task_submission_backpressure_time` - Time spent in task submission backpressure * - `task_output_backpressure_time` - Time spent in task output backpressure * - `task_completion_time` - Histogram of time spent running tasks to completion * - `block_completion_time` - Histogram of time spent running a single block to completion. If multiple blocks are generated per task, Ray Data approximates this by assuming each block took an equal amount of time to process. * - `task_completion_time_s` - Time spent running tasks to completion * - `task_completion_time_excl_backpressure_s` - Time spent running tasks to completion without backpressure * - `block_size_bytes` - Histogram of block sizes in bytes generated by tasks * - `block_size_rows` - Histogram of number of rows in blocks generated by tasks * - `average_total_task_completion_time_s` - Average task completion time in seconds including throttling. This includes Ray Core and Ray Data backpressure. * - `average_task_completion_excl_backpressure_time_s` - Average task completion time in seconds excluding throttling * - `average_max_uss_per_task` - Average Unique Set Size (USS) memory usage of tasks. USS is the amount of memory unique to a process that would be freed if the process was terminated. Actor metrics ^^^^^^^^^^^^^ These metrics track actor lifecycle for operations that use actors. .. list-table:: :header-rows: 1 :widths: 30 70 * - Metric name - Description * - `num_alive_actors` - Number of alive actors * - `num_restarting_actors` - Number of restarting actors * - `num_pending_actors` - Number of pending actors Object store memory metrics ^^^^^^^^^^^^^^^^^^^^^^^^^^^ These metrics track memory usage in the Ray object store. .. list-table:: :header-rows: 1 :widths: 30 70 * - Metric name - Description * - `obj_store_mem_internal_inqueue_blocks` - Number of blocks in the operator's internal input queue * - `obj_store_mem_internal_outqueue_blocks` - Number of blocks in the operator's internal output queue * - `obj_store_mem_freed` - Byte size of freed memory in object store * - `obj_store_mem_spilled` - Byte size of spilled memory in object store * - `obj_store_mem_used` - Byte size of used memory in object store * - `obj_store_mem_internal_inqueue` - Byte size of input blocks in the operator's internal input queue * - `obj_store_mem_internal_outqueue` - Byte size of output blocks in the operator's internal output queue * - `obj_store_mem_pending_task_inputs` - Byte size of input blocks used by pending tasks Scheduling and resource metrics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ These metrics track resource allocation and scheduling behavior in the streaming executor. .. list-table:: :header-rows: 1 :widths: 30 70 * - Metric name - Description * - `data_sched_loop_duration_s` - Duration of the scheduling loop in seconds * - `data_cpu_budget` - CPU budget allocated per operator * - `data_gpu_budget` - GPU budget allocated per operator * - `data_memory_budget` - Memory budget allocated per operator * - `data_object_store_memory_budget` - Object store memory budget allocated per operator * - `data_max_bytes_to_read` - Maximum bytes to read from streaming generator buffer per operator .. _ray-data-logs: Ray Data logs ------------- During execution, Ray Data periodically logs updates to `ray-data.log`. Every five seconds, Ray Data logs the execution progress of every operator in the dataset. For more frequent updates, set `RAY_DATA_TRACE_SCHEDULING=1` so that the progress is logged after each task is dispatched. .. code-block:: text Execution Progress: 0: - Input: 0 active, 0 queued, 0.0 MiB objects, Blocks Outputted: 200/200 1: - ReadRange->MapBatches(): 10 active, 190 queued, 381.47 MiB objects, Blocks Outputted: 100/200 When an operator completes, the metrics for that operator are also logged. .. code-block:: text Operator InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->MapBatches()] completed. Operator Metrics: {'num_inputs_received': 20, 'bytes_inputs_received': 46440, 'num_task_inputs_processed': 20, 'bytes_task_inputs_processed': 46440, 'num_task_outputs_generated': 20, 'bytes_task_outputs_generated': 800, 'rows_task_outputs_generated': 100, 'num_outputs_taken': 20, 'bytes_outputs_taken': 800, 'num_outputs_of_finished_tasks': 20, 'bytes_outputs_of_finished_tasks': 800, 'num_tasks_submitted': 20, 'num_tasks_running': 0, 'num_tasks_have_outputs': 20, 'num_tasks_finished': 20, 'obj_store_mem_freed': 46440, 'obj_store_mem_spilled': 0, 'block_generation_time': 1.191296085, 'cpu_usage': 0, 'gpu_usage': 0, 'ray_remote_args': {'num_cpus': 1, 'scheduling_strategy': 'SPREAD'}} This log file can be found locally at `/tmp/ray/{SESSION_NAME}/logs/ray-data/ray-data.log`. It can also be found on the Ray Dashboard under the head node's logs in the :ref:`Logs view `. .. _ray-data-stats: Ray Data stats -------------- To see detailed stats on the execution of a dataset you can use the :meth:`~ray.data.Dataset.stats` method. Operator stats ~~~~~~~~~~~~~~ The stats output includes a summary on the individual operator's execution stats for each operator. Ray Data calculates this summary across many different blocks, so some stats show the min, max, mean, and sum of the stats aggregated over all the blocks. The following are descriptions of the various stats included at the operator level: * **Remote wall time**: The wall time is the start to finish time for an operator. It includes the time where the operator isn't processing data, sleeping, waiting for I/O, etc. * **Remote CPU time**: The CPU time is the process time for an operator which excludes time slept. This time includes both user and system CPU time. * **UDF time**: The UDF time is time spent in functions defined by the user. This time includes functions you pass into Ray Data methods, including :meth:`~ray.data.Dataset.map`, :meth:`~ray.data.Dataset.map_batches`, :meth:`~ray.data.Dataset.filter`, etc. You can use this stat to track the time spent in functions you define and how much time optimizing those functions could save. * **Memory usage**: The output displays memory usage per block in MiB. * **Output stats**: The output includes stats on the number of rows output and size of output in bytes per block. The number of output rows per task is also included. All of this together gives you insight into how much data Ray Data is outputting at a per block and per task level. * **Task Stats**: The output shows the scheduling of tasks to nodes, which allows you to see if you are utilizing all of your nodes as expected. * **Throughput**: The summary calculates the throughput for the operator, and for a point of comparison, it also computes an estimate of the throughput of the same task on a single node. This estimate assumes the total time of the work remains the same, but with no concurrency. The overall summary also calculates the throughput at the dataset level, including a single node estimate. Iterator stats ~~~~~~~~~~~~~~ If you iterate over the data, Ray Data also generates iteration stats. Even if you aren't directly iterating over the data, you might see iteration stats, for example, if you call :meth:`~ray.data.Dataset.take_all`. Some of the stats that Ray Data includes at the iterator level are: * **Iterator initialization**: The time Ray Data spent initializing the iterator. This time is internal to Ray Data. * **Time user thread is blocked**: The time Ray Data spent producing data in the iterator. This time is often the primary execution of a dataset if you haven't previously materialized it. * **Time in user thread**: The time spent in the user thread that's iterating over the dataset outside of the Ray Data code. If this time is high, consider optimizing the body of the loop that's iterating over the dataset. * **Batch iteration stats**: Ray Data also includes stats about the prefetching of batches. These times are internal to Ray Data code, but you can further optimize this time by tuning the prefetching process. Verbose stats ~~~~~~~~~~~~~~ By default, Ray Data only logs the most important high-level stats. To enable verbose stats outputs, include the following snippet in your Ray Data code: .. testcode:: from ray.data import DataContext context = DataContext.get_current() context.verbose_stats_logs = True By enabling verbosity Ray Data adds a few more outputs: * **Extra metrics**: Operators, executors, etc. can add to this dictionary of various metrics. There is some duplication of stats between the default output and this dictionary, but for advanced users this stat provides more insight into the dataset's execution. * **Runtime metrics**: These metrics are a high-level breakdown of the runtime of the dataset execution. These stats are a per operator summary of the time each operator took to complete and the fraction of the total execution time that the operator took to complete. As there are potentially multiple concurrent operators, these percentages don't necessarily sum to 100%. Instead, they show how long running each of the operators is in the context of the full dataset execution. Example stats ~~~~~~~~~~~~~ As a concrete example, below is a stats output from :doc:`Image Classification Batch Inference with PyTorch ResNet18 `: .. code-block:: text Operator 1 ReadImage->Map(preprocess_image): 384 tasks executed, 386 blocks produced in 9.21s * Remote wall time: 33.55ms min, 2.22s max, 1.03s mean, 395.65s total * Remote cpu time: 34.93ms min, 3.36s max, 1.64s mean, 632.26s total * UDF time: 535.1ms min, 2.16s max, 975.7ms mean, 376.62s total * Peak heap memory usage (MiB): 556.32 min, 1126.95 max, 655 mean * Output num rows per block: 4 min, 25 max, 24 mean, 9469 total * Output size bytes per block: 6060399 min, 105223020 max, 31525416 mean, 12168810909 total * Output rows per task: 24 min, 25 max, 24 mean, 384 tasks used * Tasks per node: 32 min, 64 max, 48 mean; 8 nodes used * Operator throughput: * Ray Data throughput: 1028.5218637702708 rows/s * Estimated single node throughput: 23.932674100499128 rows/s Operator 2 MapBatches(ResnetModel): 14 tasks executed, 48 blocks produced in 27.43s * Remote wall time: 523.93us min, 7.01s max, 1.82s mean, 87.18s total * Remote cpu time: 523.23us min, 6.23s max, 1.76s mean, 84.61s total * UDF time: 4.49s min, 17.81s max, 10.52s mean, 505.08s total * Peak heap memory usage (MiB): 4025.42 min, 7920.44 max, 5803 mean * Output num rows per block: 84 min, 334 max, 197 mean, 9469 total * Output size bytes per block: 72317976 min, 215806447 max, 134739694 mean, 6467505318 total * Output rows per task: 319 min, 720 max, 676 mean, 14 tasks used * Tasks per node: 3 min, 4 max, 3 mean; 4 nodes used * Operator throughput: * Ray Data throughput: 345.1533728632648 rows/s * Estimated single node throughput: 108.62003864820711 rows/s Dataset iterator time breakdown: * Total time overall: 38.53s * Total time in Ray Data iterator initialization code: 16.86s * Total time user thread is blocked by Ray Data iter_batches: 19.76s * Total execution time for user thread: 1.9s * Batch iteration time breakdown (summed across prefetch threads): * In ray.get(): 70.49ms min, 2.16s max, 272.8ms avg, 13.09s total * In batch creation: 3.6us min, 5.95us max, 4.26us avg, 204.41us total * In batch formatting: 4.81us min, 7.88us max, 5.5us avg, 263.94us total Dataset throughput: * Ray Data throughput: 1026.5318925757008 rows/s * Estimated single node throughput: 19.611578909587674 rows/s For the same example with verbosity enabled, the stats output is: .. code-block:: text Operator 1 ReadImage->Map(preprocess_image): 384 tasks executed, 387 blocks produced in 9.49s * Remote wall time: 22.81ms min, 2.5s max, 999.95ms mean, 386.98s total * Remote cpu time: 24.06ms min, 3.36s max, 1.63s mean, 629.93s total * UDF time: 552.79ms min, 2.41s max, 956.84ms mean, 370.3s total * Peak heap memory usage (MiB): 550.95 min, 1186.28 max, 651 mean * Output num rows per block: 4 min, 25 max, 24 mean, 9469 total * Output size bytes per block: 4444092 min, 105223020 max, 31443955 mean, 12168810909 total * Output rows per task: 24 min, 25 max, 24 mean, 384 tasks used * Tasks per node: 39 min, 60 max, 48 mean; 8 nodes used * Operator throughput: * Ray Data throughput: 997.9207015895857 rows/s * Estimated single node throughput: 24.46899945870273 rows/s * Extra metrics: {'num_inputs_received': 384, 'bytes_inputs_received': 1104723940, 'num_task_inputs_processed': 384, 'bytes_task_inputs_processed': 1104723940, 'bytes_inputs_of_submitted_tasks': 1104723940, 'num_task_outputs_generated': 387, 'bytes_task_outputs_generated': 12168810909, 'rows_task_outputs_generated': 9469, 'num_outputs_taken': 387, 'bytes_outputs_taken': 12168810909, 'num_outputs_of_finished_tasks': 387, 'bytes_outputs_of_finished_tasks': 12168810909, 'num_tasks_submitted': 384, 'num_tasks_running': 0, 'num_tasks_have_outputs': 384, 'num_tasks_finished': 384, 'num_tasks_failed': 0, 'block_generation_time': 386.97945193799995, 'task_submission_backpressure_time': 7.263684450000142, 'obj_store_mem_internal_inqueue_blocks': 0, 'obj_store_mem_internal_inqueue': 0, 'obj_store_mem_internal_outqueue_blocks': 0, 'obj_store_mem_internal_outqueue': 0, 'obj_store_mem_pending_task_inputs': 0, 'obj_store_mem_freed': 1104723940, 'obj_store_mem_spilled': 0, 'obj_store_mem_used': 12582535566, 'cpu_usage': 0, 'gpu_usage': 0, 'ray_remote_args': {'num_cpus': 1, 'scheduling_strategy': 'SPREAD'}} Operator 2 MapBatches(ResnetModel): 14 tasks executed, 48 blocks produced in 28.81s * Remote wall time: 134.84us min, 7.23s max, 1.82s mean, 87.16s total * Remote cpu time: 133.78us min, 6.28s max, 1.75s mean, 83.98s total * UDF time: 4.56s min, 17.78s max, 10.28s mean, 493.48s total * Peak heap memory usage (MiB): 3925.88 min, 7713.01 max, 5688 mean * Output num rows per block: 125 min, 259 max, 197 mean, 9469 total * Output size bytes per block: 75531617 min, 187889580 max, 134739694 mean, 6467505318 total * Output rows per task: 325 min, 719 max, 676 mean, 14 tasks used * Tasks per node: 3 min, 4 max, 3 mean; 4 nodes used * Operator throughput: * Ray Data throughput: 328.71474145609153 rows/s * Estimated single node throughput: 108.6352856660782 rows/s * Extra metrics: {'num_inputs_received': 387, 'bytes_inputs_received': 12168810909, 'num_task_inputs_processed': 0, 'bytes_task_inputs_processed': 0, 'bytes_inputs_of_submitted_tasks': 12168810909, 'num_task_outputs_generated': 1, 'bytes_task_outputs_generated': 135681874, 'rows_task_outputs_generated': 252, 'num_outputs_taken': 1, 'bytes_outputs_taken': 135681874, 'num_outputs_of_finished_tasks': 0, 'bytes_outputs_of_finished_tasks': 0, 'num_tasks_submitted': 14, 'num_tasks_running': 14, 'num_tasks_have_outputs': 1, 'num_tasks_finished': 0, 'num_tasks_failed': 0, 'block_generation_time': 7.229860895999991, 'task_submission_backpressure_time': 0, 'obj_store_mem_internal_inqueue_blocks': 13, 'obj_store_mem_internal_inqueue': 413724657, 'obj_store_mem_internal_outqueue_blocks': 0, 'obj_store_mem_internal_outqueue': 0, 'obj_store_mem_pending_task_inputs': 12168810909, 'obj_store_mem_freed': 0, 'obj_store_mem_spilled': 0, 'obj_store_mem_used': 1221136866.0, 'cpu_usage': 0, 'gpu_usage': 4} Dataset iterator time breakdown: * Total time overall: 42.29s * Total time in Ray Data iterator initialization code: 20.24s * Total time user thread is blocked by Ray Data iter_batches: 19.96s * Total execution time for user thread: 2.08s * Batch iteration time breakdown (summed across prefetch threads): * In ray.get(): 73.0ms min, 2.15s max, 246.3ms avg, 11.82s total * In batch creation: 3.62us min, 6.6us max, 4.39us avg, 210.7us total * In batch formatting: 4.75us min, 8.67us max, 5.52us avg, 264.98us total Dataset throughput: * Ray Data throughput: 468.11051989434594 rows/s * Estimated single node throughput: 972.8197093015862 rows/s Runtime Metrics: * ReadImage->Map(preprocess_image): 9.49s (46.909%) * MapBatches(ResnetModel): 28.81s (142.406%) * Scheduling: 6.16s (30.448%) * Total: 20.23s (100.000%) --- .. _data_performance_tips: Advanced: Performance Tips and Tuning ===================================== Optimizing transforms --------------------- Batching transforms ~~~~~~~~~~~~~~~~~~~ If your transformation is vectorized like most NumPy or pandas operations, use :meth:`~ray.data.Dataset.map_batches` rather than :meth:`~ray.data.Dataset.map`. It's faster. If your transformation isn't vectorized, there's no performance benefit. Optimizing reads ---------------- .. _read_output_blocks: Tuning output blocks for read ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, Ray Data automatically selects the number of output blocks for read according to the following procedure: - The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs ` specifies the number of output blocks, which is equivalent to the number of read tasks to create. - Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks. Ray Data decides the default value for number of output blocks based on the following heuristics, applied in order: 1. Start with the default value of 200. You can overwrite this by setting :class:`DataContext.read_op_min_num_blocks `. 2. Min block size (default=1 MiB). If number of blocks would make blocks smaller than this threshold, reduce number of blocks to avoid the overhead of tiny blocks. You can override by setting :class:`DataContext.target_min_block_size ` (bytes). 3. Max block size (default=128 MiB). If number of blocks would make blocks larger than this threshold, increase number of blocks to avoid out-of-memory errors during processing. You can override by setting :class:`DataContext.target_max_block_size ` (bytes). 4. Available CPUs. Increase number of blocks to utilize all of the available CPUs in the cluster. Ray Data chooses the number of read tasks to be at least 2x the number of available CPUs. Occasionally, it's advantageous to manually tune the number of blocks to optimize the application. For example, the following code batches multiple files into the same read task to avoid creating blocks that are too large. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) # Repeat the iris.csv file 16 times. ds = ray.data.read_csv(["s3://anonymous@ray-example-data/iris.csv"] * 16) print(ds.materialize()) .. testoutput:: :options: +MOCK MaterializedDataset( num_blocks=4, num_rows=2400, ... ) But suppose that you knew that you wanted to read all 16 files in parallel. This could be, for example, because you know that additional CPUs should get added to the cluster by the autoscaler or because you want the downstream operator to transform each file's contents in parallel. You can get this behavior by setting the ``override_num_blocks`` parameter. Notice how the number of output blocks is equal to ``override_num_blocks`` in the following code: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) # Repeat the iris.csv file 16 times. ds = ray.data.read_csv(["s3://anonymous@ray-example-data/iris.csv"] * 16, override_num_blocks=16) print(ds.materialize()) .. testoutput:: :options: +MOCK MaterializedDataset( num_blocks=16, num_rows=2400, ... ) When using the default auto-detected number of blocks, Ray Data attempts to cap each task's output to :class:`DataContext.target_max_block_size ` many bytes. Note however that Ray Data can't perfectly predict the size of each task's output, so it's possible that each task produces one or more output blocks. Thus, the total blocks in the final :class:`~ray.data.Dataset` may differ from the specified ``override_num_blocks``. Here's an example where we manually specify ``override_num_blocks=1``, but the one task still produces multiple blocks in the materialized Dataset: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) # Generate ~400MB of data. ds = ray.data.range_tensor(5_000, shape=(10_000, ), override_num_blocks=1) print(ds.materialize()) .. testoutput:: :options: +MOCK MaterializedDataset( num_blocks=3, num_rows=5000, schema={data: ArrowTensorTypeV2(shape=(10000,), dtype=int64)} ) Currently, Ray Data can assign at most one read task per input file. Thus, if the number of input files is smaller than ``override_num_blocks``, the number of read tasks is capped to the number of input files. To ensure that downstream transforms can still execute with the desired number of blocks, Ray Data splits the read tasks' outputs into a total of ``override_num_blocks`` blocks and prevents fusion with the downstream transform. In other words, each read task's output blocks are materialized to Ray's object store before the consuming map task executes. For example, the following code executes :func:`~ray.data.read_csv` with only one task, but its output is split into 4 blocks before executing the :func:`~ray.data.Dataset.map`: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv").map(lambda row: row) print(ds.materialize().stats()) .. testoutput:: :options: +MOCK ... Operator 1 ReadCSV->SplitBlocks(4): 1 tasks executed, 4 blocks produced in 0.01s ... Operator 2 Map(): 4 tasks executed, 4 blocks produced in 0.3s ... To turn off this behavior and allow the read and map operators to be fused, set ``override_num_blocks`` manually. For example, this code sets the number of files equal to ``override_num_blocks``: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv", override_num_blocks=1).map(lambda row: row) print(ds.materialize().stats()) .. testoutput:: :options: +MOCK ... Operator 1 ReadCSV->Map(): 1 tasks executed, 1 blocks produced in 0.01s ... .. _tuning_read_resources: Tuning read resources ~~~~~~~~~~~~~~~~~~~~~ By default, Ray requests 1 CPU per read task, which means one read task per CPU can execute concurrently. For datasources that benefit from more IO parallelism, you can specify a lower ``num_cpus`` value for the read function with the ``ray_remote_args`` parameter. For example, use ``ray.data.read_parquet(path, ray_remote_args={"num_cpus": 0.25})`` to allow up to four read tasks per CPU. .. _parquet_column_pruning: Parquet column pruning (projection pushdown) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, :func:`ray.data.read_parquet` reads all columns in the Parquet files into memory. If you only need a subset of the columns, make sure to specify the list of columns explicitly when calling :func:`ray.data.read_parquet` to avoid loading unnecessary data (projection pushdown). Note that this is more efficient than calling :func:`~ray.data.Dataset.select_columns`, since column selection is pushed down to the file scan. .. testcode:: import ray # Read just two of the five columns of the Iris dataset. ds = ray.data.read_parquet( "s3://anonymous@ray-example-data/iris.parquet", columns=["sepal.length", "variety"], ) print(ds.schema()) .. testoutput:: Column Type ------ ---- sepal.length double variety string .. _data_memory: Reducing memory usage --------------------- .. _data_out_of_memory: Troubleshooting out-of-memory errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ During execution, a task can read multiple input blocks, and write multiple output blocks. Input and output blocks consume both worker heap memory and shared memory through Ray's object store. Ray caps object store memory usage by spilling to disk, but excessive worker heap memory usage can cause out-of-memory situations. Ray Data attempts to bound its heap memory usage to ``num_execution_slots * max_block_size``. The number of execution slots is by default equal to the number of CPUs, unless custom resources are specified. The maximum block size is set by the configuration parameter :class:`DataContext.target_max_block_size ` and is set to 128MiB by default. If the Dataset includes an :ref:`all-to-all shuffle operation ` (such as :func:`~ray.data.Dataset.random_shuffle`), then the default maximum block size is controlled by :class:`DataContext.target_shuffle_max_block_size `, set to 1GiB by default to avoid creating too many tiny blocks. .. note:: It's **not** recommended to modify :class:`DataContext.target_max_block_size `. The default is already chosen to balance between high overheads from too many tiny blocks vs. excessive heap memory usage from too-large blocks. When a task's output is larger than the maximum block size, the worker automatically splits the output into multiple smaller blocks to avoid running out of heap memory. However, too-large blocks are still possible, and they can lead to out-of-memory situations. To avoid these issues: 1. Make sure no single item in your dataset is too large. Aim for rows that are <10 MB each. 2. Always call :meth:`ds.map_batches() ` with a batch size small enough such that the output batch can comfortably fit into heap memory. Or, if vectorized execution is not necessary, use :meth:`ds.map() `. 3. If neither of these is sufficient, manually increase the :ref:`read output blocks ` or modify your application code to ensure that each task reads a smaller amount of data. As an example of tuning batch size, the following code uses one task to load a 1 GB :class:`~ray.data.Dataset` with 1000 1 MB rows and applies an identity function using :func:`~ray.data.Dataset.map_batches`. Because the default ``batch_size`` for :func:`~ray.data.Dataset.map_batches` is 1024 rows, this code produces only one very large batch, causing the heap memory usage to increase to 4 GB. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) # Force Ray Data to use one task to show the memory issue. ds = ray.data.range_tensor(1000, shape=(125_000, ), override_num_blocks=1) # The default batch size is 1024 rows. ds = ds.map_batches(lambda batch: batch) print(ds.materialize().stats()) .. testoutput:: :options: +MOCK Operator 1 ReadRange->MapBatches(): 1 tasks executed, 7 blocks produced in 1.33s ... * Peak heap memory usage (MiB): 3302.17 min, 4233.51 max, 4100 mean * Output num rows: 125 min, 125 max, 125 mean, 1000 total * Output size bytes: 134000536 min, 196000784 max, 142857714 mean, 1000004000 total ... Setting a lower batch size produces lower peak heap memory usage: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) ds = ray.data.range_tensor(1000, shape=(125_000, ), override_num_blocks=1) ds = ds.map_batches(lambda batch: batch, batch_size=32) print(ds.materialize().stats()) .. testoutput:: :options: +MOCK Operator 1 ReadRange->MapBatches(): 1 tasks executed, 7 blocks produced in 0.51s ... * Peak heap memory usage (MiB): 587.09 min, 1569.57 max, 1207 mean * Output num rows: 40 min, 160 max, 142 mean, 1000 total * Output size bytes: 40000160 min, 160000640 max, 142857714 mean, 1000004000 total ... Improving heap memory usage in Ray Data is an active area of development. Here are the current known cases in which heap memory usage may be very high: 1. Reading large (1 GiB or more) binary files. 2. Transforming a Dataset where individual rows are large (100 MiB or more). In these cases, the last resort is to reduce the number of concurrent execution slots. This can be done with custom resources. For example, use :meth:`ds.map_batches(fn, num_cpus=2) ` to halve the number of execution slots for the ``map_batches`` tasks. If these strategies are still insufficient, `file a Ray Data issue on GitHub`_. Avoiding object spilling ~~~~~~~~~~~~~~~~~~~~~~~~ A Dataset's intermediate and output blocks are stored in Ray's object store. Although Ray Data attempts to minimize object store usage with :ref:`streaming execution `, it's still possible that the working set exceeds the object store capacity. In this case, Ray begins spilling blocks to disk, which can slow down execution significantly or even cause out-of-disk errors. There are some cases where spilling is expected. In particular, if the total Dataset's size is larger than object store capacity, and one of the following is true: 1. An :ref:`all-to-all shuffle operation ` is used. Or, 2. There is a call to :meth:`ds.materialize() `. Otherwise, it's best to tune your application to avoid spilling. The recommended strategy is to manually increase the :ref:`read output blocks ` or modify your application code to ensure that each task reads a smaller amount of data. .. note:: This is an active area of development. If your Dataset is causing spilling and you don't know why, `file a Ray Data issue on GitHub`_. Handling too-small blocks ~~~~~~~~~~~~~~~~~~~~~~~~~ When different operators of your Dataset produce different-sized outputs, you may end up with very small blocks, which can hurt performance and even cause crashes from excessive metadata. Use :meth:`ds.stats() ` to check that each operator's output blocks are each at least 1 MB and ideally >100 MB. If your blocks are smaller than this, consider repartitioning into larger blocks. There are two ways to do this: 1. If you need control over the exact number of output blocks, use :meth:`ds.repartition(num_partitions) `. Note that this is an :ref:`all-to-all operation ` and it materializes all blocks into memory before performing the repartition. 2. If you don't need control over the exact number of output blocks and just want to produce larger blocks, use :meth:`ds.map_batches(lambda batch: batch, batch_size=batch_size) ` and set ``batch_size`` to the desired number of rows per block. This is executed in a streaming fashion and avoids materialization. When :meth:`ds.map_batches() ` is used, Ray Data coalesces blocks so that each map task can process at least this many rows. Note that the chosen ``batch_size`` is a lower bound on the task's input block size but it does not necessarily determine the task's final *output* block size; see :ref:`the section ` on block memory usage for more information on how block size is determined. To illustrate these, the following code uses both strategies to coalesce the 10 tiny blocks with 1 row each into 1 larger block with 10 rows: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Pretend there are two CPUs. ray.init(num_cpus=2) # 1. Use ds.repartition(). ds = ray.data.range(10, override_num_blocks=10).repartition(1) print(ds.materialize().stats()) # 2. Use ds.map_batches(). ds = ray.data.range(10, override_num_blocks=10).map_batches(lambda batch: batch, batch_size=10) print(ds.materialize().stats()) .. testoutput:: :options: +MOCK # 1. ds.repartition() output. Operator 1 ReadRange: 10 tasks executed, 10 blocks produced in 0.33s ... * Output num rows: 1 min, 1 max, 1 mean, 10 total ... Operator 2 Repartition: executed in 0.36s Suboperator 0 RepartitionSplit: 10 tasks executed, 10 blocks produced ... Suboperator 1 RepartitionReduce: 1 tasks executed, 1 blocks produced ... * Output num rows: 10 min, 10 max, 10 mean, 10 total ... # 2. ds.map_batches() output. Operator 1 ReadRange->MapBatches(): 1 tasks executed, 1 blocks produced in 0s ... * Output num rows: 10 min, 10 max, 10 mean, 10 total Configuring execution --------------------- Configuring resources and locality ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, the CPU and GPU limits are set to the cluster size, and the object store memory limit conservatively to 1/4 of the total object store size to avoid the possibility of disk spilling. You may want to customize these limits in the following scenarios: - If running multiple concurrent jobs on the cluster, setting lower limits can avoid resource contention between the jobs. - If you want to fine-tune the memory limit to maximize performance. - For data loading into training jobs, you may want to set the object store memory to a low value (for example, 2 GB) to limit resource usage. You can configure execution options with the global DataContext. The options are applied for future jobs launched in the process: .. code-block:: ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ctx.execution_options.resource_limits.copy( cpu=10, gpu=5, object_store_memory=10e9, ) .. note:: Be mindful that by default Ray reserves only 30% of the memory for its Object Store. This is recommended to be set at least to ***50%*** for all Ray Data workloads. Locality with output (ML ingest use case) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ctx.execution_options.locality_with_output = True Setting this parameter to True tells Ray Data to prefer placing operator tasks onto the consumer node in the cluster, rather than spreading them evenly across the cluster. This setting can be useful if you know you are consuming the output data directly on the consumer node (such as, for ML training ingest). However, other use cases may incur a performance penalty with this setting. Reproducibility --------------- Deterministic execution ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: # By default, this is set to False. ctx.execution_options.preserve_order = True To enable deterministic execution, set the preceding to True. This setting may decrease performance, but ensures block ordering is preserved through execution. This flag defaults to False. .. _`file a Ray Data issue on GitHub`: https://github.com/ray-project/ray/issues/new?assignees=&labels=bug%2Ctriage%2Cdata&projects=&template=bug-report.yml&title=[data]+ --- .. _data_quickstart: Ray Data Quickstart =================== Get started with Ray Data's :class:`Dataset ` abstraction for distributed data processing. This guide introduces you to the core capabilities of Ray Data: * :ref:`Loading data ` * :ref:`Transforming data ` * :ref:`Consuming data ` * :ref:`Saving data ` Datasets -------- Ray Data's main abstraction is a :class:`Dataset `, which represents a distributed collection of data. Datasets are specifically designed for machine learning workloads and can efficiently handle data collections that exceed a single machine's memory. .. _loading_key_concept: Loading data ------------ Create datasets from various sources including local files, Python objects, and cloud storage services like S3 or GCS. Ray Data seamlessly integrates with any `filesystem supported by Arrow `__. .. testcode:: import ray # Load a CSV dataset directly from S3 ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") # Preview the first record ds.show(limit=1) .. testoutput:: {'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0} To learn more about creating datasets from different sources, read :ref:`Loading data `. .. _transforming_key_concept: Transforming data ----------------- Apply user-defined functions (UDFs) to transform datasets. Ray automatically parallelizes these transformations across your cluster for better performance. .. testcode:: from typing import Dict import numpy as np # Define a transformation to compute a "petal area" attribute def transform_batch(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: vec_a = batch["petal length (cm)"] vec_b = batch["petal width (cm)"] batch["petal area (cm^2)"] = vec_a * vec_b return batch # Apply the transformation to our dataset transformed_ds = ds.map_batches(transform_batch) # View the updated schema with the new column # .materialize() will execute all the lazy transformations and # materialize the dataset into object store memory print(transformed_ds.materialize()) .. testoutput:: MaterializedDataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64, petal area (cm^2): double } ) To explore more transformation capabilities, read :ref:`Transforming data `. .. _consuming_key_concept: Consuming data -------------- Access dataset contents through convenient methods like :meth:`~ray.data.Dataset.take_batch` and :meth:`~ray.data.Dataset.iter_batches`. You can also pass datasets directly to Ray Tasks or Actors for distributed processing. .. testcode:: # Extract the first 3 rows as a batch for processing print(transformed_ds.take_batch(batch_size=3)) .. testoutput:: :options: +NORMALIZE_WHITESPACE {'sepal length (cm)': array([5.1, 4.9, 4.7]), 'sepal width (cm)': array([3.5, 3. , 3.2]), 'petal length (cm)': array([1.4, 1.4, 1.3]), 'petal width (cm)': array([0.2, 0.2, 0.2]), 'target': array([0, 0, 0]), 'petal area (cm^2)': array([0.28, 0.28, 0.26])} For more details on working with dataset contents, see :ref:`Iterating over Data ` and :ref:`Saving Data `. .. _saving_key_concept: Saving data ----------- Export processed datasets to a variety of formats and storage locations using methods like :meth:`~ray.data.Dataset.write_parquet`, :meth:`~ray.data.Dataset.write_csv`, and more. .. testcode:: :hide: # The number of blocks can be non-deterministic. Repartition the dataset beforehand # so that the number of written files is consistent. transformed_ds = transformed_ds.repartition(2) .. testcode:: import os # Save the transformed dataset as Parquet files transformed_ds.write_parquet("/tmp/iris") # Verify the files were created print(os.listdir("/tmp/iris")) .. testoutput:: :options: +MOCK ['..._000000.parquet', '..._000001.parquet'] For more information on saving datasets, see :ref:`Saving data `. --- .. _saving-data: =========== Saving Data =========== Ray Data lets you save data in files or other Python objects. This guide shows you how to: * `Write data to files <#writing-data-to-files>`_ * `Convert Datasets to other Python libraries <#converting-datasets-to-other-python-libraries>`_ Writing data to files ===================== Ray Data writes to local disk and cloud storage. Writing data to local disk ~~~~~~~~~~~~~~~~~~~~~~~~~~ To save your :class:`~ray.data.dataset.Dataset` to local disk, call a method like :meth:`Dataset.write_parquet ` and specify a local directory with the `local://` scheme. .. warning:: If your cluster contains multiple nodes and you don't use `local://`, Ray Data writes different partitions of data to different nodes. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.write_parquet("local:///tmp/iris/") To write data to formats other than Parquet, read the :ref:`Input/Output reference `. Writing data to cloud storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To save your :class:`~ray.data.dataset.Dataset` to cloud storage, authenticate all nodes with your cloud service provider. Then, call a method like :meth:`Dataset.write_parquet ` and specify a URI with the appropriate scheme. URI can point to buckets or folders. To write data to formats other than Parquet, read the :ref:`Input/Output reference `. .. tab-set:: .. tab-item:: S3 To save data to Amazon S3, specify a URI with the ``s3://`` scheme. .. testcode:: :skipif: True import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.write_parquet("s3://my-bucket/my-folder") Ray Data relies on PyArrow to authenticate with Amazon S3. For more on how to configure your credentials to be compatible with PyArrow, see their `S3 Filesystem docs `_. .. tab-item:: GCS To save data to Google Cloud Storage, install the `Filesystem interface to Google Cloud Storage `_ .. code-block:: console pip install gcsfs Then, create a ``GCSFileSystem`` and specify a URI with the ``gcs://`` scheme. .. testcode:: :skipif: True import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") filesystem = gcsfs.GCSFileSystem(project="my-google-project") ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem) Ray Data relies on PyArrow for authentication with Google Cloud Storage. For more on how to configure your credentials to be compatible with PyArrow, see their `GCS Filesystem docs `_. .. tab-item:: ABS To save data to Azure Blob Storage, install the `Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage `_ .. code-block:: console pip install adlfs Then, create a ``AzureBlobFileSystem`` and specify a URI with the ``az://`` scheme. .. testcode:: :skipif: True import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") filesystem = adlfs.AzureBlobFileSystem(account_name="azureopendatastorage") ds.write_parquet("az://my-bucket/my-folder", filesystem=filesystem) Ray Data relies on PyArrow for authentication with Azure Blob Storage. For more on how to configure your credentials to be compatible with PyArrow, see their `fsspec-compatible filesystems docs `_. Writing data to NFS ~~~~~~~~~~~~~~~~~~~ To save your :class:`~ray.data.dataset.Dataset` to NFS file systems, call a method like :meth:`Dataset.write_parquet ` and specify a mounted directory. .. testcode:: :skipif: True import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.write_parquet("/mnt/cluster_storage/iris") To write data to formats other than Parquet, read the :ref:`Input/Output reference `. .. _changing-number-output-files: Changing the number of output files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When you call a write method, Ray Data writes your data to several files. To control the number of output files, configure ``min_rows_per_file``. .. note:: ``min_rows_per_file`` is a hint, not a strict limit. Ray Data might write more or fewer rows to each file. Under the hood, if the number of rows per block is larger than the specified value, Ray Data writes the number of rows per block to each file. .. testcode:: import os import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.write_csv("/tmp/few_files/", min_rows_per_file=75) print(os.listdir("/tmp/few_files/")) .. testoutput:: :options: +MOCK ['0_000001_000000.csv', '0_000000_000000.csv', '0_000002_000000.csv'] Writing into Partitioned Dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When writing partitioned dataset (using Hive-style, folder-based partitioning) it's recommended to repartition the dataset by the partition columns prior to writing into it. This allows you to *have control over the file sizes and their number*. When the dataset is repartitioned by the partition columns every block should contain all of the rows corresponding to particular partition, meaning that the number of files created should be controlled based on the configuration provided to, for example, `write_parquet` method (such as `min_rows_per_file`, `max_rows_per_file`). Since every block is written out independently, when writing the dataset without prior repartitioning you could potentially get an N number of files per partition (where N is the number of blocks in your dataset) with very limited ability to control the number of files & their sizes (since every block could potentially carry the rows corresponding to any partition). .. testcode:: import ray import pandas as pd from ray.data import DataContext from ray.data.context import ShuffleStrategy def print_directory_tree(start_path: str) -> None: """ Prints the directory tree structure starting from the given path. """ for root, dirs, files in os.walk(start_path): level = root.replace(start_path, '').count(os.sep) indent = ' ' * 4 * (level) print(f'{indent}{os.path.basename(root)}/') subindent = ' ' * 4 * (level + 1) for f in files: print(f'{subindent}{f}') # Sample dataset that we’ll partition by ``city`` and ``year``. df = pd.DataFrame( { "city": ["SF", "SF", "NYC", "NYC", "SF", "NYC", "SF", "NYC"], "year": [2023, 2024, 2023, 2024, 2023, 2023, 2024, 2024], "sales": [100, 120, 90, 115, 105, 95, 130, 110], } ) ds = ray.data.from_pandas(df) DataContext.shuffle_strategy=ShuffleStrategy.HASH_SHUFFLE # ── Partitioned write ────────────────────────────────────────────────────── # 1. Repartition so all rows with the same (city, year) land in the same # block – this minimises shuffling during the write. # 2. Pass the same columns to ``partition_cols`` so Ray creates a # Hive-style directory layout: city=/year=/.... # 3. Use ``min_rows_per_file`` / ``max_rows_per_file`` to control how many # rows Ray puts in each Parquet file. ds.repartition(keys=["city", "year"], num_blocks=4).write_parquet( "/tmp/sales_partitioned", partition_cols=["city", "year"], min_rows_per_file=2, # At least 2 rows in each file … max_rows_per_file=3, # … but never more than 3. ) print_directory_tree("/tmp/sales_partitioned") .. testoutput:: :options: +MOCK sales_partitioned/ city=NYC/ year=2024/ 1_a2b8b82cd2904a368ec39f42ae3cf830_000000_000000-0.parquet year=2023/ 1_a2b8b82cd2904a368ec39f42ae3cf830_000001_000000-0.parquet city=SF/ year=2024/ 1_a2b8b82cd2904a368ec39f42ae3cf830_000000_000000-0.parquet year=2023/ 1_a2b8b82cd2904a368ec39f42ae3cf830_000001_000000-0.parquet Converting Datasets to other Python libraries ============================================= Converting Datasets to pandas ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To convert a :class:`~ray.data.dataset.Dataset` to a pandas DataFrame, call :meth:`Dataset.to_pandas() `. Your data must fit in memory on the head node. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_pandas() print(df) .. testoutput:: :options: +NORMALIZE_WHITESPACE sepal length (cm) sepal width (cm) ... petal width (cm) target 0 5.1 3.5 ... 0.2 0 1 4.9 3.0 ... 0.2 0 2 4.7 3.2 ... 0.2 0 3 4.6 3.1 ... 0.2 0 4 5.0 3.6 ... 0.2 0 .. ... ... ... ... ... 145 6.7 3.0 ... 2.3 2 146 6.3 2.5 ... 1.9 2 147 6.5 3.0 ... 2.0 2 148 6.2 3.4 ... 2.3 2 149 5.9 3.0 ... 1.8 2 [150 rows x 5 columns] Converting Datasets to distributed DataFrames ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Data interoperates with distributed data processing frameworks like `Daft `_, :ref:`Dask `, :ref:`Spark `, :ref:`Modin `, and :ref:`Mars `. .. tab-set:: .. tab-item:: Daft To convert a :class:`~ray.data.dataset.Dataset` to a `Daft Dataframe `_, call :meth:`Dataset.to_daft() `. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_daft() print(df) .. testoutput:: :options: +MOCK ╭───────────────────┬──────────────────┬───────────────────┬──────────────────┬────────╮ │ sepal length (cm) ┆ sepal width (cm) ┆ petal length (cm) ┆ petal width (cm) ┆ target │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ Float64 ┆ Float64 ┆ Float64 ┆ Float64 ┆ Int64 │ ╞═══════════════════╪══════════════════╪═══════════════════╪══════════════════╪════════╡ │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ 4.9 ┆ 3 ┆ 1.4 ┆ 0.2 ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ 5 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ 5.4 ┆ 3.9 ┆ 1.7 ┆ 0.4 ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ 4.6 ┆ 3.4 ┆ 1.4 ┆ 0.3 ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ 5 ┆ 3.4 ┆ 1.5 ┆ 0.2 ┆ 0 │ ╰───────────────────┴──────────────────┴───────────────────┴──────────────────┴────────╯ (Showing first 8 of 150 rows) .. tab-item:: Dask To convert a :class:`~ray.data.dataset.Dataset` to a `Dask DataFrame `__, call :meth:`Dataset.to_dask() `. .. We skip the code snippet below because `to_dask` doesn't work with PyArrow 14 and later. For more information, see https://github.com/ray-project/ray/issues/54837 .. testcode:: :skipif: True import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_dask() .. tab-item:: Spark To convert a :class:`~ray.data.dataset.Dataset` to a `Spark DataFrame `__, call :meth:`Dataset.to_spark() `. .. testcode:: :skipif: True import ray import raydp spark = raydp.init_spark( app_name = "example", num_executors = 1, executor_cores = 4, executor_memory = "512M" ) ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_spark(spark) .. testcode:: :skipif: True :hide: raydp.stop_spark() .. tab-item:: Modin To convert a :class:`~ray.data.dataset.Dataset` to a Modin DataFrame, call :meth:`Dataset.to_modin() `. .. testcode:: import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") mdf = ds.to_modin() .. tab-item:: Mars To convert a :class:`~ray.data.dataset.Dataset` from a Mars DataFrame, call :meth:`Dataset.to_mars() `. .. testcode:: :skipif: True import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") mdf = ds.to_mars() --- .. _shuffling_data: ============== Shuffling Data ============== When consuming or iterating over Ray :class:`Datasets `, it can be useful to shuffle or randomize the order of data (for example, randomizing data ingest order during ML training). This guide shows several different methods of shuffling data with Ray Data and their respective trade-offs. Types of shuffling ================== Ray Data provides several different options for shuffling data, trading off the granularity of shuffle control with memory consumption and runtime. The list below presents options in increasing order of resource consumption and runtime. Choose the most appropriate method for your use case. .. _shuffling_file_order: Shuffle the ordering of files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To randomly shuffle the ordering of input files before reading, call a :ref:`read function ` function that supports shuffling, such as :func:`~ray.data.read_images`, and use the ``shuffle="files"`` parameter. This randomly assigns input files to workers for reading. This is the fastest "shuffle" option: it's purely a metadata operation---the system random-shuffles the list of files constituting the dataset before fetching them with reading tasks. This option, however, doesn't shuffle the rows inside files, so the randomness might not be sufficient for your needs in case of files with the large number of rows. .. testcode:: import ray ds = ray.data.read_images( "s3://anonymous@ray-example-data/image-datasets/simple", shuffle="files", ) .. _local_shuffle_buffer: Local buffer shuffle ~~~~~~~~~~~~~~~~~~~~ To locally shuffle a subset of rows using iteration methods, such as :meth:`~ray.data.Dataset.iter_batches`, :meth:`~ray.data.Dataset.iter_torch_batches`, and :meth:`~ray.data.Dataset.iter_tf_batches`, specify `local_shuffle_buffer_size`. This shuffles up to a `local_shuffle_buffer_size` number of rows buffered during iteration. See more details in :ref:`Iterating over batches with shuffling `. This is slower than files shuffling, and shuffles rows locally without network transfer. You can use this local shuffle buffer together with shuffling ordering of files. See :ref:`Shuffle the ordering of files `. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_batches( batch_size=2, batch_format="numpy", local_shuffle_buffer_size=250, ): print(batch) .. tip:: If you observe reduced throughput when using ``local_shuffle_buffer_size``, check the total time spent in batch creation by examining the ``ds.stats()`` output (``In batch formatting``, under ``Batch iteration time breakdown``). If this time is significantly larger than the time spent in other steps, decrease ``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files `. Randomizing block order ~~~~~~~~~~~~~~~~~~~~~~~ This option randomizes the order of :ref:`blocks ` in a dataset. While applying this operation alone doesn't involve heavy computation and communication, it requires Ray Data to materialize all blocks in memory before actually randomizing their ordering in the queue for subsequent operation. .. note:: Ray Data doesn't guarantee any particular ordering of the blocks when reading blocks from different files in parallel by default, unless you set `DataContext.execution_options.preserve_order` to true. Henceforth, this particular option is primarily relevant in cases when the system yields blocks from relatively small set of very large files. .. note:: Only use this option when your dataset is small enough to fit into the object store memory. To perform block order shuffling, use :meth:`randomize_block_order `. .. testcode:: import ray ds = ray.data.read_text( "s3://anonymous@ray-example-data/sms_spam_collection_subset.txt" ) # Randomize the block order of this dataset. ds = ds.randomize_block_order() Global shuffle ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To shuffle all rows globally, across the whole dataset, multiple options are available 1. *Random shuffling*: invoking :meth:`~ray.data.Dataset.random_shuffle` essentially permutes and shuffles individual rows from existing blocks into the new ones using an optionally provided seed. 2. (**New in 2.46**) *Key-based repartitioning*: invoking :meth:`~ray.data.Dataset.repartition` with `keys` parameter triggers :ref:`hash-shuffle ` operation, shuffling the rows based on the hash of the values in the provided key columns, providing deterministic way of co-locating rows based on the hash of the column values. Note that shuffle is an expensive operation requiring materializing of the whole dataset in memory as well as serving as a synchronization barrier---subsequent operators won't be able to start executing until shuffle completion. Example of random shuffling with seed: .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") # Random shuffle with seed random_shuffled_ds = ds.random_shuffle(seed=123) Example of hash shuffling based on column `id`: .. testcode:: import ray from ray.data.context import DataContext, ShuffleStrategy # First enable hash-shuffle as shuffling strategy DataContext.get_current().shuffle_strategy = ShuffleStrategy.HASH_SHUFFLE # Hash-shuffle hash_shuffled_ds = ds.repartition(keys="id", num_blocks=200) .. _optimizing_shuffles: Advanced: Optimizing shuffles ============================= .. note:: This is an active area of development. If your Dataset uses a shuffle operation and you are having trouble configuring shuffle, `file a Ray Data issue on GitHub `_. When should you use global per-epoch shuffling? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use global per-epoch shuffling only if your model is sensitive to the randomness of the training data. Based on a `theoretical foundation `__, all gradient-descent-based model trainers benefit from improved global shuffle quality. In practice, the benefit's particularly pronounced for tabular data/models. However, the more global the shuffle is, the more expensive the shuffling operation. The increase compounds with distributed data-parallel training on a multi-node cluster due to data transfer costs. This cost can be prohibitive when using very large datasets. The best route for determining the best tradeoff between preprocessing time and cost and per-epoch shuffle quality is to measure the precision gain per training step for your particular model under different shuffling policies such as no shuffling, local shuffling, or global shuffling. As long as your data loading and shuffling throughput is higher than your training throughput, your GPU should saturate. If you have shuffle-sensitive models, push the shuffle quality higher until you reach this threshold. .. _shuffle_performance_tips: Enabling push-based shuffle ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some Dataset operations require a *shuffle* operation, meaning that the system shuffles data from all of the input partitions to all of the output partitions. These operations include :meth:`Dataset.random_shuffle `, :meth:`Dataset.sort ` and :meth:`Dataset.groupby `. For example, during a sort operation, the system reorders data between blocks and therefore requires shuffling across partitions. Shuffling can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory. Ray Data provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance. Try this out if your dataset has more than 1000 blocks or is larger than 1 TB in size. To try this out locally or on a cluster, you can start with the `nightly release test `_ that Ray runs for :meth:`Dataset.random_shuffle ` and :meth:`Dataset.sort `. To get an idea of the performance you can expect, here are some run time results for :meth:`Dataset.random_shuffle ` on 1-10 TB of data on 20 machines - m5.4xlarge instances on AWS EC2, each with 16 vCPUs, 64 GB RAM. .. image:: https://docs.google.com/spreadsheets/d/e/2PACX-1vQvBWpdxHsW0-loasJsBpdarAixb7rjoo-lTgikghfCeKPQtjQDDo2fY51Yc1B6k_S4bnYEoChmFrH2/pubchart?oid=598567373&format=image :align: center To try out push-based shuffle, set the environment variable ``RAY_DATA_PUSH_BASED_SHUFFLE=1`` when running your application: .. code-block:: bash $ wget https://raw.githubusercontent.com/ray-project/ray/master/release/nightly_tests/dataset/sort_benchmark.py $ RAY_DATA_PUSH_BASED_SHUFFLE=1 python sort_benchmark.py --num-partitions=10 --partition-size=1e7 # Dataset size: 10 partitions, 0.01GB partition size, 0.1GB total # [dataset]: Run `pip install tqdm` to enable progress reporting. # 2022-05-04 17:30:28,806 INFO push_based_shuffle.py:118 -- Using experimental push-based shuffle. # Finished in 9.571171760559082 # ... You can also specify the shuffle implementation during program execution by setting the ``DataContext.use_push_based_shuffle`` flag: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray ctx = ray.data.DataContext.get_current() ctx.use_push_based_shuffle = True ds = ( ray.data.range(1000) .random_shuffle() ) Large-scale shuffles can take a while to finish. For debugging purposes, shuffle operations support executing only part of the shuffle, so that you can collect an execution profile more quickly. Here is an example that shows how to limit a random shuffle operation to two output blocks: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray ctx = ray.data.DataContext.get_current() ctx.set_config( "debug_limit_shuffle_execution_to_num_blocks", 2 ) ds = ( ray.data.range(1000, override_num_blocks=10) .random_shuffle() .materialize() ) print(ds.stats()) .. testoutput:: :options: +MOCK Operator 1 ReadRange->RandomShuffle: executed in 0.08s Suboperator 0 ReadRange->RandomShuffleMap: 2/2 blocks executed ... --- .. _transforming_data: ================= Transforming Data ================= Transformations let you process and modify your dataset. You can compose transformations to express a chain of computations. .. note:: Transformations are lazy by default. They aren't executed until you trigger consumption of the data by :ref:`iterating over the Dataset `, :ref:`saving the Dataset `, or :ref:`inspecting properties of the Dataset `. This guide shows you how to scale transformations (or user-defined functions (UDFs)) on your Ray Data dataset. .. _transforming_rows: Transforming rows ================= .. tip:: If your transformation is vectorized, call :meth:`~ray.data.Dataset.map_batches` for better performance. To learn more, see :ref:`Transforming batches `. Transforming rows with map ~~~~~~~~~~~~~~~~~~~~~~~~~~ If your transformation returns exactly one row for each input row, call :meth:`~ray.data.Dataset.map`. This transformation is automatically parallelized across your Ray cluster. .. testcode:: import os from typing import Any, Dict import ray def parse_filename(row: Dict[str, Any]) -> Dict[str, Any]: row["filename"] = os.path.basename(row["path"]) return row ds = ( ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple", include_paths=True) .map(parse_filename) ) The user defined function passed to :meth:`~ray.data.Dataset.map` should be of type `Callable[[Dict[str, Any]], Dict[str, Any]]`. In other words, your function should input and output a dictionary with keys of strings and values of any type. For example: .. testcode:: from typing import Any, Dict def fn(row: Dict[str, Any]) -> Dict[str, Any]: # access row data value = row["col1"] # add data to row row["col2"] = ... # return row return row Transforming rows with flat map ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your transformation returns multiple rows for each input row, call :meth:`~ray.data.Dataset.flat_map`. This transformation is automatically parallelized across your Ray cluster. .. testcode:: from typing import Any, Dict, List import ray def duplicate_row(row: Dict[str, Any]) -> List[Dict[str, Any]]: return [row] * 2 print( ray.data.range(3) .flat_map(duplicate_row) .take_all() ) .. testoutput:: [{'id': 0}, {'id': 0}, {'id': 1}, {'id': 1}, {'id': 2}, {'id': 2}] The user defined function passed to :meth:`~ray.data.Dataset.flat_map` should be of type `Callable[[Dict[str, Any]], List[Dict[str, Any]]]`. In other words your function should input a dictionary with keys of strings and values of any type and output a list of dictionaries that have the same type as the input, for example: .. testcode:: from typing import Any, Dict, List def fn(row: Dict[str, Any]) -> List[Dict[str, Any]]: # access row data value = row["col1"] # add data to row row["col2"] = ... # construct output list output = [row, row] # return list of output rows return output .. _transforming_batches: Transforming batches ==================== If your transformation can be vectorized using NumPy, PyArrow or Pandas operations, transforming batches is considerably more performant than transforming individual rows. This transformation is automatically parallelized across your Ray cluster. .. testcode:: from typing import Dict import numpy as np import ray def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch ds = ( ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") .map_batches(increase_brightness) ) .. _configure_batch_format: Configuring batch format ~~~~~~~~~~~~~~~~~~~~~~~~ Ray Data represents batches as dicts of NumPy ndarrays, pandas DataFrames or Arrow Tables. By default, Ray Data represents batches as dicts of NumPy ndarrays. To configure the batch type, specify ``batch_format`` in :meth:`~ray.data.Dataset.map_batches`. You can return either format from your function, but ``batch_format`` should match the input of your function. When applying transformations to batches of rows, Ray Data could represent these batches as either NumPy's ``ndarrays``, Pandas ``DataFrame`` or PyArrow ``Table``. When using * ``batch_format=numpy``, the input to the function is a dictionary where keys correspond to column names and values to column values represented as ``ndarrays``. * ``batch_format=pyarrow``, the input to the function is a Pyarrow ``Table``. * ``batch_format=pandas``, the input to the function is a Pandas ``DataFrame``. .. tab-set:: .. tab-item:: NumPy .. testcode:: from typing import Dict import numpy as np import ray def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch ds = ( ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") .map_batches(increase_brightness, batch_format="numpy") ) .. tab-item:: pandas .. testcode:: import pandas as pd import ray def drop_nas(batch: pd.DataFrame) -> pd.DataFrame: return batch.dropna() ds = ( ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") .map_batches(drop_nas, batch_format="pandas") ) .. tab-item:: pyarrow .. testcode:: import pyarrow as pa import pyarrow.compute as pc import ray def drop_nas(batch: pa.Table) -> pa.Table: return pc.drop_null(batch) ds = ( ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") .map_batches(drop_nas, batch_format="pyarrow") ) The user defined function can also be a Python generator that yields batches, so the function can also be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray], pyarrow.Table]``. In this case, your function would look like: .. testcode:: from typing import Dict, Iterator import numpy as np def fn(batch: Dict[str, np.ndarray]) -> Iterator[Dict[str, np.ndarray]]: # yield the same batch multiple times for _ in range(10): yield batch Choosing the right batch format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When choosing appropriate batch format for your ``map_batches`` primary consideration is a trade-off of convenience vs performance: 1. Batches are a sliding window into the underlying block: the UDF is invoked with a subset of rows of the underlying block that make up the current batch of specified ``batch_size``. Specifying ``batch_size=None`` makes batch include all rows of the block in a single batch. 2. Depending on the batch format, such view can either be a *zero-copy* (when batch format matches the block type of either ``pandas`` or ``pyarrow``) or copying one (when the batch format differs from the block type). For example, if the underlying block type is Arrow, specifying ``batch_format="numpy"`` or ``batch_format="pandas"`` might invoke a copy on the underlying data when converting it from the underlying block type. Ray Data also strives to minimize the amount of data conversions: for example, if your ``map_batches`` operation returns Pandas batches, then these batches are combined into blocks *without* conversion and propagated further as Pandas blocks. Most Ray Data datasources produce Arrow blocks, so using batch format ``pyarrow`` can avoid unnecessary data conversions. If you'd like to use a more ergonomic API for transformations but avoid performance overheads, you can consider using ``polars`` inside your ``map_batches`` operation with ``batch_format="pyarrow"`` as follows: .. testcode:: import pyarrow as pa def udf(table: pa.Table): import polars as pl df = polars.from_pyarrow(table) df.summary() return df.to_arrow() ds.map_batches(udf, batch_format="pyarrow") Configuring batch size ~~~~~~~~~~~~~~~~~~~~~~ Increasing ``batch_size`` improves the performance of vectorized transformations as well as performance of model inference. However, if your batch size is too large, your program might run into out-of-memory (OOM) errors. If you encounter an OOM errors, try decreasing your ``batch_size``. .. _stateful_transforms: Stateful/Class-based Transforms =============================== If your transform requires expensive setup such as downloading model weights, use a callable Python class instead of a function to make the transform stateful. When a Python class is used, the ``__init__`` method is called to perform setup exactly once on each worker. In contrast, functions are stateless, so any setup must be performed for each data item. Internally, Ray Data uses tasks to execute functions, and uses actors to execute classes. To learn more about tasks and actors, read the :ref:`Ray Core Key Concepts `. To transform data with a Python class, complete these steps: 1. Implement a class. Perform setup in ``__init__`` and transform data in ``__call__``. 2. Call :meth:`~ray.data.Dataset.map_batches`, :meth:`~ray.data.Dataset.map`, or :meth:`~ray.data.Dataset.flat_map`. Pass a ``ray.data.ActorPoolStrategy(...)`` object to the ``compute`` argument to control how many workers Ray uses. Each worker transforms a partition of data in parallel. .. tab-set:: .. tab-item:: CPU .. testcode:: from typing import Dict import numpy as np import torch import ray class TorchPredictor: def __init__(self): self.model = torch.nn.Identity() self.model.eval() def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: inputs = torch.as_tensor(batch["data"], dtype=torch.float32) with torch.inference_mode(): batch["output"] = self.model(inputs).detach().numpy() return batch ds = ( ray.data.from_numpy(np.ones((32, 100))) .map_batches( TorchPredictor, compute=ray.data.ActorPoolStrategy(size=2), ) ) .. testcode:: :hide: ds.materialize() .. tab-item:: GPU .. testcode:: from typing import Dict import numpy as np import torch import ray class TorchPredictor: def __init__(self): self.model = torch.nn.Identity().cuda() self.model.eval() def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: inputs = torch.as_tensor(batch["data"], dtype=torch.float32).cuda() with torch.inference_mode(): batch["output"] = self.model(inputs).detach().cpu().numpy() return batch ds = ( ray.data.from_numpy(np.ones((32, 100))) .map_batches( TorchPredictor, # Two workers with one GPU each compute=ray.data.ActorPoolStrategy(size=2), # Batch size is required if you're using GPUs. batch_size=4, num_gpus=1 ) ) .. testcode:: :hide: ds.materialize() Specifying CPUs, GPUs, and Memory ================================= You can optionally specify logical resources per transformation by using one of the following parameters: ``num_cpus``, ``num_gpus``, ``memory``, ``resources``. * ``num_cpus``: The number of CPUs to use for the transformation. * ``num_gpus``: The number of GPUs to use for the transformation. Ray automatically configures the proper CUDA_VISIBLE_DEVICES environment variable so that GPUs are isolated from other tasks/actors. * ``memory``: The amount of memory to use for the transformation. This is useful for avoiding out-of-memory errors by telling Ray how much memory your function uses, and preventing Ray from scheduling too many tasks on a node. * ``resources``: A dictionary of resources to use for the transformation. This is useful for specifying custom resources. Note that these are logical resources and don't impose limits on actual physical resource usage. Also, both ``num_cpus`` and ``num_gpus`` support fractional values less than 1. For example, specifying ``num_cpus=0.5`` on a cluster with 4 CPUs allows 8 concurrent tasks/actors to run. You can read more about resources in Ray here: :ref:`resource-requirements`. .. testcode:: :hide: import ray ds = ray.data.range(1) .. testcode:: def uses_lots_of_memory(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: ... # Tell Ray that the function uses 1 GiB of memory ds.map_batches(uses_lots_of_memory, memory=1 * 1024 * 1024) Specifying Concurrency ====================== You can specify the concurrency of the transformation by using the ``compute`` parameter. For functions, use ``compute=ray.data.TaskPoolStrategy(size=n)`` to cap the number of concurrent tasks. By default, Ray Data automatically determines the number of concurrent tasks. For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. If ``compute`` isn't specified, an autoscaling actor pool is used by default. .. testcode:: import ray ds = ray.data.range(10).map_batches(lambda batch: {"id": batch["id"] * 2}, compute=ray.data.TaskPoolStrategy(size=2)) ds.take_all() .. testoutput:: :options: +MOCK [{'id': 0}, {'id': 2}, {'id': 4}, {'id': 6}, {'id': 8}, {'id': 10}, {'id': 12}, {'id': 14}, {'id': 16}, {'id': 18}] .. _ordering_of_rows: Ordering of rows ================ When transforming data, the order of :ref:`blocks ` isn't preserved by default. If the order of blocks needs to be preserved/deterministic, you can use :meth:`~ray.data.Dataset.sort` method, or set :attr:`ray.data.ExecutionOptions.preserve_order` to `True`. Note that setting this flag may negatively impact performance on larger cluster setups where stragglers are more likely. .. testcode:: import ray ctx = ray.data.DataContext().get_current() # By default, this is set to False. ctx.execution_options.preserve_order = True .. _transforming_groupby: Group-by and transforming groups ================================ To transform groups, call :meth:`~ray.data.Dataset.groupby` to group rows based on provided ``key`` column values. Then, call :meth:`~ray.data.grouped_data.GroupedData.map_groups` to execute a transformation on each group. .. tab-set:: .. tab-item:: NumPy .. testcode:: from typing import Dict import numpy as np import ray items = [ {"image": np.zeros((32, 32, 3)), "label": label} for _ in range(10) for label in range(100) ] def normalize_images(group: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: group["image"] = (group["image"] - group["image"].mean()) / group["image"].std() return group ds = ( ray.data.from_items(items) .groupby("label") .map_groups(normalize_images) ) .. tab-item:: pandas .. testcode:: import pandas as pd import ray def normalize_features(group: pd.DataFrame) -> pd.DataFrame: target = group.drop("target") group = (group - group.min()) / group.std() group["target"] = target return group ds = ( ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") .groupby("target") .map_groups(normalize_features) ) Advanced: Distributed UDFs with Placement Groups ================================================ While all transformations are automatically parallelized across your Ray cluster, often times these transformations can be distributed themselves. For example, if you're using a large model, you may want to distribute the model across multiple nodes. You can do this by using :ref:`placement groups ` and ``ray_remote_args_fn``, which can dynamically create placement groups for each model replica. .. testcode:: import ray from typing import Dict import numpy as np import torch NUM_SHARDS = 2 @ray.remote class ModelShard: def __init__(self): self.model = torch.nn.Linear(10, 10) def f(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: return batch class DistributedModel: def __init__(self): self.shards = [ModelShard.remote() for _ in range(NUM_SHARDS)] def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: return {"out": np.array(ray.get([shard.f.remote(batch) for shard in self.shards]))} def ray_remote_args_fn(): from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy pg = ray.util.placement_group([{"CPU": 1}] * NUM_SHARDS) return {"scheduling_strategy": PlacementGroupSchedulingStrategy(placement_group=pg)} ds = ray.data.range(10).map_batches(DistributedModel, ray_remote_args_fn=ray_remote_args_fn) ds.take_all() Advanced: Asynchronous Transforms ================================= Ray Data supports asynchronous functions by using the ``async`` keyword. This is useful for performing asynchronous operations such as fetching data from a database or making HTTP requests. Note that this only works when using a class-based transform function and currently requires ``uvloop==0.21.0``. .. testcode:: import ray from typing import Dict import numpy as np class AsyncTransform: async def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: return batch ds = ray.data.range(10).map_batches(AsyncTransform) ds.take_all() .. testoutput:: :options: +MOCK [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}, {'id': 7}, {'id': 8}, {'id': 9}] Expressions (Alpha) =================== Ray Data expressions provide a way to specify column-based operations on datasets. Use :func:`~ray.data.expressions.col` to reference columns and :func:`~ray.data.expressions.lit` to create literal values. You can combine these with operators to create complex expressions for filtering, transformations, and computations. Expressions have to be used with :meth:`~ray.data.Dataset.with_column`. The core advantage of expressions is that because they operate on specific columns, Ray Data's optimizer can optimize the execution plan by reordering the operations. See :ref:`expressions-api` for more details. .. testcode:: import ray from ray.data.expressions import col ds = ray.data.range(10).with_column("id_2", col("id") * 2) ds.show() To use a custom function with an expression, you can use :func:`~ray.data.expressions.udf`. .. testcode:: from ray.data.expressions import col, udf from ray.data.datatype import DataType import pyarrow as pa import pyarrow.compute as pc import ray # UDF that operates on a batch of values (PyArrow Array) @udf(return_dtype=DataType.int32()) def add_one(x: pa.Array) -> pa.Array: return pc.add(x, 1) # Vectorized operation on the entire Array # UDF that combines multiple columns (each as a PyArrow Array) @udf(return_dtype=DataType.string()) def format_name(first: pa.Array, last: pa.Array) -> pa.Array: return pc.binary_join_element_wise(first, last, " ") # Vectorized string concatenation # Use in dataset operations ds = ray.data.from_items([ {"value": 5, "first": "John", "last": "Doe"}, {"value": 10, "first": "Jane", "last": "Smith"} ]) ds = ds.with_column("value_plus_one", add_one(col("value"))) ds = ds.with_column("full_name", format_name(col("first"), col("last"))) ds = ds.with_column("doubled_plus_one", add_one(col("value")) * 2) ds.show() --- .. _data_user_guide: =========== User Guides =========== If you’re new to Ray Data, start with the :ref:`Ray Data Quickstart `. This user guide helps you navigate the Ray Data project and shows you how to achieve several tasks. .. toctree:: :maxdepth: 2 loading-data inspecting-data transforming-data aggregating-data iterating-over-data joining-data shuffling-data saving-data working-with-images working-with-text working-with-tensors working-with-pytorch working-with-llms monitoring-your-workload execution-configurations batch_inference performance-tips custom-datasource-example --- .. _working_with_images: Working with Images =================== With Ray Data, you can easily read and transform large image datasets. This guide shows you how to: * :ref:`Read images ` * :ref:`Transform images ` * :ref:`Perform inference on images ` * :ref:`Save images ` .. _reading_images: Reading images -------------- Ray Data can read images from a variety of formats. To view the full list of supported file formats, see the :ref:`Input/Output reference `. .. tab-set:: .. tab-item:: Raw images To load raw images like JPEG files, call :func:`~ray.data.read_images`. In the schema, the column name defaults to "image". .. note:: :func:`~ray.data.read_images` uses `PIL `_. For a list of supported file formats, see `Image file formats `_. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages") print(ds.schema()) .. testoutput:: Column Type ------ ---- image ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8) .. tab-item:: Images from Dataset of URIs To load images from a dataset of URIs, use the :func:`~ray.data.Dataset.with_column` method together with the :func:`~ray.data.expressions.download` expression. .. testcode:: import ray from ray.data.expressions import download ds = ray.data.read_parquet("s3://anonymous@ray-example-data/imagenet/metadata_file.parquet") ds = ds.with_column("bytes", download("image_url")) print(ds.schema()) .. testoutput:: Column Type ------ ---- image_url string bytes null .. tab-item:: NumPy To load images stored in NumPy format, call :func:`~ray.data.read_numpy`. .. testcode:: import ray ds = ray.data.read_numpy("s3://anonymous@air-example-data/cifar-10/images.npy") print(ds.schema()) .. testoutput:: Column Type ------ ---- data ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8) .. tab-item:: TFRecords Image datasets often contain ``tf.train.Example`` messages that look like this: .. code-block:: features { feature { key: "image" value { bytes_list { value: ... # Raw image bytes } } } feature { key: "label" value { int64_list { value: 3 } } } } To load examples stored in this format, call :func:`~ray.data.read_tfrecords`. Then, call :meth:`~ray.data.Dataset.map` to decode the raw image bytes. .. testcode:: import io from typing import Any, Dict import numpy as np from PIL import Image import ray def decode_bytes(row: Dict[str, Any]) -> Dict[str, Any]: data = row["image"] image = Image.open(io.BytesIO(data)) row["image"] = np.asarray(image) return row ds = ( ray.data.read_tfrecords( "s3://anonymous@air-example-data/cifar-10/tfrecords" ) .map(decode_bytes) ) print(ds.schema()) .. The following `testoutput` is mocked because the order of column names can be non-deterministic. For an example, see https://buildkite.com/ray-project/oss-ci-build-branch/builds/4849#01892c8b-0cd0-4432-bc9f-9f86fcd38edd. .. testoutput:: :options: +MOCK Column Type ------ ---- image ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8) label int64 .. tab-item:: Parquet To load image data stored in Parquet files, call :func:`ray.data.read_parquet`. .. testcode:: import ray ds = ray.data.read_parquet("s3://anonymous@air-example-data/cifar-10/parquet") print(ds.schema()) .. testoutput:: Column Type ------ ---- img struct label int64 For more information on creating datasets, see :ref:`Loading Data `. .. _transforming_images: Transforming images ------------------- To transform images, call :meth:`~ray.data.Dataset.map` or :meth:`~ray.data.Dataset.map_batches`. .. testcode:: from typing import Any, Dict import numpy as np import ray def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch ds = ( ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages") .map_batches(increase_brightness) ) For more information on transforming data, see :ref:`Transforming data `. .. _performing_inference_on_images: Performing inference on images ------------------------------ To perform inference with a pre-trained model, first load and transform your data. .. testcode:: from typing import Any, Dict from torchvision import transforms import ray def transform_image(row: Dict[str, Any]) -> Dict[str, Any]: transform = transforms.Compose([ transforms.ToTensor(), transforms.Resize((32, 32)) ]) row["image"] = transform(row["image"]) return row ds = ( ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages") .map(transform_image) ) Next, implement a callable class that sets up and invokes your model. .. testcode:: import torch from torchvision import models class ImageClassifier: def __init__(self): weights = models.ResNet18_Weights.DEFAULT self.model = models.resnet18(weights=weights) self.model.eval() def __call__(self, batch): inputs = torch.from_numpy(batch["image"]) with torch.inference_mode(): outputs = self.model(inputs) return {"class": outputs.argmax(dim=1)} Finally, call :meth:`Dataset.map_batches() `. .. testcode:: predictions = ds.map_batches( ImageClassifier, compute=ray.data.ActorPoolStrategy(size=2), batch_size=4 ) predictions.show(3) .. testoutput:: {'class': 118} {'class': 153} {'class': 296} For more information on performing inference, see :ref:`End-to-end: Offline Batch Inference ` and :ref:`Stateful Transforms `. .. _saving_images: Saving images ------------- Save images with formats like PNG, Parquet, and NumPy. To view all supported formats, see the :ref:`Input/Output reference `. .. tab-set:: .. tab-item:: Images To save images as image files, call :meth:`~ray.data.Dataset.write_images`. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_images("/tmp/simple", column="image", file_format="png") .. tab-item:: Parquet To save images in Parquet files, call :meth:`~ray.data.Dataset.write_parquet`. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_parquet("/tmp/simple") .. tab-item:: NumPy To save images in a NumPy file, call :meth:`~ray.data.Dataset.write_numpy`. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_numpy("/tmp/simple", column="image") For more information on saving data, see :ref:`Saving data `. --- .. _working-with-llms: Working with LLMs ================= The :ref:`ray.data.llm ` module integrates with key large language model (LLM) inference engines and deployed models to enable LLM batch inference. This guide shows you how to use :ref:`ray.data.llm ` to: * :ref:`Quickstart: vLLM batch inference ` * :ref:`Perform batch inference with LLMs ` * :ref:`Configure vLLM for LLM inference ` * :ref:`Multimodal batch inference ` * :ref:`Batch inference with embedding models ` * :ref:`Batch inference with classification models ` * :ref:`Query deployed models with an OpenAI compatible API endpoint ` .. _vllm_quickstart: Quickstart: vLLM batch inference --------------------------------- Get started with vLLM batch inference in just a few steps. This example shows the minimal setup needed to run batch inference on a dataset. .. note:: This quickstart requires a GPU as vLLM is GPU-accelerated. First, install Ray Data with LLM support: .. code-block:: bash pip install -U "ray[data, llm]>=2.49.1" Here's a complete minimal example that runs batch inference: .. literalinclude:: doc_code/working-with-llms/minimal_quickstart.py :language: python :start-after: __minimal_vllm_quickstart_start__ :end-before: __minimal_vllm_quickstart_end__ This example: 1. Creates a simple dataset with prompts 2. Configures a vLLM processor with minimal settings 3. Builds a processor that handles preprocessing (converting prompts to OpenAI chat format) and postprocessing (extracting generated text) 4. Runs inference on the dataset 5. Iterates through results The processor expects input rows with a ``prompt`` field and outputs rows with both ``prompt`` and ``response`` fields. You can consume results using ``iter_rows()``, ``take()``, ``show()``, or save to files with ``write_parquet()``. For more configuration options and advanced features, see the sections below. .. _batch_inference_llm: Perform batch inference with LLMs --------------------------------- At a high level, the :ref:`ray.data.llm ` module provides a :class:`Processor ` object which encapsulates logic for performing batch inference with LLMs on a Ray Data dataset. You can use the :func:`build_processor ` API to construct a processor. The following example uses the :class:`vLLMEngineProcessorConfig ` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model. Upon execution, the Processor object instantiates replicas of the vLLM engine (using :meth:`map_batches ` under the hood). .. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py .. :language: python .. :start-after: __basic_llm_example_start__ .. :end-before: __basic_llm_example_end__ Here's a simple configuration example: .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __basic_config_example_start__ :end-before: __basic_config_example_end__ The configuration includes detailed comments explaining: - `concurrency`: Number of vLLM engine replicas (typically 1 per node) - `batch_size`: Number of samples processed per batch (reduce if GPU memory is limited) - `max_num_batched_tokens`: Maximum tokens processed simultaneously (reduce if CUDA OOM occurs) - `accelerator_type`: Specify GPU type for optimal resource allocation The vLLM processor expects input in OpenAI chat format with a 'messages' column and outputs a 'generated_text' column containing model responses. Some models may require a Hugging Face token to be specified. You can specify the token in the `runtime_env` argument. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __hf_token_config_example_start__ :end-before: __hf_token_config_example_end__ .. _vllm_llm: Configure vLLM for LLM inference -------------------------------- Use the :class:`vLLMEngineProcessorConfig ` to configure the vLLM engine. For handling larger models, specify model parallelism: .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __parallel_config_example_start__ :end-before: __parallel_config_example_end__ The underlying :class:`Processor ` object instantiates replicas of the vLLM engine and automatically configure parallel workers to handle model parallelism (for tensor parallelism and pipeline parallelism, if specified). To optimize model loading, you can configure the `load_format` to `runai_streamer` or `tensorizer`. .. note:: In this case, install vLLM with runai dependencies: `pip install -U "vllm[runai]>=0.10.1"` .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __runai_config_example_start__ :end-before: __runai_config_example_end__ If your model is hosted on AWS S3, you can specify the S3 path in the `model_source` argument, and specify `load_format="runai_streamer"` in the `engine_kwargs` argument. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __s3_config_example_start__ :end-before: __s3_config_example_end__ To do multi-LoRA batch inference, you need to set LoRA related parameters in `engine_kwargs`. See :doc:`the vLLM with LoRA example` for details. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __lora_config_example_start__ :end-before: __lora_config_example_end__ .. _multimodal: Multimodal batch inference -------------------------------------------------------- Ray Data LLM also supports running batch inference with vision language and omni-modal models on multimodal data. To enable multimodal batch inference, apply the following 2 adjustments on top of the previous example: - Set `prepare_multimodal_stage={"enabled": True}` in the `vLLMEngineProcessorConfig` - Prepare multimodal data inside the preprocessor. Prior to running the examples below, install the required dependencies: .. code-block:: bash # Install required dependencies for downloading datasets from Hugging Face pip install datasets>=4.0.0 Image batch inference with vision language model (VLM) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, load a vision dataset: .. literalinclude:: doc_code/working-with-llms/vlm_image_example.py :language: python :start-after: def load_vision_dataset(): :end-before: def create_vlm_config(): :dedent: 0 Next, configure the VLM processor with the essential settings: .. literalinclude:: doc_code/working-with-llms/vlm_image_example.py :language: python :start-after: __vlm_config_example_start__ :end-before: __vlm_config_example_end__ Define preprocessing and postprocessing functions to convert dataset rows into the format expected by the VLM and extract model responses. Within the preprocessor, structure image data as part of an OpenAI-compatible message. Both image URL and `PIL.Image.Image` object are supported. .. literalinclude:: doc_code/working-with-llms/vlm_image_example.py :language: python :start-after: __image_message_format_example_start__ :end-before: __image_message_format_example_end__ .. literalinclude:: doc_code/working-with-llms/vlm_image_example.py :language: python :start-after: __vlm_preprocess_example_start__ :end-before: __vlm_preprocess_example_end__ Finally, run the VLM inference: .. literalinclude:: doc_code/working-with-llms/vlm_image_example.py :language: python :start-after: def run_vlm_example(): :end-before: # __vlm_run_example_end__ :dedent: 0 Video batch inference with vision language model (VLM) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, load a video dataset: .. literalinclude:: doc_code/working-with-llms/vlm_video_example.py :language: python :start-after: def load_video_dataset(): :end-before: def create_vlm_video_config(): :dedent: 0 Next, configure the VLM processor with the essential settings: .. literalinclude:: doc_code/working-with-llms/vlm_video_example.py :language: python :start-after: __vlm_video_config_example_start__ :end-before: __vlm_video_config_example_end__ Define preprocessing and postprocessing functions to convert dataset rows into the format expected by the VLM and extract model responses. Within the preprocessor, structure video data as part of an OpenAI-compatible message. .. literalinclude:: doc_code/working-with-llms/vlm_video_example.py :language: python :start-after: __vlm_video_preprocess_example_start__ :end-before: __vlm_video_preprocess_example_end__ Finally, run the VLM inference: .. literalinclude:: doc_code/working-with-llms/vlm_video_example.py :language: python :start-after: def run_vlm_video_example(): :end-before: # __vlm_video_run_example_end__ :dedent: 0 Audio batch inference with omni-modal model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, load an audio dataset: .. literalinclude:: doc_code/working-with-llms/omni_audio_example.py :language: python :start-after: def load_audio_dataset(): :end-before: def create_omni_audio_config(): :dedent: 0 Next, configure the omni-modal processor with the essential settings: .. literalinclude:: doc_code/working-with-llms/omni_audio_example.py :language: python :start-after: __omni_audio_config_example_start__ :end-before: __omni_audio_config_example_end__ Define preprocessing and postprocessing functions to convert dataset rows into the format expected by the omni-modal model and extract model responses. Within the preprocessor, structure audio data as part of an OpenAI-compatible message. Both audio URL and audio binary data are supported. .. literalinclude:: doc_code/working-with-llms/omni_audio_example.py :language: python :start-after: __audio_message_format_example_start__ :end-before: __audio_message_format_example_end__ .. literalinclude:: doc_code/working-with-llms/omni_audio_example.py :language: python :start-after: __omni_audio_preprocess_example_start__ :end-before: __omni_audio_preprocess_example_end__ Finally, run the omni-modal inference: .. literalinclude:: doc_code/working-with-llms/omni_audio_example.py :language: python :start-after: def run_omni_audio_example(): :end-before: # __omni_audio_run_example_end__ :dedent: 0 .. _embedding_models: Batch inference with embedding models --------------------------------------- Ray Data LLM supports batch inference with embedding models using vLLM: .. literalinclude:: doc_code/working-with-llms/embedding_example.py :language: python :start-after: __embedding_example_start__ :end-before: __embedding_example_end__ .. testoutput:: :options: +MOCK {'text': 'Hello world', 'embedding': [0.1, -0.2, 0.3, ...]} Key differences for embedding models: - Set ``task_type="embed"`` - Set ``apply_chat_template=False`` and ``detokenize=False`` - Use direct ``prompt`` input instead of ``messages`` - Access embeddings through``row["embeddings"]`` For a complete embedding configuration example, see: .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __embedding_config_example_start__ :end-before: __embedding_config_example_end__ .. _classification_models: Batch inference with classification models ------------------------------------------ Ray Data LLM supports batch inference with sequence classification models, such as content classifiers and sentiment analyzers: .. literalinclude:: doc_code/working-with-llms/classification_example.py :language: python :start-after: __classification_example_start__ :end-before: __classification_example_end__ .. testoutput:: :options: +MOCK {'text': 'lol that was so funny haha', 'edu_score': -0.05} {'text': 'Photosynthesis converts light energy...', 'edu_score': 1.73} {'text': "Newton's laws describe...", 'edu_score': 2.52} Key differences for classification models: - Set ``task_type="classify"`` (or ``task_type="score"`` for scoring models) - Set ``apply_chat_template=False`` and ``detokenize=False`` - Use direct ``prompt`` input instead of ``messages`` - Access classification logits through ``row["embeddings"]`` For a complete classification configuration example, see: .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __classification_config_example_start__ :end-before: __classification_config_example_end__ .. _openai_compatible_api_endpoint: Batch inference with an OpenAI-compatible endpoint -------------------------------------------------- You can also make calls to deployed models that have an OpenAI compatible API endpoint. .. literalinclude:: doc_code/working-with-llms/openai_api_example.py :language: python :start-after: __openai_example_start__ :end-before: __openai_example_end__ Batch inference with serve deployments --------------------------------------- You can configure any :ref:`serve deployment ` for batch inference. This is particularly useful for multi-turn conversations, where you can use a shared vLLM engine across conversations. To achieve this, create an :ref:`LLM serve deployment ` and use the :class:`ServeDeploymentProcessorConfig ` class to configure the processor. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __shared_vllm_engine_config_example_start__ :end-before: __shared_vllm_engine_config_example_end__ Cross-node parallelism --------------------------------------- Ray Data LLM supports cross-node parallelism, including tensor parallelism and pipeline parallelism. You can configure the parallelism level through the `engine_kwargs` argument in :class:`vLLMEngineProcessorConfig `. Use `ray` as the distributed executor backend to enable cross-node parallelism. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __cross_node_parallelism_config_example_start__ :end-before: __cross_node_parallelism_config_example_end__ In addition, you can customize the placement group strategy to control how Ray places vLLM engine workers across nodes. While you can specify the degree of tensor and pipeline parallelism, the specific assignment of model ranks to GPUs is managed by the vLLM engine and you can't directly configure it through the Ray Data LLM API. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __custom_placement_group_strategy_config_example_start__ :end-before: __custom_placement_group_strategy_config_example_end__ Besides cross-node parallelism, you can also horizontally scale the LLM stage to multiple nodes. Configure the number of replicas with the `concurrency` argument in :class:`vLLMEngineProcessorConfig `. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __concurrent_config_example_start__ :end-before: __concurrent_config_example_end__ Usage Data Collection -------------------------- Data for the following features and attributes is collected to improve Ray Data LLM: - config name used for building the llm processor - number of concurrent users for data parallelism - batch size of requests - model architecture used for building vLLMEngineProcessor - task type used for building vLLMEngineProcessor - engine arguments used for building vLLMEngineProcessor - tensor parallel size and pipeline parallel size used - GPU type used and number of GPUs used If you would like to opt-out from usage data collection, you can follow :ref:`Ray usage stats ` to turn it off. .. _faqs: Frequently Asked Questions (FAQs) -------------------------------------------------- .. _gpu_memory_management: GPU Memory Management and CUDA OOM Prevention ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you encounter CUDA out of memory errors, Ray Data LLM provides several configuration options to optimize GPU memory usage: .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __gpu_memory_config_example_start__ :end-before: __gpu_memory_config_example_end__ **Key strategies for handling GPU memory issues:** - **Reduce batch size**: Start with smaller batches (8-16) and increase gradually - **Lower `max_num_batched_tokens`**: Reduce from 4096 to 2048 or 1024 - **Decrease `max_model_len`**: Use shorter context lengths when possible - **Set `gpu_memory_utilization`**: Use 0.75-0.85 instead of default 0.90 - **Use smaller models**: Consider using smaller model variants for resource-constrained environments If you run into CUDA out of memory, your batch size is likely too large. Set an explicit small batch size or use a smaller model, or a larger GPU. .. _model_cache: How to cache model weight to remote object storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ While deploying Ray Data LLM to large scale clusters, model loading may be rate limited by HuggingFace. In this case, you can cache the model to remote object storage (AWS S3 or Google Cloud Storage) for more stable model loading. Ray Data LLM provides the following utility to help uploading models to remote object storage. .. code-block:: bash # Download model from HuggingFace, and upload to GCS python -m ray.llm.utils.upload_model \ --model-source facebook/opt-350m \ --bucket-uri gs://my-bucket/path/to/facebook-opt-350m # Or upload a local custom model to S3 python -m ray.llm.utils.upload_model \ --model-source local/path/to/model \ --bucket-uri s3://my-bucket/path/to/model_name And later you can use remote object store URI as `model_source` in the config. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py :language: python :start-after: __s3_config_example_start__ :end-before: __s3_config_example_end__ --- .. _working_with_pytorch: Working with PyTorch ==================== Ray Data integrates with the PyTorch ecosystem. This guide describes how to: * :ref:`Iterate over your dataset as Torch tensors for model training ` * :ref:`Write transformations that deal with Torch tensors ` * :ref:`Perform batch inference with Torch models ` * :ref:`Save Datasets containing Torch tensors ` * :ref:`Migrate from PyTorch Datasets to Ray Data ` .. _iterating_pytorch: Iterating over Torch tensors for training ----------------------------------------- To iterate over batches of data in Torch format, call :meth:`Dataset.iter_torch_batches() `. Each batch is represented as `Dict[str, torch.Tensor]`, with one tensor per column in the dataset. This is useful for training Torch models with batches from your dataset. For configuration details such as providing a ``collate_fn`` for customizing the conversion, see the API reference for :meth:`iter_torch_batches() `. .. testcode:: import ray import torch ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_torch_batches(batch_size=2): print(batch) .. testoutput:: :options: +MOCK {'image': tensor([[[[...]]]], dtype=torch.uint8)} ... {'image': tensor([[[[...]]]], dtype=torch.uint8)} Integration with Ray Train ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Data integrates with :ref:`Ray Train ` for easy data ingest for data parallel training, with support for PyTorch, PyTorch Lightning, or Hugging Face training. .. testcode:: import torch from torch import nn import ray from ray import train from ray.train import ScalingConfig from ray.train.torch import TorchTrainer def train_func(): model = nn.Sequential(nn.Linear(30, 1), nn.Sigmoid()) loss_fn = torch.nn.BCELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.001) # Datasets can be accessed in your train_func via ``get_dataset_shard``. train_data_shard = train.get_dataset_shard("train") for epoch_idx in range(2): for batch in train_data_shard.iter_torch_batches(batch_size=128, dtypes=torch.float32): features = torch.stack([batch[col_name] for col_name in batch.keys() if col_name != "target"], axis=1) predictions = model(features) train_loss = loss_fn(predictions, batch["target"].unsqueeze(1)) train_loss.backward() optimizer.step() train_dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=2) ) trainer.fit() For more details, see the :ref:`Ray Train user guide `. .. _transform_pytorch: Transformations with Torch tensors ---------------------------------- Transformations applied with `map` or `map_batches` can return Torch tensors. .. caution:: Under the hood, Ray Data automatically converts Torch tensors to NumPy arrays. Subsequent transformations accept NumPy arrays as input, not Torch tensors. .. tab-set:: .. tab-item:: map .. testcode:: from typing import Dict import numpy as np import torch import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") def convert_to_torch(row: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]: return {"tensor": torch.as_tensor(row["image"])} # The tensor gets converted into a Numpy array under the hood transformed_ds = ds.map(convert_to_torch) print(transformed_ds.schema()) # Subsequent transformations take in Numpy array as input. def check_numpy(row: Dict[str, np.ndarray]): assert isinstance(row["tensor"], np.ndarray) return row transformed_ds.map(check_numpy).take_all() .. testoutput:: Column Type ------ ---- tensor ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8) .. tab-item:: map_batches .. testcode:: from typing import Dict import numpy as np import torch import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") def convert_to_torch(batch: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]: return {"tensor": torch.as_tensor(batch["image"])} # The tensor gets converted into a Numpy array under the hood transformed_ds = ds.map_batches(convert_to_torch, batch_size=2) print(transformed_ds.schema()) # Subsequent transformations take in Numpy array as input. def check_numpy(batch: Dict[str, np.ndarray]): assert isinstance(batch["tensor"], np.ndarray) return batch transformed_ds.map_batches(check_numpy, batch_size=2).take_all() .. testoutput:: Column Type ------ ---- tensor ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8) For more information on transforming data, see :ref:`Transforming data `. Built-in PyTorch transforms ~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can use built-in Torch transforms from ``torchvision``, ``torchtext``, and ``torchaudio``. .. tab-set:: .. tab-item:: torchvision .. testcode:: from typing import Dict import numpy as np import torch from torchvision import transforms import ray # Create the Dataset. ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") # Define the torchvision transform. transform = transforms.Compose( [ transforms.ToTensor(), transforms.CenterCrop(10) ] ) # Define the map function def transform_image(row: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]: row["transformed_image"] = transform(row["image"]) return row # Apply the transform over the dataset. transformed_ds = ds.map(transform_image) print(transformed_ds.schema()) .. testoutput:: Column Type ------ ---- image ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8) transformed_image ArrowTensorTypeV2(shape=(3, 10, 10), dtype=float) .. tab-item:: torchtext .. testcode:: from typing import Dict, List import numpy as np from torchtext import transforms import ray # Create the Dataset. ds = ray.data.read_text("s3://anonymous@ray-example-data/simple.txt") # Define the torchtext transform. VOCAB_FILE = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt" transform = transforms.BERTTokenizer(vocab_path=VOCAB_FILE, do_lower_case=True, return_tokens=True) # Define the map_batches function. def tokenize_text(batch: Dict[str, np.ndarray]) -> Dict[str, List[str]]: batch["tokenized_text"] = transform(list(batch["text"])) return batch # Apply the transform over the dataset. transformed_ds = ds.map_batches(tokenize_text, batch_size=2) print(transformed_ds.schema()) .. testoutput:: Column Type ------ ---- text string tokenized_text list .. _batch_inference_pytorch: Batch inference with PyTorch ---------------------------- With Ray Datasets, you can do scalable offline batch inference with Torch models by mapping a pre-trained model over your data. .. testcode:: from typing import Dict import numpy as np import torch import torch.nn as nn import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.ones((1, 100))) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and reuse it for inference across multiple batches. class TorchPredictor: def __init__(self): # Load a dummy neural network. # Set `self.model` to your pre-trained PyTorch model. self.model = nn.Sequential( nn.Linear(in_features=100, out_features=1), nn.Sigmoid(), ) self.model.eval() # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: tensor = torch.as_tensor(batch["data"], dtype=torch.float32) with torch.inference_mode(): # Get the predictions from the input batch. return {"output": self.model(tensor).numpy()} # Step 3: Map the Predictor over the Dataset to get predictions. # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. predictions = ds.map_batches(TorchPredictor, compute=ray.data.ActorPoolStrategy(size=2)) # Step 4: Show one prediction output. predictions.show(limit=1) .. testoutput:: :options: +MOCK {'output': array([0.5590901], dtype=float32)} For more details, see the :ref:`Batch inference user guide `. .. _saving_pytorch: Saving Datasets containing Torch tensors ---------------------------------------- Datasets containing Torch tensors can be saved to files, like parquet or NumPy. For more information on saving data, read :ref:`Saving data `. .. caution:: Torch tensors that are on GPU devices can't be serialized and written to disk. Convert the tensors to CPU (``tensor.to("cpu")``) before saving the data. .. tab-set:: .. tab-item:: Parquet .. testcode:: import torch import ray tensor = torch.Tensor(1) ds = ray.data.from_items([{"tensor": tensor}]) ds.write_parquet("local:///tmp/tensor") .. tab-item:: Numpy .. testcode:: import torch import ray tensor = torch.Tensor(1) ds = ray.data.from_items([{"tensor": tensor}]) ds.write_numpy("local:///tmp/tensor", column="tensor") .. _migrate_pytorch: Migrating from PyTorch Datasets and DataLoaders ----------------------------------------------- If you're currently using PyTorch Datasets and DataLoaders, you can migrate to Ray Data for working with distributed datasets. PyTorch Datasets are replaced by the :class:`Dataset ` abstraction, and the PyTorch DataLoader is replaced by :meth:`Dataset.iter_torch_batches() `. Built-in PyTorch Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~ If you are using built-in PyTorch datasets, for example from ``torchvision``, these can be converted to a Ray Dataset using the :meth:`from_torch() ` API. .. testcode:: import torchvision import ray mnist = torchvision.datasets.MNIST(root="/tmp/", download=True) ds = ray.data.from_torch(mnist) # The data for each item of the Torch dataset is under the "item" key. print(ds.schema()) .. The following `testoutput` is mocked to avoid illustrating download logs like "Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz". .. testoutput:: :options: +MOCK Column Type ------ ---- item Custom PyTorch Datasets ~~~~~~~~~~~~~~~~~~~~~~~ If you have a custom PyTorch Dataset, you can migrate to Ray Data by converting the logic in ``__getitem__`` to Ray Data read and transform operations. Any logic for reading data from cloud storage and disk can be replaced by one of the Ray Data ``read_*`` APIs, and any transformation logic can be applied as a :meth:`map ` call on the Dataset. The following example shows a custom PyTorch Dataset, and what the analogous would look like with Ray Data. .. note:: Unlike PyTorch Map-style datasets, Ray Datasets aren't indexable. .. tab-set:: .. tab-item:: PyTorch Dataset .. testcode:: import tempfile import boto3 from botocore import UNSIGNED from botocore.config import Config from torchvision import transforms from torch.utils.data import Dataset from PIL import Image class ImageDataset(Dataset): def __init__(self, bucket_name: str, dir_path: str): self.s3 = boto3.resource("s3", config=Config(signature_version=UNSIGNED)) self.bucket = self.s3.Bucket(bucket_name) self.files = [obj.key for obj in self.bucket.objects.filter(Prefix=dir_path)] self.transform = transforms.Compose([ transforms.ToTensor(), transforms.Resize((128, 128)), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) def __len__(self): return len(self.files) def __getitem__(self, idx): img_name = self.files[idx] # Infer the label from the file name. last_slash_idx = img_name.rfind("/") dot_idx = img_name.rfind(".") label = int(img_name[last_slash_idx+1:dot_idx]) # Download the S3 file locally. obj = self.bucket.Object(img_name) tmp = tempfile.NamedTemporaryFile() tmp_name = "{}.jpg".format(tmp.name) with open(tmp_name, "wb") as f: obj.download_fileobj(f) f.flush() f.close() image = Image.open(tmp_name) # Preprocess the image. image = self.transform(image) return image, label dataset = ImageDataset(bucket_name="ray-example-data", dir_path="batoidea/JPEGImages/") .. tab-item:: Ray Data .. testcode:: import torchvision import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages", include_paths=True) # Extract the label from the file path. def extract_label(row: dict): filepath = row["path"] last_slash_idx = filepath.rfind("/") dot_idx = filepath.rfind('.') label = int(filepath[last_slash_idx+1:dot_idx]) row["label"] = label return row transform = transforms.Compose([ transforms.ToTensor(), transforms.Resize((128, 128)), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) # Preprocess the images. def transform_image(row: dict): row["transformed_image"] = transform(row["image"]) return row # Map the transformations over the dataset. ds = ds.map(extract_label).map(transform_image) PyTorch DataLoader ~~~~~~~~~~~~~~~~~~ The PyTorch DataLoader can be replaced by calling :meth:`Dataset.iter_torch_batches() ` to iterate over batches of the dataset. The following table describes how the arguments for PyTorch DataLoader map to Ray Data. Note the behavior may not necessarily be identical. For exact semantics and usage, see the API reference for :meth:`iter_torch_batches() `. .. list-table:: :header-rows: 1 * - PyTorch DataLoader arguments - Ray Data API * - ``batch_size`` - ``batch_size`` argument to :meth:`ds.iter_torch_batches() ` * - ``shuffle`` - ``local_shuffle_buffer_size`` argument to :meth:`ds.iter_torch_batches() ` * - ``collate_fn`` - ``collate_fn`` argument to :meth:`ds.iter_torch_batches() ` * - ``sampler`` - Not supported. Can be manually implemented after iterating through the dataset with :meth:`ds.iter_torch_batches() `. * - ``batch_sampler`` - Not supported. Can be manually implemented after iterating through the dataset with :meth:`ds.iter_torch_batches() `. * - ``drop_last`` - ``drop_last`` argument to :meth:`ds.iter_torch_batches() ` * - ``num_workers`` - Use ``prefetch_batches`` argument to :meth:`ds.iter_torch_batches() ` to indicate how many batches to prefetch. The number of prefetching threads are automatically configured according to ``prefetch_batches``. * - ``prefetch_factor`` - Use ``prefetch_batches`` argument to :meth:`ds.iter_torch_batches() ` to indicate how many batches to prefetch. The number of prefetching threads are automatically configured according to ``prefetch_batches``. * - ``pin_memory`` - Pass in ``device`` to :meth:`ds.iter_torch_batches() ` to get tensors that have already been moved to the correct device. --- .. _working_with_tensors: Working with Tensors / NumPy ============================ N-dimensional arrays (in other words, tensors) are ubiquitous in ML workloads. This guide describes the limitations and best practices of working with such data. Tensor data representation -------------------------- Ray Data represents tensors as `NumPy ndarrays `__. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@air-example-data/digits") print(ds) .. testoutput:: Dataset(num_rows=100, schema=...) Batches of fixed-shape tensors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your tensors have a fixed shape, Ray Data represents batches as regular ndarrays. .. doctest:: >>> import ray >>> ds = ray.data.read_images("s3://anonymous@air-example-data/digits") >>> batch = ds.take_batch(batch_size=32) >>> batch["image"].shape (32, 28, 28) >>> batch["image"].dtype dtype('uint8') Batches of variable-shape tensors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your tensors vary in shape, Ray Data represents batches as arrays of object dtype. .. doctest:: >>> import ray >>> ds = ray.data.read_images("s3://anonymous@air-example-data/AnimalDetection") >>> batch = ds.take_batch(batch_size=32) >>> batch["image"].shape (32,) >>> batch["image"].dtype dtype('O') The individual elements of these object arrays are regular ndarrays. .. doctest:: >>> batch["image"][0].dtype dtype('uint8') >>> batch["image"][0].shape # doctest: +SKIP (375, 500, 3) >>> batch["image"][3].shape # doctest: +SKIP (333, 465, 3) .. _transforming_tensors: Transforming tensor data ------------------------ Call :meth:`~ray.data.Dataset.map` or :meth:`~ray.data.Dataset.map_batches` to transform tensor data. .. testcode:: from typing import Any, Dict import ray import numpy as np ds = ray.data.read_images("s3://anonymous@air-example-data/AnimalDetection") def increase_brightness(row: Dict[str, Any]) -> Dict[str, Any]: row["image"] = np.clip(row["image"] + 4, 0, 255) return row # Increase the brightness, record at a time. ds.map(increase_brightness) def batch_increase_brightness(batch: Dict[str, np.ndarray]) -> Dict: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch # Increase the brightness, batch at a time. ds.map_batches(batch_increase_brightness) In addition to NumPy ndarrays, Ray Data also treats returned lists of NumPy ndarrays and objects implementing ``__array__`` (for example, ``torch.Tensor``) as tensor data. For more information on transforming data, read :ref:`Transforming data `. Saving tensor data ------------------ Save tensor data with formats like Parquet, NumPy, and JSON. To view all supported formats, see the :ref:`Input/Output reference `. .. tab-set:: .. tab-item:: Parquet Call :meth:`~ray.data.Dataset.write_parquet` to save data in Parquet files. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_parquet("/tmp/simple") .. tab-item:: NumPy Call :meth:`~ray.data.Dataset.write_numpy` to save an ndarray column in NumPy files. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_numpy("/tmp/simple", column="image") .. tab-item:: JSON To save images in a JSON file, call :meth:`~ray.data.Dataset.write_json`. .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_json("/tmp/simple") For more information on saving data, read :ref:`Saving data `. --- Working with Text ================= With Ray Data, you can easily read and transform large amounts of text data. This guide shows you how to: * :ref:`Read text files ` * :ref:`Transform text data ` * :ref:`Perform inference on text data ` * :ref:`Save text data ` .. _reading-text-files: Reading text files ------------------ Ray Data can read lines of text and JSONL. Alternatively, you can read raw binary files and manually decode data. .. tab-set:: .. tab-item:: Text lines To read lines of text, call :func:`~ray.data.read_text`. Ray Data creates a row for each line of text. In the schema, the column name defaults to "text". .. testcode:: import ray ds = ray.data.read_text("s3://anonymous@ray-example-data/this.txt") ds.show(3) .. testoutput:: {'text': 'The Zen of Python, by Tim Peters'} {'text': 'Beautiful is better than ugly.'} {'text': 'Explicit is better than implicit.'} .. tab-item:: JSON Lines `JSON Lines `_ is a text format for structured data. It's typically used to process data one record at a time. To read JSON Lines files, call :func:`~ray.data.read_json`. Ray Data creates a row for each JSON object. .. testcode:: import ray ds = ray.data.read_json("s3://anonymous@ray-example-data/logs.json") ds.show(3) .. testoutput:: {'timestamp': datetime.datetime(2022, 2, 8, 15, 43, 41), 'size': 48261360} {'timestamp': datetime.datetime(2011, 12, 29, 0, 19, 10), 'size': 519523} {'timestamp': datetime.datetime(2028, 9, 9, 5, 6, 7), 'size': 2163626} .. tab-item:: Other formats To read other text formats, call :func:`~ray.data.read_binary_files`. Then, call :meth:`~ray.data.Dataset.map` to decode your data. .. testcode:: from typing import Any, Dict from bs4 import BeautifulSoup import ray def parse_html(row: Dict[str, Any]) -> Dict[str, Any]: html = row["bytes"].decode("utf-8") soup = BeautifulSoup(html, features="html.parser") return {"text": soup.get_text().strip()} ds = ( ray.data.read_binary_files("s3://anonymous@ray-example-data/index.html") .map(parse_html) ) ds.show() .. testoutput:: {'text': 'Batoidea\nBatoidea is a superorder of cartilaginous fishes...'} For more information on reading files, see :ref:`Loading data `. .. _transforming-text: Transforming text ----------------- To transform text, implement your transformation in a function or callable class. Then, call :meth:`Dataset.map() ` or :meth:`Dataset.map_batches() `. Ray Data transforms your text in parallel. .. testcode:: from typing import Any, Dict import ray def to_lower(row: Dict[str, Any]) -> Dict[str, Any]: row["text"] = row["text"].lower() return row ds = ( ray.data.read_text("s3://anonymous@ray-example-data/this.txt") .map(to_lower) ) ds.show(3) .. testoutput:: {'text': 'the zen of python, by tim peters'} {'text': 'beautiful is better than ugly.'} {'text': 'explicit is better than implicit.'} For more information on transforming data, see :ref:`Transforming data `. .. _performing-inference-on-text: Performing inference on text ---------------------------- To perform inference with a pre-trained model on text data, implement a callable class that sets up and invokes a model. Then, call :meth:`Dataset.map_batches() `. .. testcode:: from typing import Dict import numpy as np from transformers import pipeline import ray class TextClassifier: def __init__(self): self.model = pipeline("text-classification") def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: predictions = self.model(list(batch["text"])) batch["label"] = [prediction["label"] for prediction in predictions] return batch ds = ( ray.data.read_text("s3://anonymous@ray-example-data/this.txt") .map_batches(TextClassifier, compute=ray.data.ActorPoolStrategy(size=2)) ) ds.show(3) .. testoutput:: {'text': 'The Zen of Python, by Tim Peters', 'label': 'POSITIVE'} {'text': 'Beautiful is better than ugly.', 'label': 'POSITIVE'} {'text': 'Explicit is better than implicit.', 'label': 'POSITIVE'} For more information on handling large language models, see :ref:`Working with LLMs `. For more information on performing inference, see :ref:`End-to-end: Offline Batch Inference ` and :ref:`Stateful Transforms `. .. _saving-text: Saving text ----------- To save text, call a method like :meth:`~ray.data.Dataset.write_parquet`. Ray Data can save text in many formats. To view the full list of supported file formats, see the :ref:`Input/Output reference `. .. testcode:: import ray ds = ray.data.read_text("s3://anonymous@ray-example-data/this.txt") ds.write_parquet("local:///tmp/results") For more information on saving data, see :ref:`Saving data `. --- :html_theme.sidebar_secondary.remove: .. title:: Welcome to Ray! .. toctree:: :hidden: Overview Getting Started Installation Use Cases Examples Ecosystem Ray Core Ray Data Ray Train Ray Tune Ray Serve Ray RLlib More Libraries Ray Clusters Monitoring and Debugging Developer Guides Glossary Security --- Deploying Ray for ML platforms ============================== This page describes how you might use or deploy Ray in your infrastructure. There are two main deployment patterns -- pick and choose, and within existing platforms. The core idea is that Ray can be **complementary** to your existing infrastructure and integration tools. Design Principles ----------------- * Ray and its libraries handle the heavyweight compute aspects of AI apps and services. * Ray relies on external integrations (e.g., Tecton, MLFlow, W&B) for Storage and Tracking. * Workflow Orchestrators (e.g., AirFlow) are an optional component that can be used for scheduling recurring jobs, launching new Ray clusters for jobs, and running non-Ray compute steps. * Lightweight orchestration of task graphs within a single Ray app can be handled using Ray tasks. * Ray libraries can be used independently, within an existing ML platform, or to build a Ray-native ML platform. Pick and choose your own libraries ---------------------------------- You can pick and choose which Ray AI libraries you want to use. This is applicable if you are an ML engineer who wants to independently use a Ray library for a specific AI app or service use case and do not need to integrate with existing ML platforms. For example, Alice wants to use RLlib to train models for her work project. Bob wants to use Ray Serve to deploy his model pipeline. In both cases, Alice and Bob can leverage these libraries independently without any coordination. This scenario describes most usages of Ray libraries today. .. https://docs.google.com/drawings/d/1DcrchNda9m_3MH45NuhgKY49ZCRtj2Xny5dgY0X9PCA/edit .. image:: /images/air_arch_1.svg In the above diagram: * Only one library is used -- showing that you can pick and choose and do not need to replace all of your ML infrastructure to use Ray. * You can use one of :ref:`Ray's many deployment modes ` to launch and manage Ray clusters and Ray applications. * Ray AI libraries can read data from external storage systems such as Amazon S3 / Google Cloud Storage, as well as store results there. Existing ML Platform integration -------------------------------- You may already have an existing machine learning platform but want to use some subset of Ray's ML libraries. For example, an ML engineer wants to use Ray within the ML Platform their organization has purchased (e.g., SageMaker, Vertex). Ray can complement existing machine learning platforms by integrating with existing pipeline/workflow orchestrators, storage, and tracking services, without requiring a replacement of your entire ML platform. .. image:: images/air_arch_2.png In the above diagram: 1. A workflow orchestrator such as AirFlow, Oozie, SageMaker Pipelines, etc. is responsible for scheduling and creating Ray clusters and running Ray apps and services. The Ray application may be part of a larger orchestrated workflow (e.g., Spark ETL, then Training on Ray). 2. Lightweight orchestration of task graphs can be handled entirely within Ray. External workflow orchestrators will integrate nicely but are only needed if running non-Ray steps. 3. Ray clusters can also be created for interactive use (e.g., Jupyter notebooks, Google Colab, Databricks Notebooks, etc.). 4. Ray Train, Data, and Serve provide integration with Feature Stores like Feast for Training and Serving. 5. Ray Train and Tune provide integration with tracking services such as MLFlow and Weights & Biases. --- .. _ray-for-ml-infra: Ray for ML Infrastructure ========================= .. tip:: We'd love to hear from you if you are using Ray to build an ML platform! Fill out `this short form `__ to get involved. Ray and its AI libraries provide a unified compute runtime for teams looking to simplify their ML platform. Ray's libraries such as Ray Train, Ray Data, and Ray Serve can be used to compose end-to-end ML workflows, providing features and APIs for data preprocessing as part of training, and transitioning from training to serving. .. https://docs.google.com/drawings/d/1PFA0uJTq7SDKxzd7RHzjb5Sz3o1WvP13abEJbD0HXTE/edit .. image:: /images/ray-air.svg Why Ray for ML Infrastructure? ------------------------------ Ray's AI libraries simplify the ecosystem of machine learning frameworks, platforms, and tools, by providing a seamless, unified, and open experience for scalable ML: .. image:: images/why-air-2.svg .. https://docs.google.com/drawings/d/1oi_JwNHXVgtR_9iTdbecquesUd4hOk0dWgHaTaFj6gk/edit **1. Seamless Dev to Prod**: Ray's AI libraries reduce friction going from development to production. With Ray and its libraries, the same Python code scales seamlessly from a laptop to a large cluster. **2. Unified ML API and Runtime**: Ray's APIs enable swapping between popular frameworks, such as XGBoost, PyTorch, and Hugging Face, with minimal code changes. Everything from training to serving runs on a single runtime (Ray + KubeRay). **3. Open and Extensible**: Ray is fully open-source and can run on any cluster, cloud, or Kubernetes. Build custom components and integrations on top of scalable developer APIs. Example ML Platforms built on Ray --------------------------------- `Merlin `_ is Shopify's ML platform built on Ray. It enables fast-iteration and `scaling of distributed applications `_ such as product categorization and recommendations. .. figure:: /images/shopify-workload.png Shopify's Merlin architecture built on Ray. Spotify `uses Ray for advanced applications `_ that include personalizing content recommendations for home podcasts, and personalizing Spotify Radio track sequencing. .. figure:: /images/spotify.png How Ray ecosystem empowers ML scientists and engineers at Spotify. The following highlights feature companies leveraging Ray's unified API to build simpler, more flexible ML platforms. - `[Blog] The Magic of Merlin - Shopify's New ML Platform `_ - `[Slides] Large Scale Deep Learning Training and Tuning with Ray `_ - `[Blog] Griffin: How Instacart’s ML Platform Tripled in a year `_ - `[Talk] Predibase - A low-code deep learning platform built for scale `_ - `[Blog] Building a ML Platform with Kubeflow and Ray on GKE `_ - `[Talk] Ray Summit Panel - ML Platform on Ray `_ .. Deployments on Ray. .. include:: /ray-air/deployment.rst --- .. _api-policy: API Policy ============= Ray APIs refer to classes, class methods, or functions. When we declare an API, we promise our users that they can use these APIs to develop their apps without worrying about changes to these interfaces between different Ray releases. Declaring or deprecating an API has a significant impact on the community. This document proposes simple policies to hold Ray contributors accountable to these promises and manage user expectations. For API exposure levels, see :ref:`API Stability `. API documentation policy ~~~~~~~~~~~~~~~~~~~~~~~~ Documentation is one of the main channels through which we expose our APIs to users. If we provide incorrect information, it can significantly impact the reliability and maintainability of our users' applications. Based on the API exposure level, here is the policy to ensure the accuracy of our information. .. list-table:: API Documentation Policy :widths: 20 16 16 16 16 16 :header-rows: 1 * - Policy/Exposure Level - Stable Public API - Beta Public API - Alpha Public API - Deprecated - Developer API * - Must this API be documented? - Yes - Yes - Yes - Yes - Up to the developers * - Must a method be annotated with one of the API annotations (PublicAPI, DeveloperAPI or Deprecated)? - Yes - Yes - Yes - Yes - No. The absence of annotations implies the Developer API level by default. * - Can this API be private (either inside the _internal module or has an underscore prefix)? - No - No - No - No - No API Lifecycle Policy ~~~~~~~~~~~~~~~~~~~~ Users have high expectations for certain exposure levels, so we need to be cautious when moving APIs between different levels. Here is the policy for managing the API exposure lifecycle. .. list-table:: API Lifecycle Policy :widths: 20 16 16 16 16 16 :header-rows: 1 * - Policy/Exposure Level - Stable Public API - Beta Public API - Alpha Public API - Deprecated API - Developer API * - Can this API be promoted to a higher level without any warnings, heads up to users? - Yes - Yes - Yes - No - Yes * - Can this API be demoted to a lower level? If so then how? - Can be demoted to Deprecated only. The API should emit warning messages and a deadline for deprecations in **6 months (or +25 ray minor versions)**. - Can be demoted to Deprecated only. The API should emit warning messages and a deadline for deprecations in **3 months (or +12 ray minor versions)**. - Users must allow for and expect breaking changes in alpha components, and must have no expectations of stability. - Yes - No annotations mean it is a developer API by default * - Can you remove or change this API's parameters? - Yes. The API should emit warning messages and you must set a deadline for the end-of-life of the original version that is **6 months or +25 Ray minor versions**. During the transition period, you must support both the new and old parameters. - Yes. The API should emit warning messages and you must set a deadline for the change in **3 months or +12 Ray minor versions**. During the transition period, you must support both the new and old parameters. - Users must allow for and expect breaking changes in alpha components, and must have no expectations of stability. - No - Yes --- CI Testing Workflow on PRs ========================== This guide helps contributors to understand the Continuous Integration (CI) workflow on a PR. Here CI stands for the automated testing of the codebase on the PR. `microcheck`: default tests on your PR -------------------------------------- With every commit on your PR, by default, we'll run a set of tests called `microcheck`. These tests are designed to be 90% accurate at catching bugs on your PR while running only 10% of the full test suite. As a result, microcheck typically finishes twice as fast and twice cheaper than the full test suite. Some of the notable features of microcheck are: * If a new test is added or an existing test is modified in a pull request, microcheck will ensure these tests are included. * You can manually add more tests to microcheck by including the following line in the body of your git commit message: `@microcheck TEST_TARGET01 TEST_TARGET02 ....`. This line must be in the body of your message, starting from the second line or below (the first line is the commit message title). For example, here is how I manually add tests in my pull request:: // git command to add commit message git commit -a -s // content of the commit message run other serve doc tests @microcheck //doc:source/serve/doc_code/distilbert //doc:source/serve/doc_code/object_detection //doc:source/serve/doc_code/stable_diffusion Signed-off-by: can If microcheck passes, you'll see a green checkmark on your PR. If it fails, you'll see a red cross. In either case, you'll see a summary of the test run statuses in the github UI. Additional tests at merge time ------------------------------ In this workflow, to merge your PR, simply click on the Enable auto-merge button (or ask a committer to do so). This will trigger additional test cases, and the PR will merge automatically once they finish and pass. Alternatively, you can also add a `go` label to manually trigger the full test suite on your PR (be mindful that this is less recommended but we understand you know best about the need of your PR). While we anticipate this will be rarely needed, if you do require it constantly, please let us know. We are continuously improving the effectiveness of microcheck. --- Debugging for Ray Developers ============================ This debugging guide is for contributors to the Ray project. Starting processes in a debugger -------------------------------- When processes are crashing, it is often useful to start them in a debugger. Ray currently allows processes to be started in the following: - valgrind - the valgrind profiler - the perftools profiler - gdb - tmux To use any of these tools, please make sure that you have them installed on your machine first (``gdb`` and ``valgrind`` on MacOS are known to have issues). Then, you can launch a subset of ray processes by adding the environment variable ``RAY_{PROCESS_NAME}_{DEBUGGER}=1``. For instance, if you wanted to start the raylet in ``valgrind``, then you simply need to set the environment variable ``RAY_RAYLET_VALGRIND=1``. To start a process inside of ``gdb``, the process must also be started inside of ``tmux``. So if you want to start the raylet in ``gdb``, you would start your Python script with the following: .. code-block:: bash RAY_RAYLET_GDB=1 RAY_RAYLET_TMUX=1 python You can then list the ``tmux`` sessions with ``tmux ls`` and attach to the appropriate one. You can also get a core dump of the ``raylet`` process, which is especially useful when filing `issues`_. The process to obtain a core dump is OS-specific, but usually involves running ``ulimit -c unlimited`` before starting Ray to allow core dump files to be written. .. _backend-logging: Backend logging --------------- The ``raylet`` process logs detailed information about events like task execution and object transfers between nodes. To set the logging level at runtime, you can set the ``RAY_BACKEND_LOG_LEVEL`` environment variable before starting Ray. For example, you can do: .. code-block:: shell export RAY_BACKEND_LOG_LEVEL=debug ray start This will print any ``RAY_LOG(DEBUG)`` lines in the source code to the ``raylet.err`` file, which you can find in :ref:`temp-dir-log-files`. If it worked, you should see as the first line in ``raylet.err``: .. code-block:: shell logging.cc:270: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 (-1 is defined as RayLogLevel::DEBUG in logging.h.) .. literalinclude:: /../../src/ray/util/logging.h :language: C :lines: 113,120 Backend event stats ------------------- The ``raylet`` process also periodically dumps event stats to ``debug_state.txt`` and its log file if ``RAY_event_stats=1`` environment variable is set. To alter the interval at which Ray writes stats to log files, you can set ``RAY_event_stats_print_interval_ms``. Event stats include ASIO event handlers, periodic timers, and RPC handlers. Here is a sample of what the event stats look like: .. code-block:: shell Event stats: Global stats: 739128 total (27 active) Queueing time: mean = 47.402 ms, max = 1372.219 s, min = -0.000 s, total = 35035.892 s Execution time: mean = 36.943 us, total = 27.306 s Handler stats: ClientConnection.async_read.ReadBufferAsync - 241173 total (19 active), CPU time: mean = 9.999 us, total = 2.411 s ObjectManager.ObjectAdded - 61215 total (0 active), CPU time: mean = 43.953 us, total = 2.691 s CoreWorkerService.grpc_client.AddObjectLocationOwner - 61204 total (0 active), CPU time: mean = 3.860 us, total = 236.231 ms CoreWorkerService.grpc_client.GetObjectLocationsOwner - 51333 total (0 active), CPU time: mean = 25.166 us, total = 1.292 s ObjectManager.ObjectDeleted - 43188 total (0 active), CPU time: mean = 26.017 us, total = 1.124 s CoreWorkerService.grpc_client.RemoveObjectLocationOwner - 43177 total (0 active), CPU time: mean = 2.368 us, total = 102.252 ms NodeManagerService.grpc_server.PinObjectIDs - 40000 total (0 active), CPU time: mean = 194.860 us, total = 7.794 s Callback latency injection -------------------------- Sometimes, bugs are caused by RPC issues, for example, due to the delay of some requests, the system goes to a deadlock. To debug and reproduce this kind of issue, we need to have a way to inject latency for the RPC request. To enable this, ``RAY_testing_asio_delay_us`` is introduced. If you'd like to make the callback of some RPC requests be executed after some time, you can do it with this variable. For example: .. code-block:: shell RAY_testing_asio_delay_us="NodeManagerService.grpc_client.PrepareBundleResources=2000000:2000000" ray start --head The syntax for this is ``RAY_testing_asio_delay_us="method1=min_us:max_us,method2=min_us:max_us"``. Entries are comma separated. There is a special method ``*`` which means all methods. It has a lower priority compared with other entries. .. _`issues`: https://github.com/ray-project/ray/issues --- .. _building-ray: Building Ray from Source ========================= To contribute to the Ray repository, follow the instructions below to build from the latest master branch. .. tip:: If you are only editing Python files, follow instructions for :ref:`python-develop` to avoid long build times. If you already followed the instructions in :ref:`python-develop` and want to switch to the Full build in this section, you will need to first uninstall. .. contents:: :local: Fork the Ray repository ----------------------- Forking an open source repository is a best practice when looking to contribute, as it allows you to make and test changes without affecting the original project, ensuring a clean and organized collaboration process. You can propose changes to the main project by submitting a pull request to the main project's repository. 1. Navigate to the `Ray GitHub repository `_. 2. Follow these `GitHub instructions `_, and do the following: a. `Fork the repo `_ using your preferred method. b. `Clone `_ to your local machine. c. `Connect your repo `_ to the upstream (main project) Ray repo to sync changes. Prepare a Python virtual environment ------------------------------------ Create a virtual environment to prevent version conflicts and to develop with an isolated, project-specific Python setup. .. tab-set:: .. tab-item:: conda Set up a ``conda`` environment named ``myenv``: .. code-block:: shell conda create -c conda-forge python=3.10 -n myenv Activate your virtual environment to tell the shell/terminal to use this particular Python: .. code-block:: shell conda activate myenv You need to activate the virtual environment every time you start a new shell/terminal to work on Ray. .. tab-item:: venv Use Python's integrated ``venv`` module to create a virtual environment called ``myenv`` in the current directory: .. code-block:: shell python -m venv myenv This contains a directory with all the packages used by the local Python of your project. You only need to do this step once. Activate your virtual environment to tell the shell/terminal to use this particular Python: .. code-block:: shell source myenv/bin/activate You need to activate the virtual environment every time you start a new shell/terminal to work on Ray. Creating a new virtual environment can come with older versions of ``pip`` and ``wheel``. To avoid problems when you install packages, use the module ``pip`` to install the latest version of ``pip`` (itself) and ``wheel``: .. code-block:: shell python -m pip install --upgrade pip wheel .. _python-develop: Building Ray (Python Only) -------------------------- .. note:: Unless otherwise stated, directory and file paths are relative to the project root directory. RLlib, Tune, Autoscaler, and most Python files do not require you to build and compile Ray. Follow these instructions to develop Ray's Python files locally without building Ray. 1. Make sure you have a clone of Ray's git repository as explained above. 2. Make sure you activate the Python (virtual) environment as described above. 3. Pip install the **latest Ray wheels.** See :ref:`install-nightlies` for instructions. .. code-block:: shell # For example, for Python 3.10: pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl 4. Replace Python files in the installed package with your local editable copy. We provide a simple script to help you do this: ``python python/ray/setup-dev.py``. Running the script will remove the ``ray/tune``, ``ray/rllib``, ``ray/autoscaler`` dir (among other directories) bundled with the ``ray`` pip package, and replace them with links to your local code. This way, changing files in your git clone will directly affect the behavior of your installed Ray. .. code-block:: shell # This replaces `/site-packages/ray/` # with your local `ray/python/ray/`. python python/ray/setup-dev.py .. note:: [Advanced] You can also optionally skip creating symbolic link for directories of your choice. .. code-block:: shell # This links all folders except "_private" and "dashboard" without user prompt. python python/ray/setup-dev.py -y --skip _private dashboard .. warning:: Do not run ``pip uninstall ray`` or ``pip install -U`` (for Ray or Ray wheels) if setting up your environment this way. To uninstall or upgrade, you must first ``rm -rf`` the pip-installation site (usually a directory at the ``site-packages/ray`` location), then do a pip reinstall (see the command above), and finally run the above ``setup-dev.py`` script again. .. code-block:: shell # To uninstall, delete the symlinks first. rm -rf /site-packages/ray # Path will be in the output of `setup-dev.py`. pip uninstall ray # or `pip install -U ` Preparing to build Ray on Linux ------------------------------- .. tip:: If you are only editing Tune/RLlib/Autoscaler files, follow instructions for :ref:`python-develop` to avoid long build times. To build Ray on Ubuntu, run the following commands: .. code-block:: bash sudo apt-get update sudo apt-get install -y build-essential curl clang-12 pkg-config psmisc unzip # Install Bazelisk. ci/env/install-bazel.sh # Install node version manager and node 14 curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash nvm install 14 nvm use 14 .. note:: The `install-bazel.sh` script installs `bazelisk` for building Ray. Note that `bazel` is installed at `$HOME/bin/bazel`; make sure it's on the executable `PATH`. If you prefer to use `bazel`, only version `6.5.0` is currently supported. For RHELv8 (Redhat EL 8.0-64 Minimal), run the following commands: .. code-block:: bash sudo yum groupinstall 'Development Tools' sudo yum install psmisc In RedHat, install Bazel manually from this link: https://bazel.build/versions/6.5.0/install/redhat Preparing to build Ray on MacOS ------------------------------- .. tip:: Assuming you already have Brew and Bazel installed on your mac and you also have grpc and protobuf installed on your mac consider removing those (grpc and protobuf) for smooth build through the commands ``brew uninstall grpc``, ``brew uninstall protobuf``. If you have built the source code earlier and it still fails with errors like ``No such file or directory:``, try cleaning previous builds on your host by running the commands ``brew uninstall binutils`` and ``bazel clean --expunge``. To build Ray on MacOS, first install these dependencies: .. code-block:: bash brew update brew install wget # Install Bazel. ci/env/install-bazel.sh Building Ray on Linux & MacOS (full) ------------------------------------ Make sure you have a local clone of Ray's git repository as explained above. You will also need to install NodeJS_ to build the dashboard. Enter into the project directory, for example: .. code-block:: shell cd ray Now you can build the dashboard. From inside of your local Ray project directory enter into the dashboard client directory: .. code-block:: bash cd python/ray/dashboard/client Then you can install the dependencies and build the dashboard: .. code-block:: bash npm ci npm run build After that, you can now move back to the top level Ray directory: .. code-block:: shell cd - Now let's build Ray for Python. Make sure you activate any Python virtual (or conda) environment you could be using as described above. Enter into the ``python/`` directory inside of the Ray project directory and install the project with ``pip``: .. code-block:: bash # Install Ray. cd python/ # Install required dependencies. pip install -r requirements.txt # You may need to set the following two env vars if you have a macOS ARM64(M1) platform. # See https://github.com/grpc/grpc/issues/25082 for more details. # export GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 # export GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install -e . --verbose # Add --user if you see a permission denied error. The ``-e`` means "editable", so changes you make to files in the Ray directory will take effect without reinstalling the package. .. warning:: if you run ``python setup.py install``, files will be copied from the Ray directory to a directory of Python packages (``/lib/python3.6/site-packages/ray``). This means that changes you make to files in the Ray directory will not have any effect. .. tip:: If your machine is running out of memory during the build or the build is causing other programs to crash, try adding the following line to ``~/.bazelrc``: ``build --local_ram_resources=HOST_RAM*.5 --local_cpu_resources=4`` The ``build --disk_cache=~/bazel-cache`` option can be useful to speed up repeated builds too. .. note:: Warning: If you run into an error building protobuf, switching from miniforge to anaconda might help. .. _NodeJS: https://nodejs.org Building Ray on Windows (full) ------------------------------ **Requirements** The following links were correct during the writing of this section. In case the URLs changed, search at the organizations' sites. - Bazel 6.5.0 (https://github.com/bazelbuild/bazel/releases/tag/6.5.0) - Microsoft Visual Studio 2019 (or Microsoft Build Tools 2019 - https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2019) - JDK 15 (https://www.oracle.com/java/technologies/javase-jdk15-downloads.html) - Miniforge 3 (https://github.com/conda-forge/miniforge/blob/main/README.md) - git for Windows, version 2.31.1 or later (https://git-scm.com/download/win) You can also use the included script to install Bazel: .. code-block:: bash # Install Bazel. ray/ci/env/install-bazel.sh # (Windows users: please manually place Bazel in your PATH, and point # BAZEL_SH to MSYS2's Bash: ``set BAZEL_SH=C:\Program Files\Git\bin\bash.exe``) **Steps** 1. Enable Developer mode on Windows 10 systems. This is necessary so git can create symlinks. 1. Open Settings app; 2. Go to "Update & Security"; 3. Go to "For Developers" on the left pane; 4. Turn on "Developer mode". 2. Add the following Miniforge subdirectories to PATH. If Miniforge was installed for all users, the following paths are correct. If Miniforge is installed for a single user, adjust the paths accordingly. - ``C:\ProgramData\miniforge3`` - ``C:\ProgramData\miniforge3\Scripts`` - ``C:\ProgramData\miniforge3\Library\bin`` 3. Define an environment variable ``BAZEL_SH`` to point to ``bash.exe``. If git for Windows was installed for all users, bash's path should be ``C:\Program Files\Git\bin\bash.exe``. If git was installed for a single user, adjust the path accordingly. 4. Bazel 6.5.0 installation. Go to Bazel 6.5.0 release web page and download bazel-4.2.1-windows-x86_64.exe. Copy the exe into the directory of your choice. Define an environment variable BAZEL_PATH to full exe path (example: ``set BAZEL_PATH=C:\bazel\bazel.exe``). Also add the Bazel directory to the ``PATH`` (example: ``set PATH=%PATH%;C:\bazel``) 5. Download ray source code and build it. .. code-block:: shell # cd to the directory under which the ray source tree will be downloaded. git clone -c core.symlinks=true https://github.com/ray-project/ray.git cd ray\python pip install -e . --verbose Environment variables that influence builds -------------------------------------------- You can tweak the build with the following environment variables (when running ``pip install -e .`` or ``python setup.py install``): - ``RAY_BUILD_CORE``: If set and equal to ``1``, the core parts will be built. Defaults to ``1``. - ``RAY_INSTALL_JAVA``: If set and equal to ``1``, extra build steps will be executed to build java portions of the codebase - ``RAY_INSTALL_CPP``: If set and equal to ``1``, ``ray-cpp`` will be installed - ``RAY_BUILD_REDIS``: If set and equal to ``1``, Redis binaries will be built or fetched. These binaries are only used for testing. Defaults to ``1``. - ``RAY_DISABLE_EXTRA_CPP``: If set and equal to ``1``, a regular (non - ``cpp``) build will not provide some ``cpp`` interfaces - ``SKIP_BAZEL_BUILD``: If set and equal to ``1``, no Bazel build steps will be executed - ``SKIP_THIRDPARTY_INSTALL_CONDA_FORGE``: If set, setup will skip installation of third-party packages required for build. This is active on conda-forge where pip is not used to create a build environment. - ``RAY_DEBUG_BUILD``: Can be set to ``debug``, ``asan``, or ``tsan``. Any other value will be ignored - ``BAZEL_ARGS``: If set, pass a space-separated set of arguments to Bazel. This can be useful for restricting resource usage during builds, for example. See https://bazel.build/docs/user-manual for more information about valid arguments. - ``IS_AUTOMATED_BUILD``: Used in conda-forge CI to tweak the build for the managed CI machines - ``SRC_DIR``: Can be set to the root of the source checkout, defaults to ``None`` which is ``cwd()`` - ``BAZEL_SH``: used on Windows to find a ``bash.exe``, see below - ``BAZEL_PATH``: used on Windows to find ``bazel.exe``, see below - ``MINGW_DIR``: used on Windows to find ``bazel.exe`` if not found in ``BAZEL_PATH`` Installing additional dependencies for development -------------------------------------------------- Dependencies for the linter (``pre-commit``) can be installed with: .. code-block:: shell pip install -c python/requirements_compiled.txt pre-commit pre-commit install Dependencies for running Ray unit tests under ``python/ray/tests`` can be installed with: .. code-block:: shell pip install -c python/requirements_compiled.txt -r python/requirements/test-requirements.txt Requirement files for running Ray Data / ML library tests are under ``python/requirements/``. Pre-commit Hooks ---------------- Ray uses pre-commit hooks with `the pre-commit python package `_. The ``.pre-commit-config.yaml`` file configures all the linting and formatting checks. To start using ``pre-commit``: .. code-block:: shell pip install pre-commit pre-commit install This will install pre-commit into the current environment, and enable pre-commit checks every time you commit new code changes with git. To temporarily skip pre-commit checks, use the ``-n`` or ``--no-verify`` flag when committing: .. code-block:: shell git commit -n If you encounter any issues with ``pre-commit``, please `report an issue here`_. .. _report an issue here: https://github.com/ray-project/ray/issues/new?template=bug-report.yml Fast, Debug, and Optimized Builds --------------------------------- Currently, Ray is built with optimizations, which can take a long time and interfere with debugging. To perform fast, debug, or optimized builds, you can run the following (via ``-c`` ``fastbuild``/``dbg``/``opt``, respectively): .. code-block:: shell bazel run -c fastbuild //:gen_ray_pkg This will rebuild Ray with the appropriate options (which may take a while). If you need to build all targets, you can use ``bazel build //:all`` instead of ``bazel run //:gen_ray_pkg``. To make this change permanent, you can add an option such as the following line to your user-level ``~/.bazelrc`` file (not to be confused with the workspace-level ``.bazelrc`` file): .. code-block:: shell build --compilation_mode=fastbuild If you do so, remember to revert this change, unless you want it to affect all of your development in the future. Using ``dbg`` instead of ``fastbuild`` generates more debug information, which can make it easier to debug with a debugger like ``gdb``. Building the Docs ----------------- To learn more about building the docs refer to `Contributing to the Ray Documentation`_. .. _Contributing to the Ray Documentation: https://docs.ray.io/en/master/ray-contribute/docs.html Using a local repository for dependencies ----------------------------------------- If you'd like to build Ray with custom dependencies (for example, with a different version of Cython), you can modify your ``.bzl`` file as follows: .. code-block:: python http_archive( name = "cython", ..., ) if False else native.new_local_repository( name = "cython", build_file = "bazel/BUILD.cython", path = "../cython", ) This replaces the existing ``http_archive`` rule with one that references a sibling of your Ray directory (named ``cython``) using the build file provided in the Ray repository (``bazel/BUILD.cython``). If the dependency already has a Bazel build file in it, you can use ``native.local_repository`` instead, and omit ``build_file``. To test switching back to the original rule, change ``False`` to ``True``. .. _`PR template`: https://github.com/ray-project/ray/blob/master/.github/PULL_REQUEST_TEMPLATE.md Troubleshooting --------------- If importing Ray (``python3 -c "import ray"``) in your development clone results in this error: .. code-block:: python Traceback (most recent call last): File "", line 1, in File ".../ray/python/ray/__init__.py", line 63, in import ray._raylet # noqa: E402 File "python/ray/_raylet.pyx", line 98, in init ray._raylet import ray.memory_monitor as memory_monitor File ".../ray/python/ray/memory_monitor.py", line 9, in import psutil # noqa E402 File ".../ray/python/ray/thirdparty_files/psutil/__init__.py", line 159, in from . import _psosx as _psplatform File ".../ray/python/ray/thirdparty_files/psutil/_psosx.py", line 15, in from . import _psutil_osx as cext ImportError: cannot import name '_psutil_osx' from partially initialized module 'psutil' (most likely due to a circular import) (.../ray/python/ray/thirdparty_files/psutil/__init__.py) Then you should run the following commands: .. code-block:: bash rm -rf python/ray/thirdparty_files/ python3 -m pip install psutil --- .. _fake-multinode: Testing Autoscaling Locally =========================== Testing autoscaling behavior is important for autoscaler development and the debugging of applications that depend on autoscaler behavior. You can run the autoscaler locally without needing to launch a real cluster with one of the following methods: Using ``RAY_FAKE_CLUSTER=1 ray start`` -------------------------------------- Instructions: 1. Navigate to the root directory of the Ray repo you have cloned locally. 2. Locate the `fake_multi_node/example.yaml `__ example file and fill in the number of CPUs and GPUs the local machine has for the head node type config. The YAML follows the same format as cluster autoscaler configurations, but some fields are not supported. 3. Configure worker types and other autoscaling configs as desired in the YAML file. 4. Start the fake cluster locally: .. code-block:: shell $ ray stop --force $ RAY_FAKE_CLUSTER=1 ray start \ --autoscaling-config=./python/ray/autoscaler/_private/fake_multi_node/example.yaml \ --head --block 5. Connect your application to the fake local cluster with ``ray.init("auto")``. 6. Run ``ray status`` to view the status of your cluster, or ``cat /tmp/ray/session_latest/logs/monitor.*`` to view the autoscaler monitor log: .. code-block:: shell $ ray status ======== Autoscaler status: 2021-10-12 13:10:21.035674 ======== Node status --------------------------------------------------------------- Healthy: 1 ray.head.default 2 ray.worker.cpu Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/10.0 CPU 0.00/70.437 GiB memory 0.00/10.306 GiB object_store_memory Demands: (no resource demands) Using ``ray.cluster_utils.AutoscalingCluster`` ---------------------------------------------- To programmatically create a fake multi-node autoscaling cluster and connect to it, you can use `cluster_utils.AutoscalingCluster `__. Here's an example of a basic autoscaling test that launches tasks triggering autoscaling: .. literalinclude:: /../../python/ray/tests/test_autoscaler_fake_multinode.py :language: python :dedent: 4 :start-after: __example_begin__ :end-before: __example_end__ Python documentation: .. autoclass:: ray.cluster_utils.AutoscalingCluster :members: Features and Limitations of ``fake_multinode`` ---------------------------------------------- Most of the features of the autoscaler are supported in fake multi-node mode. For example, if you update the contents of the YAML file, the autoscaler will pick up the new configuration and apply changes, as it does in a real cluster. Node selection, launch, and termination are governed by the same bin-packing and idle timeout algorithms as in a real cluster. However, there are a few limitations: 1. All node raylets run uncontainerized on the local machine, and hence they share the same IP address. See the :ref:`fake_multinode_docker ` section for an alternative local multi node setup. 2. Configurations for auth, setup, initialization, Ray start, file sync, and anything cloud-specific are not supported. 3. It's necessary to limit the number of nodes / node CPU / object store memory to avoid overloading your local machine. .. _fake-multinode-docker: Testing containerized multi nodes locally with Docker compose ============================================================= To go one step further and locally test a multi node setup where each node uses its own container (and thus has a separate filesystem, IP address, and Ray processes), you can use the ``fake_multinode_docker`` node provider. The setup is very similar to the :ref:`fake_multinode ` provider. However, you need to start a monitoring process (``docker_monitor.py``) that takes care of running the ``docker compose`` command. Prerequisites: 1. Make sure you have `docker `_ installed. 2. Make sure you have the `docker compose V2 plugin `_ installed. Using ``RAY_FAKE_CLUSTER=1 ray up`` ----------------------------------- Instructions: 1. Navigate to the root directory of the Ray repo you have cloned locally. 2. Locate the `fake_multi_node/example_docker.yaml `__ example file and fill in the number of CPUs and GPUs the local machine has for the head node type config. The YAML follows the same format as cluster autoscaler configurations, but some fields are not supported. 3. Configure worker types and other autoscaling configs as desired in the YAML file. 4. Make sure the ``shared_volume_dir`` is empty on the host system 5. Start the monitoring process: .. code-block:: shell $ python ./python/ray/autoscaler/_private/fake_multi_node/docker_monitor.py \ ./python/ray/autoscaler/_private/fake_multi_node/example_docker.yaml 6. Start the Ray cluster using ``ray up``: .. code-block:: shell $ RAY_FAKE_CLUSTER=1 ray up -y ./python/ray/autoscaler/_private/fake_multi_node/example_docker.yaml 7. Connect your application to the fake local cluster with ``ray.init("ray://localhost:10002")``. 8. Alternatively, get a shell on the head node: .. code-block:: shell $ docker exec -it fake_docker_fffffffffffffffffffffffffffffffffffffffffffffffffff00000_1 bash Using ``ray.autoscaler._private.fake_multi_node.test_utils.DockerCluster`` -------------------------------------------------------------------------- This utility is used to write tests that use multi node behavior. The ``DockerCluster`` class can be used to setup a Docker-compose cluster in a temporary directory, start the monitoring process, wait for the cluster to come up, connect to it, and update the configuration. Please see the API documentation and example test cases on how to use this utility. .. autoclass:: ray.autoscaler._private.fake_multi_node.test_utils.DockerCluster :members: Features and Limitations of ``fake_multinode_docker`` ----------------------------------------------------- The fake multinode docker node provider provides fully fledged nodes in their own containers. However, some limitations still remain: 1. Configurations for auth, setup, initialization, Ray start, file sync, and anything cloud-specific are not supported (but might be in the future). 2. It's necessary to limit the number of nodes / node CPU / object store memory to avoid overloading your local machine. 3. In docker-in-docker setups, a careful setup has to be followed to make the fake multinode docker provider work (see below). Shared directories within the docker environment ------------------------------------------------ The containers will mount two locations to host storage: - ``/cluster/node``: This location (in the container) will point to ``cluster_dir/nodes/`` (on the host). This location is individual per node, but it can be used so that the host can examine contents stored in this directory. - ``/cluster/shared``: This location (in the container) will point to ``cluster_dir/shared`` (on the host). This location is shared across nodes and effectively acts as a shared filesystem (comparable to NFS). Setting up in a Docker-in-Docker (dind) environment --------------------------------------------------- When setting up in a Docker-in-Docker (dind) environment (e.g. the Ray OSS Buildkite environment), some things have to be kept in mind. To make this clear, consider these concepts: * The **host** is the not-containerized machine on which the code is executed (e.g. Buildkite runner) * The **outer container** is the container running directly on the **host**. In the Ray OSS Buildkite environment, two containers are started - a *dind* network host and a container with the Ray source code and wheel in it. * The **inner container** is a container started by the fake multinode docker node provider. The control plane for the multinode docker node provider lives in the outer container. However, ``docker compose`` commands are executed from the connected docker-in-docker network. In the Ray OSS Buildkite environment, this is the ``dind-daemon`` container running on the host docker. If you e.g. mounted ``/var/run/docker.sock`` from the host instead, it would be the host docker daemon. We will refer to both as the **host daemon** from now on. The outer container modifies files that have to be mounted in the inner containers (and modified from there as well). This means that the host daemon also has to have access to these files. Similarly, the inner containers expose ports - but because the containers are actually started by the host daemon, the ports are also only accessible on the host (or the dind container). For the Ray OSS Buildkite environment, we thus set some environment variables: * ``RAY_TEMPDIR="/ray-mount"``. This environment variable defines where the temporary directory for the cluster files should be created. This directory has to be accessible by the host, the outer container, and the inner container. In the inner container, we can control the directory name. * ``RAY_HOSTDIR="/ray"``. In the case where the shared directory has a different name on the host, we can rewrite the mount points dynamically. In this example, the outer container is started with ``-v /ray:/ray-mount`` or similar, so the directory on the host is ``/ray`` and in the outer container ``/ray-mount`` (see ``RAY_TEMPDIR``). * ``RAY_TESTHOST="dind-daemon"`` As the containers are started by the host daemon, we can't just connect to ``localhost``, as the ports are not exposed to the outer container. Thus, we can set the Ray host with this environment variable. Lastly, docker-compose obviously requires a docker image. The default docker image is ``rayproject/ray:nightly``. The docker image requires ``openssh-server`` to be installed and enabled. In Buildkite we build a new image from ``rayproject/ray:nightly-py38-cpu`` to avoid installing this on the fly for every node (which is the default way). This base image is built in one of the previous build steps. Thus, we set * ``RAY_DOCKER_IMAGE="rayproject/ray:multinode-py38"`` * ``RAY_HAS_SSH=1`` to use this docker image and inform our multinode infrastructure that SSH is already installed. Local development ----------------- If you're doing local development on the fake multi node docker module, you can set * ``FAKE_CLUSTER_DEV="auto"`` this will mount the ``ray/python/ray/autoscaler`` directory to the started nodes. Please note that this is will probably not work in your docker-in-docker setup. If you want to specify which top-level Ray directories to mount, you can use: * ``FAKE_CLUSTER_DEV_MODULES="autoscaler,tune"`` This will mount both ``ray/python/ray/autoscaler`` and ``ray/python/ray/tune`` within the node containers. The list of modules should be comma separated and without spaces. --- .. include:: /_includes/_latest_contribution_doc.rst .. _getting-involved: Getting Involved / Contributing =============================== .. toctree:: :hidden: development ci docs writing-code-snippets fake-autoscaler testing-tips debugging profiling Ray is more than a framework for distributed applications but also an active community of developers, researchers, and folks that love machine learning. .. tip:: Ask questions on `our forum `_! The community is extremely active in helping people succeed in building their Ray applications. You can join (and Star!) us `on GitHub`_. .. _`on GitHub`: https://github.com/ray-project/ray Contributing to Ray ------------------- We welcome (and encourage!) all forms of contributions to Ray, including and not limited to: - Code reviewing of patches and PRs. - Pushing patches. - Documentation and examples. - Community participation in forums and issues. - Code readability and code comments to improve readability. - Test cases to make the codebase more robust. - Tutorials, blog posts, talks that promote the project. - Features and major changes via Ray Enhancement Proposals (REP): https://github.com/ray-project/enhancements What can I work on? ------------------- We use GitHub labels to categorize issues and help contributors find work that matches their interests and skill level. Getting started ~~~~~~~~~~~~~~~ If you're new to Ray, start with these labels: - `good-first-issue`_: Small issues that are good for new contributors to onboard to the codebase. - `contribution-welcome`_: Impactful issues that are good candidates for community contributions. Reviews for these issues will be prioritized. By component ~~~~~~~~~~~~ Find issues in the area you're most interested in: - `core`_: Ray Core (tasks, actors, objects, scheduling). - `data`_: Ray Data for distributed data processing. - `train`_: Ray Train for distributed training. - `tune`_: Ray Tune for hyperparameter tuning. - `serve`_: Ray Serve for model serving. - `rllib`_: RLlib for reinforcement learning. By type ~~~~~~~ Choose the kind of contribution you'd like to make: - `bug`_: Bug fixes. - `enhancement`_: New features or improvements. - `docs`_: Documentation improvements. You can combine labels in GitHub's search to find issues that match multiple criteria. .. _`good-first-issue`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3A%22good-first-issue%22 .. _`contribution-welcome`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3A%22contribution-welcome%22 .. _`core`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Acore .. _`data`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Adata .. _`train`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Atrain .. _`tune`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Atune .. _`serve`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Aserve .. _`rllib`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Arllib .. _`bug`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Abug .. _`enhancement`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement .. _`docs`: https://github.com/ray-project/ray/issues?q=is%3Aissue+is%3Aopen+label%3Adocs Setting up your development environment --------------------------------------- To edit the Ray source code, fork the repository, clone it, and build Ray from source. Follow :ref:`these instructions for building ` a local copy of Ray to easily make changes. Submitting and Merging a Contribution ------------------------------------- There are a couple steps to merge a contribution. 1. First merge the most recent version of master into your development branch. .. code:: bash git remote add upstream https://github.com/ray-project/ray.git git pull . upstream/master 2. Make sure all existing `tests `__ and `linters `__ pass. Run ``setup_hooks.sh`` to create a git hook that will run the linter before you push your changes. 3. If introducing a new feature or patching a bug, be sure to add new test cases in the relevant file in ``ray/python/ray/tests/``. 4. Document the code. Public functions need to be documented, and remember to provide an usage example if applicable. See ``doc/README.md`` for instructions on editing and building public documentation. 5. Address comments on your PR. During the review process you may need to address merge conflicts with other changes. To resolve merge conflicts, run ``git pull . upstream/master`` on your branch (please do not use rebase, as it is less friendly to the GitHub review tool. All commits will be squashed on merge.) 6. Reviewers will merge and approve the pull request; be sure to ping them if the pull request is getting stale. PR Review Process ----------------- For contributors who are in the ``ray-project`` organization: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - When you first create a PR, add an reviewer to the `assignee` section. - Assignees will review your PR and add the `@author-action-required` label if further actions are required. - Address their comments and remove the `@author-action-required` label from the PR. - Repeat this process until assignees approve your PR. - Once the PR is approved, the author is in charge of ensuring the PR passes the build. Add the `test-ok` label if the build succeeds. - Committers will merge the PR once the build is passing. For contributors who are not in the ``ray-project`` organization: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Your PRs will have assignees shortly. Assignees of PRs will be actively engaging with contributors to merge the PR. - Please actively ping assignees after you address your comments! Testing ------- Even though we have hooks to run unit tests automatically for each pull request, we recommend you to run unit tests locally beforehand to reduce reviewers’ burden and speedup review process. If you are running tests for the first time, you can install the required dependencies with: .. code-block:: shell pip install -c python/requirements_compiled.txt -r python/requirements/test-requirements.txt Testing for Python development ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The full suite of tests is too large to run on a single machine. However, you can run individual relevant Python test files. Suppose that one of the tests in a file of tests, e.g., ``python/ray/tests/test_basic.py``, is failing. You can run just that test file locally as follows: .. code-block:: shell # Directly calling `pytest -v ...` may lose import paths. python -m pytest -v -s python/ray/tests/test_basic.py This will run all of the tests in the file. To run a specific test, use the following: .. code-block:: shell # Directly calling `pytest -v ...` may lose import paths. python -m pytest -v -s test_file.py::name_of_the_test Testing for C++ development ~~~~~~~~~~~~~~~~~~~~~~~~~~~ To compile and run all C++ tests, you can run: .. code-block:: shell bazel test $(bazel query 'kind(cc_test, ...)') Alternatively, you can also run one specific C++ test. You can use: .. code-block:: shell bazel test $(bazel query 'kind(cc_test, ...)') --test_filter=ClientConnectionTest --test_output=streamed Code Style ---------- In general, we follow the `Google style guide `__ for C++ code and the `Black code style `__ for Python code. Python imports follow `PEP8 style `__. However, it is more important for code to be in a locally consistent style than to strictly follow guidelines. Whenever in doubt, follow the local code style of the component. For Python documentation, we follow a subset of the `Google pydoc format `__. The following code snippets demonstrate the canonical Ray pydoc formatting: .. testcode:: def ray_canonical_doc_style(param1: int, param2: str) -> bool: """First sentence MUST be inline with the quotes and fit on one line. Additional explanatory text can be added in paragraphs such as this one. Do not introduce multi-line first sentences. Examples: .. doctest:: >>> # Provide code examples for key use cases, as possible. >>> ray_canonical_doc_style(41, "hello") True >>> # A second example. >>> ray_canonical_doc_style(72, "goodbye") False Args: param1: The first parameter. Do not include the types in the docstring. They should be defined only in the signature. Multi-line parameter docs should be indented by four spaces. param2: The second parameter. Returns: The return value. Do not include types here. """ .. testcode:: class RayClass: """The summary line for a class docstring should fit on one line. Additional explanatory text can be added in paragraphs such as this one. Do not introduce multi-line first sentences. The __init__ method is documented here in the class level docstring. All the public methods and attributes should have docstrings. Examples: .. testcode:: obj = RayClass(12, "world") obj.increment_attr1() Args: param1: The first parameter. Do not include the types in the docstring. They should be defined only in the signature. Multi-line parameter docs should be indented by four spaces. param2: The second parameter. """ def __init__(self, param1: int, param2: str): #: Public attribute is documented here. self.attr1 = param1 #: Public attribute is documented here. self.attr2 = param2 @property def attr3(self) -> str: """Public property of the class. Properties created with the @property decorator should be documented here. """ return "hello" def increment_attr1(self) -> None: """Class methods are similar to regular functions. See above about how to document functions. """ self.attr1 = self.attr1 + 1 See :ref:`this ` for more details about how to write code snippets in docstrings. Lint and Formatting ~~~~~~~~~~~~~~~~~~~ We also have tests for code formatting and linting that need to pass before merge. * For Python formatting, install the `required dependencies `_ first with: .. code-block:: shell pip install -c python/requirements_compiled.txt -r python/requirements/lint-requirements.txt * If developing for C++, you will need `clang-format `_ version ``12`` (download this version of Clang from `here `_) You can run the following locally: .. code-block:: shell pip install -U pre-commit==3.5.0 pre-commit install # automatic checks before committing pre-commit run ruff -a An output like the following indicates failure: .. code-block:: shell WARNING: clang-format is not installed! # This is harmless From https://github.com/ray-project/ray * branch master -> FETCH_HEAD python/ray/util/sgd/tf/tf_runner.py:4:1: F401 'numpy as np' imported but unused # Below is the failure In addition, there are other formatting and semantic checkers for components like the following (not included in ``pre-commit``): * Python README format: .. code-block:: shell cd python python setup.py check --restructuredtext --strict --metadata * Python & Docs banned words check .. code-block:: shell ./ci/lint/check-banned-words.sh * Bazel format: .. code-block:: shell ./ci/lint/bazel-format.sh * clang-tidy for C++ lint, requires ``clang`` and ``clang-tidy`` version 12 to be installed: .. code-block:: shell ./ci/lint/check-git-clang-tidy-output.sh Understanding CI test jobs -------------------------- The Ray project automatically runs continuous integration (CI) tests once a PR is opened using `Buildkite `_ with multiple CI test jobs. The `CI`_ test folder contains all integration test scripts and they invoke other test scripts via ``pytest``, ``bazel``-based test or other bash scripts. Some of the examples include: * Bazel test command: * ``bazel test --build_tests_only //:all`` * Ray serving test commands: * ``pytest python/ray/serve/tests`` * ``python python/ray/serve/examples/echo_full.py`` If a CI build exception doesn't appear to be related to your change, please visit `this link `_ to check recent tests known to be flaky. .. _`CI`: https://github.com/ray-project/ray/tree/master/ci API compatibility style guide ----------------------------- Ray provides stability guarantees for its public APIs in Ray core and libraries, which are described in the :ref:`API Stability guide `. It's hard to fully capture the semantics of API compatibility into a single annotation (for example, public APIs may have "experimental" arguments). For more granular stability contracts, those can be noted in the pydoc (e.g., "the ``random_shuffle`` option is experimental"). When possible, experimental arguments should also be prefixed by underscores in Python (e.g., `_owner=`). **Other recommendations**: In Python APIs, consider forcing the use of kwargs instead of positional arguments (with the ``*`` operator). Kwargs are easier to keep backwards compatible than positional arguments, e.g. imagine if you needed to deprecate "opt1" below, it's easier with forced kwargs: .. code-block:: python def foo_bar(file, *, opt1=x, opt2=y) pass For callback APIs, consider adding a ``**kwargs`` placeholder as a "forward compatibility placeholder" in case more args need to be passed to the callback in the future, e.g.: .. code-block:: python def tune_user_callback(model, score, **future_kwargs): pass Community Examples ------------------ We're always looking for new example contributions! When contributing an example for a Ray library, include a link to your example in the ``examples.yml`` file for that library: .. code-block:: yaml - title: Serve a Java App skill_level: advanced link: tutorials/java contributor: community Give your example a title, a skill level (``beginner``, ``intermediate``, or ``advanced``), and a link (relative links point to other documentation pages, but direct links starting with ``http://`` also work). Include the ``contributor: community`` metadata to ensure that the example is correctly labeled as a community example in the example gallery. Becoming a Reviewer ------------------- We identify reviewers from active contributors. Reviewers are individuals who not only actively contribute to the project and are also willing to participate in the code review of new contributions. A pull request to the project has to be reviewed by at least one reviewer in order to be merged. There is currently no formal process, but active contributors to Ray will be solicited by current reviewers. More Resources for Getting Involved ----------------------------------- .. include:: ../ray-contribute/involvement.rst .. note:: These tips are based off of the TVM `contributor guide `__. --- Developer Guides ================ .. toctree:: :maxdepth: 2 stability api-policy getting-involved ../ray-core/configure whitepaper --- Ray is more than a framework for distributed applications but also an active community of developers, researchers, and folks that love machine learning. Here's a list of tips for getting involved with the Ray community: - Join our `community Slack `_ to discuss Ray! - Star and follow us on `on GitHub`_. - To post questions or feature requests, check out the `Discussion Board`_. - Follow us and spread the word on `Twitter`_. - Join our `Meetup Group`_ to connect with others in the community. - Use the `[ray]` tag on `StackOverflow`_ to ask and answer questions about Ray usage. .. _`Discussion Board`: https://discuss.ray.io/ .. _`GitHub Issues`: https://github.com/ray-project/ray/issues .. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray .. _`Pull Requests`: https://github.com/ray-project/ray/pulls .. _`Twitter`: https://x.com/raydistributed .. _`Meetup Group`: https://www.meetup.com/Bay-Area-Ray-Meetup/ .. _`on GitHub`: https://github.com/ray-project/ray --- .. _ray-core-internal-profiling: Profiling for Ray Developers ============================ This guide helps contributors to the Ray project analyze Ray performance. Getting a stack trace of Ray C++ processes ------------------------------------------ You can use the following GDB command to view the current stack trace of any running Ray process (e.g., raylet). This can be useful for debugging 100% CPU utilization or infinite loops (simply run the command a few times to see what the process is stuck on). .. code-block:: shell sudo gdb -batch -ex "thread apply all bt" -p Note that you can find the pid of the raylet with ``pgrep raylet``. Installation ------------ These instructions are for Ubuntu only. Attempts to get ``pprof`` to correctly symbolize on Mac OS have failed. .. code-block:: bash sudo apt-get install google-perftools libgoogle-perftools-dev You may need to install ``graphviz`` for ``pprof`` to generate flame graphs. .. code-block:: bash sudo apt-get install graphviz CPU profiling ------------- To launch Ray in profiling mode and profile Raylet, define the following variables: .. code-block:: bash export PERFTOOLS_PATH=/usr/lib/x86_64-linux-gnu/libprofiler.so export PERFTOOLS_LOGFILE=/tmp/pprof.out export RAY_RAYLET_PERFTOOLS_PROFILER=1 The file ``/tmp/pprof.out`` is empty until you let the binary run the target workload for a while and then ``kill`` it via ``ray stop`` or by letting the driver exit. Note: Enabling `RAY_RAYLET_PERFTOOLS_PROFILER` allows profiling of the Raylet component. To profile other modules, use `RAY_{MODULE}_PERFTOOLS_PROFILER`, where `MODULE` represents the uppercase form of the process type, such as `GCS_SERVER`. Visualizing the CPU profile ~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can visualize the output of ``pprof`` in different ways. Below, the output is a zoomable ``.svg`` image displaying the call graph annotated with hot paths. .. code-block:: bash # Use the appropriate path. RAYLET=ray/python/ray/core/src/ray/raylet/raylet google-pprof -svg $RAYLET /tmp/pprof.out > /tmp/pprof.svg # Then open the .svg file with Chrome. # If you realize the call graph is too large, use -focus= to zoom # into subtrees. google-pprof -focus=epoll_wait -svg $RAYLET /tmp/pprof.out > /tmp/pprof.svg Below is a snapshot of an example SVG output, from the official documentation: .. image:: http://goog-perftools.sourceforge.net/doc/pprof-test-big.gif Memory profiling ---------------- To run memory profiling on Ray core components, use `jemalloc `_. Ray supports environment variables that override `LD_PRELOAD` on core components. You can find the component name from `ray_constants.py`. For example, if you'd like to profile gcs_server, search `PROCESS_TYPE_GCS_SERVER` in `ray_constants.py`. You can see the value is `gcs_server`. Users are supposed to provide 4 env vars for memory profiling. * `RAY_JEMALLOC_LIB_PATH`: The path to the jemalloc shared library `libjemalloc.so` * `RAY_JEMALLOC_CONF`: The MALLOC_CONF configuration for jemalloc, using comma-separated values. Read `jemalloc docs `_ for more details. * `RAY_JEMALLOC_PROFILE`: Comma separated Ray components to run Jemalloc `.so`. e.g., ("raylet,gcs_server"). Note that the components should match the process type in `ray_constants.py`. (It means "RAYLET,GCS_SERVER" won't work). * `RAY_LD_PRELOAD_ON_WORKERS`: Default value is `0`, which means Ray doesn't preload Jemalloc for workers if a library is incompatible with Jemalloc. Set to `1` to instruct Ray to preload Jemalloc for a worker using values configured by `RAY_JEMALLOC_LIB_PATH` and `RAY_JEMALLOC_PROFILE`. .. code-block:: bash # Install jemalloc wget https://github.com/jemalloc/jemalloc/releases/download/5.2.1/jemalloc-5.2.1.tar.bz2 tar -xf jemalloc-5.2.1.tar.bz2 cd jemalloc-5.2.1 export JEMALLOC_DIR=$PWD ./configure --enable-prof --enable-prof-libunwind make sudo make install # Verify jeprof is installed. which jeprof # Start a Ray head node with jemalloc enabled. # (1) `prof_prefix` defines the path to the output profile files and the prefix of their file names. # (2) This example only profiles the GCS server component. RAY_JEMALLOC_CONF=prof:true,lg_prof_interval:33,lg_prof_sample:17,prof_final:true,prof_leak:true,prof_prefix:$PATH_TO_OUTPUT_DIR/jeprof.out \ RAY_JEMALLOC_LIB_PATH=$JEMALLOC_DIR/lib/libjemalloc.so \ RAY_JEMALLOC_PROFILE=gcs_server \ ray start --head # Check the output files. You should see files with the format of "jeprof..0.f.heap". # Example: jeprof.out.1904189.0.f.heap ls $PATH_TO_OUTPUT_DIR/ # If you don't see any output files, try stopping the Ray cluster to force it to flush the # profile data since `prof_final:true` is set. ray stop # Use jeprof to view the profile data. The first argument is the binary of GCS server. # Note that you can also use `--pdf` or `--svg` to generate different formats of the profile data. jeprof --text $YOUR_RAY_SRC_DIR/python/ray/core/src/ray/gcs/gcs_server $PATH_TO_OUTPUT_DIR/jeprof.out.1904189.0.f.heap # [Example output] Using local file ../ray/core/src/ray/gcs/gcs_server. Using local file jeprof.out.1904189.0.f.heap. addr2line: DWARF error: section .debug_info is larger than its filesize! (0x93f189 vs 0x530e70) Total: 1.0 MB 0.3 25.9% 25.9% 0.3 25.9% absl::lts_20230802::container_internal::InitializeSlots 0.1 12.9% 38.7% 0.1 12.9% google::protobuf::DescriptorPool::Tables::CreateFlatAlloc 0.1 12.4% 51.1% 0.1 12.4% ::do_tcp_client_global_init 0.1 12.3% 63.4% 0.1 12.3% grpc_core::Server::Start 0.1 12.2% 75.6% 0.1 12.2% std::__cxx11::basic_string::_M_assign 0.1 12.2% 87.8% 0.1 12.2% std::__cxx11::basic_string::_M_mutate 0.1 12.2% 100.0% 0.1 12.2% std::__cxx11::basic_string::reserve 0.0 0.0% 100.0% 0.8 75.4% EventTracker::RecordExecution ... Running microbenchmarks ----------------------- To run a set of single-node Ray microbenchmarks, use: .. code-block:: bash ray microbenchmark You can find the microbenchmark results for Ray releases in the `GitHub release logs `__. References ---------- - The `pprof documentation `_. - A `Go version of pprof `_. - The `gperftools `_, including libprofiler, tcmalloc, and other useful tools. --- .. _api-stability: API Stability ============= Ray provides stability guarantees for its public APIs in Ray core and libraries, which are decorated/labeled accordingly. An API can be labeled: * :ref:`PublicAPI `, which means the API is exposed to end users. PublicAPI has three sub-levels (alpha, beta, stable), as described below. * :ref:`DeveloperAPI `, which means the API is explicitly exposed to *advanced* Ray users and library developers * :ref:`Deprecated `, which may be removed in future releases of Ray. Ray's PublicAPI stability definitions are based off the `Google stability level guidelines `_, with minor differences: .. _api-stability-alpha: Alpha ~~~~~ An *alpha* component undergoes rapid iteration with a known set of users who **must** be tolerant of change. The number of users **should** be a curated, manageable set, such that it is feasible to communicate with all of them individually. Breaking changes **must** be both allowed and expected in alpha components, and users **must** have no expectation of stability. .. _api-stability-beta: Beta ~~~~ A *beta* component **must** be considered complete and ready to be declared stable, subject to public testing. Because users of beta components tend to have a lower tolerance of change, beta components **should** be as stable as possible; however, the beta component **must** be permitted to change over time. These changes **should** be minimal but **may** include backwards-incompatible changes to beta components. Backwards-incompatible changes **must** be made only after a reasonable deprecation period to provide users with an opportunity to migrate their code. .. _api-stability-stable: Stable ~~~~~~ A *stable* component **must** be fully-supported over the lifetime of the major API version. Because users expect such stability from components marked stable, there **must** be no breaking changes to these components within a major version (excluding extraordinary circumstances). Docstrings ---------- .. _public-api-def: .. autofunction:: ray.util.annotations.PublicAPI .. _developer-api-def: .. autofunction:: ray.util.annotations.DeveloperAPI .. _deprecated-api-def: .. autofunction:: ray.util.annotations.Deprecated Undecorated functions can be generally assumed to not be part of the Ray public API. --- Tips for testing Ray programs ============================= Ray programs can be a little tricky to test due to the nature of parallel programs. We've put together a list of tips and tricks for common testing practices for Ray programs. .. contents:: :local: Tip 1: Fixing the resource quantity with ``ray.init(num_cpus=...)`` ------------------------------------------------------------------- By default, ``ray.init()`` detects the number of CPUs and GPUs on your local machine/cluster. However, your testing environment may have a significantly lower number of resources. For example, the TravisCI build environment only has `2 cores `_ If tests are written to depend on ``ray.init()``, they may be implicitly written in a way that relies on a larger multi-core machine. This may easily result in tests exhibiting unexpected, flaky, or faulty behavior that is hard to reproduce. To overcome this, you should override the detected resources by setting them in ``ray.init`` like: ``ray.init(num_cpus=2)`` Tip 2: Sharing the ray cluster across tests if possible -------------------------------------------------------- It is safest to start a new ray cluster for each test. .. testcode:: import unittest class RayTest(unittest.TestCase): def setUp(self): ray.init(num_cpus=4, num_gpus=0) def tearDown(self): ray.shutdown() However, starting and stopping a Ray cluster can actually incur a non-trivial amount of latency. For example, on a typical Macbook Pro laptop, starting and stopping can take nearly 5 seconds: .. code-block:: bash python -c 'import ray; ray.init(); ray.shutdown()' 3.93s user 1.23s system 116% cpu 4.420 total Across 20 tests, this ends up being 90 seconds of added overhead. Reusing a Ray cluster across tests can provide significant speedups to your test suite. This reduces the overhead to a constant, amortized quantity: .. testcode:: class RayClassTest(unittest.TestCase): @classmethod def setUpClass(cls): # Start it once for the entire test suite/module ray.init(num_cpus=4, num_gpus=0) @classmethod def tearDownClass(cls): ray.shutdown() Depending on your application, there are certain cases where it may be unsafe to reuse a Ray cluster across tests. For example: 1. If your application depends on setting environment variables per process. 2. If your remote actor/task sets any sort of process-level global variables. Tip 3: Create a mini-cluster with ``ray.cluster_utils.Cluster`` --------------------------------------------------------------- If writing an application for a cluster setting, you may want to mock a multi-node Ray cluster. This can be done with the ``ray.cluster_utils.Cluster`` utility. .. note:: On Windows, support for multi-node Ray clusters is currently experimental and untested. If you run into issues please file a report at https://github.com/ray-project/ray/issues. .. testcode:: from ray.cluster_utils import Cluster # Starts a head-node for the cluster. cluster = Cluster( initialize_head=True, head_node_args={ "num_cpus": 10, }) After starting a cluster, you can execute a typical ray script in the same process: .. testcode:: import ray ray.init(address=cluster.address) @ray.remote def f(x): return x for _ in range(1): ray.get([f.remote(1) for _ in range(1000)]) for _ in range(10): ray.get([f.remote(1) for _ in range(100)]) for _ in range(100): ray.get([f.remote(1) for _ in range(10)]) for _ in range(1000): ray.get([f.remote(1) for _ in range(1)]) You can also add multiple nodes, each with different resource quantities: .. testcode:: mock_node = cluster.add_node(num_cpus=10) assert ray.cluster_resources()["CPU"] == 20 You can also remove nodes, which is useful when testing failure-handling logic: .. testcode:: cluster.remove_node(mock_node) assert ray.cluster_resources()["CPU"] == 10 See the `Cluster Util for more details `_. Tip 4: Be careful when running tests in parallel ------------------------------------------------ Since Ray starts a variety of services, it is easy to trigger timeouts if too many services are started at once. Therefore, when using tools such as `pytest xdist `_ that run multiple tests in parallel, one should keep in mind that this may introduce flakiness into the test environment. --- .. _whitepaper: Architecture Whitepapers ======================== For an in-depth overview of Ray internals, check out the `Ray 2.0 Architecture whitepaper `__. The previous v1.0 whitepaper can be found `here `__. For more about the scalability and performance of the Ray dataplane, see the `Exoshuffle paper `__. --- .. _writing-code-snippets_ref: ========================== How to write code snippets ========================== Users learn from example. So, whether you're writing a docstring or a user guide, include examples that illustrate the relevant APIs. Your examples should run out-of-the-box so that users can copy them and adapt them to their own needs. This page describes how to write code snippets so that they're tested in CI. .. note:: The examples in this guide use reStructuredText. If you're writing Markdown, use MyST syntax. To learn more, read the `MyST documentation `_. ----------------- Types of examples ----------------- There are three types of examples: *doctest-style*, *code-output-style*, and *literalinclude*. *doctest-style* examples ======================== *doctest-style* examples mimic interactive Python sessions. :: .. doctest:: >>> def is_even(x): ... return (x % 2) == 0 >>> is_even(0) True >>> is_even(1) False They're rendered like this: .. doctest:: >>> def is_even(x): ... return (x % 2) == 0 >>> is_even(0) True >>> is_even(1) False .. tip:: If you're writing docstrings, exclude `.. doctest::` to simplify your code. :: Example: >>> def is_even(x): ... return (x % 2) == 0 >>> is_even(0) True >>> is_even(1) False *code-output-style* examples ============================ *code-output-style* examples contain ordinary Python code. :: .. testcode:: def is_even(x): return (x % 2) == 0 print(is_even(0)) print(is_even(1)) .. testoutput:: True False They're rendered like this: .. testcode:: def is_even(x): return (x % 2) == 0 print(is_even(0)) print(is_even(1)) .. testoutput:: True False *literalinclude* examples ========================= *literalinclude* examples display Python modules. :: .. literalinclude:: ./doc_code/example_module.py :language: python :start-after: __is_even_begin__ :end-before: __is_even_end__ .. literalinclude:: ./doc_code/example_module.py :language: python They're rendered like this: .. literalinclude:: ./doc_code/example_module.py :language: python :start-after: __is_even_begin__ :end-before: __is_even_end__ --------------------------------------- Which type of example should you write? --------------------------------------- There's no hard rule about which style you should use. Choose the style that best illustrates your API. .. tip:: If you're not sure which style to use, use *code-output-style*. When to use *doctest-style* =========================== If you're writing a small example that emphasizes object representations, or if you want to print intermediate objects, use *doctest-style*. :: .. doctest:: >>> import ray >>> ds = ray.data.range(100) >>> ds.schema() Column Type ------ ---- id int64 >>> ds.take(5) [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}] When to use *code-output-style* ====================================== If you're writing a longer example, or if object representations aren't relevant to your example, use *code-output-style*. :: .. testcode:: from typing import Dict import numpy as np import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") # Compute a "petal area" attribute. def transform_batch(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: vec_a = batch["petal length (cm)"] vec_b = batch["petal width (cm)"] batch["petal area (cm^2)"] = vec_a * vec_b return batch transformed_ds = ds.map_batches(transform_batch) print(transformed_ds.materialize()) .. testoutput:: MaterializedDataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64, petal area (cm^2): double } ) When to use *literalinclude* ============================ If you're writing an end-to-end examples and your examples doesn't contain outputs, use *literalinclude*. ----------------------------------- How to handle hard-to-test examples ----------------------------------- When is it okay to not test an example? ======================================= You don't need to test examples that depend on external systems like Weights and Biases. Skipping *doctest-style* examples ================================= To skip a *doctest-style* example, append `# doctest: +SKIP` to your Python code. :: .. doctest:: >>> import ray >>> ray.data.read_images("s3://private-bucket") # doctest: +SKIP Skipping *code-output-style* examples ========================================== To skip a *code-output-style* example, add `:skipif: True` to the `testcode` block. :: .. testcode:: :skipif: True from ray.air.integrations.wandb import WandbLoggerCallback callback = WandbLoggerCallback( project="Optimization_Project", api_key_file=..., log_config=True ) ---------------------------------------------- How to handle long or non-determnistic outputs ---------------------------------------------- If your Python code is non-deterministic, or if your output is excessively long, you may want to skip all or part of an output. Ignoring *doctest-style* outputs ================================ To ignore parts of a *doctest-style* output, replace problematic sections with ellipses. :: >>> import ray >>> ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") Dataset(num_rows=..., schema=...) To ignore an output altogether, write a *code-output-style* snippet. Don't use `# doctest: +SKIP`. Ignoring *code-output-style* outputs ======================================== If parts of your output are long or non-deterministic, replace problematic sections with ellipses. :: .. testcode:: import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") print(ds) .. testoutput:: Dataset(num_rows=..., schema=...) If your output is nondeterministic and you want to display a sample output, add `:options: +MOCK`. :: .. testcode:: import random print(random.random()) .. testoutput:: :options: +MOCK 0.969461416250246 If your output is hard to test and you don't want to display a sample output, exclude the ``testoutput``. :: .. testcode:: print("This output is hidden and untested") ------------------------------ How to test examples with GPUs ------------------------------ To configure Bazel to run an example with GPUs, complete the following steps: #. Open the corresponding ``BUILD`` file. If your example is in the ``doc/`` folder, open ``doc/BUILD``. If your example is in the ``python/`` folder, open a file like ``python/ray/train/BUILD``. #. Locate the ``doctest`` rule. It looks like this: :: doctest( files = glob( include=["source/**/*.rst"], ), size = "large", tags = ["team:none"] ) #. Add the file that contains your example to the list of excluded files. :: doctest( files = glob( include=["source/**/*.rst"], exclude=["source/data/requires-gpus.rst"] ), tags = ["team:none"] ) #. If it doesn't already exist, create a ``doctest`` rule with ``gpu`` set to ``True``. :: doctest( files = [], tags = ["team:none"], gpu = True ) #. Add the file that contains your example to the GPU rule. :: doctest( files = ["source/data/requires-gpus.rst"] size = "large", tags = ["team:none"], gpu = True ) For a practical example, see ``doc/BUILD`` or ``python/ray/train/BUILD``. ---------------------------- How to locally test examples ---------------------------- To locally test examples, install the Ray fork of `pytest-sphinx`. .. code-block:: bash pip install git+https://github.com/ray-project/pytest-sphinx Then, run pytest on a module, docstring, or user guide. .. code-block:: bash pytest --doctest-modules python/ray/data/read_api.py pytest --doctest-modules python/ray/data/read_api.py::ray.data.read_api.range pytest --doctest-modules doc/source/data/getting-started.rst --- :orphan: .. _accelerator_types: Accelerator types ================= Ray supports the following accelerator types: .. literalinclude:: ../../../python/ray/util/accelerators/accelerators.py :language: python --- Utility Classes =============== Actor Pool ~~~~~~~~~~ .. tab-set:: .. tab-item:: Python The ``ray.util`` module contains a utility class, ``ActorPool``. This class is similar to multiprocessing.Pool and lets you schedule Ray tasks over a fixed pool of actors. .. literalinclude:: ../doc_code/actor-pool.py See the :class:`package reference ` for more information. .. tab-item:: Java Actor pool hasn't been implemented in Java yet. .. tab-item:: C++ Actor pool hasn't been implemented in C++ yet. Message passing using Ray Queue ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes just using one signal to synchronize is not enough. If you need to send data among many tasks or actors, you can use :class:`ray.util.queue.Queue `. .. literalinclude:: ../doc_code/actor-queue.py Ray's Queue API has a similar API to Python's ``asyncio.Queue`` and ``queue.Queue``. --- AsyncIO / Concurrency for Actors ================================ Within a single actor process, it is possible to execute concurrent threads. Ray offers two types of concurrency within an actor: * :ref:`async execution ` * :ref:`threading ` Keep in mind that the Python's `Global Interpreter Lock (GIL) `_ will only allow one thread of Python code running at once. This means if you are just parallelizing Python code, you won't get true parallelism. If you call Numpy, Cython, Tensorflow, or PyTorch code, these libraries will release the GIL when calling into C/C++ functions. **Neither the** :ref:`threaded-actors` nor :ref:`async-actors` **model will allow you to bypass the GIL.** .. _async-actors: AsyncIO for Actors ------------------ Since Python 3.5, it is possible to write concurrent code using the ``async/await`` `syntax `__. Ray natively integrates with asyncio. You can use Ray alongside popular async frameworks like aiohttp, aioredis, etc. .. testcode:: import ray import asyncio @ray.remote class AsyncActor: def __init__(self, expected_num_tasks: int): self._event = asyncio.Event() self._curr_num_tasks = 0 self._expected_num_tasks = expected_num_tasks # Multiple invocations of this method can run concurrently on the same event loop. async def run_concurrent(self): self._curr_num_tasks += 1 if self._curr_num_tasks == self._expected_num_tasks: print("All coroutines are executing concurrently, unblocking.") self._event.set() else: print("Waiting for other coroutines to start.") await self._event.wait() print("All coroutines ran concurrently.") actor = AsyncActor.remote(4) refs = [actor.run_concurrent.remote() for _ in range(4)] # Fetch results using regular `ray.get`. ray.get(refs) # Fetch results using `asyncio` APIs. async def get_async(): return await asyncio.gather(*refs) asyncio.run(get_async()) .. testoutput:: :options: +MOCK (AsyncActor pid=9064) Waiting for other coroutines to start. (AsyncActor pid=9064) Waiting for other coroutines to start. (AsyncActor pid=9064) Waiting for other coroutines to start. (AsyncActor pid=9064) All coroutines are executing concurrently, unblocking. (AsyncActor pid=9064) All coroutines ran concurrently. (AsyncActor pid=9064) All coroutines ran concurrently. (AsyncActor pid=9064) All coroutines ran concurrently. (AsyncActor pid=9064) All coroutines ran concurrently. .. testcode:: :hide: # NOTE: The outputs from the previous code block can show up in subsequent tests. # To prevent flakiness, we wait for a grace period. import time print("Sleeping...") time.sleep(1) .. testoutput:: ... ObjectRefs as asyncio.Futures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ObjectRefs can be translated to asyncio.Futures. This feature make it possible to ``await`` on ray futures in existing concurrent applications. Instead of: .. testcode:: import ray @ray.remote def some_task(): return 1 ray.get(some_task.remote()) ray.wait([some_task.remote()]) you can wait on the ref with Python 3.9 and Python 3.10: .. testcode:: import ray import asyncio @ray.remote def some_task(): return 1 async def await_obj_ref(): await some_task.remote() await asyncio.wait([some_task.remote()]) asyncio.run(await_obj_ref()) or the Future object directly with Python 3.11+: .. testcode:: import asyncio async def convert_to_asyncio_future(): ref = some_task.remote() fut: asyncio.Future = asyncio.wrap_future(ref.future()) print(await fut) asyncio.run(convert_to_asyncio_future()) .. testoutput:: 1 See the `asyncio doc `__ for more `asyncio` patterns including timeouts and ``asyncio.gather``. .. _async-ref-to-futures: ObjectRefs as concurrent.futures.Futures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ObjectRefs can also be wrapped into ``concurrent.futures.Future`` objects. This is useful for interfacing with existing ``concurrent.futures`` APIs: .. testcode:: import concurrent refs = [some_task.remote() for _ in range(4)] futs = [ref.future() for ref in refs] for fut in concurrent.futures.as_completed(futs): assert fut.done() print(fut.result()) .. testoutput:: 1 1 1 1 Defining an Async Actor ~~~~~~~~~~~~~~~~~~~~~~~ By using `async` method definitions, Ray will automatically detect whether an actor support `async` calls or not. .. testcode:: import ray import asyncio @ray.remote class AsyncActor: def __init__(self, expected_num_tasks: int): self._event = asyncio.Event() self._curr_num_tasks = 0 self._expected_num_tasks = expected_num_tasks async def run_task(self): print("Started task") self._curr_num_tasks += 1 if self._curr_num_tasks == self._expected_num_tasks: self._event.set() else: # Yield the event loop for multiple coroutines to run concurrently. await self._event.wait() print("Finished task") actor = AsyncActor.remote(5) # All 5 tasks will start at once and run concurrently. ray.get([actor.run_task.remote() for _ in range(5)]) .. testoutput:: :options: +MOCK (AsyncActor pid=3456) Started task (AsyncActor pid=3456) Started task (AsyncActor pid=3456) Started task (AsyncActor pid=3456) Started task (AsyncActor pid=3456) Started task (AsyncActor pid=3456) Finished task (AsyncActor pid=3456) Finished task (AsyncActor pid=3456) Finished task (AsyncActor pid=3456) Finished task (AsyncActor pid=3456) Finished task Under the hood, Ray runs all of the methods inside a single python event loop. Please note that running blocking ``ray.get`` or ``ray.wait`` inside async actor method is not allowed, because ``ray.get`` will block the execution of the event loop. In async actors, only one task can be running at any point in time (though tasks can be multiplexed). There will be only one thread in AsyncActor! See :ref:`threaded-actors` if you want a threadpool. Setting concurrency in Async Actors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can set the number of "concurrent" task running at once using the ``max_concurrency`` flag. By default, 1000 tasks can be running concurrently. .. testcode:: import asyncio import ray @ray.remote class AsyncActor: def __init__(self, batch_size: int): self._event = asyncio.Event() self._curr_tasks = 0 self._batch_size = batch_size async def run_task(self): print("Started task") self._curr_tasks += 1 if self._curr_tasks == self._batch_size: self._event.set() else: await self._event.wait() self._event.clear() self._curr_tasks = 0 print("Finished task") actor = AsyncActor.options(max_concurrency=2).remote(2) # Only 2 tasks will run concurrently. # Once 2 finish, the next 2 should run. ray.get([actor.run_task.remote() for _ in range(8)]) .. testoutput:: :options: +MOCK (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Finished task (AsyncActor pid=5859) Finished task (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Finished task (AsyncActor pid=5859) Finished task (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Finished task (AsyncActor pid=5859) Finished task (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Started task (AsyncActor pid=5859) Finished task (AsyncActor pid=5859) Finished task .. _threaded-actors: Threaded Actors --------------- Sometimes, asyncio is not an ideal solution for your actor. For example, you may have one method that performs some computation heavy task while blocking the event loop, not giving up control via ``await``. This would hurt the performance of an Async Actor because Async Actors can only execute 1 task at a time and rely on ``await`` to context switch. Instead, you can use the ``max_concurrency`` Actor options without any async methods, allowing you to achieve threaded concurrency (like a thread pool). .. warning:: When there is at least one ``async def`` method in actor definition, Ray will recognize the actor as AsyncActor instead of ThreadedActor. .. testcode:: @ray.remote class ThreadedActor: def task_1(self): print("I'm running in a thread!") def task_2(self): print("I'm running in another thread!") a = ThreadedActor.options(max_concurrency=2).remote() ray.get([a.task_1.remote(), a.task_2.remote()]) .. testoutput:: :options: +MOCK (ThreadedActor pid=4822) I'm running in a thread! (ThreadedActor pid=4822) I'm running in another thread! Each invocation of the threaded actor will be running in a thread pool. The size of the threadpool is limited by the ``max_concurrency`` value. AsyncIO for Remote Tasks ------------------------ We don't support asyncio for remote tasks. The following snippet will fail: .. testcode:: :skipif: True @ray.remote async def f(): pass Instead, you can wrap the ``async`` function with a wrapper to run the task synchronously: .. testcode:: async def f(): pass @ray.remote def wrapper(): import asyncio asyncio.run(f()) --- Limiting Concurrency Per-Method with Concurrency Groups ======================================================= Besides setting the max concurrency overall for an actor, Ray allows methods to be separated into *concurrency groups*, each with its own thread(s). This allows you to limit the concurrency per-method, e.g., allow a health-check method to be given its own concurrency quota separate from request serving methods. .. tip:: Concurrency groups work with both asyncio and threaded actors. The syntax is the same. .. _defining-concurrency-groups: Defining Concurrency Groups --------------------------- This defines two concurrency groups, "io" with max concurrency = 2 and "compute" with max concurrency = 4. The methods ``f1`` and ``f2`` are placed in the "io" group, and the methods ``f3`` and ``f4`` are placed into the "compute" group. Note that there is always a default concurrency group for actors, which has a default concurrency of 1000 for AsyncIO actors and 1 otherwise. .. tab-set:: .. tab-item:: Python You can define concurrency groups for actors using the ``concurrency_group`` decorator argument: .. testcode:: import ray @ray.remote(concurrency_groups={"io": 2, "compute": 4}) class AsyncIOActor: def __init__(self): pass @ray.method(concurrency_group="io") async def f1(self): pass @ray.method(concurrency_group="io") async def f2(self): pass @ray.method(concurrency_group="compute") async def f3(self): pass @ray.method(concurrency_group="compute") async def f4(self): pass async def f5(self): pass a = AsyncIOActor.remote() a.f1.remote() # executed in the "io" group. a.f2.remote() # executed in the "io" group. a.f3.remote() # executed in the "compute" group. a.f4.remote() # executed in the "compute" group. a.f5.remote() # executed in the default group. .. tab-item:: Java You can define concurrency groups for concurrent actors using the API ``setConcurrencyGroups()`` argument: .. code-block:: java class ConcurrentActor { public long f1() { return Thread.currentThread().getId(); } public long f2() { return Thread.currentThread().getId(); } public long f3(int a, int b) { return Thread.currentThread().getId(); } public long f4() { return Thread.currentThread().getId(); } public long f5() { return Thread.currentThread().getId(); } } ConcurrencyGroup group1 = new ConcurrencyGroupBuilder() .setName("io") .setMaxConcurrency(1) .addMethod(ConcurrentActor::f1) .addMethod(ConcurrentActor::f2) .build(); ConcurrencyGroup group2 = new ConcurrencyGroupBuilder() .setName("compute") .setMaxConcurrency(1) .addMethod(ConcurrentActor::f3) .addMethod(ConcurrentActor::f4) .build(); ActorHandle myActor = Ray.actor(ConcurrentActor::new) .setConcurrencyGroups(group1, group2) .remote(); myActor.task(ConcurrentActor::f1).remote(); // executed in the "io" group. myActor.task(ConcurrentActor::f2).remote(); // executed in the "io" group. myActor.task(ConcurrentActor::f3, 3, 5).remote(); // executed in the "compute" group. myActor.task(ConcurrentActor::f4).remote(); // executed in the "compute" group. myActor.task(ConcurrentActor::f5).remote(); // executed in the "default" group. .. _default-concurrency-group: Default Concurrency Group ------------------------- By default, methods are placed in a default concurrency group which has a concurrency limit of 1000 for AsyncIO actors and 1 otherwise. The concurrency of the default group can be changed by setting the ``max_concurrency`` actor option. .. tab-set:: .. tab-item:: Python The following actor has 2 concurrency groups: "io" and "default". The max concurrency of "io" is 2, and the max concurrency of "default" is 10. .. testcode:: @ray.remote(concurrency_groups={"io": 2}) class AsyncIOActor: async def f1(self): pass actor = AsyncIOActor.options(max_concurrency=10).remote() .. tab-item:: Java The following concurrent actor has 2 concurrency groups: "io" and "default". The max concurrency of "io" is 2, and the max concurrency of "default" is 10. .. code-block:: java class ConcurrentActor { public long f1() { return Thread.currentThread().getId(); } } ConcurrencyGroup group = new ConcurrencyGroupBuilder() .setName("io") .setMaxConcurrency(2) .addMethod(ConcurrentActor::f1) .build(); ActorHandle myActor = Ray.actor(ConcurrentActor::new) .setConcurrencyGroups(group) .setMaxConcurrency(10) .remote(); .. _setting-the-concurrency-group-at-runtime: Setting the Concurrency Group at Runtime ---------------------------------------- You can also dispatch actor methods into a specific concurrency group at runtime. The following snippet demonstrates setting the concurrency group of the ``f2`` method dynamically at runtime. .. tab-set:: .. tab-item:: Python You can use the ``.options`` method. .. testcode:: # Executed in the "io" group (as defined in the actor class). a.f2.options().remote() # Executed in the "compute" group. a.f2.options(concurrency_group="compute").remote() .. tab-item:: Java You can use ``setConcurrencyGroup`` method. .. code-block:: java // Executed in the "io" group (as defined in the actor creation). myActor.task(ConcurrentActor::f2).remote(); // Executed in the "compute" group. myActor.task(ConcurrentActor::f2).setConcurrencyGroup("compute").remote(); --- Named Actors ============ An actor can be given a unique name within their :ref:`namespace `. This allows you to retrieve the actor from any job in the Ray cluster. This can be useful if you cannot directly pass the actor handle to the task that needs it, or if you are trying to access an actor launched by another driver. Note that the actor will still be garbage-collected if no handles to it exist. See :ref:`actor-lifetimes` for more details. .. tab-set:: .. tab-item:: Python .. testcode:: import ray @ray.remote class Counter: pass # Create an actor with a name counter = Counter.options(name="some_name").remote() # Retrieve the actor later somewhere counter = ray.get_actor("some_name") .. tab-item:: Java .. code-block:: java // Create an actor with a name. ActorHandle counter = Ray.actor(Counter::new).setName("some_name").remote(); ... // Retrieve the actor later somewhere Optional> counter = Ray.getActor("some_name"); Assert.assertTrue(counter.isPresent()); .. tab-item:: C++ .. code-block:: c++ // Create an actor with a globally unique name ActorHandle counter = ray::Actor(CreateCounter).SetGlobalName("some_name").Remote(); ... // Retrieve the actor later somewhere boost::optional> counter = ray::GetGlobalActor("some_name"); We also support non-global named actors in C++, which means that the actor name is only valid within the job and the actor cannot be accessed from another job. .. code-block:: c++ // Create an actor with a job-scope-unique name ActorHandle counter = ray::Actor(CreateCounter).SetName("some_name").Remote(); ... // Retrieve the actor later somewhere in the same job boost::optional> counter = ray::GetActor("some_name"); .. note:: Named actors are scoped by namespace. If no namespace is assigned, they will be placed in an anonymous namespace by default. .. tab-set:: .. tab-item:: Python .. testcode:: :skipif: True import ray @ray.remote class Actor: pass # driver_1.py # Job 1 creates an actor, "orange" in the "colors" namespace. ray.init(address="auto", namespace="colors") Actor.options(name="orange", lifetime="detached").remote() # driver_2.py # Job 2 is now connecting to a different namespace. ray.init(address="auto", namespace="fruit") # This fails because "orange" was defined in the "colors" namespace. ray.get_actor("orange") # You can also specify the namespace explicitly. ray.get_actor("orange", namespace="colors") # driver_3.py # Job 3 connects to the original "colors" namespace ray.init(address="auto", namespace="colors") # This returns the "orange" actor we created in the first job. ray.get_actor("orange") .. tab-item:: Java .. code-block:: java import ray class Actor { } // Driver1.java // Job 1 creates an actor, "orange" in the "colors" namespace. System.setProperty("ray.job.namespace", "colors"); Ray.init(); Ray.actor(Actor::new).setName("orange").remote(); // Driver2.java // Job 2 is now connecting to a different namespace. System.setProperty("ray.job.namespace", "fruits"); Ray.init(); // This fails because "orange" was defined in the "colors" namespace. Optional> actor = Ray.getActor("orange"); Assert.assertFalse(actor.isPresent()); // actor.isPresent() is false. // Driver3.java System.setProperty("ray.job.namespace", "colors"); Ray.init(); // This returns the "orange" actor we created in the first job. Optional> actor = Ray.getActor("orange"); Assert.assertTrue(actor.isPresent()); // actor.isPresent() is true. Get-Or-Create a Named Actor --------------------------- A common use case is to create an actor only if it doesn't exist. Ray provides a ``get_if_exists`` option for actor creation that does this out of the box. This method is available after you set a name for the actor via ``.options()``. If the actor already exists, a handle to the actor will be returned and the arguments will be ignored. Otherwise, a new actor will be created with the specified arguments. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/get_or_create.py .. tab-item:: Java .. code-block:: java // This feature is not yet available in Java. .. tab-item:: C++ .. code-block:: c++ // This feature is not yet available in C++. .. _actor-lifetimes: Actor Lifetimes --------------- Separately, actor lifetimes can be decoupled from the job, allowing an actor to persist even after the driver process of the job exits. We call these actors *detached*. .. tab-set:: .. tab-item:: Python .. testcode:: counter = Counter.options(name="CounterActor", lifetime="detached").remote() The ``CounterActor`` will be kept alive even after the driver running above script exits. Therefore it is possible to run the following script in a different driver: .. testcode:: counter = ray.get_actor("CounterActor") Note that an actor can be named but not detached. If we only specified the name without specifying ``lifetime="detached"``, then the CounterActor can only be retrieved as long as the original driver is still running. .. tab-item:: Java .. code-block:: java System.setProperty("ray.job.namespace", "lifetime"); Ray.init(); ActorHandle counter = Ray.actor(Counter::new).setName("some_name").setLifetime(ActorLifetime.DETACHED).remote(); The CounterActor will be kept alive even after the driver running above process exits. Therefore it is possible to run the following code in a different driver: .. code-block:: java System.setProperty("ray.job.namespace", "lifetime"); Ray.init(); Optional> counter = Ray.getActor("some_name"); Assert.assertTrue(counter.isPresent()); .. tab-item:: C++ Customizing lifetime of an actor hasn't been implemented in C++ yet. Unlike normal actors, detached actors are not automatically garbage-collected by Ray. Detached actors must be manually destroyed once you are sure that they are no longer needed. To do this, use ``ray.kill`` to :ref:`manually terminate ` the actor. After this call, the actor's name may be reused. --- Out-of-band Communication ========================= Typically, Ray actor communication is done through actor method calls and data is shared through the distributed object store. However, in some use cases out-of-band communication can be useful. Wrapping Library Processes -------------------------- Many libraries already have mature, high-performance internal communication stacks and they leverage Ray as a language-integrated actor scheduler. The actual communication between actors is mostly done out-of-band using existing communication stacks. For example, Horovod-on-Ray uses NCCL or MPI-based collective communications, and RayDP uses Spark's internal RPC and object manager. See `Ray Distributed Library Patterns `_ for more details. Ray Collective -------------- Ray's collective communication library (\ ``ray.util.collective``\ ) allows efficient out-of-band collective and point-to-point communication between distributed CPUs or GPUs. See :ref:`Ray Collective ` for more details. HTTP Server ----------- You can start an HTTP server inside the actor and expose HTTP endpoints to clients so users outside of the Ray cluster can communicate with the actor. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/actor-http-server.py Similarly, you can expose other types of servers as well (e.g., gRPC servers). Limitations ----------- When using out-of-band communication with Ray actors, keep in mind that Ray does not manage the calls between actors. This means that functionality like distributed reference counting will not work with out-of-band communication, so you should take care not to pass object references in this way. --- .. _actor-task-order: Actor Task Execution Order ========================== Synchronous, Single-Threaded Actor ---------------------------------- In Ray, an actor receives tasks from multiple submitters (including driver and workers). For tasks received from the same submitter, a synchronous, single-threaded actor executes them in the order they were submitted, unless you set ``allow_out_of_order_execution``, or Ray retries tasks. In other words, a given task will not be executed until previously submitted tasks from the same submitter have finished execution. For actors where `max_task_retries` is set to a non-zero number, the task execution order is not guaranteed when task retries occur. .. tab-set:: .. tab-item:: Python .. testcode:: import ray @ray.remote class Counter: def __init__(self): self.value = 0 def add(self, addition): self.value += addition return self.value counter = Counter.remote() # For tasks from the same submitter, # they are executed according to submission order. value0 = counter.add.remote(1) value1 = counter.add.remote(2) # Output: 1. The first submitted task is executed first. print(ray.get(value0)) # Output: 3. The later submitted task is executed later. print(ray.get(value1)) .. testoutput:: 1 3 However, the actor does not guarantee the execution order of the tasks from different submitters. For example, suppose an unfulfilled argument blocks a previously submitted task. In this case, the actor can still execute tasks submitted by a different worker. .. tab-set:: .. tab-item:: Python .. testcode:: import time import ray @ray.remote class Counter: def __init__(self): self.value = 0 def add(self, addition): self.value += addition return self.value counter = Counter.remote() # Submit task from a worker @ray.remote def submitter(value): return ray.get(counter.add.remote(value)) # Simulate delayed result resolution. @ray.remote def delayed_resolution(value): time.sleep(1) return value # Submit tasks from different workers, with # the first submitted task waiting for # dependency resolution. value0 = submitter.remote(delayed_resolution.remote(1)) value1 = submitter.remote(2) # Output: 3. The first submitted task is executed later. print(ray.get(value0)) # Output: 2. The later submitted task is executed first. print(ray.get(value1)) .. testoutput:: 3 2 Asynchronous or Threaded Actor ------------------------------ :ref:`Asynchronous or threaded actors ` do not guarantee the task execution order. This means the system might execute a task even though previously submitted tasks are pending execution. .. tab-set:: .. tab-item:: Python .. testcode:: import time import ray @ray.remote class AsyncCounter: def __init__(self): self.value = 0 async def add(self, addition): self.value += addition return self.value counter = AsyncCounter.remote() # Simulate delayed result resolution. @ray.remote def delayed_resolution(value): time.sleep(1) return value # Submit tasks from the driver, with # the first submitted task waiting for # dependency resolution. value0 = counter.add.remote(delayed_resolution.remote(1)) value1 = counter.add.remote(2) # Output: 3. The first submitted task is executed later. print(ray.get(value0)) # Output: 2. The later submitted task is executed first. print(ray.get(value1)) .. testoutput:: 3 2 --- Terminating Actors ================== Actor processes will be terminated automatically when all copies of the actor handle have gone out of scope in Python, or if the original creator process dies. When actors terminate gracefully, Ray calls the actor's ``__ray_shutdown__()`` method if defined, allowing for cleanup of resources (see :ref:`actor-cleanup` for details). Note that automatic termination of actors is not yet supported in Java or C++. .. _ray-kill-actors: Manual termination via an actor handle ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In most cases, Ray will automatically terminate actors that have gone out of scope, but you may sometimes need to terminate an actor forcefully. This should be reserved for cases where an actor is unexpectedly hanging or leaking resources, and for :ref:`detached actors `, which must be manually destroyed. .. tab-set:: .. tab-item:: Python .. testcode:: import ray @ray.remote class Actor: pass actor_handle = Actor.remote() ray.kill(actor_handle) # Force kill: the actor exits immediately without cleanup. # This will NOT call __ray_shutdown__() or atexit handlers. .. tab-item:: Java .. code-block:: java actorHandle.kill(); // This will not go through the normal Java System.exit teardown logic, so any // shutdown hooks installed in the actor using ``Runtime.addShutdownHook(...)`` will // not be called. .. tab-item:: C++ .. code-block:: c++ actor_handle.Kill(); // This will not go through the normal C++ std::exit // teardown logic, so any exit handlers installed in // the actor using ``std::atexit`` will not be called. This will cause the actor to immediately exit its process, causing any current, pending, and future tasks to fail with a ``RayActorError``. If you would like Ray to :ref:`automatically restart ` the actor, make sure to set a nonzero ``max_restarts`` in the ``@ray.remote`` options for the actor, then pass the flag ``no_restart=False`` to ``ray.kill``. For :ref:`named and detached actors `, calling ``ray.kill`` on an actor handle destroys the actor and allows the name to be reused. Use `ray list actors --detail` from :ref:`State API ` to see the death cause of dead actors: .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray list actors --detail .. code-block:: bash --- - actor_id: e8702085880657b355bf7ef001000000 class_name: Actor state: DEAD job_id: '01000000' name: '' node_id: null pid: 0 ray_namespace: dbab546b-7ce5-4cbb-96f1-d0f64588ae60 serialized_runtime_env: '{}' required_resources: {} death_cause: actor_died_error_context: # <---- You could see the error message w.r.t why the actor exits. error_message: The actor is dead because `ray.kill` killed it. owner_id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff owner_ip_address: 127.0.0.1 ray_namespace: dbab546b-7ce5-4cbb-96f1-d0f64588ae60 class_name: Actor actor_id: e8702085880657b355bf7ef001000000 never_started: true node_ip_address: '' pid: 0 name: '' is_detached: false placement_group_id: null repr_name: '' Manual termination within the actor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If necessary, you can manually terminate an actor from within one of the actor methods. This will kill the actor process and release resources associated/assigned to the actor. .. tab-set:: .. tab-item:: Python .. testcode:: @ray.remote class Actor: def exit(self): ray.actor.exit_actor() actor = Actor.remote() actor.exit.remote() This approach should generally not be necessary as actors are automatically garbage collected. The ``ObjectRef`` resulting from the task can be waited on to wait for the actor to exit (calling ``ray.get()`` on it will raise a ``RayActorError``). .. tab-item:: Java .. code-block:: java Ray.exitActor(); Garbage collection for actors hasn't been implemented yet, so this is currently the only way to terminate an actor gracefully. The ``ObjectRef`` resulting from the task can be waited on to wait for the actor to exit (calling ``ObjectRef::get`` on it will throw a ``RayActorException``). .. tab-item:: C++ .. code-block:: c++ ray::ExitActor(); Garbage collection for actors hasn't been implemented yet, so this is currently the only way to terminate an actor gracefully. The ``ObjectRef`` resulting from the task can be waited on to wait for the actor to exit (calling ``ObjectRef::Get`` on it will throw a ``RayActorException``). Note that this method of termination waits until any previously submitted tasks finish executing and then exits the process gracefully with sys.exit. You could see the actor is dead as a result of the user's `exit_actor()` call: .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray list actors --detail .. code-block:: bash --- - actor_id: 070eb5f0c9194b851bb1cf1602000000 class_name: Actor state: DEAD job_id: '02000000' name: '' node_id: 47ccba54e3ea71bac244c015d680e202f187fbbd2f60066174a11ced pid: 47978 ray_namespace: 18898403-dda0-485a-9c11-e9f94dffcbed serialized_runtime_env: '{}' required_resources: {} death_cause: actor_died_error_context: error_message: 'The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by a user request. exit_actor() is called.' owner_id: 02000000ffffffffffffffffffffffffffffffffffffffffffffffff owner_ip_address: 127.0.0.1 node_ip_address: 127.0.0.1 pid: 47978 ray_namespace: 18898403-dda0-485a-9c11-e9f94dffcbed class_name: Actor actor_id: 070eb5f0c9194b851bb1cf1602000000 name: '' never_started: false is_detached: false placement_group_id: null repr_name: '' .. _actor-cleanup: Actor cleanup with `__ray_shutdown__` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When an actor terminates gracefully, Ray calls the ``__ray_shutdown__()`` method if it exists, allowing cleanup of resources like database connections or file handles. .. tab-set:: .. tab-item:: Python .. testcode:: import ray import tempfile import os @ray.remote class FileProcessorActor: def __init__(self): self.temp_file = tempfile.NamedTemporaryFile(delete=False) self.temp_file.write(b"processing data") self.temp_file.flush() def __ray_shutdown__(self): # Clean up temporary file if hasattr(self, 'temp_file'): self.temp_file.close() os.unlink(self.temp_file.name) def process(self): return "done" actor = FileProcessorActor.remote() ray.get(actor.process.remote()) del actor # __ray_shutdown__() is called automatically When ``__ray_shutdown__()`` is called: - **Automatic termination**: When all actor handles go out of scope (``del actor`` or natural scope exit) - **Manual graceful termination**: When you call ``actor.__ray_terminate__.remote()`` When ``__ray_shutdown__()`` is **NOT** called: - **Force kill**: When you use ``ray.kill(actor)`` - the actor is killed immediately without cleanup. - **Unexpected termination**: When the actor process crashes or exits unexpectedly (such as a segfault or being killed by the OOM killer). **Important notes:** - ``__ray_shutdown__()`` runs after all actor tasks complete. - By default, Ray waits 30 seconds for the graceful shutdown procedure (including ``__ray_shutdown__()``) to complete. If the actor doesn't exit within this timeout, it's force killed. Configure this with ``ray.init(_system_config={"actor_graceful_shutdown_timeout_ms": 60000})``. - Exceptions in ``__ray_shutdown__()`` are caught and logged but don't prevent actor termination. - ``__ray_shutdown__()`` must be a synchronous method, including for async actors. --- .. _ray-remote-classes: .. _actor-guide: Actors ====== Actors extend the Ray API from functions (tasks) to classes. An actor is essentially a stateful worker (or a service). When you instantiate a new actor, Ray creates a new worker and schedules methods of the actor on that specific worker. The methods can access and mutate the state of that worker. .. tab-set:: .. tab-item:: Python The ``ray.remote`` decorator indicates that instances of the ``Counter`` class are actors. Each actor runs in its own Python process. .. testcode:: import ray @ray.remote class Counter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value def get_counter(self): return self.value # Create an actor from this class. counter = Counter.remote() .. tab-item:: Java ``Ray.actor`` is used to create actors from regular Java classes. .. code-block:: java // A regular Java class. public class Counter { private int value = 0; public int increment() { this.value += 1; return this.value; } } // Create an actor from this class. // `Ray.actor` takes a factory method that can produce // a `Counter` object. Here, we pass `Counter`'s constructor // as the argument. ActorHandle counter = Ray.actor(Counter::new).remote(); .. tab-item:: C++ ``ray::Actor`` is used to create actors from regular C++ classes. .. code-block:: c++ // A regular C++ class. class Counter { private: int value = 0; public: int Increment() { value += 1; return value; } }; // Factory function of Counter class. static Counter *CreateCounter() { return new Counter(); }; RAY_REMOTE(&Counter::Increment, CreateCounter); // Create an actor from this class. // `ray::Actor` takes a factory method that can produce // a `Counter` object. Here, we pass `Counter`'s factory function // as the argument. auto counter = ray::Actor(CreateCounter).Remote(); Use `ray list actors` from :ref:`State API ` to see actors states: .. code-block:: bash # This API is only available when you install Ray with `pip install "ray[default]"`. ray list actors .. code-block:: bash ======== List: 2023-05-25 10:10:50.095099 ======== Stats: ------------------------------ Total: 1 Table: ------------------------------ ACTOR_ID CLASS_NAME STATE JOB_ID NAME NODE_ID PID RAY_NAMESPACE 0 9e783840250840f87328c9f201000000 Counter ALIVE 01000000 13a475571662b784b4522847692893a823c78f1d3fd8fd32a2624923 38906 ef9de910-64fb-4575-8eb5-50573faa3ddf Specifying required resources ----------------------------- .. _actor-resource-guide: Specify resource requirements in actors. See :ref:`resource-requirements` for more details. .. tab-set:: .. tab-item:: Python .. testcode:: # Specify required resources for an actor. @ray.remote(num_cpus=2, num_gpus=0.5) class Actor: pass .. tab-item:: Java .. code-block:: java // Specify required resources for an actor. Ray.actor(Counter::new).setResource("CPU", 2.0).setResource("GPU", 0.5).remote(); .. tab-item:: C++ .. code-block:: c++ // Specify required resources for an actor. ray::Actor(CreateCounter).SetResource("CPU", 2.0).SetResource("GPU", 0.5).Remote(); Calling the actor ----------------- You can interact with the actor by calling its methods with the ``remote`` operator. You can then call ``get`` on the object ref to retrieve the actual value. .. tab-set:: .. tab-item:: Python .. testcode:: # Call the actor. obj_ref = counter.increment.remote() print(ray.get(obj_ref)) .. testoutput:: 1 .. tab-item:: Java .. code-block:: java // Call the actor. ObjectRef objectRef = counter.task(&Counter::increment).remote(); Assert.assertTrue(objectRef.get() == 1); .. tab-item:: C++ .. code-block:: c++ // Call the actor. auto object_ref = counter.Task(&Counter::increment).Remote(); assert(*object_ref.Get() == 1); Methods called on different actors execute in parallel, and methods called on the same actor execute serially in the order you call them. Methods on the same actor share state with one another, as shown below. .. tab-set:: .. tab-item:: Python .. testcode:: # Create ten Counter actors. counters = [Counter.remote() for _ in range(10)] # Increment each Counter once and get the results. These tasks all happen in # parallel. results = ray.get([c.increment.remote() for c in counters]) print(results) # Increment the first Counter five times. These tasks are executed serially # and share state. results = ray.get([counters[0].increment.remote() for _ in range(5)]) print(results) .. testoutput:: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] [2, 3, 4, 5, 6] .. tab-item:: Java .. code-block:: java // Create ten Counter actors. List> counters = new ArrayList<>(); for (int i = 0; i < 10; i++) { counters.add(Ray.actor(Counter::new).remote()); } // Increment each Counter once and get the results. These tasks all happen in // parallel. List> objectRefs = new ArrayList<>(); for (ActorHandle counterActor : counters) { objectRefs.add(counterActor.task(Counter::increment).remote()); } // prints [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] System.out.println(Ray.get(objectRefs)); // Increment the first Counter five times. These tasks are executed serially // and share state. objectRefs = new ArrayList<>(); for (int i = 0; i < 5; i++) { objectRefs.add(counters.get(0).task(Counter::increment).remote()); } // prints [2, 3, 4, 5, 6] System.out.println(Ray.get(objectRefs)); .. tab-item:: C++ .. code-block:: c++ // Create ten Counter actors. std::vector> counters; for (int i = 0; i < 10; i++) { counters.emplace_back(ray::Actor(CreateCounter).Remote()); } // Increment each Counter once and get the results. These tasks all happen in // parallel. std::vector> object_refs; for (ray::ActorHandle counter_actor : counters) { object_refs.emplace_back(counter_actor.Task(&Counter::Increment).Remote()); } // prints 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 auto results = ray::Get(object_refs); for (const auto &result : results) { std::cout << *result; } // Increment the first Counter five times. These tasks are executed serially // and share state. object_refs.clear(); for (int i = 0; i < 5; i++) { object_refs.emplace_back(counters[0].Task(&Counter::Increment).Remote()); } // prints 2, 3, 4, 5, 6 results = ray::Get(object_refs); for (const auto &result : results) { std::cout << *result; } Passing around actor handles ---------------------------- You can pass actor handles into other tasks. You can also define remote functions or actor methods that use actor handles. .. tab-set:: .. tab-item:: Python .. testcode:: import time @ray.remote def f(counter): for _ in range(10): time.sleep(0.1) counter.increment.remote() .. tab-item:: Java .. code-block:: java public static class MyRayApp { public static void foo(ActorHandle counter) throws InterruptedException { for (int i = 0; i < 1000; i++) { TimeUnit.MILLISECONDS.sleep(100); counter.task(Counter::increment).remote(); } } } .. tab-item:: C++ .. code-block:: c++ void Foo(ray::ActorHandle counter) { for (int i = 0; i < 1000; i++) { std::this_thread::sleep_for(std::chrono::milliseconds(100)); counter.Task(&Counter::Increment).Remote(); } } If you instantiate an actor, you can pass the handle around to various tasks. .. tab-set:: .. tab-item:: Python .. testcode:: counter = Counter.remote() # Start some tasks that use the actor. [f.remote(counter) for _ in range(3)] # Print the counter value. for _ in range(10): time.sleep(0.1) print(ray.get(counter.get_counter.remote())) .. testoutput:: :options: +MOCK 0 3 8 10 15 18 20 25 30 30 .. tab-item:: Java .. code-block:: java ActorHandle counter = Ray.actor(Counter::new).remote(); // Start some tasks that use the actor. for (int i = 0; i < 3; i++) { Ray.task(MyRayApp::foo, counter).remote(); } // Print the counter value. for (int i = 0; i < 10; i++) { TimeUnit.SECONDS.sleep(1); System.out.println(counter.task(Counter::getCounter).remote().get()); } .. tab-item:: C++ .. code-block:: c++ auto counter = ray::Actor(CreateCounter).Remote(); // Start some tasks that use the actor. for (int i = 0; i < 3; i++) { ray::Task(Foo).Remote(counter); } // Print the counter value. for (int i = 0; i < 10; i++) { std::this_thread::sleep_for(std::chrono::seconds(1)); std::cout << *counter.Task(&Counter::GetCounter).Remote().Get() << std::endl; } Type hints and static typing for actors --------------------------------------- Ray supports Python type hints for both remote functions and actors, enabling better IDE support and static type checking. To get the best type inference and pass type checkers when working with actors, follow these patterns: - **Prefer** ``ray.remote(MyClass)`` **over** ``@ray.remote`` **for actors**: Instead of decorating your class with ``@ray.remote``, use ``ActorClass = ray.remote(MyClass)``. This preserves the original class type and allows type checkers and IDEs to infer the correct types. - **Use** ``@ray.method`` **for actor methods**: Decorate actor methods with ``@ray.method`` to enable type hints for remote method calls on actor handles. - **Use the** ``ActorClass`` **and** ``ActorProxy`` **types**: When you instantiate an actor, annotate the handle as ``ActorProxy[MyClass]`` to get type hints for remote methods. **Example:** .. testcode:: import ray from ray.actor import ActorClass, ActorProxy class Counter: def __init__(self): self.value = 0 @ray.method def increment(self) -> int: self.value += 1 return self.value CounterActor: ActorClass[Counter] = ray.remote(Counter) counter: ActorProxy[Counter] = CounterActor.remote() # Type checkers and IDEs will now provide type hints for remote methods obj_ref: ray.ObjectRef[int] = counter.increment.remote() print(ray.get(obj_ref)) For more details and advanced patterns, see :ref:`Type hints in Ray `. Generators ---------- Ray is compatible with Python generator syntax. See :ref:`Ray Generators ` for more details. Cancelling actor tasks ---------------------- Cancel Actor Tasks by calling :func:`ray.cancel() ` on the returned `ObjectRef`. .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/actors.py :language: python :start-after: __cancel_start__ :end-before: __cancel_end__ In Ray, Task cancellation behavior is contingent on the Task's current state: **Unscheduled tasks**: If Ray hasn't scheduled an Actor Task yet, Ray attempts to cancel the scheduling. When Ray successfully cancels at this stage, it invokes ``ray.get(actor_task_ref)`` which produces a :class:`TaskCancelledError `. **Running actor tasks (regular actor, threaded actor)**: For tasks classified as a single-threaded Actor or a multi-threaded Actor, Ray sets a cancellation flag that can be checked via ``ray.get_runtime_context().is_canceled()``. This allows for graceful cancellation by periodically checking the cancellation status within the task. **Running async actor tasks**: For Tasks classified as :ref:`async Actors `, Ray seeks to cancel the associated `asyncio.Task`. This cancellation approach aligns with the standards presented in `asyncio task cancellation `__. Note that `asyncio.Task` won't be interrupted in the middle of execution if you don't `await` within the async function. Note: ``ray.get_runtime_context().is_canceled()`` is not supported for async actors and will raise a ``RuntimeError``. **Cancellation guarantee**: Ray attempts to cancel Tasks on a *best-effort* basis, meaning cancellation isn't always guaranteed. For example, if the cancellation request doesn't get through to the executor, the Task might not be cancelled. You can check if a Task was successfully cancelled using ``ray.get(actor_task_ref)``. **Recursive cancellation**: Ray tracks all child and Actor Tasks. When the ``recursive=True`` argument is given, it cancels all child and Actor Tasks. Detecting cancellation in running actor tasks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For non-async actor tasks, you can periodically check whether a cancellation has been requested by calling ``ray.get_runtime_context().is_canceled()``. This allows tasks to detect cancellation and perform cleanup operations before exiting gracefully. .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/actors.py :language: python :start-after: __cancel_graceful_actor_start__ :end-before: __cancel_graceful_actor_end__ **Important notes:** - For **non-async actor tasks**, direct interruption is not supported. You need to check ``is_canceled()`` periodically to detect cancellation requests. - ``is_canceled()`` is **not supported** for async actor tasks and will raise a ``RuntimeError``. Scheduling ---------- For each actor, Ray chooses a node to run it on, and bases the scheduling decision on a few factors like :ref:`the actor's resource requirements ` and :ref:`the specified scheduling strategy `. See :ref:`Ray scheduling ` for more details. Fault Tolerance --------------- By default, Ray actors won't be :ref:`restarted ` and actor tasks won't be retried when actors crash unexpectedly. You can change this behavior by setting ``max_restarts`` and ``max_task_retries`` options in :func:`ray.remote() ` and :meth:`.options() `. See :ref:`Ray fault tolerance ` for more details. FAQ: Actors, Workers and Resources ---------------------------------- What's the difference between a worker and an actor? Each "Ray worker" is a python process. Ray treats a worker differently for tasks and actors. For tasks, Ray uses a "Ray worker" to execute multiple Ray tasks. For actors, Ray starts a "Ray worker" as a dedicated Ray actor. * **Tasks**: When Ray starts on a machine, a number of Ray workers start automatically (1 per CPU by default). Ray uses them to execute tasks (like a process pool). If you execute 8 tasks with `num_cpus=2`, and total number of CPUs is 16 (`ray.cluster_resources()["CPU"] == 16`), you end up with 8 of your 16 workers idling. * **Actor**: A Ray Actor is also a "Ray worker" but you instantiate it at runtime with `actor_cls.remote()`. All of its methods run on the same process, using the same resources Ray designates when you define the Actor. Note that unlike tasks, Ray doesn't reuse the Python processes that run Ray Actors. Ray terminates them when you delete the Actor. To maximally utilize your resources, you want to maximize the time that your workers work. You also want to allocate enough cluster resources so Ray can run all of your needed actors and any other tasks you define. This also implies that Ray schedules tasks more flexibly, and that if you don't need the stateful part of an actor, it's better to use tasks. Task Events ----------- By default, Ray traces the execution of actor tasks, reporting task status events and profiling events that Ray Dashboard and :ref:`State API ` use. You can disable task event reporting for the actor by setting the `enable_task_events` option to `False` in :func:`ray.remote() ` and :meth:`.options() `. This setting reduces the overhead of task execution by reducing the amount of data Ray sends to the Ray Dashboard. You can also disable task event reporting for some actor methods by setting the `enable_task_events` option to `False` in :func:`ray.remote() ` and :meth:`.options() ` on the actor method. Method settings override the actor setting: .. literalinclude:: doc_code/actors.py :language: python :start-after: __enable_task_events_start__ :end-before: __enable_task_events_end__ More about Ray Actors --------------------- .. toctree:: :maxdepth: 1 actors/named-actors.rst actors/terminating-actors.rst actors/async_api.rst actors/concurrency_group_api.rst actors/actor-utils.rst actors/out-of-band-communication.rst actors/task-orders.rst --- Advanced topics =============== This section covers extended topics on how to use Ray. .. toctree:: :maxdepth: -1 tips-for-first-time type-hint starting-ray ray-generator namespaces cross-language using-ray-with-jupyter ray-dag miscellaneous runtime_env_auth user-spawn-processes --- Ray Core CLI ============ .. _ray-cli: Debugging applications ---------------------- This section contains commands for inspecting and debugging the current cluster. .. _ray-stack-doc: .. click:: ray.scripts.scripts:stack :prog: ray stack :show-nested: .. _ray-memory-doc: .. click:: ray.scripts.scripts:memory :prog: ray memory :show-nested: .. _ray-timeline-doc: .. click:: ray.scripts.scripts:timeline :prog: ray timeline :show-nested: .. _ray-status-doc: .. click:: ray.scripts.scripts:status :prog: ray status :show-nested: .. click:: ray.scripts.scripts:debug :prog: ray debug :show-nested: Usage Stats ----------- This section contains commands to enable/disable :ref:`Ray usage stats `. .. _ray-disable-usage-stats-doc: .. click:: ray.scripts.scripts:disable_usage_stats :prog: ray disable-usage-stats :show-nested: .. _ray-enable-usage-stats-doc: .. click:: ray.scripts.scripts:enable_usage_stats :prog: ray enable-usage-stats :show-nested: --- Core API ======== .. autosummary:: :nosignatures: :toctree: doc/ ray.init ray.shutdown ray.is_initialized ray.job_config.JobConfig ray.LoggingConfig Tasks ----- .. autosummary:: :nosignatures: :toctree: doc/ ray.remote ray.remote_function.RemoteFunction.options ray.cancel Actors ------ .. autosummary:: :nosignatures: :toctree: doc/ ray.remote ray.actor.ActorClass ray.actor.ActorClass.options ray.actor.ActorMethod ray.actor.ActorHandle ray.actor.ActorClassInheritanceException ray.actor.exit_actor ray.method ray.get_actor ray.kill Objects ------- .. autosummary:: :nosignatures: :toctree: doc/ ray.get ray.wait ray.put ray.util.as_completed ray.util.map_unordered .. _runtime-context-apis: Runtime Context --------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.runtime_context.get_runtime_context ray.runtime_context.RuntimeContext ray.get_gpu_ids Cross Language -------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.cross_language.java_function ray.cross_language.java_actor_class --- Ray Direct Transport (RDT) API ============================== Usage with Core APIs -------------------- Enable RDT for actor tasks with the :func:`@ray.method ` decorator, or pass `_tensor_transport` to :func:`ray.put`. You can then pass the resulting `ray.ObjectRef` to other actor tasks, or use :func:`ray.get` to retrieve the result. See :ref:`Ray Direct Transport (RDT) ` for more details on usage. .. autosummary:: :nosignatures: :toctree: doc/ ray.method ray.put ray.get Collective tensor transports ---------------------------- Collective tensor transports require a collective group to be created before RDT objects can be used. Use these methods to create and manage collective groups for the `gloo` and `nccl` tensor transports. .. autosummary:: :nosignatures: :toctree: doc/ ray.experimental.collective.create_collective_group ray.experimental.collective.get_collective_groups ray.experimental.collective.destroy_collective_group Advanced APIs ------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.experimental.wait_tensor_freed --- .. _ray-core-exceptions: Exceptions ========== .. autosummary:: :nosignatures: :toctree: doc/ ray.exceptions.RayError ray.exceptions.RayTaskError ray.exceptions.RayActorError ray.exceptions.TaskCancelledError ray.exceptions.TaskUnschedulableError ray.exceptions.ActorDiedError ray.exceptions.ActorUnschedulableError ray.exceptions.ActorUnavailableError ray.exceptions.AsyncioActorExit ray.exceptions.LocalRayletDiedError ray.exceptions.WorkerCrashedError ray.exceptions.TaskPlacementGroupRemoved ray.exceptions.ActorPlacementGroupRemoved ray.exceptions.ObjectStoreFullError ray.exceptions.OutOfDiskError ray.exceptions.OutOfMemoryError ray.exceptions.ObjectLostError ray.exceptions.ObjectFetchTimedOutError ray.exceptions.GetTimeoutError ray.exceptions.OwnerDiedError ray.exceptions.PendingCallsLimitExceeded ray.exceptions.PlasmaObjectNotAvailable ray.exceptions.ObjectReconstructionFailedError ray.exceptions.ObjectReconstructionFailedMaxAttemptsExceededError ray.exceptions.ObjectReconstructionFailedLineageEvictedError ray.exceptions.RayChannelError ray.exceptions.RayChannelTimeoutError ray.exceptions.RayCgraphCapacityExceeded ray.exceptions.RuntimeEnvSetupError ray.exceptions.CrossLanguageError ray.exceptions.RaySystemError ray.exceptions.NodeDiedError ray.exceptions.UnserializableException ray.exceptions.AuthenticationError --- Ray Core API ============ .. toctree:: :maxdepth: 2 core.rst scheduling.rst runtime-env.rst utility.rst exceptions.rst cli.rst ../../ray-observability/reference/cli.rst ../../ray-observability/reference/api.rst direct-transport.rst --- Runtime Env API =============== .. autosummary:: :nosignatures: :toctree: doc/ ray.runtime_env.RuntimeEnvConfig ray.runtime_env.RuntimeEnv --- Scheduling API ============== Scheduling Strategy ------------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy .. _ray-placement-group-ref: Placement Group --------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.util.placement_group ray.util.placement_group.get_placement_group ray.util.placement_group.PlacementGroup ray.util.placement_group_table ray.util.remove_placement_group ray.util.get_current_placement_group --- Utility ======= .. autosummary:: :nosignatures: :toctree: doc/ ray.util.ActorPool ray.util.queue.Queue ray.util.list_named_actors ray.util.serialization.register_serializer ray.util.serialization.deregister_serializer ray.util.tpu.get_current_pod_worker_count ray.util.tpu.get_current_pod_name ray.util.tpu.get_num_tpu_chips_on_node ray.util.tpu.get_tpu_coordinator_env_vars ray.util.tpu.get_tpu_version_from_type ray.util.tpu.get_tpu_worker_resources ray.util.tpu.SlicePlacementGroup ray.util.tpu.slice_placement_group ray.nodes ray.cluster_resources ray.available_resources .. Other docs have references to these ray.util.queue.Empty ray.util.queue.Full .. _custom-metric-api-ref: Custom Metrics -------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.util.metrics.Counter ray.util.metrics.Gauge ray.util.metrics.Histogram .. _package-ref-debugging-apis: Debugging --------- .. autosummary:: :nosignatures: :toctree: doc/ ray.util.rpdb.set_trace ray.util.inspect_serializability ray.timeline --- Compiled Graph API ================== Input and Output Nodes ---------------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.dag.input_node.InputNode ray.dag.output_node.MultiOutputNode DAG Construction ---------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.actor.ActorMethod.bind ray.dag.DAGNode.with_tensor_transport ray.experimental.compiled_dag_ref.CompiledDAGRef Compiled Graph Operations ------------------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.dag.DAGNode.experimental_compile ray.dag.compiled_dag_node.CompiledDAG.execute ray.dag.compiled_dag_node.CompiledDAG.visualize Configurations -------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.dag.context.DAGContext --- .. _compiled-graph-overlap: Experimental: Overlapping communication and computation ======================================================= Compiled Graph currently provides experimental support for GPU communication and computation overlap. When you turn this feature on, it automatically overlaps the GPU communication with computation operations, thereby hiding the communication overhead and improving performance. To enable this feature, specify ``_overlap_gpu_communication=True`` when calling :func:`dag.experimental_compile() `. The following code has GPU communication and computation operations that benefit from overlapping. .. literalinclude:: ../doc_code/cgraph_overlap.py :language: python :start-after: __cgraph_overlap_start__ :end-before: __cgraph_overlap_end__ The output of the preceding code includes the following two lines: .. testoutput:: overlap_gpu_communication=False, duration=1.0670117866247892 overlap_gpu_communication=True, duration=0.9211348341777921 The actual performance numbers may vary on different hardware, but enabling ``_overlap_gpu_communication`` improves latency by about 14% for this example. --- Profiling ========= Ray Compiled Graph provides both PyTorch-based and Nsight-based profiling functionalities to better understand the performance of individual tasks, system overhead, and performance bottlenecks. You can pick your favorite profiler based on your preference. PyTorch profiler ---------------- To run PyTorch Profiling on Compiled Graph, simply set the environment variable ``RAY_CGRAPH_ENABLE_TORCH_PROFILING=1`` when running the script. For example, for a Compiled Graph script in ``example.py``, run the following command: .. code-block:: bash RAY_CGRAPH_ENABLE_TORCH_PROFILING=1 python3 example.py After execution, Compiled Graph generates the profiling results in the `compiled_graph_torch_profiles` directory under the current working directory. Compiled Graph generates one trace file per actor. You can visualize traces by using https://ui.perfetto.dev/. Nsight system profiler ---------------------- Compiled Graph builds on top of Ray's profiling capabilities, and leverages Nsight system profiling. To run Nsight Profiling on Compiled Graph, specify the runtime_env for the involved actors as described in :ref:`Run Nsight on Ray `. For example, .. literalinclude:: ../doc_code/cgraph_profiling.py :language: python :start-after: __profiling_setup_start__ :end-before: __profiling_setup_end__ Then, create a Compiled Graph as usual. .. literalinclude:: ../doc_code/cgraph_profiling.py :language: python :start-after: __profiling_execution_start__ :end-before: __profiling_execution_end__ Finally, run the script as usual. .. code-block:: bash python3 example.py After execution, Compiled Graph generates the profiling results under the `/tmp/ray/session_*/logs/{profiler_name}` directory. For fine-grained performance analysis of method calls and system overhead, set the environment variable ``RAY_CGRAPH_ENABLE_NVTX_PROFILING=1`` when running the script: .. code-block:: bash RAY_CGRAPH_ENABLE_NVTX_PROFILING=1 python3 example.py This command leverages the `NVTX library `_ under the hood to automatically annotate all methods called in the execution loops of compiled graph. To visualize the profiling results, follow the same instructions as described in :ref:`Nsight Profiling Result `. Visualization ------------- To visualize the graph structure, call the :func:`visualize ` method after calling :func:`experimental_compile ` on the graph. .. literalinclude:: ../doc_code/cgraph_visualize.py :language: python :start-after: __cgraph_visualize_start__ :end-before: __cgraph_visualize_end__ By default, Ray generates a PNG image named ``compiled_graph.png`` and saves it in the current working directory. Note that this requires ``graphviz``. The following image shows the visualization for the preceding code. Tasks that belong to the same actor are the same color. .. image:: ../../images/compiled_graph_viz.png :alt: Visualization of Graph Structure :align: center --- Quickstart ========== Hello World ----------- This "hello world" example uses Ray Compiled Graph. First, install Ray. .. code-block:: bash pip install "ray[cgraph]" # For a ray version before 2.41, use the following instead: # pip install "ray[adag]" First, define a simple actor that echoes its argument. .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __simple_actor_start__ :end-before: __simple_actor_end__ Next instantiate the actor and use the classic Ray Core APIs ``remote`` and ``ray.get`` to execute tasks on the actor. .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __ray_core_usage_start__ :end-before: __ray_core_usage_end__ .. code-block:: Execution takes 969.0364822745323 us Now, create an equivalent program using Ray Compiled Graph. First, define a graph to execute using classic Ray Core, without any compilation. Later, compile this graph, to apply optimizations and prevent further modifications to the graph. First, create a :ref:`Ray DAG ` (directed acyclic graph), which is a lazily executed graph of Ray tasks. Note 3 key differences with the classic Ray Core APIs: 1. Use the :class:`ray.dag.InputNode ` context manager to indicate which inputs to the DAG should be provided at run time. 2. Use :func:`bind() ` instead of :func:`remote() ` to indicate lazily executed Ray tasks. 3. Use :func:`execute() ` to execute the DAG. Here, define a graph and execute it. Note that there is **no** compilation happening here. This uses the same execution backend as the preceding example: .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __dag_usage_start__ :end-before: __dag_usage_end__ Next, compile the ``dag`` using the :func:`experimental_compile ` API. The graph uses the same APIs for execution: .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __cgraph_usage_start__ :end-before: __cgraph_usage_end__ .. code-block:: Execution takes 86.72196418046951 us The performance of the same task graph improved by 10X. This is because the function ``echo`` is cheap and thus highly affected by the system overhead. Due to various bookkeeping and distributed protocols, the classic Ray Core APIs usually have 1 ms+ system overhead. Because the system knows the task graph ahead of time, Ray Compiled Graphs can pre-allocate all necessary resources ahead of time and greatly reduce the system overhead. For example, if the actor ``a`` is on the same node as the driver, Ray Compiled Graphs uses shared memory instead of RPC to transfer data directly between the driver and the actor. Currently, the DAG tasks run on a **background thread** of the involved actors. An actor can only participate in one DAG at a time. Normal tasks can still execute on the actors while the actors participate in a Compiled Graph, but these tasks execute on the main thread. Once you're done, you can tear down the Compiled Graph by deleting it or explicitly calling ``dag.teardown()``. This allows reuse of the actors in a new Compiled Graph. .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __teardown_start__ :end-before: __teardown_end__ Specifying data dependencies ---------------------------- When creating the DAG, a ``ray.dag.DAGNode`` can be passed as an argument to other ``.bind`` calls to specify data dependencies. For example, the following uses the preceding example to create a DAG that passes the same message from one actor to another: .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __cgraph_bind_start__ :end-before: __cgraph_bind_end__ .. code-block:: hello Here is another example that passes the same message to both actors, which can then execute in parallel. It uses :class:`ray.dag.MultiOutputNode ` to indicate that this DAG returns multiple outputs. Then, :func:`dag.execute() ` returns multiple :class:`CompiledDAGRef ` objects, one per node: .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __cgraph_multi_output_start__ :end-before: __cgraph_multi_output_end__ .. code-block:: Execution takes 86.72196418046951 us Be aware that: * On the same actor, a Compiled Graph executes in order. If an actor has multiple tasks in the same Compiled Graph, it executes all of them to completion before executing on the next DAG input. * Across actors in the same Compiled Graph, the execution may be pipelined. An actor may begin executing on the next DAG input while a downstream actor executes on the current one. * Compiled Graphs currently only supports actor tasks. Non-actor tasks aren't supported. ``asyncio`` support ------------------- If your Compiled Graph driver is running in an ``asyncio`` event loop, use the ``async`` APIs to ensure that executing the Compiled Graph and getting the results doesn't block the event loop. First, pass ``enable_async=True`` to the ``dag.experimental_compile()``: .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __cgraph_async_compile_start__ :end-before: __cgraph_async_compile_end__ Next, use `execute_async` to invoke the Compiled Graph. Calling ``await`` on ``execute_async`` will return once the input has been submitted, and it returns a future that can be used to get the result. Finally, use `await` to get the result of the Compiled Graph. .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __cgraph_async_execute_start__ :end-before: __cgraph_async_execute_end__ Execution and failure semantics ------------------------------- Like classic Ray Core, Ray Compiled Graph propagates exceptions to the final output. In particular: - **Application exceptions**: If an application task throws an exception, Compiled Graph wraps the exception in a :class:`RayTaskError ` and raises it when the caller calls :func:`ray.get() ` on the result. The thrown exception inherits from both :class:`RayTaskError ` and the original exception class. - **System exceptions**: System exceptions include actor death or unexpected errors such as network errors. For actor death, Compiled Graph raises a :class:`ActorDiedError `, and for other errors, it raises a :class:`RayChannelError `. The graph can still execute after application exceptions. However, the graph automatically shuts down in the case of system exceptions. If an actor's death causes the graph to shut down, the remaining actors stay alive. For example, this example explicitly destroys an actor while it's participating in a Compiled Graph. The remaining actors are reusable: .. literalinclude:: ../doc_code/cgraph_quickstart.py :language: python :start-after: __cgraph_actor_death_start__ :end-before: __cgraph_actor_death_end__ Execution Timeouts ------------------ Some errors, such as NCCL network errors, require additional handling to avoid hanging. In the future, Ray may attempt to detect such errors, but currently as a fallback, it allows configurable timeouts for :func:`compiled_dag.execute() ` and :func:`ray.get() `. The default timeout is 10 seconds for both. Set the following environment variables to change the default timeout: - ``RAY_CGRAPH_submit_timeout``: Timeout for :func:`compiled_dag.execute() `. - ``RAY_CGRAPH_get_timeout``: Timeout for :func:`ray.get() `. :func:`ray.get() ` also has a timeout parameter to set timeout on a per-call basis. CPU to GPU communication ------------------------ With classic Ray Core, passing ``torch.Tensors`` between actors can become expensive, especially when transferring between devices. This is because Ray Core doesn't know the final destination device. Therefore, you may see unnecessary copies across devices other than the source and destination devices. Ray Compiled Graph ships with native support for passing ``torch.Tensors`` between actors executing on different devices. Developers can now use type hint annotations in the Compiled Graph declaration to indicate the final destination device of a ``torch.Tensor``. .. literalinclude:: ../doc_code/cgraph_nccl.py :language: python :start-after: __cgraph_cpu_to_gpu_actor_start__ :end-before: __cgraph_cpu_to_gpu_actor_end__ In Ray Core, if you try to pass a CPU tensor from the driver, the GPU actor receives a CPU tensor: .. testcode:: :skipif: True # This will fail because the driver passes a CPU copy of the tensor, # and the GPU actor also receives a CPU copy. ray.get(actor.process.remote(torch.zeros(10))) With Ray Compiled Graph, you can annotate DAG nodes with type hints to indicate that there may be a ``torch.Tensor`` contained in the value: .. literalinclude:: ../doc_code/cgraph_nccl.py :language: python :start-after: __cgraph_cpu_to_gpu_start__ :end-before: __cgraph_cpu_to_gpu_end__ Under the hood, the Ray Compiled Graph backend copies the ``torch.Tensor`` to the GPU assigned to the ``GPUActor`` by Ray Core. Of course, you can also do this yourself, but there are advantages to using Compiled Graph instead: - Ray Compiled Graph can minimize the number of data copies made. For example, passing from one CPU to multiple GPUs requires one copy to a shared memory buffer, and then one host-to-device copy per destination GPU. - In the future, this can be further optimized through techniques such as `memory pinning `_, using zero-copy deserialization when the CPU is the destination, etc. GPU to GPU communication ------------------------ Ray Compiled Graphs supports NCCL-based transfers of CUDA ``torch.Tensor`` objects, avoiding any copies through Ray's CPU-based shared-memory object store. With user-provided type hints, Ray prepares NCCL communicators and operation scheduling ahead of time, avoiding deadlock and :ref:`overlapping compute and communication `. Ray Compiled Graph uses `cupy `_ under the hood to support NCCL operations. The cupy version affects the NCCL version. The Ray team is also planning to support custom communicators in the future, for example to support collectives across CPUs or to reuse existing collective groups. First, create sender and receiver actors. Note that this example requires at least 2 GPUs. .. literalinclude:: ../doc_code/cgraph_nccl.py :language: python :start-after: __cgraph_nccl_setup_start__ :end-before: __cgraph_nccl_setup_end__ To support GPU-to-GPU communication with NCCL, wrap the DAG node that contains the ``torch.Tensor`` that you want to transmit using the ``with_tensor_transport`` API hint: .. literalinclude:: ../doc_code/cgraph_nccl.py :language: python :start-after: __cgraph_nccl_exec_start__ :end-before: __cgraph_nccl_exec_end__ Current limitations include: * ``torch.Tensor`` and NVIDIA NCCL only * Support for peer-to-peer transfers. Collective communication operations are coming soon. * Communication operations are currently done synchronously. :ref:`Overlapping compute and communication ` is an experimental feature. --- .. _ray-compiled-graph: Ray Compiled Graph (beta) ========================= .. warning:: Ray Compiled Graph is currently in beta (since Ray 2.44). The APIs are subject to change and expected to evolve. The API is available from Ray 2.32, but it's recommended to use a version after 2.44. As large language models (LLMs) become common, programming distributed systems with multiple GPUs is essential. :ref:`Ray Core APIs ` facilitate using multiple GPUs but have limitations such as: * System overhead of ~1 ms per task launch, which is unsuitable for high-performance tasks like LLM inference. * Lack of support for direct GPU-to-GPU communication, requiring manual development with external libraries like NVIDIA Collective Communications Library (`NCCL `_). Ray Compiled Graph gives you a Ray Core-like API but with: - **Less than 50us system overhead** for workloads that repeatedly execute the same task graph. - **Native support for GPU-GPU communication** with NCCL. For example, consider the following Ray Core code, which sends data to an actor and gets the result: .. testcode:: :skipif: True # Ray Core API for remote execution. # ~1ms overhead to invoke `recv`. ref = receiver.recv.remote(data) ray.get(ref) This code shows how to compile and execute the same example as a Compiled Graph. .. testcode:: :skipif: True # Compiled Graph for remote execution. # less than 50us overhead to invoke `recv` (during `graph.execute(data)`). with InputNode() as inp: graph = receiver.recv.bind(inp) graph = graph.experimental_compile() ref = graph.execute(data) ray.get(ref) Ray Compiled Graph has a static execution model. It's different from classic Ray APIs, which are eager. Because of the static nature, Ray Compiled Graph can perform various optimizations such as: - Pre-allocate resources so that it can reduce system overhead. - Prepare NCCL communicators and apply deadlock-free scheduling. - (experimental) Automatically overlap GPU compute and communication. - Improve multi-node performance. Use Cases --------- Ray Compiled Graph APIs simplify development of high-performance multi-GPU workloads such as LLM inference or distributed training that require: - Sub-millisecond level task orchestration. - Direct GPU-GPU peer-to-peer or collective communication. - `Heterogeneous `_ or MPMD (Multiple Program Multiple Data) execution. More Resources -------------- - `Ray Compiled Graph blog `_ - `Ray Compiled Graph talk at Ray Summit `_ - `Heterogeneous training with Ray Compiled Graph `_ - `Distributed LLM inference with Ray Compiled Graph `_ Table of Contents ----------------- Learn more details about Ray Compiled Graph from the following links. .. toctree:: :maxdepth: 1 quickstart profiling overlap troubleshooting compiled-graph-api --- Troubleshooting =============== This page contains common issues and solutions for Compiled Graph execution. Limitations ----------- Compiled Graph is a new feature and has some limitations: - Invoking Compiled Graph - Only the process that compiles the Compiled Graph may call it. - A Compiled Graph has a maximum number of in-flight executions. When using the DAG API, if there aren't enough resources at the time of ``dag.execute()``, Ray will queue the tasks for later execution. Ray Compiled Graph currently doesn't support queuing past its maximum capacity. Therefore, you may need to consume some results using ``ray.get()`` before submitting more executions. As a stopgap, ``dag.execute()`` throws a ``RayCgraphCapacityExceeded`` exception if the call takes too long. In the future, Compiled Graph may have better error handling and queuing. - Compiled Graph Execution - Ideally, you should try not to execute other tasks on the actor while it is participating in a Compiled Graph. Compiled Graph tasks will be executed on a **background thread**. Any concurrent tasks submitted to the actor can still execute on the main thread, but you are responsible for synchronization with the Compiled Graph background thread. - For now, actors can only execute one Compiled Graph at a time. To execute a different Compiled Graph on the same actor, you must teardown the current Compiled Graph. See :ref:`Return NumPy arrays ` for more details. - Passing and getting Compiled Graph results (:class:`CompiledDAGRef `) - Compiled Graph results can't be passed to another task or actor. This restriction may be loosened in the future, but for now, it allows for better performance because the backend knows exactly where to push the results. - ``ray.get()`` can be called at most once on a :class:`CompiledDAGRef `. An exception will be raised if it is called twice on the same :class:`CompiledDAGRef `. This is because the underlying memory for the result may need to be reused for a future DAG execution. Restricting ``ray.get()`` to once per reference simplifies the tracking of the memory buffers. - If the value returned by ``ray.get()`` is zero-copy deserialized, then subsequent executions of the same DAG will block until the value goes out of scope in Python. Thus, if you hold onto zero-copy deserialized values returned by ``ray.get()``, and you try to execute the Compiled Graph above its max concurrency, it may deadlock. This case will be detected in the future, but for now you will receive a ``RayChannelTimeoutError``. See :ref:`Explicitly teardown before reusing the same actors ` for more details. - Collective operations - For GPU to GPU communication, Compiled Graph only supports peer-to-peer transfers. Collective communication operations are coming soon. Keep an eye out for additional features in future Ray releases: - Support better queuing of DAG inputs, to enable more concurrent executions of the same DAG. - Support for more collective operations with NCCL. - Support for multiple DAGs executing on the same actor. - General performance improvements. If you run into additional issues, or have other feedback or questions, file an issue on `GitHub `_. For a full list of known issues, check the ``compiled-graphs`` label on Ray GitHub. .. _troubleshoot-numpy: Returning NumPy arrays ---------------------- Ray zero-copy deserializes NumPy arrays when possible. If you execute compiled graph with a NumPy array output multiple times, you could possibly run into issues if a NumPy array output from a previous Compiled Graph execution isn't deleted before attempting to get the result of a following execution of the same Compiled Graph. This is because the NumPy array stays in the buffer of the Compiled Graph until you or Python delete it. It's recommended to explicitly delete the NumPy array as Python may not always garbage collect the NumPy array immediately as you may expect. For example, the following code sample could result in a hang or RayChannelTimeoutError if the NumPy array isn't deleted: .. literalinclude:: ../doc_code/cgraph_troubleshooting.py :language: python :start-after: __numpy_troubleshooting_start__ :end-before: __numpy_troubleshooting_end__ In the preceding code snippet, Python may not garbage collect the NumPy array in `result` on each iteration of the loop. Therefore, you should explicitly delete the NumPy array before you try to get the result of subsequent Compiled Graph executions. .. _troubleshoot-teardown: Explicitly teardown before reusing the same actors -------------------------------------------------- If you want to reuse the actors of a Compiled Graph, it's important to explicitly teardown the Compiled Graph before reusing the actors. Without explicitly tearing down the Compiled Graph, the resources created for actors in a Compiled Graph may have conflicts with further usage of those actors. For example, in the following code, Python could delay garbage collection, which triggers the implicit teardown of the first Compiled Graph. This could lead to a segfault due to the resource conflicts mentioned: .. literalinclude:: ../doc_code/cgraph_troubleshooting.py :language: python :start-after: __teardown_troubleshooting_start__ :end-before: __teardown_troubleshooting_end__ --- .. _configuring-ray: Configuring Ray =============== .. note:: For running Java applications, see `Java Applications`_. This page discusses the various ways to configure Ray, both from the Python API and from the command line. Take a look at the ``ray.init`` `documentation `__ for a complete overview of the configurations. .. important:: For the multi-node setting, you must first run ``ray start`` on the command line to start the Ray cluster services on the machine before ``ray.init`` in Python to connect to the cluster services. On a single machine, you can run ``ray.init()`` without ``ray start``, which both starts the Ray cluster services and connects to them. .. _cluster-resources: Cluster resources ----------------- Ray by default detects available resources. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # This automatically detects available resources in the single machine. ray.init() If not running cluster mode, you can specify cluster resources overrides through ``ray.init`` as follows. .. testcode:: :hide: ray.shutdown() .. testcode:: # If not connecting to an existing cluster, you can specify resources overrides: ray.init(num_cpus=8, num_gpus=1) .. testcode:: :hide: ray.shutdown() .. testcode:: # Specifying custom resources ray.init(num_gpus=1, resources={'Resource1': 4, 'Resource2': 16}) When starting Ray from the command line, pass the ``--num-cpus`` and ``--num-gpus`` flags into ``ray start``. You can also specify custom resources. .. code-block:: bash # To start a head node. $ ray start --head --num-cpus= --num-gpus= # To start a non-head node. $ ray start --address=
--num-cpus= --num-gpus= # Specifying custom resources ray start [--head] --num-cpus= --resources='{"Resource1": 4, "Resource2": 16}' If using the command line, connect to the Ray cluster as follow: .. testcode:: :skipif: True # Connect to ray. Notice if connected to existing cluster, you don't specify resources. ray.init(address=
) .. _temp-dir-log-files: Logging and debugging --------------------- Each Ray session has a unique name. By default, the name is ``session_{timestamp}_{pid}``. The format of ``timestamp`` is ``%Y-%m-%d_%H-%M-%S_%f`` (See `Python time format `__ for details); the pid belongs to the startup process (the process calling ``ray.init()`` or the Ray process executed by a shell in ``ray start``). For each session, Ray places all its temporary files under the *session directory*. A *session directory* is a subdirectory of the *root temporary path* (``/tmp/ray`` by default), so the default session directory is ``/tmp/ray/{ray_session_name}``. You can sort by their names to find the latest session. Change the *root temporary directory* by passing ``--temp-dir={your temp path}`` to ``ray start``. There currently isn't a stable way to change the root temporary directory when calling ``ray.init()``, but if you need to, you can provide the ``_temp_dir`` argument to ``ray.init()``. See :ref:`Logging Directory Structure ` for more details. .. _ray-ports: Ports configurations -------------------- Ray requires bi-directional communication among its nodes in a cluster. Each node opens specific ports to receive incoming network requests. All Nodes ~~~~~~~~~ - ``--node-manager-port``: Raylet port for node manager. Default: Random value. - ``--object-manager-port``: Raylet port for object manager. Default: Random value. - ``--runtime-env-agent-port``: Raylet port for runtime env agent. Default: Random value. The node manager and object manager run as separate processes with their own ports for communication. The following options specify the ports used by dashboard agent process. - ``--dashboard-agent-grpc-port``: The port to listen for grpc on. Default: Random value. - ``--dashboard-agent-listen-port``: The port to listen for http on. Default: 52365. - ``--metrics-export-port``: The port to use to expose Ray metrics. Default: Random value. The following options specify the range of ports used by worker processes across machines. All ports in the range should be open. - ``--min-worker-port``: Minimum port number for the worker to bind to. Default: 10002. - ``--max-worker-port``: Maximum port number for the worker to bind to. Default: 19999. Port numbers are how Ray differentiates input and output to and from multiple workers on a single node. Each worker takes input and gives output on a single port number. Therefore, by default, there's a maximum of 10,000 workers on each node, irrespective of number of CPUs. In general, you should give Ray a wide range of possible worker ports, in case any of those ports happen to be in use by some other program on your machine. However, when debugging, it's useful to explicitly specify a short list of worker ports such as ``--worker-port-list=10000,10001,10002,10003,10004`` Note that this practice limits the number of workers, just like specifying a narrow range. Head node ~~~~~~~~~ In addition to ports specified in the preceding section, the head node needs to open several more ports. - ``--port``: Port of the Ray GCS server. The head node starts a GCS server listening on this port. Default: 6379. - ``--ray-client-server-port``: Listening port for Ray Client Server. Default: 10001. - ``--redis-shard-ports``: Comma-separated list of ports for non-primary Redis shards. Default: Random values. - ``--dashboard-grpc-port``: (Deprecated) No longer used. Only kept for backward compatibility. - If ``--include-dashboard`` is true (the default), then the head node must open ``--dashboard-port``. Default: 8265. If ``--include-dashboard`` is true but the ``--dashboard-port`` isn't open on the head node, you won't be able to access the dashboard, and you repeatedly get .. code-block:: bash WARNING worker.py:1114 -- The agent on node failed with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/grpc/aio/_call.py", line 285, in __await__ raise _create_rpc_error(self._cython_call._initial_metadata, grpc.aio._call.AioRpcError: --from-file=ca.key= Step 2: Generate individual private keys and self-signed certificates for the Ray head and workers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The `YAML file `__, has a ConfigMap named `tls` that includes two shell scripts: `gencert_head.sh` and `gencert_worker.sh`. These scripts produce the private key and self-signed certificate files (`tls.key` and `tls.crt`) for both head and worker Pods in the initContainer of each deployment. By using the initContainer, we can dynamically retrieve the `POD_IP` to the `[alt_names]` section. The scripts perform the following steps: first, it generates a 2048-bit RSA private key and saves the key as `/etc/ray/tls/tls.key`. Then, a Certificate Signing Request (CSR) is generated using the `tls.key` file and the `csr.conf` configuration file. Finally, a self-signed certificate (`tls.crt`) is created using the Certificate Authority's (`ca.key and ca.crt`) keypair and the CSR (`ca.csr`). Step 3: Set the environment variables for both Ray head and worker to enable TLS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You enable TLS by setting environment variables. - ``RAY_USE_TLS``: Either 1 or 0 to use/not-use TLS. If you set it to 1, you must set the environment variables below. Default: 0. - ``RAY_TLS_SERVER_CERT``: Location of a `certificate file (tls.crt)`, which Ray presents to other endpoints to achieve mutual authentication. - ``RAY_TLS_SERVER_KEY``: Location of a `private key file (tls.key)`, which is the cryptographic means to prove to other endpoints that you are the authorized user of a given certificate. - ``RAY_TLS_CA_CERT``: Location of a `CA certificate file (ca.crt)`, which allows TLS to decide whether an the correct authority signed the endpoint's certificate. Step 4: Verify TLS authentication ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Log in to the worker Pod kubectl exec -it ${WORKER_POD} -- bash # Since the head Pod has the certificate of the full qualified DNS resolution for the Ray head service, the connection to the worker Pods # is established successfully ray health-check --address service-ray-head.default.svc.cluster.local:6379 # Since service-ray-head hasn't added to the alt_names section in the certificate, the connection fails and an error # message similar to the following is displayed: "Peer name service-ray-head is not in peer certificate". ray health-check --address service-ray-head:6379 # After you add `DNS.3 = service-ray-head` to the alt_names sections and deploy the YAML again, the connection is able to work. Enabling TLS causes a performance hit due to the extra overhead of mutual authentication and encryption. Testing has shown that this overhead is large for small workloads and becomes relatively smaller for large workloads. The exact overhead depends on the nature of your workload. Java applications ----------------- .. important:: For the multi-node setting, you must first run ``ray start`` on the command line to start the Ray cluster services on the machine before ``ray.init()`` in Java to connect to the cluster services. On a single machine, you can run ``ray.init()`` without ``ray start``. It both starts the Ray cluster services and connects to them. .. _code_search_path: Code search path ~~~~~~~~~~~~~~~~ If you want to run a Java application in a multi-node cluster, you must specify the code search path in your driver. The code search path tells Ray where to load jars when starting Java workers. You must distribute your jar files to the same paths on all nodes of the Ray cluster before running your code. .. code-block:: bash $ java -classpath \ -Dray.address=
\ -Dray.job.code-search-path=/path/to/jars/ \ The ``/path/to/jars/`` points to a directory which contains jars. Workers load all jars in the directory. You can also provide multiple directories for this parameter. .. code-block:: bash $ java -classpath \ -Dray.address=
\ -Dray.job.code-search-path=/path/to/jars1:/path/to/jars2:/path/to/pys1:/path/to/pys2 \ You don't need to configure code search path if you run a Java application in a single-node cluster. See ``ray.job.code-search-path`` under :ref:`Driver Options ` for more information. .. note:: Currently there's no way to configure Ray when running a Java application in single machine mode. If you need to configure Ray, run ``ray start`` to start the Ray cluster first. .. _java-driver-options: Driver options ~~~~~~~~~~~~~~ There's a limited set of options for Java drivers. They're not for configuring the Ray cluster, but only for configuring the driver. Ray uses `Typesafe Config `__ to read options. There are several ways to set options: - System properties. You can configure system properties either by adding options in the format of ``-Dkey=value`` in the driver command line, or by invoking ``System.setProperty("key", "value");`` before ``Ray.init()``. - A `HOCON format `__ configuration file. By default, Ray will try to read the file named ``ray.conf`` in the root of the classpath. You can customize the location of the file by setting system property ``ray.config-file`` to the path of the file. .. note:: Options configured by system properties have higher priority than options configured in the configuration file. The list of available driver options: - ``ray.address`` - The cluster address if the driver connects to an existing Ray cluster. If it's empty, Ray creates a new Ray cluster. - Type: ``String`` - Default: empty string. - ``ray.job.code-search-path`` - The paths for Java workers to load code from. Currently, Ray only supports directories. You can specify one or more directories split by a ``:``. You don't need to configure code search path if you run a Java application in single machine mode or local mode. Ray also uses the code search path to load Python code, if specified. This parameter is required for :ref:`cross_language`. If you specify a code search path, you can only run Python remote functions which you can find in the code search path. - Type: ``String`` - Default: empty string. - Example: ``/path/to/jars1:/path/to/jars2:/path/to/pys1:/path/to/pys2`` - ``ray.job.namespace`` - The namespace of this job. Ray uses it for isolation between jobs. Jobs in different namespaces can't access each other. If it's not specified, Ray uses a randomized value. - Type: ``String`` - Default: A random UUID string value. .. _`Apache Arrow`: https://arrow.apache.org/ --- .. _cross_language: Cross-language programming ========================== This page shows you how to use Ray's cross-language programming feature. Setup the driver ----------------- You need to set :ref:`code_search_path` in your driver. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/cross_language.py :language: python :start-after: __crosslang_init_start__ :end-before: __crosslang_init_end__ .. tab-item:: Java .. code-block:: bash java -classpath \ -Dray.address=
\ -Dray.job.code-search-path=/path/to/code/ \ You may want to include multiple directories to load both Python and Java code for workers, if you place them in different directories. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/cross_language.py :language: python :start-after: __crosslang_multidir_start__ :end-before: __crosslang_multidir_end__ .. tab-item:: Java .. code-block:: bash java -classpath \ -Dray.address=
\ -Dray.job.code-search-path=/path/to/jars:/path/to/pys \ Python calling Java ------------------- Suppose you have a Java static method and a Java class as follows: .. code-block:: java package io.ray.demo; public class Math { public static int add(int a, int b) { return a + b; } } .. code-block:: java package io.ray.demo; // A regular Java class. public class Counter { private int value = 0; public int increment() { this.value += 1; return this.value; } } Then, in Python, you can call the preceding Java remote function, or create an actor from the preceding Java class. .. literalinclude:: ./doc_code/cross_language.py :language: python :start-after: __python_call_java_start__ :end-before: __python_call_java_end__ Java calling Python ------------------- Suppose you have a Python module as follows: .. literalinclude:: ./doc_code/cross_language.py :language: python :start-after: __python_module_start__ :end-before: __python_module_end__ .. note:: * You should decorate the function or class with `@ray.remote`. Then, in Java, you can call the preceding Python remote function, or create an actor from the preceding Python class. .. code-block:: java package io.ray.demo; import io.ray.api.ObjectRef; import io.ray.api.PyActorHandle; import io.ray.api.Ray; import io.ray.api.function.PyActorClass; import io.ray.api.function.PyActorMethod; import io.ray.api.function.PyFunction; import org.testng.Assert; public class JavaCallPythonDemo { public static void main(String[] args) { // Set the code-search-path to the directory of your `ray_demo.py` file. System.setProperty("ray.job.code-search-path", "/path/to/the_dir/"); Ray.init(); // Define a Python class. PyActorClass actorClass = PyActorClass.of( "ray_demo", "Counter"); // Create a Python actor and call actor method. PyActorHandle actor = Ray.actor(actorClass).remote(); ObjectRef objRef1 = actor.task( PyActorMethod.of("increment", int.class)).remote(); Assert.assertEquals(objRef1.get(), 1); ObjectRef objRef2 = actor.task( PyActorMethod.of("increment", int.class)).remote(); Assert.assertEquals(objRef2.get(), 2); // Call the Python remote function. ObjectRef objRef3 = Ray.task(PyFunction.of( "ray_demo", "add", int.class), 1, 2).remote(); Assert.assertEquals(objRef3.get(), 3); Ray.shutdown(); } } Cross-language data serialization --------------------------------- Ray automatically serializes and deserializes the arguments and return values of Ray calls if their types are the following: - Primitive data types =========== ======= ======= MessagePack Python Java =========== ======= ======= nil None null bool bool Boolean int int Short / Integer / Long / BigInteger float float Float / Double str str String bin bytes byte[] =========== ======= ======= - Basic container types =========== ======= ======= MessagePack Python Java =========== ======= ======= array list Array =========== ======= ======= - Ray builtin types - ActorHandle .. note:: * Be aware of float / double precision between Python and Java. If Java is using a float type to receive the input argument, the double precision Python data reduces to float precision in Java. * BigInteger can support a max value of 2^64-1. See: https://github.com/msgpack/msgpack/blob/master/spec.md#int-format-family. If the value is larger than 2^64-1, then sending the value to Python raises an exception. The following example shows how to pass these types as parameters and how to return these types. You can write a Python function which returns the input data: .. literalinclude:: ./doc_code/cross_language.py :language: python :start-after: __serialization_start__ :end-before: __serialization_end__ Then you can transfer the object from Java to Python, and back from Python to Java: .. code-block:: java package io.ray.demo; import io.ray.api.ObjectRef; import io.ray.api.Ray; import io.ray.api.function.PyFunction; import java.math.BigInteger; import org.testng.Assert; public class SerializationDemo { public static void main(String[] args) { Ray.init(); Object[] inputs = new Object[]{ true, // Boolean Byte.MAX_VALUE, // Byte Short.MAX_VALUE, // Short Integer.MAX_VALUE, // Integer Long.MAX_VALUE, // Long BigInteger.valueOf(Long.MAX_VALUE), // BigInteger "Hello World!", // String 1.234f, // Float 1.234, // Double "example binary".getBytes()}; // byte[] for (Object o : inputs) { ObjectRef res = Ray.task( PyFunction.of("ray_serialization", "py_return_input", o.getClass()), o).remote(); Assert.assertEquals(res.get(), o); } Ray.shutdown(); } } Cross-language exception stacks ------------------------------- Suppose you have a Java package as follows: .. code-block:: java package io.ray.demo; import io.ray.api.ObjectRef; import io.ray.api.Ray; import io.ray.api.function.PyFunction; public class MyRayClass { public static int raiseExceptionFromPython() { PyFunction raiseException = PyFunction.of( "ray_exception", "raise_exception", Integer.class); ObjectRef refObj = Ray.task(raiseException).remote(); return refObj.get(); } } and a Python module as follows: .. literalinclude:: ./doc_code/cross_language.py :language: python :start-after: __raise_exception_start__ :end-before: __raise_exception_end__ Then, run the following code: .. literalinclude:: ./doc_code/cross_language.py :language: python :start-after: __raise_exception_demo_start__ :end-before: __raise_exception_demo_end__ The exception stack will be: .. code-block:: text Traceback (most recent call last): File "ray_exception_demo.py", line 9, in ray.get(obj_ref) # <-- raise exception from here. File "ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "ray/python/ray/_private/worker.py", line 2247, in get raise value ray.exceptions.CrossLanguageError: An exception raised from JAVA: io.ray.api.exception.RayTaskException: (pid=61894, ip=172.17.0.2) Error executing task c8ef45ccd0112571ffffffffffffffffffffffff01000000 at io.ray.runtime.task.TaskExecutor.execute(TaskExecutor.java:186) at io.ray.runtime.RayNativeRuntime.nativeRunTaskExecutor(Native Method) at io.ray.runtime.RayNativeRuntime.run(RayNativeRuntime.java:231) at io.ray.runtime.runner.worker.DefaultWorker.main(DefaultWorker.java:15) Caused by: io.ray.api.exception.CrossLanguageException: An exception raised from PYTHON: ray.exceptions.RayTaskError: ray::raise_exception() (pid=62041, ip=172.17.0.2) File "ray_exception.py", line 7, in raise_exception 1 / 0 ZeroDivisionError: division by zero --- .. _direct-transport: ************************** Ray Direct Transport (RDT) ************************** Ray objects are normally stored in Ray's CPU-based object store and copied and deserialized when accessed by a Ray task or actor. For GPU data specifically, this can lead to unnecessary and expensive data transfers. For example, passing a CUDA ``torch.Tensor`` from one Ray task to another would require a copy from GPU to CPU memory, then back again to GPU memory. *Ray Direct Transport (RDT)* is a new feature that allows Ray to store and pass objects directly between Ray actors. This feature augments the familiar Ray :class:`ObjectRef ` API by: - Keeping GPU data in GPU memory until a transfer is necessary - Avoiding expensive serialization and copies to and from the Ray object store - Using efficient data transports like collective communication libraries (`Gloo `__ or `NCCL `__) or point-to-point RDMA (via `NVIDIA's NIXL `__) to transfer data directly between devices, including both CPU and GPUs .. note:: RDT is currently in **alpha** and doesn't support all Ray Core APIs yet. Future releases may introduce breaking API changes. See the :ref:`limitations ` section for more details. Getting started =============== .. tip:: RDT currently supports ``torch.Tensor`` objects created by Ray actor tasks. Other datatypes and Ray non-actor tasks may be supported in future releases. This walkthrough will show how to create and use RDT with different *tensor transports*, i.e. the mechanism used to transfer the tensor between actors. Currently, RDT supports the following tensor transports: 1. `Gloo `__: A collective communication library for PyTorch and CPUs. 2. `NVIDIA NCCL `__: A collective communication library for NVIDIA GPUs. 3. `NVIDIA NIXL `__ (backed by `UCX `__): A library for accelerating point-to-point transfers via RDMA, especially between various types of memory and NVIDIA GPUs. For ease of following along, we'll start with the `Gloo `__ transport, which can be used without any physical GPUs. .. _direct-transport-gloo: Usage with Gloo (CPUs only) --------------------------- Installation ^^^^^^^^^^^^ .. note:: Under construction. Walkthrough ^^^^^^^^^^^ To get started, define an actor class and a task that returns a ``torch.Tensor``: .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __normal_example_start__ :end-before: __normal_example_end__ As written, when the ``torch.Tensor`` is returned, it will be copied into Ray's CPU-based object store. For CPU-based tensors, this can require an expensive step to copy and serialize the object, while GPU-based tensors additionally require a copy to and from CPU memory. To enable RDT, use the ``tensor_transport`` option in the :func:`@ray.method ` decorator. .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_example_start__ :end-before: __gloo_example_end__ This decorator can be added to any actor tasks that return a ``torch.Tensor``, or that return ``torch.Tensors`` nested inside other Python objects. Adding this decorator will change Ray's behavior in the following ways: 1. When returning the tensor, Ray will store a *reference* to the tensor instead of copying it to CPU memory. 2. When the :class:`ray.ObjectRef` is passed to another task, Ray will use Gloo to transfer the tensor to the destination task. Note that for (2) to work, the :func:`@ray.method(tensor_transport) ` decorator only needs to be added to the actor task that *returns* the tensor. It should not be added to actor tasks that *consume* the tensor (unless those tasks also return tensors). Also, for (2) to work, we must first create a *collective group* of actors. Creating a collective group ^^^^^^^^^^^^^^^^^^^^^^^^^^^ To create a collective group for use with RDT: 1. Create multiple Ray actors. 2. Create a collective group on the actors using the :func:`ray.experimental.collective.create_collective_group ` function. The `backend` specified must match the `tensor_transport` used in the :func:`@ray.method ` decorator. Here is an example: .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_group_start__ :end-before: __gloo_group_end__ The actors can now communicate directly via gloo. The group can also be destroyed using the :func:`ray.experimental.collective.destroy_collective_group ` function. After calling this function, a new collective group can be created on the same actors. Passing objects to other actors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Now that we have a collective group, we can create and pass RDT objects between the actors. Here is a full example: .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_full_example_start__ :end-before: __gloo_full_example_end__ When the :class:`ray.ObjectRef` is passed to another task, Ray will use Gloo to transfer the tensor directly from the source actor to the destination actor instead of the default object store. Note that the :func:`@ray.method(tensor_transport) ` decorator is only added to the actor task that *returns* the tensor; once this hint has been added, the receiving actor task `receiver.sum` will automatically use Gloo to receive the tensor. In this example, because `MyActor.sum` does not have the :func:`@ray.method(tensor_transport) ` decorator, it will use the default Ray object store transport to return `torch.sum(tensor)`. RDT also supports passing tensors nested inside Python data structures, as well as actor tasks that return multiple tensors, like in this example: .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_multiple_tensors_example_start__ :end-before: __gloo_multiple_tensors_example_end__ Passing RDT objects to the actor that produced them ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RDT :class:`ray.ObjectRefs ` can also be passed to the actor that produced them. This avoids any copies and just provides a reference to the same ``torch.Tensor`` that was previously created. For example: .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_intra_actor_start__ :end-before: __gloo_intra_actor_end__ .. note:: Ray only keeps a reference to the tensor created by the user, so the tensor objects are *mutable*. If ``sender.sum`` were to modify the tensor in the above example, the changes would also be seen by ``receiver.sum``. This differs from the normal Ray Core API, which always makes an immutable copy of data returned by actors. ``ray.get`` ^^^^^^^^^^^ The :func:`ray.get ` function can also be used as usual to retrieve the result of an RDT object. However, :func:`ray.get ` will by default use the same tensor transport as the one specified in the :func:`@ray.method ` decorator. For collective-based transports, this will not work if the caller is not part of the collective group. Therefore, users need to specify the Ray object store as the tensor transport explicitly by setting ``_use_object_store`` in :func:`ray.get `. .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_get_start__ :end-before: __gloo_get_end__ Object mutability ^^^^^^^^^^^^^^^^^ Unlike objects in the Ray object store, RDT objects are *mutable*, meaning that Ray only holds a reference to the tensor and will not copy it until a transfer is requested. This means that if the actor that returns a tensor also keeps a reference to the tensor, and the actor later modifies it in place while Ray is still storing the tensor reference, it's possible that some or all of the changes may be seen by receiving actors. Here is an example of what can go wrong: .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_wait_tensor_freed_bad_start__ :end-before: __gloo_wait_tensor_freed_bad_end__ In this example, the sender actor returns a tensor to Ray, but it also keeps a reference to the tensor in its local state. Then, in `sender.increment_and_sum_stored_tensor`, the sender actor modifies the tensor in place while Ray is still holding the tensor reference. Then, the `receiver.increment_and_sum` task receives the modified tensor instead of the original, so the assertion fails. To fix this kind of error, use the :func:`ray.experimental.wait_tensor_freed ` function to wait for Ray to release all references to the tensor, so that the actor can safely write to the tensor again. :func:`wait_tensor_freed ` will unblock once all tasks that depend on the tensor have finished executing and all corresponding `ObjectRefs` have gone out of scope. Ray tracks tasks that depend on the tensor by keeping track of which tasks take the `ObjectRef` corresponding to the tensor as an argument. Here's a fixed version of the earlier example. .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_wait_tensor_freed_start__ :end-before: __gloo_wait_tensor_freed_end__ The main changes are: 1. `sender` calls :func:`wait_tensor_freed ` before modifying the tensor in place. 2. The driver skips :func:`ray.get ` because :func:`wait_tensor_freed ` blocks until all `ObjectRefs` pointing to the tensor are freed, so calling :func:`ray.get ` here would cause a deadlock. 3. The driver calls `del tensor` to release its reference to the tensor. Again, this is necessary because :func:`wait_tensor_freed ` blocks until all `ObjectRefs` pointing to the tensor are freed. When an RDT `ObjectRef` is passed back to the same actor that produced it, Ray passes back a *reference* to the tensor instead of a copy. Therefore, the same kind of bug can occur. To help catch such cases, Ray will print a warning if an RDT object is passed to the actor that produced it and a different actor, like so: .. literalinclude:: doc_code/direct_transport_gloo.py :language: python :start-after: __gloo_object_mutability_warning_start__ :end-before: __gloo_object_mutability_warning_end__ Usage with NCCL (NVIDIA GPUs only) ---------------------------------- RDT requires just a few lines of code change to switch tensor transports. Here is the :ref:`Gloo example `, modified to use NVIDIA GPUs and the `NCCL `__ library for collective GPU communication. .. literalinclude:: doc_code/direct_transport_nccl.py :language: python :start-after: __nccl_full_example_start__ :end-before: __nccl_full_example_end__ The main code differences are: 1. The :func:`@ray.method ` uses ``tensor_transport="nccl"`` instead of ``tensor_transport="gloo"``. 2. The :func:`ray.experimental.collective.create_collective_group ` function is used to create a collective group. 3. The tensor is created on the GPU using the ``.cuda()`` method. Usage with NIXL (CPUs or NVIDIA GPUs) ------------------------------------- Installation ^^^^^^^^^^^^ For maximum performance, run the `install_gdrcopy.sh `__ script (e.g., ``install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"``). You can find available OS versions `here `__. If `gdrcopy` is not installed, things will still work with a plain ``pip install nixl``, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. Walkthrough ^^^^^^^^^^^ NIXL can transfer data between different devices, including CPUs and NVIDIA GPUs, but doesn't require a collective group to be created ahead of time. This means that any actor that has NIXL installed in its environment can be used to create and pass an RDT object. Otherwise, the usage is the same as in the :ref:`Gloo example `. Here is an example showing how to use NIXL to transfer an RDT object between two actors: .. literalinclude:: doc_code/direct_transport_nixl.py :language: python :start-after: __nixl_full_example_start__ :end-before: __nixl_full_example_end__ Compared to the :ref:`Gloo example `, the main code differences are: 1. The :func:`@ray.method ` uses ``tensor_transport="nixl"`` instead of ``tensor_transport="gloo"``. 2. No collective group is needed. ray.put and ray.get with NIXL ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Unlike the collective-based tensor transports (Gloo and NCCL), the :func:`ray.get ` function can use NIXL to retrieve a copy of the result. By default, the tensor transport for :func:`ray.get ` will be the one specified in the :func:`@ray.method ` decorator. .. literalinclude:: doc_code/direct_transport_nixl.py :language: python :start-after: __nixl_get_start__ :end-before: __nixl_get_end__ You can also use NIXL to retrieve the result from references created by :func:`ray.put `. .. literalinclude:: doc_code/direct_transport_nixl.py :language: python :start-after: __nixl_put__and_get_start__ :end-before: __nixl_put__and_get_end__ Summary ------- RDT allows Ray to store and pass objects directly between Ray actors, using accelerated transports like GLOO, NCCL, and NIXL. Here are the main points to keep in mind: * If using a collective-based tensor transport (Gloo or NCCL), a collective group must be created ahead of time. NIXL just requires all involved actors to have NIXL installed. * Unlike objects in the Ray object store, RDT objects are *mutable*, meaning that Ray only holds a reference, not a copy, to the stored tensor(s). * Otherwise, actors can be used as normal. For a full list of limitations, see the :ref:`limitations ` section. Microbenchmarks =============== .. note:: Under construction. .. _limitations: Limitations =========== RDT is currently in alpha and currently has the following limitations, which may be addressed in future releases: * Support for ``torch.Tensor`` objects only. * Support for Ray actors only, not Ray tasks. * Not yet compatible with `asyncio `__. Follow the `tracking issue `__ for updates. * Support for the following transports: Gloo, NCCL, and NIXL. * Support for CPUs and NVIDIA GPUs only. * RDT objects are *mutable*. This means that Ray only holds a reference to the tensor, and will not copy it until a transfer is requested. Thus, if the application code also keeps a reference to a tensor before returning it, and modifies the tensor in place, then some or all of the changes may be seen by the receiving actor. For collective-based tensor transports (Gloo and NCCL): * Only the process that created the collective group can submit actor tasks that return and pass RDT objects. If the creating process passes the actor handles to other processes, those processes can submit actor tasks as usual, but will not be able to use RDT objects. * Similarly, the process that created the collective group cannot serialize and pass RDT :class:`ray.ObjectRefs ` to other Ray tasks or actors. Instead, the :class:`ray.ObjectRef`\s can only be passed as direct arguments to other actor tasks, and those actors must be in the same collective group. * Each actor can only be in one collective group per tensor transport at a time. * No support for :func:`ray.put `. Due to a known issue, for NIXL, we currently do not support storing different GPU objects at the same actor, where the objects contain an overlapping but not equal set of tensors. To support this pattern, ensure that the first `ObjectRef` has gone out of scope before storing the same tensor(s) again in a second object. .. literalinclude:: doc_code/direct_transport_nixl.py :language: python :start-after: __nixl_limitations_start__ :end-before: __nixl_limitations_end__ Error handling ============== * Application-level errors, i.e. exceptions raised by user code, will not destroy the collective group and will instead be propagated to any dependent task(s), as for non-RDT Ray objects. * If a system-level error occurs during a GLOO or NCCL collective operation, the collective group will be destroyed and the actors will be killed to prevent any hanging. * If a system-level error occurs during a NIXL transfer, Ray or NIXL will abort the transfer with an exception and Ray will raise the exception in the dependent task or on the ray.get on the NIXL ref. * System-level errors include: * Errors internal to the third-party transport, e.g., NCCL network errors * Actor or node failures * Transport errors due to tensor device / transport mismatches, e.g., a CPU tensor when using NCCL * Ray RDT object fetch timeouts (can be overridden by setting the ``RAY_rdt_fetch_fail_timeout_milliseconds`` environment variable) * Any unexpected system bugs Advanced: RDT Internals ======================= .. note:: Under construction. --- .. _monte-carlo-pi: Monte Carlo Estimation of π =========================== .. raw:: html Run on Anyscale

This tutorial shows you how to estimate the value of π using a `Monte Carlo method `_ that works by randomly sampling points within a 2x2 square. We can use the proportion of the points that are contained within the unit circle centered at the origin to estimate the ratio of the area of the circle to the area of the square. Given that we know the true ratio to be π/4, we can multiply our estimated ratio by 4 to approximate the value of π. The more points that we sample to calculate this approximation, the closer the value should be to the true value of π. .. image:: ../images/monte_carlo_pi.png We use Ray :ref:`tasks ` to distribute the work of sampling and Ray :ref:`actors ` to track the progress of these distributed sampling tasks. The code can run on your laptop and can be easily scaled to large :ref:`clusters ` to increase the accuracy of the estimate. To get started, install Ray via ``pip install -U ray``. See :ref:`Installing Ray ` for more installation options. Starting Ray ------------ First, let's include all modules needed for this tutorial and start a local Ray cluster with :func:`ray.init() `: .. literalinclude:: ../doc_code/monte_carlo_pi.py :language: python :start-after: __starting_ray_start__ :end-before: __starting_ray_end__ Defining the Progress Actor --------------------------- Next, we define a Ray actor that can be called by sampling tasks to update progress. Ray actors are essentially stateful services that anyone with an instance (a handle) of the actor can call its methods. .. literalinclude:: ../doc_code/monte_carlo_pi.py :language: python :start-after: __defining_actor_start__ :end-before: __defining_actor_end__ We define a Ray actor by decorating a normal Python class with :func:`ray.remote `. The progress actor has ``report_progress()`` method that will be called by sampling tasks to update their progress individually and ``get_progress()`` method to get the overall progress. Defining the Sampling Task -------------------------- After our actor is defined, we now define a Ray task that does the sampling up to ``num_samples`` and returns the number of samples that are inside the circle. Ray tasks are stateless functions. They execute asynchronously, and run in parallel. .. literalinclude:: ../doc_code/monte_carlo_pi.py :language: python :start-after: __defining_task_start__ :end-before: __defining_task_end__ To convert a normal Python function as a Ray task, we decorate the function with :func:`ray.remote `. The sampling task takes a progress actor handle as an input and reports progress to it. The above code shows an example of calling actor methods from tasks. Creating a Progress Actor ------------------------- Once the actor is defined, we can create an instance of it. .. literalinclude:: ../doc_code/monte_carlo_pi.py :language: python :start-after: __creating_actor_start__ :end-before: __creating_actor_end__ To create an instance of the progress actor, simply call ``ActorClass.remote()`` method with arguments to the constructor. This creates and runs the actor on a remote worker process. The return value of ``ActorClass.remote(...)`` is an actor handle that can be used to call its methods. Executing Sampling Tasks ------------------------ Now the task is defined, we can execute it asynchronously. .. literalinclude:: ../doc_code/monte_carlo_pi.py :language: python :start-after: __executing_task_start__ :end-before: __executing_task_end__ We execute the sampling task by calling ``remote()`` method with arguments to the function. This immediately returns an ``ObjectRef`` as a future and then executes the function asynchronously on a remote worker process. Calling the Progress Actor -------------------------- While sampling tasks are running, we can periodically query the progress by calling the actor ``get_progress()`` method. .. literalinclude:: ../doc_code/monte_carlo_pi.py :language: python :start-after: __calling_actor_start__ :end-before: __calling_actor_end__ To call an actor method, use ``actor_handle.method.remote()``. This invocation immediately returns an ``ObjectRef`` as a future and then executes the method asynchronously on the remote actor process. To fetch the actual returned value of ``ObjectRef``, we use the blocking :func:`ray.get() `. Calculating π ------------- Finally, we get number of samples inside the circle from the remote sampling tasks and calculate π. .. literalinclude:: ../doc_code/monte_carlo_pi.py :language: python :start-after: __calculating_pi_start__ :end-before: __calculating_pi_end__ As we can see from the above code, besides a single ``ObjectRef``, :func:`ray.get() ` can also take a list of ``ObjectRef`` and return a list of results. If you run this tutorial, you will see output like: .. code-block:: text Progress: 0% Progress: 15% Progress: 28% Progress: 40% Progress: 50% Progress: 60% Progress: 70% Progress: 80% Progress: 90% Progress: 100% Estimated value of π is: 3.1412202 --- .. _ray-core-examples-tutorial: Ray Core Examples ================= .. toctree:: :hidden: :glob: * .. Organize example .rst files in the same manner as the .py files in ray/python/ray/train/examples. Below are examples for using Ray Core for a variety use cases. Beginner -------- .. list-table:: * - :doc:`A Gentle Introduction to Ray Core by Example ` * - :doc:`Using Ray for Highly Parallelizable Tasks ` * - :doc:`Monte Carlo Estimation of π ` Intermediate ------------ .. list-table:: * - :doc:`Running a Simple MapReduce Example with Ray Core ` * - :doc:`Speed Up Your Web Crawler by Parallelizing it with Ray ` Advanced -------- .. list-table:: * - :doc:`Build Batch Prediction Using Ray ` * - :doc:`Build a Simple Parameter Server Using Ray ` * - :doc:`Simple Parallel Model Selection ` * - :doc:`Learning to Play Pong ` --- .. _fault-tolerance: Fault tolerance =============== Ray is a distributed system, and that means failures can happen. Generally, Ray classifies failures into two classes: 1. application-level failures 2. system-level failures Bugs in user-level code or external system failures trigger application-level failures. Node failures, network failures, or just bugs in Ray trigger system-level failures. The following section contains the mechanisms that Ray provides to allow applications to recover from failures. To handle application-level failures, Ray provides mechanisms to catch errors, retry failed code, and handle misbehaving code. See the pages for :ref:`task ` and :ref:`actor ` fault tolerance for more information on these mechanisms. Ray also provides several mechanisms to automatically recover from internal system-level failures like :ref:`node failures `. In particular, Ray can automatically recover from some failures in the :ref:`distributed object store `. How to write fault tolerant Ray applications -------------------------------------------- There are several recommendations to make Ray applications fault tolerant: First, if the fault tolerance mechanisms provided by Ray don't work for you, you can always catch :ref:`exceptions ` caused by failures and recover manually. .. literalinclude:: doc_code/fault_tolerance_tips.py :language: python :start-after: __manual_retry_start__ :end-before: __manual_retry_end__ Second, avoid letting an ``ObjectRef`` outlive its :ref:`owner ` task or actor (the task or actor that creates the initial ``ObjectRef`` by calling :meth:`ray.put() ` or ``foo.remote()``). As long as there are still references to an object, the owner worker of the object keeps running even after the corresponding task or actor finishes. If the owner worker fails, Ray :ref:`cannot recover ` the object automatically for those who try to access the object. One example of creating such outlived objects is returning ``ObjectRef`` created by ``ray.put()`` from a task: .. literalinclude:: doc_code/fault_tolerance_tips.py :language: python :start-after: __return_ray_put_start__ :end-before: __return_ray_put_end__ In the preceding example, object ``x`` outlives its owner task ``a``. If the worker process running task ``a`` fails, calling ``ray.get`` on ``x_ref`` afterwards results in an ``OwnerDiedError`` exception. The following example is a fault tolerant version which returns ``x`` directly. In this example, the driver owns ``x`` and you only access it within the lifetime of the driver. If ``x`` is lost, Ray can automatically recover it via :ref:`lineage reconstruction `. See :doc:`/ray-core/patterns/return-ray-put` for more details. .. literalinclude:: doc_code/fault_tolerance_tips.py :language: python :start-after: __return_directly_start__ :end-before: __return_directly_end__ Third, avoid using :ref:`custom resource requirements ` that only particular nodes can satisfy. If that particular node fails, Ray won't retry the running tasks or actors. .. literalinclude:: doc_code/fault_tolerance_tips.py :language: python :start-after: __node_ip_resource_start__ :end-before: __node_ip_resource_end__ If you prefer running a task on a particular node, you can use the :class:`NodeAffinitySchedulingStrategy `. It allows you to specify the affinity as a soft constraint so even if the target node fails, the task can still be retried on other nodes. .. literalinclude:: doc_code/fault_tolerance_tips.py :language: python :start-after: __node_affinity_scheduling_strategy_start__ :end-before: __node_affinity_scheduling_strategy_end__ More about Ray fault tolerance ------------------------------ .. toctree:: :maxdepth: 1 fault_tolerance/tasks.rst fault_tolerance/actors.rst fault_tolerance/objects.rst fault_tolerance/nodes.rst fault_tolerance/gcs.rst --- .. _fault-tolerance-actors: .. _actor-fault-tolerance: Actor Fault Tolerance ===================== Actors can fail if the actor process dies, or if the **owner** of the actor dies. The owner of an actor is the worker that originally created the actor by calling ``ActorClass.remote()``. :ref:`Detached actors ` do not have an owner process and are cleaned up when the Ray cluster is destroyed. Actor process failure --------------------- Ray can automatically restart actors that crash unexpectedly. This behavior is controlled using ``max_restarts``, which sets the maximum number of times that an actor will be restarted. The default value of ``max_restarts`` is 0, meaning that the actor won't be restarted. If set to -1, the actor will be restarted infinitely many times. When an actor is restarted, its state will be recreated by rerunning its constructor. After the specified number of restarts, subsequent actor methods will raise a ``RayActorError``. By default, actor tasks execute with at-most-once semantics (``max_task_retries=0`` in the ``@ray.remote`` :func:`decorator `). This means that if an actor task is submitted to an actor that is unreachable, Ray will report the error with ``RayActorError``, a Python-level exception that is thrown when ``ray.get`` is called on the future returned by the task. Note that this exception may be thrown even though the task did indeed execute successfully. For example, this can happen if the actor dies immediately after executing the task. Ray also offers at-least-once execution semantics for actor tasks (``max_task_retries=-1`` or ``max_task_retries > 0``). This means that if an actor task is submitted to an actor that is unreachable, the system will automatically retry the task. With this option, the system will only throw a ``RayActorError`` to the application if one of the following occurs: (1) the actor’s ``max_restarts`` limit has been exceeded and the actor cannot be restarted anymore, or (2) the ``max_task_retries`` limit has been exceeded for this particular task. Note that if the actor is currently restarting when a task is submitted, this will count for one retry. The retry limit can be set to infinity with ``max_task_retries = -1``. You can experiment with this behavior by running the following code. .. literalinclude:: ../doc_code/actor_restart.py :language: python :start-after: __actor_restart_begin__ :end-before: __actor_restart_end__ For at-least-once actors, the system will still guarantee execution ordering according to the initial submission order. For example, any tasks submitted after a failed actor task will not execute on the actor until the failed actor task has been successfully retried. The system will not attempt to re-execute any tasks that executed successfully before the failure (unless ``max_task_retries`` is nonzero and the task is needed for :ref:`object reconstruction `). .. note:: For :ref:`async or threaded actors `, :ref:`tasks might be executed out of order `. Upon actor restart, the system will only retry *incomplete* tasks. Previously completed tasks will not be re-executed. At-least-once execution is best suited for read-only actors or actors with ephemeral state that does not need to be rebuilt after a failure. For actors that have critical state, the application is responsible for recovering the state, e.g., by taking periodic checkpoints and recovering from the checkpoint upon actor restart. Actor checkpointing ~~~~~~~~~~~~~~~~~~~ ``max_restarts`` automatically restarts the crashed actor, but it doesn't automatically restore application level state in your actor. Instead, you should manually checkpoint your actor's state and recover upon actor restart. For actors that are restarted manually, the actor's creator should manage the checkpoint and manually restart and recover the actor upon failure. This is recommended if you want the creator to decide when the actor should be restarted and/or if the creator is coordinating actor checkpoints with other execution: .. literalinclude:: ../doc_code/actor_checkpointing.py :language: python :start-after: __actor_checkpointing_manual_restart_begin__ :end-before: __actor_checkpointing_manual_restart_end__ Alternatively, if you are using Ray's automatic actor restart, the actor can checkpoint itself manually and restore from a checkpoint in the constructor: .. literalinclude:: ../doc_code/actor_checkpointing.py :language: python :start-after: __actor_checkpointing_auto_restart_begin__ :end-before: __actor_checkpointing_auto_restart_end__ .. note:: If the checkpoint is saved to external storage, make sure it's accessible to the entire cluster since the actor can be restarted on a different node. For example, save the checkpoint to cloud storage (e.g., S3) or a shared directory (e.g., via NFS). Actor creator failure --------------------- For :ref:`non-detached actors `, the owner of an actor is the worker that created it, i.e. the worker that called ``ActorClass.remote()``. Similar to :ref:`objects `, if the owner of an actor dies, then the actor will also fate-share with the owner. Ray will not automatically recover an actor whose owner is dead, even if it has a nonzero ``max_restarts``. Since :ref:`detached actors ` do not have an owner, they will still be restarted by Ray even if their original creator dies. Detached actors will continue to be automatically restarted until the maximum restarts is exceeded, the actor is destroyed, or until the Ray cluster is destroyed. You can try out this behavior in the following code. .. literalinclude:: ../doc_code/actor_creator_failure.py :language: python :start-after: __actor_creator_failure_begin__ :end-before: __actor_creator_failure_end__ Force-killing a misbehaving actor --------------------------------- Sometimes application-level code can cause an actor to hang or leak resources. In these cases, Ray allows you to recover from the failure by :ref:`manually terminating ` the actor. You can do this by calling ``ray.kill`` on any handle to the actor. Note that it does not need to be the original handle to the actor. If ``max_restarts`` is set, you can also allow Ray to automatically restart the actor by passing ``no_restart=False`` to ``ray.kill``. Unavailable actors ---------------------- When an actor can't accept method calls, a ``ray.get`` on the method's returned object reference may raise ``ActorUnavailableError``. This exception indicates the actor isn't accessible at the moment, but may recover after waiting and retrying. Typical cases include: - The actor is restarting. For example, it's waiting for resources or running the class constructor during the restart. - The actor is experiencing transient network issues, like connection outages. - The actor is dead, but the death hasn't yet been reported to the system. Actor method calls are executed at-most-once. When a ``ray.get()`` call raises the ``ActorUnavailableError`` exception, there's no guarantee on whether the actor executed the task or not. If the method has side effects, they may or may not be observable. Ray does guarantee that the method won't be executed twice, unless the actor or the method is configured with retries, as described in the next section. The actor may or may not recover in the next calls. Those subsequent calls may raise ``ActorDiedError`` if the actor is confirmed dead, ``ActorUnavailableError`` if it's still unreachable, or return values normally if the actor recovered. As a best practice, if the caller gets the ``ActorUnavailableError`` error, it should "quarantine" the actor and stop sending traffic to the actor. It can then periodically ping the actor until it raises ``ActorDiedError`` or returns OK. If a task has ``max_task_retries > 0`` and it received ``ActorUnavailableError``, Ray will retry the task up to ``max_task_retries`` times. If the actor is restarting in its constructor, the task retry will fail, consuming one retry count. If there are still retries remaining, Ray will retry again after ``RAY_task_retry_delay_ms``, until all retries are consumed or the actor is ready to accept tasks. If the constructor takes a long time to run, consider increasing ``max_task_retries`` or increase ``RAY_task_retry_delay_ms``. Actor method exceptions ----------------------- Sometimes you want to retry when an actor method raises exceptions. Use ``max_task_retries`` with ``retry_exceptions`` to enable this. Note that by default, retrying on user raised exceptions is disabled. To enable it, make sure the method is **idempotent**, that is, invoking it multiple times should be equivalent to invoking it only once. You can set ``retry_exceptions`` in the `@ray.method(retry_exceptions=...)` decorator, or in the `.options(retry_exceptions=...)` in the method call. Retry behavior depends on the value you set ``retry_exceptions`` to: - ``False`` (default): No retries for user exceptions. - ``True``: Ray retries a method on user exception up to ``max_task_retries`` times. - A list of exceptions: Ray retries a method on user exception up to ``max_task_retries`` times, only if the method raises an exception from these specific classes. ``max_task_retries`` applies to both exceptions and actor crashes. A Ray actor can set this option to apply to all of its methods. A method can also set an overriding option for itself. Ray searches for the first non-default value of ``max_task_retries`` in this order: - The method call's value, for example, `actor.method.options(max_task_retries=2)`. Ray ignores this value if you don't set it. - The method definition's value, for example, `@ray.method(max_task_retries=2)`. Ray ignores this value if you don't set it. - The actor creation call's value, for example, `Actor.options(max_task_retries=2)`. Ray ignores this value if you didn't set it. - The Actor class definition's value, for example, `@ray.remote(max_task_retries=2)` decorator. Ray ignores this value if you didn't set it. - The default value, `0`. For example, if a method sets `max_task_retries=5` and `retry_exceptions=True`, and the actor sets `max_restarts=2`, Ray executes the method up to 6 times: once for the initial invocation, and 5 additional retries. The 6 invocations may include 2 actor crashes. After the 6th invocation, a `ray.get` call to the result Ray ObjectRef raises the exception raised in the last invocation, or `ray.exceptions.RayActorError` if the actor crashed in the last invocation. --- .. _fault-tolerance-gcs: GCS Fault Tolerance =================== The Global Control Service, or GCS, manages cluster-level metadata. It also provides a handful of cluster-level operations including :ref:`actor `, :ref:`placement groups ` and node management. By default, the GCS isn't fault tolerant because it stores all data in memory. If it fails, the entire Ray cluster fails. To enable GCS fault tolerance, you need a highly available Redis instance, known as HA Redis. Then, when the GCS restarts, it loads all the data from the Redis instance and resumes regular functions. During the recovery period, the following functions aren't available: - Actor creation, deletion and reconstruction. - Placement group creation, deletion and reconstruction. - Resource management. - Worker node registration. - Worker process creation. However, running Ray tasks and actors remain alive, and any existing objects stay available. Setting up Redis ---------------- .. tab-set:: .. tab-item:: KubeRay (officially supported) If you are using :ref:`KubeRay `, refer to :ref:`KubeRay docs on GCS Fault Tolerance `. .. tab-item:: ray start If you are using :ref:`ray start ` to start the Ray head node, set the OS environment ``RAY_REDIS_ADDRESS`` to the Redis address, and supply the ``--redis-password`` flag with the password when calling ``ray start``: .. code-block:: shell RAY_REDIS_ADDRESS=redis_ip:port ray start --head --redis-password PASSWORD --redis-username default .. tab-item:: ray up If you are using :ref:`ray up ` to start the Ray cluster, change :ref:`head_start_ray_commands ` field to add ``RAY_REDIS_ADDRESS`` and ``--redis-password`` to the ``ray start`` command: .. code-block:: yaml head_start_ray_commands: - ray stop - ulimit -n 65536; RAY_REDIS_ADDRESS=redis_ip:port ray start --head --redis-password PASSWORD --redis-username default --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 After you back the GCS with Redis, it recovers its state from Redis when it restarts. While the GCS recovers, each raylet tries to reconnect to it. If a raylet can't reconnect for more than 60 seconds, that raylet exits and the corresponding node fails. Set this timeout threshold with the OS environment variable ``RAY_gcs_rpc_server_reconnect_timeout_s``. If the GCS IP address might change after restarts, use a qualified domain name and pass it to all raylets at start time. Each raylet resolves the domain name and connects to the correct GCS. You need to ensure that at any time, only one GCS is alive. .. note:: GCS fault tolerance with external Redis is officially supported only if you are using :ref:`KubeRay ` for :ref:`Ray serve fault tolerance `. For other cases, you can use it at your own risk and you need to implement additional mechanisms to detect the failure of GCS or the head node and restart it. .. note:: You can also enable GCS fault tolerance when running Ray on `Anyscale `_. See the Anyscale `documentation `_ for instructions. --- .. _fault-tolerance-nodes: Node Fault Tolerance ==================== A Ray cluster consists of one or more worker nodes, each of which consists of worker processes and system processes (e.g. raylet). One of the worker nodes is designated as the head node and has extra processes like the GCS. Here, we describe node failures and their impact on tasks, actors, and objects. Worker node failure ------------------- When a worker node fails, all the running tasks and actors will fail and all the objects owned by worker processes of this node will be lost. In this case, the :ref:`tasks `, :ref:`actors `, :ref:`objects ` fault tolerance mechanisms will kick in and try to recover the failures using other worker nodes. Head node failure ----------------- When a head node fails, the entire Ray cluster fails. To tolerate head node failures, we need to make :ref:`GCS fault tolerant ` so that when we start a new head node we still have all the cluster-level data. Raylet failure -------------- When a raylet process fails, the corresponding node will be marked as dead and is treated the same as a node failure. Each raylet is associated with a unique id, so even if the raylet restarts on the same physical machine, it'll be treated as a new raylet/node to the Ray cluster. --- .. _fault-tolerance-objects: .. _object-fault-tolerance: Object Fault Tolerance ====================== A Ray object has both data (the value returned when calling ``ray.get``) and metadata (e.g., the location of the value). Data is stored in the Ray object store while the metadata is stored at the object's **owner**. The owner of an object is the worker process that creates the original ``ObjectRef``, e.g., by calling ``f.remote()`` or ``ray.put()``. Note that this worker is usually a distinct process from the worker that creates the **value** of the object, except in cases of ``ray.put``. .. literalinclude:: ../doc_code/owners.py :language: python :start-after: __owners_begin__ :end-before: __owners_end__ Ray can automatically recover from data loss but not owner failure. .. _fault-tolerance-objects-reconstruction: Recovering from data loss ------------------------- When an object value is lost from the object store, such as during node failures, Ray will use *lineage reconstruction* to recover the object. Ray will first automatically attempt to recover the value by looking for copies of the same object on other nodes. If none are found, then Ray will automatically recover the value by :ref:`re-executing ` the task that previously created the value. Arguments to the task are recursively reconstructed through the same mechanism. Lineage reconstruction currently has the following limitations: * The object, and any of its transitive dependencies, must have been generated by a task (actor or non-actor). This means that **objects created by ray.put are not recoverable**. * Tasks are assumed to be deterministic and idempotent. Thus, **by default, objects created by actor tasks are not reconstructable**. To allow reconstruction of actor task results, set the ``max_task_retries`` parameter to a non-zero value (see :ref:`actor fault tolerance ` for more details). * Tasks will only be re-executed up to their maximum number of retries. By default, a non-actor task can be retried up to 3 times and an actor task cannot be retried. This can be overridden with the ``max_retries`` parameter for :ref:`remote functions ` and the ``max_task_retries`` parameter for :ref:`actors `. * The owner of the object must still be alive (see :ref:`below `). Lineage reconstruction can cause higher than usual driver memory usage because the driver keeps the descriptions of any tasks that may be re-executed in case of failure. To limit the amount of memory used by lineage, set the environment variable ``RAY_max_lineage_bytes`` (default 1GB) to evict lineage if the threshold is exceeded. To disable lineage reconstruction entirely, set the environment variable ``RAY_TASK_MAX_RETRIES=0`` during ``ray start`` or ``ray.init``. With this setting, if there are no copies of an object left, an ``ObjectLostError`` will be raised. .. _fault-tolerance-ownership: Recovering from owner failure ----------------------------- The owner of an object can die because of node or worker process failure. Currently, **Ray does not support recovery from owner failure**. In this case, Ray will clean up any remaining copies of the object's value to prevent a memory leak. Any workers that subsequently try to get the object's value will receive an ``OwnerDiedError`` exception, which can be handled manually. Understanding ``ObjectLostErrors`` ---------------------------------- Ray throws an ``ObjectLostError`` to the application when an object cannot be retrieved due to application or system error. This can occur during a ``ray.get()`` call or when fetching a task's arguments, and can happen for a number of reasons. Here is a guide to understanding the root cause for different error types: - ``OwnerDiedError``: The owner of an object, i.e., the Python worker that first created the ``ObjectRef`` via ``.remote()`` or ``ray.put()``, has died. The owner stores critical object metadata and an object cannot be retrieved if this process is lost. - ``ObjectReconstructionFailedError``: This error is thrown if an object, or another object that this object depends on, cannot be reconstructed due to one of the limitations described :ref:`above `. - ``ReferenceCountingAssertionError``: The object has already been deleted, so it cannot be retrieved. Ray implements automatic memory management through distributed reference counting, so this error should not happen in general. However, there is a `known edge case `_ that can produce this error. - ``ObjectFetchTimedOutError``: A node timed out while trying to retrieve a copy of the object from a remote node. This error usually indicates a system-level bug. The timeout period can be configured using the ``RAY_fetch_fail_timeout_milliseconds`` environment variable (default 10 minutes). - ``ObjectLostError``: The object was successfully created, but no copy is reachable. This is a generic error thrown when lineage reconstruction is disabled and all copies of the object are lost from the cluster. --- .. _fault-tolerance-tasks: .. _task-fault-tolerance: Task Fault Tolerance ==================== Tasks can fail due to application-level errors, e.g., Python-level exceptions, or system-level failures, e.g., a machine fails. Here, we describe the mechanisms that an application developer can use to recover from these errors. Catching application-level failures ----------------------------------- Ray surfaces application-level failures as Python-level exceptions. When a task on a remote worker or actor fails due to a Python-level exception, Ray wraps the original exception in a ``RayTaskError`` and stores this as the task's return value. This wrapped exception will be thrown to any worker that tries to get the result, either by calling ``ray.get`` or if the worker is executing another task that depends on the object. If the user's exception type can be subclassed, the raised exception is an instance of both ``RayTaskError`` and the user's exception type so the user can try-catch either of them. Otherwise, the wrapped exception is just ``RayTaskError`` and the actual user's exception type can be accessed via the ``cause`` field of the ``RayTaskError``. .. literalinclude:: ../doc_code/task_exceptions.py :language: python :start-after: __task_exceptions_begin__ :end-before: __task_exceptions_end__ Example code of catching the user exception type when the exception type can be subclassed: .. literalinclude:: ../doc_code/task_exceptions.py :language: python :start-after: __catch_user_exceptions_begin__ :end-before: __catch_user_exceptions_end__ Example code of accessing the user exception type when the exception type can *not* be subclassed: .. literalinclude:: ../doc_code/task_exceptions.py :language: python :start-after: __catch_user_final_exceptions_begin__ :end-before: __catch_user_final_exceptions_end__ If Ray can't serialize the user's exception, it converts the exception to a ``RayError``. .. literalinclude:: ../doc_code/task_exceptions.py :language: python :start-after: __unserializable_exceptions_begin__ :end-before: __unserializable_exceptions_end__ Use `ray list tasks` from :ref:`State API CLI ` to query task exit details: .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray list tasks .. code-block:: bash ======== List: 2023-05-26 10:32:00.962610 ======== Stats: ------------------------------ Total: 3 Table: ------------------------------ TASK_ID ATTEMPT_NUMBER NAME STATE JOB_ID ACTOR_ID TYPE FUNC_OR_CLASS_NAME PARENT_TASK_ID NODE_ID WORKER_ID ERROR_TYPE 0 16310a0f0a45af5cffffffffffffffffffffffff01000000 0 f FAILED 01000000 NORMAL_TASK f ffffffffffffffffffffffffffffffffffffffff01000000 767bd47b72efb83f33dda1b661621cce9b969b4ef00788140ecca8ad b39e3c523629ab6976556bd46be5dbfbf319f0fce79a664122eb39a9 TASK_EXECUTION_EXCEPTION 1 c2668a65bda616c1ffffffffffffffffffffffff01000000 0 g FAILED 01000000 NORMAL_TASK g ffffffffffffffffffffffffffffffffffffffff01000000 767bd47b72efb83f33dda1b661621cce9b969b4ef00788140ecca8ad b39e3c523629ab6976556bd46be5dbfbf319f0fce79a664122eb39a9 TASK_EXECUTION_EXCEPTION 2 c8ef45ccd0112571ffffffffffffffffffffffff01000000 0 f FAILED 01000000 NORMAL_TASK f ffffffffffffffffffffffffffffffffffffffff01000000 767bd47b72efb83f33dda1b661621cce9b969b4ef00788140ecca8ad b39e3c523629ab6976556bd46be5dbfbf319f0fce79a664122eb39a9 TASK_EXECUTION_EXCEPTION .. _task-retries: Retrying failed tasks --------------------- When a worker is executing a task, if the worker dies unexpectedly, either because the process crashed or because the machine failed, Ray will rerun the task until either the task succeeds or the maximum number of retries is exceeded. The default number of retries is 3 and can be overridden by specifying ``max_retries`` in the ``@ray.remote`` decorator. Specifying -1 allows infinite retries, and 0 disables retries. To override the default number of retries for all tasks submitted, set the OS environment variable ``RAY_TASK_MAX_RETRIES``. e.g., by passing this to your driver script or by using :ref:`runtime environments`. You can experiment with this behavior by running the following code. .. literalinclude:: ../doc_code/tasks_fault_tolerance.py :language: python :start-after: __tasks_fault_tolerance_retries_begin__ :end-before: __tasks_fault_tolerance_retries_end__ When a task returns a result in the Ray object store, it is possible for the resulting object to be lost **after** the original task has already finished. In these cases, Ray will also try to automatically recover the object by re-executing the tasks that created the object. This can be configured through the same ``max_retries`` option described here. See :ref:`object fault tolerance ` for more information. By default, Ray will **not** retry tasks upon exceptions thrown by application code. However, you may control whether application-level errors are retried, and even **which** application-level errors are retried, via the ``retry_exceptions`` argument. This is ``False`` by default. To enable retries upon application-level errors, set ``retry_exceptions=True`` to retry upon any exception, or pass a list of retryable exceptions. An example is shown below. .. literalinclude:: ../doc_code/tasks_fault_tolerance.py :language: python :start-after: __tasks_fault_tolerance_retries_exception_begin__ :end-before: __tasks_fault_tolerance_retries_exception_end__ Use `ray list tasks -f task_id=\` from :ref:`State API CLI ` to see task attempts failures and retries: .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray list tasks -f task_id=16310a0f0a45af5cffffffffffffffffffffffff01000000 .. code-block:: bash ======== List: 2023-05-26 10:38:08.809127 ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ TASK_ID ATTEMPT_NUMBER NAME STATE JOB_ID ACTOR_ID TYPE FUNC_OR_CLASS_NAME PARENT_TASK_ID NODE_ID WORKER_ID ERROR_TYPE 0 16310a0f0a45af5cffffffffffffffffffffffff01000000 0 potentially_fail FAILED 01000000 NORMAL_TASK potentially_fail ffffffffffffffffffffffffffffffffffffffff01000000 94909e0958e38d10d668aa84ed4143d0bf2c23139ae1a8b8d6ef8d9d b36d22dbf47235872ad460526deaf35c178c7df06cee5aa9299a9255 WORKER_DIED 1 16310a0f0a45af5cffffffffffffffffffffffff01000000 1 potentially_fail FINISHED 01000000 NORMAL_TASK potentially_fail ffffffffffffffffffffffffffffffffffffffff01000000 94909e0958e38d10d668aa84ed4143d0bf2c23139ae1a8b8d6ef8d9d 22df7f2a9c68f3db27498f2f435cc18582de991fbcaf49ce0094ddb0 Cancelling misbehaving tasks ---------------------------- If a task is hanging, you may want to cancel the task to continue to make progress. You can do this by calling ``ray.cancel`` on an ``ObjectRef`` returned by the task. By default, this will send a KeyboardInterrupt to the task's worker if it is mid-execution. Passing ``force=True`` to ``ray.cancel`` will force-exit the worker. See :func:`the API reference ` for ``ray.cancel`` for more details. Note that currently, Ray will not automatically retry tasks that have been cancelled. Sometimes, application-level code may cause memory leaks on a worker after repeated task executions, e.g., due to bugs in third-party libraries. To make progress in these cases, you can set the ``max_calls`` option in a task's ``@ray.remote`` decorator. Once a worker has executed this many invocations of the given remote function, it will automatically exit. By default, ``max_calls`` is set to infinity. --- .. _handling_dependencies: Environment Dependencies ======================== Your Ray application may have dependencies that exist outside of your Ray script. For example: * Your Ray script may import/depend on some Python packages. * Your Ray script may be looking for some specific environment variables to be available. * Your Ray script may import some files outside of the script. One frequent problem when running on a cluster is that Ray expects these "dependencies" to exist on each Ray node. If these are not present, you may run into issues such as ``ModuleNotFoundError``, ``FileNotFoundError`` and so on. To address this problem, you can (1) prepare your dependencies on the cluster in advance (e.g. using a container image) using the Ray :ref:`Cluster Launcher `, or (2) use Ray's :ref:`runtime environments ` to install them on the fly. For production usage or non-changing environments, we recommend installing your dependencies into a container image and specifying the image using the Cluster Launcher. For dynamic environments (e.g. for development and experimentation), we recommend using runtime environments. Concepts -------- - **Ray Application**. A program including a Ray script that calls ``ray.init()`` and uses Ray tasks or actors. - **Dependencies**, or **Environment**. Anything outside of the Ray script that your application needs to run, including files, packages, and environment variables. - **Files**. Code files, data files or other files that your Ray application needs to run. - **Packages**. External libraries or executables required by your Ray application, often installed via ``pip`` or ``conda``. - **Local machine** and **Cluster**. Usually, you may want to separate the Ray cluster compute machines/pods from the machine/pod that handles and submits the application. You can submit a Ray Job via :ref:`the Ray Job Submission mechanism `, or use `ray attach` to connect to a cluster interactively. We call the machine submitting the job your *local machine*. - **Job**. A :ref:`Ray job ` is a single application: it is the collection of Ray tasks, objects, and actors that originate from the same script. .. _using-the-cluster-launcher: Preparing an environment using the Ray Cluster launcher ------------------------------------------------------- The first way to set up dependencies is to prepare a single environment across the cluster before starting the Ray runtime. - You can build all your files and dependencies into a container image and specify this in your :ref:`Cluster YAML Configuration `. - You can also install packages using ``setup_commands`` in the Ray Cluster configuration file (:ref:`reference `); these commands will be run as each node joins the cluster. Note that for production settings, it is recommended to build any necessary packages into a container image instead. - You can push local files to the cluster using ``ray rsync_up`` (:ref:`reference`). .. _runtime-environments: Runtime environments -------------------- .. note:: This feature requires a full installation of Ray using ``pip install "ray[default]"``. This feature is available starting with Ray 1.4.0 and is currently supported on macOS and Linux, with beta support on Windows. The second way to set up dependencies is to install them dynamically while Ray is running. A **runtime environment** describes the dependencies your Ray application needs to run, including :ref:`files, packages, environment variables, and more `. It is installed dynamically on the cluster at runtime and cached for future use (see :ref:`Caching and Garbage Collection ` for details about the lifecycle). Runtime environments can be used on top of the prepared environment from :ref:`the Ray Cluster launcher ` if it was used. For example, you can use the Cluster launcher to install a base set of packages, and then use runtime environments to install additional packages. In contrast with the base cluster environment, a runtime environment will only be active for Ray processes. (For example, if using a runtime environment specifying a ``pip`` package ``my_pkg``, the statement ``import my_pkg`` will fail if called outside of a Ray task, actor, or job.) Runtime environments also allow you to set dependencies per-task, per-actor, and per-job on a long-running Ray cluster. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray runtime_env = {"pip": ["emoji"]} ray.init(runtime_env=runtime_env) @ray.remote def f(): import emoji return emoji.emojize('Python is :thumbs_up:') print(ray.get(f.remote())) .. testoutput:: Python is 👍 A runtime environment can be described by a Python `dict`: .. literalinclude:: /ray-core/doc_code/runtime_env_example.py :language: python :start-after: __runtime_env_pip_def_start__ :end-before: __runtime_env_pip_def_end__ Alternatively, you can use :class:`ray.runtime_env.RuntimeEnv `: .. literalinclude:: /ray-core/doc_code/runtime_env_example.py :language: python :start-after: __strong_typed_api_runtime_env_pip_def_start__ :end-before: __strong_typed_api_runtime_env_pip_def_end__ For more examples, jump to the :ref:`API Reference `. There are two primary scopes for which you can specify a runtime environment: * :ref:`Per-Job `, and * :ref:`Per-Task/Actor, within a job `. .. _rte-per-job: Specifying a Runtime Environment Per-Job ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can specify a runtime environment for your whole job, whether running a script directly on the cluster, using the :ref:`Ray Jobs API `, or submitting a :ref:`KubeRay RayJob `: .. literalinclude:: /ray-core/doc_code/runtime_env_example.py :language: python :start-after: __ray_init_start__ :end-before: __ray_init_end__ .. testcode:: :skipif: True # Option 2: Using Ray Jobs API (Python SDK) from ray.job_submission import JobSubmissionClient client = JobSubmissionClient("http://:8265") job_id = client.submit_job( entrypoint="python my_ray_script.py", runtime_env=runtime_env, ) .. code-block:: bash # Option 3: Using Ray Jobs API (CLI). (Note: can use --runtime-env to pass a YAML file instead of an inline JSON string.) $ ray job submit --address="http://:8265" --runtime-env-json='{"working_dir": "/data/my_files", "pip": ["emoji"]}' -- python my_ray_script.py .. code-block:: yaml # Option 4: Using KubeRay RayJob. You can specify the runtime environment in the RayJob YAML manifest. # [...] spec: runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: KEY: "VALUE" .. warning:: Specifying the ``runtime_env`` argument in the ``submit_job`` or ``ray job submit`` call ensures the runtime environment is installed on the cluster before the entrypoint script is run. If ``runtime_env`` is specified from ``ray.init(runtime_env=...)``, the runtime env is only applied to all children Tasks and Actors, not the entrypoint script (Driver) itself. If ``runtime_env`` is specified by both ``ray job submit`` and ``ray.init``, the runtime environments are merged. See :ref:`Runtime Environment Specified by Both Job and Driver ` for more details. .. note:: There are two options for when to install the runtime environment: 1. As soon as the job starts (i.e., as soon as ``ray.init()`` is called), the dependencies are eagerly downloaded and installed. 2. The dependencies are installed only when a task is invoked or an actor is created. The default is option 1. To change the behavior to option 2, add ``"eager_install": False`` to the ``config`` of ``runtime_env``. .. _rte-per-task-actor: Specifying a Runtime Environment Per-Task or Per-Actor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can specify different runtime environments per-actor or per-task using ``.options()`` or the ``@ray.remote`` decorator: .. literalinclude:: /ray-core/doc_code/runtime_env_example.py :language: python :start-after: __per_task_per_actor_start__ :end-before: __per_task_per_actor_end__ This allows you to have actors and tasks running in their own environments, independent of the surrounding environment. (The surrounding environment could be the job's runtime environment, or the system environment of the cluster.) .. warning:: Ray does not guarantee compatibility between tasks and actors with conflicting runtime environments. For example, if an actor whose runtime environment contains a ``pip`` package tries to communicate with an actor with a different version of that package, it can lead to unexpected behavior such as unpickling errors. Common Workflows ^^^^^^^^^^^^^^^^ This section describes some common use cases for runtime environments. These use cases are not mutually exclusive; all of the options described below can be combined in a single runtime environment. .. _workflow-local-files: Using Local Files """"""""""""""""" Your Ray application might depend on source files or data files. For a development workflow, these might live on your local machine, but when it comes time to run things at scale, you will need to get them to your remote cluster. The following simple example explains how to get your local files on the cluster. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import os import ray os.makedirs("/tmp/runtime_env_working_dir", exist_ok=True) with open("/tmp/runtime_env_working_dir/hello.txt", "w") as hello_file: hello_file.write("Hello World!") # Specify a runtime environment for the entire Ray job ray.init(runtime_env={"working_dir": "/tmp/runtime_env_working_dir"}) # Create a Ray task, which inherits the above runtime env. @ray.remote def f(): # The function will have its working directory changed to its node's # local copy of /tmp/runtime_env_working_dir. return open("hello.txt").read() print(ray.get(f.remote())) .. testoutput:: Hello World! .. note:: The example above is written to run on a local machine, but as for all of these examples, it also works when specifying a Ray cluster to connect to (e.g., using ``ray.init("ray://123.456.7.89:10001", runtime_env=...)`` or ``ray.init(address="auto", runtime_env=...)``). The specified local directory will automatically be pushed to the cluster nodes when ``ray.init()`` is called. You can also specify files via a remote cloud storage URI; see :ref:`remote-uris` for details. If you specify a `working_dir`, Ray always prepares it first, and it's present in the creation of other runtime environments in the `${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}` environment variable. This sequencing allows `pip` and `conda` to reference local files in the `working_dir` like `requirements.txt` or `environment.yml`. See `pip` and `conda` sections in :ref:`runtime-environments-api-ref` for more details. Using ``conda`` or ``pip`` packages """"""""""""""""""""""""""""""""""" Your Ray application might depend on Python packages (for example, ``pendulum`` or ``requests``) via ``import`` statements. Ray ordinarily expects all imported packages to be preinstalled on every node of the cluster; in particular, these packages are not automatically shipped from your local machine to the cluster or downloaded from any repository. However, using runtime environments you can dynamically specify packages to be automatically downloaded and installed in a virtual environment for your Ray job, or for specific Ray tasks or actors. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray import requests # This example runs on a local machine, but you can also do # ray.init(address=..., runtime_env=...) to connect to a cluster. ray.init(runtime_env={"pip": ["requests"]}) @ray.remote def reqs(): return requests.get("https://www.ray.io/").status_code print(ray.get(reqs.remote())) .. testoutput:: 200 You may also specify your ``pip`` dependencies either via a Python list or a local ``requirements.txt`` file. Consider specifying a ``requirements.txt`` file when your ``pip install`` command requires options such as ``--extra-index-url`` or ``--find-links``; see ``_ for details. Alternatively, you can specify a ``conda`` environment, either as a Python dictionary or via a local ``environment.yml`` file. This conda environment can include ``pip`` packages. For details, head to the :ref:`API Reference `. .. warning:: Since the packages in the ``runtime_env`` are installed at runtime, be cautious when specifying ``conda`` or ``pip`` packages whose installations involve building from source, as this can be slow. .. note:: When using the ``"pip"`` field, the specified packages will be installed "on top of" the base environment using ``virtualenv``, so existing packages on your cluster will still be importable. By contrast, when using the ``conda`` field, your Ray tasks and actors will run in an isolated environment. The ``conda`` and ``pip`` fields cannot both be used in a single ``runtime_env``. .. note:: The ``ray[default]`` package itself will automatically be installed in the environment. For the ``conda`` field only, if you are using any other Ray libraries (for example, Ray Serve), then you will need to specify the library in the runtime environment (e.g. ``runtime_env = {"conda": {"dependencies": ["pytorch", "pip", {"pip": ["requests", "ray[serve]"]}]}}``.) .. note:: ``conda`` environments must have the same Python version as the Ray cluster. Do not list ``ray`` in the ``conda`` dependencies, as it will be automatically installed. .. _use-uv-for-package-management: Using ``uv`` for package management """"""""""""""""""""""""""""""""""" The recommended approach for package management with `uv` in runtime environments is through `uv run`. This method offers several key advantages: First, it keeps dependencies synchronized between your driver and Ray workers. Additionally, it provides full support for `pyproject.toml` including editable packages. It also allows you to lock package versions using `uv lock`. For more details, see the `UV scripts documentation `_ as well as `our blog post `_. Create a file `pyproject.toml` in your working directory like the following: .. code-block:: toml [project] name = "test" version = "0.1" dependencies = [ "emoji", "ray", ] And then a `test.py` like the following: .. testcode:: :skipif: True import emoji import ray @ray.remote def f(): return emoji.emojize('Python is :thumbs_up:') # Execute 1000 copies of f across a cluster. print(ray.get([f.remote() for _ in range(1000)])) and run the driver script with `uv run test.py`. This runs 1000 copies of the `f` function across a number of Python worker processes in a Ray cluster. The `emoji` dependency, in addition to being available for the main script, is also available for all worker processes. Also, the source code in the current working directory is available to all the workers. This workflow also supports editable packages, for example, you can use `uv add --editable ./path/to/package` where `./path/to/package` must be inside your current working directory so it's available to all workers. See `here `_ for an end-to-end example of how to use `uv run` to run a batch inference workload with Ray Data. **Using uv in a Ray Job:** With the same `pyproject.toml` and `test.py` files as above, you can submit a Ray Job via .. code-block:: sh ray job submit --working-dir . -- uv run test.py This command makes sure both the driver and workers of the job run in the uv environment as specified by your `pyproject.toml`. **Using uv with Ray Serve:** With appropriate `pyproject.toml` and `app.py` files, you can run a Ray Serve application with `uv run serve run app:main`. **Best Practices and Tips:** - If you are running on a Ray Cluster, the Ray and Python versions of your uv environment must be the same as the Ray and Python versions of your cluster or you will get a version mismatch exception. There are multiple ways to solve this: 1. If you are using ephemeral Ray clusters, run the application on a cluster with the right versions. 2. If you need to run on a cluster with a different versions, consider modifying the versions of your uv environment by updating the `pyproject.toml` file or by using the `--active` flag with `uv run` (i.e., `uv run --active main.py`). - Use `uv lock` to generate a lockfile and make sure all your dependencies are frozen, so things won't change in uncontrolled ways if a new version of a package gets released. - If you have a requirements.txt file, you can use `uv add -r requirement.txt` to add the dependencies to your `pyproject.toml` and then use that with uv run. - If your `pyproject.toml` is in some subdirectory, you can use `uv run --project` to use it from there. - If you use uv run and want to reset the working directory to something that isn't the current working directory, use the `--directory` flag. The Ray uv integration makes sure your `working_dir` is set accordingly. **Advanced use cases:** Under the hood, the `uv run` support is implemented using a low level runtime environment plugin called `py_executable`. It allows you to specify the Python executable (including arguments) that Ray workers will be started in. In the case of uv, the `py_executable` is set to `uv run` with the same parameters that were used to run the driver. Also, the `working_dir` runtime environment is used to propagate the working directory of the driver (including the `pyproject.toml`) to the workers. This allows uv to set up the right dependencies and environment for the workers to run in. There are some advanced use cases where you might want to use the `py_executable` mechanism directly in your programs: - *Applications with heterogeneous dependencies:* Ray supports using a different runtime environment for different tasks or actors. This is useful for deploying different inference engines, models, or microservices in different `Ray Serve deployments `_ and also for heterogeneous data pipelines in Ray Data. To implement this, you can specify a different `py_executable` for each of the runtime environments and use uv run with a different `--project` parameter for each. Instead, you can also use a different `working_dir` for each environment. - *Customizing the command the worker runs in:* On the workers, you might want to customize uv with some special arguments that aren't used for the driver. Or, you might want to run processes using `poetry run`, a build system like bazel, a profiler, or a debugger. In these cases, you can explicitly specify the executable the worker should run in via `py_executable`. It could even be a shell script that is stored in `working_dir` if you are trying to wrap multiple processes in more complex ways. .. note:: The uv environment is inherited by all children tasks and actors. If you want to mix environments, for example, `pip` runtime environments with `uv run`, you need to set the Python executable back to an executable that's not running in the isolated uv environment like the following: .. code-block:: toml [project] name = "test" version = "0.1" dependencies = [ "emoji", "ray", "pip", "virtualenv", ] .. testcode:: :skipif: True import ray @ray.remote(runtime_env={"pip": ["wikipedia"], "py_executable": "python"}) def f(): import wikipedia return wikipedia.summary("Wikipedia") @ray.remote def g(): import emoji return emoji.emojize('Python is :thumbs_up:') print(ray.get(f.remote())) print(ray.get(g.remote())) While the above pattern can be useful for supporting legacy applications, the Ray Team recommends also using uv for tracking nested environments. You can use this approach by creating a separate `pyproject.toml` containing the dependencies of the nested environment. Library Development """"""""""""""""""" Suppose you are developing a library ``my_module`` on Ray. A typical iteration cycle will involve 1. Making some changes to the source code of ``my_module`` 2. Running a Ray script to test the changes, perhaps on a distributed cluster. To ensure your local changes show up across all Ray workers and can be imported properly, use the ``py_modules`` field. .. testcode:: :skipif: True import ray import my_module ray.init("ray://123.456.7.89:10001", runtime_env={"py_modules": [my_module]}) @ray.remote def test_my_module(): # No need to import my_module inside this function. my_module.test() ray.get(test_my_module.remote()) .. _runtime-environments-api-ref: API Reference ^^^^^^^^^^^^^ The ``runtime_env`` is a Python dictionary or a Python class :class:`ray.runtime_env.RuntimeEnv ` including one or more of the following fields: - ``working_dir`` (str): Specifies the working directory for the Ray workers. This must either be (1) an local existing directory with total size at most 500 MiB, (2) a local existing zipped file with total unzipped size at most 500 MiB (Note: ``excludes`` has no effect), or (3) a URI to a remotely-stored zip file containing the working directory for your job (no file size limit is enforced by Ray). See :ref:`remote-uris` for details. The specified directory will be downloaded to each node on the cluster, and Ray workers will be started in their node's copy of this directory. - Examples - ``"." # cwd`` - ``"/src/my_project"`` - ``"/src/my_project.zip"`` - ``"s3://path/to/my_dir.zip"`` Note: Setting a local directory per-task or per-actor is currently unsupported; it can only be set per-job (i.e., in ``ray.init()``). Note: By default, if the local directory contains a ``.gitignore`` and/or ``.rayignore`` file, the specified files are not uploaded to the cluster. To disable the ``.gitignore`` from being considered, set ``RAY_RUNTIME_ENV_IGNORE_GITIGNORE=1`` on the machine doing the uploading. Note: If the local directory contains symbolic links, Ray follows the links and the files they point to are uploaded to the cluster. - ``py_modules`` (List[str|module]): Specifies Python modules to be available for import in the Ray workers. (For more ways to specify packages, see also the ``pip`` and ``conda`` fields below.) Each entry must be either (1) a path to a local file or directory, (2) a URI to a remote zip or wheel file (see :ref:`remote-uris` for details), (3) a Python module object, or (4) a path to a local `.whl` file. - Examples of entries in the list: - ``"."`` - ``"/local_dependency/my_dir_module"`` - ``"/local_dependency/my_file_module.py"`` - ``"s3://bucket/my_module.zip"`` - ``my_module # Assumes my_module has already been imported, e.g. via 'import my_module'`` - ``my_module.whl`` - ``"s3://bucket/my_module.whl"`` The modules will be downloaded to each node on the cluster. Note: Setting options (1), (3) and (4) per-task or per-actor is currently unsupported, it can only be set per-job (i.e., in ``ray.init()``). Note: For option (1), by default, if the local directory contains a ``.gitignore`` and/or ``.rayignore`` file, the specified files are not uploaded to the cluster. To disable the ``.gitignore`` from being considered, set ``RAY_RUNTIME_ENV_IGNORE_GITIGNORE=1`` on the machine doing the uploading. - ``py_executable`` (str): Specifies the executable used for running the Ray workers. It can include arguments as well. The executable can be located in the `working_dir`. This runtime environment is useful to run workers in a custom debugger or profiler as well as to run workers in an environment set up by a package manager like `UV` (see :ref:`here `). Note: ``py_executable`` is new functionality and currently experimental. If you have some requirements or run into any problems, raise issues in `github `__. - ``excludes`` (List[str]): When used with ``working_dir`` or ``py_modules``, specifies a list of files or paths to exclude from being uploaded to the cluster. This field uses the pattern-matching syntax used by ``.gitignore`` files: see ``_ for details. Note: In accordance with ``.gitignore`` syntax, if there is a separator (``/``) at the beginning or middle (or both) of the pattern, then the pattern is interpreted relative to the level of the ``working_dir``. In particular, you shouldn't use absolute paths (e.g. `/Users/my_working_dir/subdir/`) with `excludes`; rather, you should use the relative path `/subdir/` (written here with a leading `/` to match only the top-level `subdir` directory, rather than all directories named `subdir` at all levels.) - Example: ``{"working_dir": "/Users/my_working_dir/", "excludes": ["my_file.txt", "/subdir/", "path/to/dir", "*.log"]}`` - ``pip`` (dict | List[str] | str): Either (1) a list of pip `requirements specifiers `_, (2) a string containing the path to a local pip `“requirements.txt” `_ file, or (3) a python dictionary that has three fields: (a) ``packages`` (required, List[str]): a list of pip packages, (b) ``pip_check`` (optional, bool): whether to enable `pip check `_ at the end of pip install, defaults to ``False``. (c) ``pip_version`` (optional, str): the version of pip; Ray will spell the package name "pip" in front of the ``pip_version`` to form the final requirement string. (d) ``pip_install_options`` (optional, List[str]): user-provided options for ``pip install`` command. Defaults to ``["--disable-pip-version-check", "--no-cache-dir"]``. The syntax of a requirement specifier is defined in full in `PEP 508 `_. This will be installed in the Ray workers at runtime. Packages in the preinstalled cluster environment will still be available. To use a library like Ray Serve or Ray Tune, you will need to include ``"ray[serve]"`` or ``"ray[tune]"`` here. The Ray version must match that of the cluster. - Example: ``["requests==1.0.0", "aiohttp", "ray[serve]"]`` - Example: ``"./requirements.txt"`` - Example: ``{"packages":["tensorflow", "requests"], "pip_check": False, "pip_version": "==22.0.2;python_version=='3.8.11'"}`` When specifying a path to a ``requirements.txt`` file, the file must be present on your local machine and it must be a valid absolute path or relative filepath relative to your local current working directory, *not* relative to the ``working_dir`` specified in the ``runtime_env``. Furthermore, referencing local files *within* a ``requirements.txt`` file isn't directly supported (e.g., ``-r ./my-laptop/more-requirements.txt``, ``./my-pkg.whl``). Instead, use the ``${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}`` environment variable in the creation process. For example, use ``-r ${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/my-laptop/more-requirements.txt`` or ``${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/my-pkg.whl`` to reference local files, while ensuring they're in the ``working_dir``. - ``uv`` (dict | List[str] | str): Alpha version feature. This plugin is the ``uv pip`` version of the ``pip`` plugin above. If you are looking for ``uv run`` support with ``pyproject.toml`` and ``uv.lock`` support, use :ref:`the uv run runtime environment plugin ` instead. Either (1) a list of uv `requirements specifiers `_, (2) a string containing the path to a local uv `“requirements.txt” `_ file, or (3) a python dictionary that has three fields: (a) ``packages`` (required, List[str]): a list of uv packages, (b) ``uv_version`` (optional, str): the version of uv; Ray will spell the package name "uv" in front of the ``uv_version`` to form the final requirement string. (c) ``uv_check`` (optional, bool): whether to enable pip check at the end of uv install, default to False. (d) ``uv_pip_install_options`` (optional, List[str]): user-provided options for ``uv pip install`` command, default to ``["--no-cache"]``. To override the default options and install without any options, use an empty list ``[]`` as install option value. The syntax of a requirement specifier is the same as ``pip`` requirements. This will be installed in the Ray workers at runtime. Packages in the preinstalled cluster environment will still be available. To use a library like Ray Serve or Ray Tune, you will need to include ``"ray[serve]"`` or ``"ray[tune]"`` here. The Ray version must match that of the cluster. - Example: ``["requests==1.0.0", "aiohttp", "ray[serve]"]`` - Example: ``"./requirements.txt"`` - Example: ``{"packages":["tensorflow", "requests"], "uv_version": "==0.4.0;python_version=='3.8.11'"}`` When specifying a path to a ``requirements.txt`` file, the file must be present on your local machine and it must be a valid absolute path or relative filepath relative to your local current working directory, *not* relative to the ``working_dir`` specified in the ``runtime_env``. Furthermore, referencing local files *within* a ``requirements.txt`` file isn't directly supported (e.g., ``-r ./my-laptop/more-requirements.txt``, ``./my-pkg.whl``). Instead, use the ``${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}`` environment variable in the creation process. For example, use ``-r ${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/my-laptop/more-requirements.txt`` or ``${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/my-pkg.whl`` to reference local files, while ensuring they're in the ``working_dir``. - ``conda`` (dict | str): Either (1) a dict representing the conda environment YAML, (2) a string containing the path to a local `conda “environment.yml” `_ file, or (3) the name of a local conda environment already installed on each node in your cluster (e.g., ``"pytorch_p36"``) or its absolute path (e.g. ``"/home/youruser/anaconda3/envs/pytorch_p36"``) . In the first two cases, the Ray and Python dependencies will be automatically injected into the environment to ensure compatibility, so there is no need to manually include them. The Python and Ray version must match that of the cluster, so you likely should not specify them manually. Note that the ``conda`` and ``pip`` keys of ``runtime_env`` cannot both be specified at the same time---to use them together, please use ``conda`` and add your pip dependencies in the ``"pip"`` field in your conda ``environment.yaml``. - Example: ``{"dependencies": ["pytorch", "torchvision", "pip", {"pip": ["pendulum"]}]}`` - Example: ``"./environment.yml"`` - Example: ``"pytorch_p36"`` - Example: ``"/home/youruser/anaconda3/envs/pytorch_p36"`` When specifying a path to a ``environment.yml`` file, the file must be present on your local machine and it must be a valid absolute path or a relative filepath relative to your local current working directory, *not* relative to the ``working_dir`` specified in the ``runtime_env``. Furthermore, referencing local files *within* a ``environment.yml`` file isn't directly supported (e.g., ``-r ./my-laptop/more-requirements.txt``, ``./my-pkg.whl``). Instead, use the ``${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}`` environment variable in the creation process. For example, use ``-r ${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/my-laptop/more-requirements.txt`` or ``${RAY_RUNTIME_ENV_CREATE_WORKING_DIR}/my-pkg.whl`` to reference local files, while ensuring they're in the ``working_dir``. - ``env_vars`` (Dict[str, str]): Environment variables to set. Environment variables already set on the cluster will still be visible to the Ray workers; so there is no need to include ``os.environ`` or similar in the ``env_vars`` field. By default, these environment variables override the same name environment variables on the cluster. You can also reference existing environment variables using ${ENV_VAR} to achieve the appending behavior. If the environment variable doesn't exist, it becomes an empty string `""`. - Example: ``{"OMP_NUM_THREADS": "32", "TF_WARNINGS": "none"}`` - Example: ``{"LD_LIBRARY_PATH": "${LD_LIBRARY_PATH}:/home/admin/my_lib"}`` - Non-existent variable example: ``{"ENV_VAR_NOT_EXIST": "${ENV_VAR_NOT_EXIST}:/home/admin/my_lib"}`` -> ``ENV_VAR_NOT_EXIST=":/home/admin/my_lib"``. - ``nsight`` (Union[str, Dict[str, str]]): specifies the config for the Nsight System Profiler. The value is either (1) "default", which refers to the `default config `_, or (2) a dict of Nsight System Profiler options and their values. See :ref:`here ` for more details on setup and usage. - Example: ``"default"`` - Example: ``{"stop-on-exit": "true", "t": "cuda,cublas,cudnn", "ftrace": ""}`` - ``image_uri`` (dict): Require a given Docker image. The worker process runs in a container with this image. - Example: ``{"image_uri": "anyscale/ray:2.31.0-py39-cpu"}`` Note: ``image_uri`` is experimental. If you have some requirements or run into any problems, raise issues in `github `__. - ``config`` (dict | :class:`ray.runtime_env.RuntimeEnvConfig `): config for runtime environment. Either a dict or a RuntimeEnvConfig. Fields: (1) setup_timeout_seconds, the timeout of runtime environment creation, timeout is in seconds. - Example: ``{"setup_timeout_seconds": 10}`` - Example: ``RuntimeEnvConfig(setup_timeout_seconds=10)`` (2) ``eager_install`` (bool): Indicates whether to install the runtime environment on the cluster at ``ray.init()`` time, before the workers are leased. This flag is set to ``True`` by default. If set to ``False``, the runtime environment will be only installed when the first task is invoked or when the first actor is created. Currently, specifying this option per-actor or per-task is not supported. - Example: ``{"eager_install": False}`` - Example: ``RuntimeEnvConfig(eager_install=False)`` .. _runtime-environments-caching: Caching and Garbage Collection """""""""""""""""""""""""""""" Runtime environment resources on each node (such as conda environments, pip packages, or downloaded ``working_dir`` or ``py_modules`` directories) will be cached on the cluster to enable quick reuse across different runtime environments within a job. Each field (``working_dir``, ``py_modules``, etc.) has its own cache whose size defaults to 10 GB. To change this default, you may set the environment variable ``RAY_RUNTIME_ENV__CACHE_SIZE_GB`` on each node in your cluster before starting Ray e.g. ``export RAY_RUNTIME_ENV_WORKING_DIR_CACHE_SIZE_GB=1.5``. When the cache size limit is exceeded, resources not currently used by any Actor, Task or Job are deleted. .. _runtime-environments-job-conflict: Runtime Environment Specified by Both Job and Driver """""""""""""""""""""""""""""""""""""""""""""""""""" When running an entrypoint script (Driver), the runtime environment can be specified via `ray.init(runtime_env=...)` or `ray job submit --runtime-env` (See :ref:`Specifying a Runtime Environment Per-Job ` for more details). - If the runtime environment is specified by ``ray job submit --runtime-env=...``, the runtime environments are applied to the entrypoint script (Driver) and all the tasks and actors created from it. - If the runtime environment is specified by ``ray.init(runtime_env=...)``, the runtime environments are applied to all the tasks and actors, but not the entrypoint script (Driver) itself. Since ``ray job submit`` submits a Driver (that calls ``ray.init``), sometimes runtime environments are specified by both of them. When both the Ray Job and Driver specify runtime environments, their runtime environments are merged if there's no conflict. It means the driver script uses the runtime environment specified by `ray job submit`, and all the tasks and actors are going to use the merged runtime environment. Ray raises an exception if the runtime environments conflict. * The ``runtime_env["env_vars"]`` of `ray job submit --runtime-env=...` is merged with the ``runtime_env["env_vars"]`` of `ray.init(runtime_env=...)`. Note that each individual env_var keys are merged. If the environment variables conflict, Ray raises an exception. * Every other field in the ``runtime_env`` will be merged. If any key conflicts, it raises an exception. Example: .. testcode:: # `ray job submit --runtime_env=...` {"pip": ["requests", "chess"], "env_vars": {"A": "a", "B": "b"}} # ray.init(runtime_env=...) {"env_vars": {"C": "c"}} # Driver's actual `runtime_env` (merged with Job's) {"pip": ["requests", "chess"], "env_vars": {"A": "a", "B": "b", "C": "c"}} Conflict Example: .. testcode:: # Example 1, env_vars conflicts # `ray job submit --runtime_env=...` {"pip": ["requests", "chess"], "env_vars": {"C": "a", "B": "b"}} # ray.init(runtime_env=...) {"env_vars": {"C": "c"}} # Ray raises an exception because the "C" env var conflicts. # Example 2, other field (e.g., pip) conflicts # `ray job submit --runtime_env=...` {"pip": ["requests", "chess"]} # ray.init(runtime_env=...) {"pip": ["torch"]} # Ray raises an exception because "pip" conflicts. You can set an environment variable `RAY_OVERRIDE_JOB_RUNTIME_ENV=1` to avoid raising an exception upon a conflict. In this case, the runtime environments are inherited in the same way as :ref:`Driver and Task and Actor both specify runtime environments `, where ``ray job submit`` is a parent and ``ray.init`` is a child. .. _runtime-environments-inheritance: Inheritance """"""""""" .. _runtime-env-driver-to-task-inheritance: The runtime environment is inheritable, so it applies to all Tasks and Actors within a Job and all child Tasks and Actors of a Task or Actor once set, unless it is overridden. If an Actor or Task specifies a new ``runtime_env``, it overrides the parent’s ``runtime_env`` (i.e., the parent Actor's or Task's ``runtime_env``, or the Job's ``runtime_env`` if Actor or Task doesn't have a parent) as follows: * The ``runtime_env["env_vars"]`` field will be merged with the ``runtime_env["env_vars"]`` field of the parent. This allows for environment variables set in the parent's runtime environment to be automatically propagated to the child, even if new environment variables are set in the child's runtime environment. * Every other field in the ``runtime_env`` will be *overridden* by the child, not merged. For example, if ``runtime_env["py_modules"]`` is specified, it will replace the ``runtime_env["py_modules"]`` field of the parent. Example: .. testcode:: # Parent's `runtime_env` {"pip": ["requests", "chess"], "env_vars": {"A": "a", "B": "b"}} # Child's specified `runtime_env` {"pip": ["torch", "ray[serve]"], "env_vars": {"B": "new", "C": "c"}} # Child's actual `runtime_env` (merged with parent's) {"pip": ["torch", "ray[serve]"], "env_vars": {"A": "a", "B": "new", "C": "c"}} .. _runtime-env-faq: Frequently Asked Questions ^^^^^^^^^^^^^^^^^^^^^^^^^^ Are environments installed on every node? """"""""""""""""""""""""""""""""""""""""" If a runtime environment is specified in ``ray.init(runtime_env=...)``, then the environment will be installed on every node. See :ref:`Per-Job ` for more details. (Note, by default the runtime environment will be installed eagerly on every node in the cluster. If you want to lazily install the runtime environment on demand, set the ``eager_install`` option to false: ``ray.init(runtime_env={..., "config": {"eager_install": False}}``.) When is the environment installed? """""""""""""""""""""""""""""""""" When specified per-job, the environment is installed when you call ``ray.init()`` (unless ``"eager_install": False`` is set). When specified per-task or per-actor, the environment is installed when the task is invoked or the actor is instantiated (i.e. when you call ``my_task.remote()`` or ``my_actor.remote()``.) See :ref:`Per-Job ` :ref:`Per-Task/Actor, within a job ` for more details. Where are the environments cached? """""""""""""""""""""""""""""""""" Any local files downloaded by the environments are cached at ``/tmp/ray/session_latest/runtime_resources``. How long does it take to install or to load from cache? """"""""""""""""""""""""""""""""""""""""""""""""""""""" The install time usually mostly consists of the time it takes to run ``pip install`` or ``conda create`` / ``conda activate``, or to upload/download a ``working_dir``, depending on which ``runtime_env`` options you're using. This could take seconds or minutes. On the other hand, loading a runtime environment from the cache should be nearly as fast as the ordinary Ray worker startup time, which is on the order of a few seconds. A new Ray worker is started for every Ray actor or task that requires a new runtime environment. (Note that loading a cached ``conda`` environment could still be slow, since the ``conda activate`` command sometimes takes a few seconds.) You can set ``setup_timeout_seconds`` config to avoid the installation hanging for a long time. If the installation is not finished within this time, your tasks or actors will fail to start. What is the relationship between runtime environments and Docker? """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" They can be used independently or together. A container image can be specified in the :ref:`Cluster Launcher ` for large or static dependencies, and runtime environments can be specified per-job or per-task/actor for more dynamic use cases. The runtime environment will inherit packages, files, and environment variables from the container image. My ``runtime_env`` was installed, but when I log into the node I can't import the packages. """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" The runtime environment is only active for the Ray worker processes; it does not install any packages "globally" on the node. .. _remote-uris: Remote URIs ----------- The ``working_dir`` and ``py_modules`` arguments in the ``runtime_env`` dictionary can specify either local path(s) or remote URI(s). A local path must be a directory path. The directory's contents will be directly accessed as the ``working_dir`` or a ``py_module``. A remote URI must be a link directly to a zip file or a wheel file (only for ``py_module``). **The zip file must contain only a single top-level directory.** The contents of this directory will be directly accessed as the ``working_dir`` or a ``py_module``. For example, suppose you want to use the contents in your local ``/some_path/example_dir`` directory as your ``working_dir``. If you want to specify this directory as a local path, your ``runtime_env`` dictionary should contain: .. testcode:: :skipif: True runtime_env = {..., "working_dir": "/some_path/example_dir", ...} Suppose instead you want to host your files in your ``/some_path/example_dir`` directory remotely and provide a remote URI. You would need to first compress the ``example_dir`` directory into a zip file. There should be no other files or directories at the top level of the zip file, other than ``example_dir``. You can use the following command in the Terminal to do this: .. code-block:: bash cd /some_path zip -r zip_file_name.zip example_dir Note that this command must be run from the *parent directory* of the desired ``working_dir`` to ensure that the resulting zip file contains a single top-level directory. In general, the zip file's name and the top-level directory's name can be anything. The top-level directory's contents will be used as the ``working_dir`` (or ``py_module``). You can check that the zip file contains a single top-level directory by running the following command in the Terminal: .. code-block:: bash zipinfo -1 zip_file_name.zip # example_dir/ # example_dir/my_file_1.txt # example_dir/subdir/my_file_2.txt Suppose you upload the compressed ``example_dir`` directory to AWS S3 at the S3 URI ``s3://example_bucket/example.zip``. Your ``runtime_env`` dictionary should contain: .. testcode:: :skipif: True runtime_env = {..., "working_dir": "s3://example_bucket/example.zip", ...} .. warning:: Check for hidden files and metadata directories in zipped dependencies. You can inspect a zip file's contents by running the ``zipinfo -1 zip_file_name.zip`` command in the Terminal. Some zipping methods can cause hidden files or metadata directories to appear in the zip file at the top level. To avoid this, use the ``zip -r`` command directly on the directory you want to compress from its parent's directory. For example, if you have a directory structure such as: ``a/b`` and you want to compress ``b``, issue the ``zip -r b`` command from the directory ``a.`` If Ray detects more than a single directory at the top level, it will use the entire zip file instead of the top-level directory, which may lead to unexpected behavior. Currently, four types of remote URIs are supported for hosting ``working_dir`` and ``py_modules`` packages: - ``HTTPS``: ``HTTPS`` refers to URLs that start with ``https``. These are particularly useful because remote Git providers (e.g. GitHub, Bitbucket, GitLab, etc.) use ``https`` URLs as download links for repository archives. This allows you to host your dependencies on remote Git providers, push updates to them, and specify which dependency versions (i.e. commits) your jobs should use. To use packages via ``HTTPS`` URIs, you must have the ``smart_open`` library (you can install it using ``pip install smart_open``). - Example: - ``runtime_env = {"working_dir": "https://github.com/example_username/example_repository/archive/HEAD.zip"}`` - ``S3``: ``S3`` refers to URIs starting with ``s3://`` that point to compressed packages stored in `AWS S3 `_. To use packages via ``S3`` URIs, you must have the ``smart_open`` and ``boto3`` libraries (you can install them using ``pip install smart_open`` and ``pip install boto3``). Ray does not explicitly pass in any credentials to ``boto3`` for authentication. ``boto3`` will use your environment variables, shared credentials file, and/or AWS config file to authenticate access. See the `AWS boto3 documentation `_ to learn how to configure these. - Example: - ``runtime_env = {"working_dir": "s3://example_bucket/example_file.zip"}`` - ``GS``: ``GS`` refers to URIs starting with ``gs://`` that point to compressed packages stored in `Google Cloud Storage `_. To use packages via ``GS`` URIs, you must have the ``smart_open`` and ``google-cloud-storage`` libraries (you can install them using ``pip install smart_open`` and ``pip install google-cloud-storage``). Ray does not explicitly pass in any credentials to the ``google-cloud-storage``'s ``Client`` object. ``google-cloud-storage`` will use your local service account key(s) and environment variables by default. Follow the steps on Google Cloud Storage's `Getting started with authentication `_ guide to set up your credentials, which allow Ray to access your remote package. - Example: - ``runtime_env = {"working_dir": "gs://example_bucket/example_file.zip"}`` - ``Azure``: ``Azure`` refers to URIs starting with ``azure://`` that point to compressed packages stored in `Azure Blob Storage `_. To use packages via ``Azure`` URIs, you must have the ``smart_open``, ``azure-storage-blob``, and ``azure-identity`` libraries (you can install them using ``pip install smart_open[azure] azure-storage-blob azure-identity``). Ray supports two authentication methods for Azure Blob Storage: 1. Connection string: Set the environment variable ``AZURE_STORAGE_CONNECTION_STRING`` with your Azure storage connection string. 2. Managed Identity: Set the environment variable ``AZURE_STORAGE_ACCOUNT`` with your Azure storage account name. This will use Azure's Managed Identity for authentication. - Example: - ``runtime_env = {"working_dir": "azure://container-name/example_file.zip"}`` Note that the ``smart_open``, ``boto3``, ``google-cloud-storage``, ``azure-storage-blob``, and ``azure-identity`` packages are not installed by default, and it is not sufficient to specify them in the ``pip`` section of your ``runtime_env``. The relevant packages must already be installed on all nodes of the cluster when Ray starts. Hosting a Dependency on a Remote Git Provider: Step-by-Step Guide ----------------------------------------------------------------- You can store your dependencies in repositories on a remote Git provider (e.g. GitHub, Bitbucket, GitLab, etc.), and you can periodically push changes to keep them updated. In this section, you will learn how to store a dependency on GitHub and use it in your runtime environment. .. note:: These steps will also be useful if you use another large, remote Git provider (e.g. BitBucket, GitLab, etc.). For simplicity, this section refers to GitHub alone, but you can follow along on your provider. First, create a repository on GitHub to store your ``working_dir`` contents or your ``py_module`` dependency. By default, when you download a zip file of your repository, the zip file will already contain a single top-level directory that holds the repository contents, so you can directly upload your ``working_dir`` contents or your ``py_module`` dependency to the GitHub repository. Once you have uploaded your ``working_dir`` contents or your ``py_module`` dependency, you need the HTTPS URL of the repository zip file, so you can specify it in your ``runtime_env`` dictionary. You have two options to get the HTTPS URL. Option 1: Download Zip (quicker to implement, but not recommended for production environments) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The first option is to use the remote Git provider's "Download Zip" feature, which provides an HTTPS link that zips and downloads your repository. This is quick, but it is **not recommended** because it only allows you to download a zip file of a repository branch's latest commit. To find a GitHub URL, navigate to your repository on `GitHub `_, choose a branch, and click on the green "Code" drop down button: .. figure:: images/ray_repo.png :width: 500px This will drop down a menu that provides three options: "Clone" which provides HTTPS/SSH links to clone the repository, "Open with GitHub Desktop", and "Download ZIP." Right-click on "Download Zip." This will open a pop-up near your cursor. Select "Copy Link Address": .. figure:: images/download_zip_url.png :width: 300px Now your HTTPS link is copied to your clipboard. You can paste it into your ``runtime_env`` dictionary. .. warning:: Using the HTTPS URL from your Git provider's "Download as Zip" feature is not recommended if the URL always points to the latest commit. For instance, using this method on GitHub generates a link that always points to the latest commit on the chosen branch. By specifying this link in the ``runtime_env`` dictionary, your Ray Cluster always uses the chosen branch's latest commit. This creates a consistency risk: if you push an update to your remote Git repository while your cluster's nodes are pulling the repository's contents, some nodes may pull the version of your package just before you pushed, and some nodes may pull the version just after. For consistency, it is better to specify a particular commit, so all the nodes use the same package. See "Option 2: Manually Create URL" to create a URL pointing to a specific commit. Option 2: Manually Create URL (slower to implement, but recommended for production environments) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The second option is to manually create this URL by pattern-matching your specific use case with one of the following examples. **This is recommended** because it provides finer-grained control over which repository branch and commit to use when generating your dependency zip file. These options prevent consistency issues on Ray Clusters (see the warning above for more info). To create the URL, pick a URL template below that fits your use case, and fill in all parameters in brackets (e.g. [username], [repository], etc.) with the specific values from your repository. For instance, suppose your GitHub username is ``example_user``, the repository's name is ``example_repository``, and the desired commit hash is ``abcdefg``. If ``example_repository`` is public and you want to retrieve the ``abcdefg`` commit (which matches the first example use case), the URL would be: .. testcode:: runtime_env = {"working_dir": ("https://github.com" "/example_user/example_repository/archive/abcdefg.zip")} Here is a list of different use cases and corresponding URLs: - Example: Retrieve package from a specific commit hash on a public GitHub repository .. testcode:: runtime_env = {"working_dir": ("https://github.com" "/[username]/[repository]/archive/[commit hash].zip")} - Example: Retrieve package from a private GitHub repository using a Personal Access Token **during development**. **For production** see :ref:`this document ` to learn how to authenticate private dependencies safely. .. testcode:: runtime_env = {"working_dir": ("https://[username]:[personal access token]@github.com" "/[username]/[private repository]/archive/[commit hash].zip")} - Example: Retrieve package from a public GitHub repository's latest commit .. testcode:: runtime_env = {"working_dir": ("https://github.com" "/[username]/[repository]/archive/HEAD.zip")} - Example: Retrieve package from a specific commit hash on a public Bitbucket repository .. testcode:: runtime_env = {"working_dir": ("https://bitbucket.org" "/[owner]/[repository]/get/[commit hash].tar.gz")} .. tip:: It is recommended to specify a particular commit instead of always using the latest commit. This prevents consistency issues on a multi-node Ray Cluster. See the warning below "Option 1: Download Zip" for more info. Once you have specified the URL in your ``runtime_env`` dictionary, you can pass the dictionary into a ``ray.init()`` or ``.options()`` call. Congratulations! You have now hosted a ``runtime_env`` dependency remotely on GitHub! Debugging --------- If runtime_env cannot be set up (e.g., network issues, download failures, etc.), Ray will fail to schedule tasks/actors that require the runtime_env. If you call ``ray.get``, it will raise ``RuntimeEnvSetupError`` with the error message in detail. .. testcode:: import ray import time @ray.remote def f(): pass @ray.remote class A: def f(self): pass start = time.time() bad_env = {"conda": {"dependencies": ["this_doesnt_exist"]}} # [Tasks] will raise `RuntimeEnvSetupError`. try: ray.get(f.options(runtime_env=bad_env).remote()) except ray.exceptions.RuntimeEnvSetupError: print("Task fails with RuntimeEnvSetupError") # [Actors] will raise `RuntimeEnvSetupError`. a = A.options(runtime_env=bad_env).remote() try: ray.get(a.f.remote()) except ray.exceptions.RuntimeEnvSetupError: print("Actor fails with RuntimeEnvSetupError") .. testoutput:: Task fails with RuntimeEnvSetupError Actor fails with RuntimeEnvSetupError Full logs can always be found in the file ``runtime_env_setup-[job_id].log`` for per-actor, per-task and per-job environments, or in ``runtime_env_setup-ray_client_server_[port].log`` for per-job environments when using Ray Client. You can also enable ``runtime_env`` debugging log streaming by setting an environment variable ``RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1`` on each node before starting Ray, for example using ``setup_commands`` in the Ray Cluster configuration file (:ref:`reference `). This will print the full ``runtime_env`` setup log messages to the driver (the script that calls ``ray.init()``). Example log output: .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(runtime_env={"pip": ["requests"]}) .. testoutput:: :options: +MOCK (pid=runtime_env) 2022-02-28 14:12:33,653 INFO pip.py:188 -- Creating virtualenv at /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv, current python dir /Users/user/anaconda3/envs/ray-py38 (pid=runtime_env) 2022-02-28 14:12:33,653 INFO utils.py:76 -- Run cmd[1] ['/Users/user/anaconda3/envs/ray-py38/bin/python', '-m', 'virtualenv', '--app-data', '/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv_app_data', '--reset-app-data', '--no-periodic-update', '--system-site-packages', '--no-download', '/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv'] (pid=runtime_env) 2022-02-28 14:12:34,267 INFO utils.py:97 -- Output of cmd[1]: created virtual environment CPython3.8.11.final.0-64 in 473ms (pid=runtime_env) creator CPython3Posix(dest=/private/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv, clear=False, no_vcs_ignore=False, global=True) (pid=runtime_env) seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/private/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv_app_data) (pid=runtime_env) added seed packages: pip==22.0.3, setuptools==60.6.0, wheel==0.37.1 (pid=runtime_env) activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator (pid=runtime_env) (pid=runtime_env) 2022-02-28 14:12:34,268 INFO utils.py:76 -- Run cmd[2] ['/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv/bin/python', '-c', 'import ray; print(ray.__version__, ray.__path__[0])'] (pid=runtime_env) 2022-02-28 14:12:35,118 INFO utils.py:97 -- Output of cmd[2]: 3.0.0.dev0 /Users/user/ray/python/ray (pid=runtime_env) (pid=runtime_env) 2022-02-28 14:12:35,120 INFO pip.py:236 -- Installing python requirements to /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv (pid=runtime_env) 2022-02-28 14:12:35,122 INFO utils.py:76 -- Run cmd[3] ['/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt'] (pid=runtime_env) 2022-02-28 14:12:38,000 INFO utils.py:97 -- Output of cmd[3]: Requirement already satisfied: requests in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from -r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (2.26.0) (pid=runtime_env) Requirement already satisfied: idna<4,>=2.5 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (3.2) (pid=runtime_env) Requirement already satisfied: certifi>=2017.4.17 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (2021.10.8) (pid=runtime_env) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (1.26.7) (pid=runtime_env) Requirement already satisfied: charset-normalizer~=2.0.0 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (2.0.6) (pid=runtime_env) (pid=runtime_env) 2022-02-28 14:12:38,001 INFO utils.py:76 -- Run cmd[4] ['/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv/bin/python', '-c', 'import ray; print(ray.__version__, ray.__path__[0])'] (pid=runtime_env) 2022-02-28 14:12:38,804 INFO utils.py:97 -- Output of cmd[4]: 3.0.0.dev0 /Users/user/ray/python/ray See :ref:`Logging Directory Structure ` for more details. --- .. _autoscaler-v2: Autoscaler v2 ============= This document explains how the open-source autoscaler v2 works in Ray 2.48 and outlines its high-level responsibilities and implementation details. Overview -------- The autoscaler is responsible for resizing the cluster based on resource demand from tasks, actors, and placement groups. To achieve this, it follows a structured process: evaluating worker group configurations, periodically reconciling cluster state with user constraints, applying bin-packing strategies to pending workload demands, and interacting with cloud instance providers through the Instance Manager. The following sections describe these components in detail. Worker Group Configurations --------------------------- Worker groups (also referred to as node types) define the sets of nodes that the Ray autoscaler scales. Each worker group represents a logical category of nodes with the same resource configurations, such as CPU, memory, GPU, or custom resources. The autoscaler dynamically adjusts the cluster size by adding or removing nodes within each group as workload demands change. In other words, it scales the cluster by modifying the number of nodes per worker group according to the specified scaling rules and resource requirements. Worker groups can be configured in these ways: - The `available_node_types `__ field in the Cluster YAML file, if you are using the ``ray up`` cluster launcher. - The `workerGroupSpecs `__ field in the RayCluster CRD, if you are using KubeRay. The configuration specifies the logical resources each node has in a worker group, along with the minimum and maximum number of nodes that should exist in each group. .. note:: Although the autoscaler fulfills pending resource demands and releases idle nodes, it doesn't perform the actual scheduling of Ray tasks, actors, or placement groups. Scheduling is handled internally by Ray. The autoscaler does its own simulation of scheduling decisions on pending demands periodically to determine which nodes to launch or to stop. See the next sections for details. Periodic Reconciliation ----------------------- The entry point of the autoscaler is `monitor.py `__, which starts a GCS client and runs the reconciliation loop. This process is launched on the head node by the `start_head_processes `__ function when using the ``ray up`` cluster launcher. When running under KubeRay, it instead runs as a `separate autoscaler container `__ in the Head Pod. .. warning:: In the case of the cluster launcher, if the autoscaler process crashes, then there is no autoscaling. While in the case of KubeRay, Kubernetes restarts the autoscaler container if it crashes by the default container restart policy. The process periodically `reconciles `__ against a snapshot of the following information using the Reconciler: 1. **The latest pending demands** (queried from the `get_cluster_resource_state `__ GCS RPC): Pending Ray tasks, actors, and placement groups. 2. **The latest user cluster constraints** (queried from the `get_cluster_resource_state `__ GCS RPC): The minimum cluster size, if specified via the ``ray.autoscaler.sdk.request_resources`` invocation. 3. **The latest Ray nodes information** (queried from the `get_cluster_resource_state `__ GCS RPC): The total and currently available resources of each Ray node in the cluster. Also includes each Ray node's status (ALIVE or DEAD) and other information such as idle duration. See Appendix for more details. 4. **The latest cloud instances** (`queried from the cloud instance provider's implementation `__): The list of instances managed by the cloud instance provider implementation. 5. **The latest worker group configurations** (queried from the cluster YAML file or the RayCluster CRD). The preceding information is retrieved at the beginning of each reconciliation loop. The Reconciler uses this information to construct its internal state and perform "`passive `__" instance lifecycle transitions by observations. This is the `sync phase `__. After the sync phase, the Reconciler performs the `following steps `__ in order with the ``ResourceDemandScheduler``: 1. Enforce configuration constraints, including min/max nodes for each worker group. 2. Enforce user cluster constraints (if specified by `ray.autoscaler.sdk.request_resources `__ invocation). 3. Fit pending demands into available resources on the cluster snapshot. This is the simulation mentioned earlier. 4. Fit any remaining demands (left over from the previous step) against worker group configurations to determine which nodes to launch. 5. Terminate idle instances (nodes that are needed by the previous 1-4 steps aren't considered idle) according to each node's ``idle_duration_ms`` (queried from GCS) and the configured idle timeout for each group. 6. Send accumulated scaling decisions (steps 1–5) to the Instance Manager with `Reconciler._update_instance_manager `__. 7. `Sleep briefly (5s by default) `__, then return to the sync phase. .. warning:: If any error occurs, such as an error from the cloud instance provider or a timeout in the sync phase, the current reconciliation is aborted and the loop jumps to step 7 to wait for the next reconciliation. .. note:: All scaling decisions from steps 1–5 are accumulated purely in memory. No interaction with the cloud instance provider occurs until step 6. Bin Packing and Worker Group Selection -------------------------------------- The autoscaler applies the following scoring logic to evaluate each existing node. It selects the node with the highest score and assigns it a subset of feasible demands. It also applies the same scoring logic to each worker group and selects the one with the highest score to launch new instances. `Scoring `__ is based on a tuple of four values: 1. Whether the node is a GPU node and whether feasible requests require GPUs: - ``0`` if the node is a GPU node and requests do **not** require GPUs. - ``1`` if the node isn't a GPU node or requests do require GPUs. 2. The number of resource types on the node used by feasible requests. 3. The minimum `utilization rate `__ across all resource types used by feasible requests. 4. The average `utilization rate `__ across all resource types used by feasible requests. .. note:: Utilization rate used by feasible requests is calculated as the difference between the total and available resources divided by the total resources. In other words: - The autoscaler avoids launching GPU nodes unless necessary. - It prefers nodes that maximize utilization and minimize unused resources. Example: - Task requires **2 GPUs**. - Two node types are available: - A: [GPU: 6] - B: [GPU: 2, TPU: 1] Node type **A** should be selected, since node B would leave an unused TPU (with a utilization rate of 0% on TPU), making it less favorable with respect to the third scoring criterion. This process repeats until all feasible pending demands are packed or the maximum cluster size is reached. Instance Manager and Cloud Instance Provider -------------------------------------------- `Cloud Instance Provider `__ is an abstract interface that defines the operations for managing instances in the cloud. `Instance Manager `__ is the component that tracks instance lifecycle and drives event subscribers that call the cloud instance provider. As described in the previous section, the autoscaler accumulates scaling decisions (steps 1–5) in memory and reconciles them with the cloud instance provider through the Instance Manager. Scaling decisions are represented as a list of `InstanceUpdateEvent `__ records. For example: - **For launching new instances**: - ``instance_id``: A randomly generated ID for Instance Manager tracking. - ``instance_type``: The type of instance to launch. - ``new_instance_status``: ``QUEUED``. - **For terminating instances**: - ``instance_id``: The ID of the instance to stop. - ``new_instance_status``: ``TERMINATING`` or ``RAY_STOP_REQUESTED``. These update events are passed to the Instance Manager, which transitions instance statuses. A normal transition flow for an instance is: - ``(non-existent) -> QUEUED``: The Reconciler creates an instance with the ``QUEUED`` ``InstanceUpdateEvent`` when it decides to launch a new instance. - ``QUEUED -> REQUESTED``: The Reconciler considers ``max_concurrent_launches`` and ``upscaling_speed`` when selecting an instance from the queue to transition to ``REQUESTED`` during each reconciliation iteration. - ``REQUESTED -> ALLOCATED``: Once the Reconciler detects the instance is allocated from the cloud instance provider, it will transition the instance to ``ALLOCATED``. - ``ALLOCATED -> RAY_INSTALLING``: If the cloud instance provider is not ``KubeRayProvider``, the Reconciler will transition the instance to ``RAY_INSTALLING`` when the instance is allocated. - ``RAY_INSTALLING -> RAY_RUNNING``: Once the Reconciler detects from GCS that Ray has started on the instance, it will transition the instance to ``RAY_RUNNING``. - ``RAY_RUNNING -> RAY_STOP_REQUESTED``: If the instance is idle for longer than the configured timeout, the Reconciler will transition the instance to ``RAY_STOP_REQUESTED`` to start draining the Ray process. - ``RAY_STOP_REQUESTED -> RAY_STOPPING``: Once the Reconciler detects from GCS that the Ray process is draining, it will transition the instance to ``RAY_STOPPING``. - ``RAY_STOPPING -> RAY_STOPPED``: Once the Reconciler detects from GCS that the Ray process has stopped, it will transition the instance to ``RAY_STOPPED``. - ``RAY_STOPPED -> TERMINATING``: The Reconciler will transition the instance from ``RAY_STOPPED`` to ``TERMINATING``. - ``TERMINATING -> TERMINATED``: Once the Reconciler detects that the instance has been terminated by the cloud instance provider, it will transition the instance to ``TERMINATED``. .. note:: The drain request sent by ``RAY_STOP_REQUESTED`` can be rejected if the node is no longer idle when the drain request arrives the node. Then the instance will be transitioned back to ``RAY_RUNNING`` instead. You can find all valid transitions in the `get_valid_transitions `__ method. Once transitions are triggered by the Reconciler, subscribers perform side effects, such as: - ``QUEUED -> REQUESTED``: CloudInstanceUpdater launches the instance through the Cloud Instance Provider. - ``ALLOCATED -> RAY_INSTALLING``: ThreadedRayInstaller installs the Ray process. - ``RAY_RUNNING -> RAY_STOP_REQUESTED``: RayStopper stops the Ray process on the instance. - ``RAY_STOPPED -> TERMINATING``: CloudInstanceUpdater terminates the instance through the Cloud Instance Provider. .. note:: These transitions trigger side effects, but side effects don't trigger new transitions directly. Instead, their results are observed from external state during the sync phase; subsequent transitions are triggered based on those observations. .. note:: Cloud instance provider implementations in autoscaler v2 must implement: - **Listing instances**: Return the set of instances currently managed by the provider. - **Launching instances**: Create new instances given the requested instance type and tags. - **Terminating instances**: Safely remove instances identified by their IDs. ``KubeRayProvider`` is one such cloud instance provider implementation. ``NodeProviderAdapter`` is an adapter that can wrap a v1 node provider (such as ``AWSNodeProvider``) to act as a cloud instance provider. Appendix -------- How ``get_cluster_resource_state`` Aggregates Cluster State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The autoscaler retrieves a cluster snapshot through the ``get_cluster_resource_state`` RPC served by GCS (`HandleGetClusterResourceState `__) which builds the reply in `MakeClusterResourceStateInternal `__. Internally, GCS assembles the reply by combining per-node resource reports, pending workload demand, and any user-requested cluster constraints into a single ``ClusterResourceState`` message. - Data sources and ownership: - `GcsAutoscalerStateManager `__ maintains a per-node cache of ``ResourcesData`` that includes totals, availables, and load-by-shape. GCS periodically polls each alive raylet (``GetResourceLoad``) and updates this cache (`GcsServer::InitGcsResourceManager `__, `UpdateResourceLoadAndUsage `__), then uses it to construct snapshots. - `GcsNodeInfo `__ provides static and slowly changing node metadata (node ID, instance ID, node type name, IP, labels, instance type) and dead/alive status. - Placement group demand comes from the `placement group manager `__. - User cluster constraints come from autoscaler SDK requests that GCS records. - Fields assembled in the reply: - ``node_states``: For each node, GCS sets identity and metadata from `GcsNodeInfo `__ and pulls resources and status from the cached ``ResourcesData`` (`GetNodeStates `__). Dead nodes are marked ``DEAD`` and omit resource details. For alive nodes, GCS also includes ``idle_duration_ms`` and any node activity strings. - ``pending_resource_requests``: Computed by aggregating per-node load-by-shape across the cluster (`GetPendingResourceRequests `__). For each resource shape, the count is the sum of infeasible, backlog, and ready requests that haven't been scheduled yet. - ``pending_gang_resource_requests``: Pending or rescheduling placement groups represented as gang requests (`GetPendingGangResourceRequests `__). - ``cluster_resource_constraints``: The set of minimal cluster resource constraints previously requested via ``ray.autoscaler.sdk.request_resources`` (`GetClusterResourceConstraints `__). --- .. _metric-exporter: Metric Exporter Infrastructure ================================ This document is based on Ray version 2.52.1. Ray's metric exporting infrastructure collects metrics from C++ components (raylet, GCS, workers) and Python components, aggregates them, and exports them to Prometheus. This document explains how metrics flow through the system from registration to final export. Architecture Overview --------------------- Ray's metric system uses a multi-stage pipeline: 1. **C++ Components**: Raylet, GCS, and worker processes record metrics using the OpenTelemetry SDK 2. **OTLP Export**: Metrics are exported via OpenTelemetry Protocol (OTLP) over gRPC to the metrics agent 3. **Metrics Agent**: The Python metrics agent (ReporterAgent) receives and processes metrics 4. **Aggregation**: High-cardinality labels are filtered and values are aggregated 5. **Prometheus Export**: Final metrics are exported in Prometheus format The following diagram shows the high-level flow: .. code-block:: text C++ Components (raylet, GCS, workers) ↓ (Record metrics via Metric::Record) OpenTelemetryMetricRecorder (C++) ↓ (OTLP gRPC export) Metrics Agent (Python - ReporterAgent) ↓ (Aggregate & process) OpenTelemetryMetricRecorder (Python) ↓ (Prometheus format) Prometheus Server Metric Registration and Recording (C++ Side) --------------------------------------------- Ray's C++ components register and record metrics through the `OpenTelemetryMetricRecorder `__ singleton. The recorder supports four metric types: Gauge, Counter, Sum, and Histogram. Metric Types ~~~~~~~~~~~~ - **Gauge**: Represents a current value that can go up or down (e.g., number of running tasks) - **Counter**: A cumulative metric that only increases (e.g., total tasks submitted) - **Sum (UpDownCounter)**: A cumulative metric that can increase or decrease (e.g., number of objects in object store) - **Histogram**: Tracks the distribution of values over time (e.g., task execution time) Registration Process ~~~~~~~~~~~~~~~~~~~~ Metrics are registered lazily on first use. The `OpenTelemetryMetricRecorder` uses a singleton pattern accessible via `GetInstance() `__. When a metric is first recorded, it's automatically registered if it hasn't been registered already. Registration methods (defined in `open_telemetry_metric_recorder.cc `__): - `RegisterGaugeMetric() `__: Registers an observable gauge with a callback - `RegisterCounterMetric() `__: Registers a synchronous counter - `RegisterSumMetric() `__: Registers a synchronous up-down counter - `RegisterHistogramMetric() `__: Registers a histogram with explicit bucket boundaries Recording Mechanisms ~~~~~~~~~~~~~~~~~~~~ Ray uses two different recording mechanisms depending on the metric type: **Observable Metrics (Gauges)** Observable gauges store values in an intermediate map (`observations_by_name_`) until collection time. When you call `SetMetricValue() `__ for a gauge, the value is stored with its tags. During export, a callback function (`_DoubleGaugeCallback `__) is invoked by the OpenTelemetry SDK, which collects all stored values and clears the map to prevent stale data. The callback implementation is in `CollectGaugeMetricValues() `__. **Synchronous Metrics (Counters, Sums, Histograms)** Synchronous metrics record values directly to their instruments without intermediate storage. When you call `SetMetricValue() `__ for these types, the value is immediately added to the counter or recorded in the histogram via `SetSynchronousMetricValue() `__. Key Implementation Details ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Thread Safety**: The recorder uses a mutex (`mutex_`) to protect the observations map and registered instruments - **Lock Ordering**: Callbacks are registered after releasing the mutex to prevent deadlocks between the recorder's mutex and OpenTelemetry SDK's internal locks (see `RegisterGaugeMetric() `__ for details) - **Lazy Registration**: Metrics can be registered multiple times safely; the recorder checks if a metric is already registered before creating a new instrument C++ components record metrics through the `Metric::Record() `__ method, which forwards to `OpenTelemetryMetricRecorder::SetMetricValue() `__. Metric Export from C++ (OTLP gRPC) ----------------------------------- C++ components export metrics to the metrics agent using the OpenTelemetry Protocol (OTLP) over gRPC. The export process is configured when the recorder is started. OpenTelemetry SDK Integration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The `OpenTelemetryMetricRecorder` initializes the OpenTelemetry SDK in its `constructor `__ and `Start() `__ method with: - **MeterProvider**: Manages meter instances and metric readers - **PeriodicExportingMetricReader**: Collects metrics at regular intervals and exports them - **OTLP gRPC Exporter**: Sends metrics to the metrics agent endpoint Export Configuration ~~~~~~~~~~~~~~~~~~~~~ When `Start() `__ is called, the recorder configures: - **Endpoint**: The metrics agent's gRPC address (typically `127.0.0.1:port`) - **Export Interval**: How often metrics are collected and exported (configurable) - **Export Timeout**: Maximum time to wait for export completion - **Aggregation Temporality**: Set to delta mode to prevent double-counting (see `exporter_options.aggregation_temporality `__) Delta Aggregation Temporality ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray uses delta aggregation temporality, which means only the changes since the last export are sent. This is important because the metrics agent accumulates metrics, and re-accumulating them during export would lead to double-counting. Export Process ~~~~~~~~~~~~~~ During each export interval: 1. **Observable Gauges**: The OpenTelemetry SDK invokes registered callbacks, which collect values from `observations_by_name_` and clear the map 2. **Synchronous Metrics**: Values are read directly from the instruments 3. **OTLP Format**: Metrics are converted to OTLP format 4. **gRPC Export**: Metrics are sent to the metrics agent via gRPC Metric Reception and Processing (Python Side) ---------------------------------------------- The metrics agent (ReporterAgent) receives metrics from C++ components via a gRPC service that implements the OpenTelemetry Metrics Service interface. gRPC Service Implementation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The `ReporterAgent `__ class implements `MetricsServiceServicer`, which provides the `Export() `__ method. This method receives `ExportMetricsServiceRequest` messages containing OTLP-formatted metrics from C++ components. Metric Processing ~~~~~~~~~~~~~~~~~ When metrics are received, the `Export()` method processes them in the following structure: - **Resource Metrics**: Top-level container for metrics from a specific resource (e.g., a raylet process) - **Scope Metrics**: Groups metrics by instrumentation scope - **Metrics**: Individual metric data points The method routes metrics to appropriate handlers based on their type: - **Histogram Metrics**: Processed by `_export_histogram_data() `__ - **Number Metrics** (Gauge, Counter, Sum): Processed by `_export_number_data() `__ Conversion to Internal Format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The metrics agent converts OTLP format to Ray's internal metric representation and forwards them to the Python `OpenTelemetryMetricRecorder `__ for further processing and aggregation. Metric Aggregation and Cardinality Reduction (Python) ------------------------------------------------------ The Python `OpenTelemetryMetricRecorder `__ handles final aggregation and cardinality reduction before exporting to Prometheus. This step is crucial for managing metric cardinality and preventing metric explosion. OpenTelemetryMetricRecorder (Python) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Python recorder (defined in `open_telemetry_metric_recorder.py `__) has a similar structure to the C++ version but uses the Prometheus exporter instead of OTLP. It maintains: - **Registered Instruments**: Maps metric names to OpenTelemetry instruments - **Observations Map**: Stores gauge values with their tag sets until collection - **Histogram Bucket Midpoints**: Pre-calculated midpoints for histogram bucket conversion High-cardinality labels can cause metric explosion, making metrics systems unusable. Ray implements cardinality reduction through label filtering and value aggregation. **Label Filtering** The system identifies high-cardinality labels based on the `RAY_metric_cardinality_level` environment variable. The logic is implemented in `MetricCardinality.get_high_cardinality_labels_to_drop() `__: - **`legacy`**: All labels are preserved (default behavior before Ray 2.53) - **`recommended`**: The `WorkerId` label is dropped (default since Ray 2.53) - **`low`**: Both `WorkerId` and `Name` labels are dropped for tasks and actors **Aggregation Process** For observable gauges, the aggregation happens in the `callback function `__ within `register_gauge_metric() `__: - **Collection**: All observations for a metric are collected from `_observations_by_name `__ - **Label Filtering**: High-cardinality labels are identified and removed from tag sets using `MetricCardinality.get_high_cardinality_labels_to_drop() `__ - **Grouping**: Observations with the same filtered tag set are grouped together - **Aggregation**: An aggregation function is applied to each group via `MetricCardinality.get_aggregation_function() `__ (sum for tasks/actors, first value for others) - **Export**: Aggregated observations are returned to OpenTelemetry for Prometheus export This process ensures that metrics remain manageable even when there are thousands of workers or unique task names. --- .. _ray-event-exporter: Ray Event Exporter Infrastructure ================================== This document is based on Ray version 2.52.1. Ray's event exporting infrastructure collects events from C++ components (GCS, workers) and Python components, buffers and merges them, and exports them to external HTTP services. This document explains how events flow through the system from creation to final export. Architecture Overview --------------------- Ray's event system uses a multi-stage pipeline: 1. **C++ Components**: GCS and worker processes create events implementing `RayEventInterface `__. Raylet does not emit any Ray events, but there are no technical limitations preventing it from doing so. 2. **Event Buffering**: Events are buffered in a bounded circular buffer 3. **Event Merging**: Events with the same entity ID and type are merged before export 4. **gRPC Export**: Events are exported via gRPC to the aggregator agent 5. **Python Aggregation**: The `AggregatorAgent `__ receives and buffers events 6. **HTTP Publishing**: Events are filtered, converted to JSON, and published to external HTTP services The following diagram shows the high-level flow: .. code-block:: text C++ Components (GCS, workers) ↓ (Create events via RayEventInterface) RayEventRecorder (C++) ↓ (Buffer & merge events) ↓ (gRPC export via EventAggregatorClient) AggregatorAgent (Python) ↓ (Add to MultiConsumerEventBuffer) RayEventPublisher ↓ (Filter & convert to JSON) ↓ (HTTP POST) External HTTP Service Event Types and Structure ------------------------- Ray events are structured using protobuf messages with a base `RayEvent` message that contains event-specific nested messages. Event Types ~~~~~~~~~~~ Events are categorized by type, defined in the `EventType` enum in `events_base_event.proto `__: - **TASK_DEFINITION_EVENT**: Task definition information - **TASK_LIFECYCLE_EVENT**: Task state transitions (this covers both normal tasks and actor tasks) - **ACTOR_TASK_DEFINITION_EVENT**: Actor task definition - **ACTOR_DEFINITION_EVENT**: Actor definition - **ACTOR_LIFECYCLE_EVENT**: Actor state transitions - **DRIVER_JOB_DEFINITION_EVENT**: Driver job definition - **DRIVER_JOB_LIFECYCLE_EVENT**: Driver job state transitions - **NODE_DEFINITION_EVENT**: Node definition - **NODE_LIFECYCLE_EVENT**: Node state transitions - **TASK_PROFILE_EVENT**: Task profiling data Event Structure ~~~~~~~~~~~~~~~ The base `RayEvent `__ message contains: - **event_id**: Unique identifier for the event - **source_type**: Component that generated the event - **event_type**: Type of event (from EventType enum) - **timestamp**: When the event was created - **severity**: Event severity level (TRACE, DEBUG, INFO, WARNING, ERROR, FATAL) - **message**: Optional string message - **session_name**: Ray session identifier - **Nested event messages**: One of the event-specific messages (e.g., `task_definition_event`, `actor_lifecycle_event`) Entity ID Concept ~~~~~~~~~~~~~~~~~ The entity ID is a unique identifier for the entity associated with an event. It's used for two purposes: 1. **Association**: Links execution events with definition events (e.g., task lifecycle events with task definition events) 2. **Merging**: Groups events with the same entity ID and type for merging before export For example: - Task events use `task_id + task_attempt` as the entity ID - Actor events use `actor_id` as the entity ID - Driver job events use `job_id` as the entity ID Event Recording and Buffering (C++ Side) ----------------------------------------- C++ components record events through the `RayEventRecorder `__ class, which provides thread-safe event buffering and export. RayEventRecorder ~~~~~~~~~~~~~~~~ The `RayEventRecorder` is a thread-safe event recorder that: - Maintains a bounded circular buffer for events - Merges events with the same entity ID and type before export - Periodically exports events via gRPC to the aggregator agent using `EventAggregatorClient `__ - Tracks dropped events when the buffer is full Adding Events ~~~~~~~~~~~~~ Events are added to the recorder via the `AddEvents() `__ method, which accepts a vector of `RayEventInterface` pointers. The method: 1. Checks if event recording is enabled (via `enable_ray_event` config) 2. Calculates if adding events would exceed the buffer size 3. Drops old events if necessary and records metrics for dropped events 4. Adds new events to the circular buffer Buffer Management ~~~~~~~~~~~~~~~~~ The recorder uses a `boost::circular_buffer `__ to store events. When the buffer is full: - Oldest events are dropped to make room for new ones - Dropped events are tracked via the `dropped_events_counter` metric - The metric includes the source component name for tracking - The default buffer size is 10,000 events, but it can be configured via the `RAY_ray_event_recorder_max_queued_events` environment variable Event Export from C++ (gRPC) ------------------------------ Events are exported from C++ components to the aggregator agent using gRPC. The export process is initiated by calling `StartExportingEvents() `__. StartExportingEvents ~~~~~~~~~~~~~~~~~~~~ This method: 1. Checks if event recording is enabled 2. Verifies it hasn't been called before (should only be called once) 3. Sets up a `PeriodicalRunner` to periodically call `ExportEvents()` 4. Uses the configured export interval (`ray_events_report_interval_ms`) ExportEvents Process ~~~~~~~~~~~~~~~~~~~~ The `ExportEvents() `__ method performs the following steps: 1. **Check Buffer**: Returns early if the buffer is empty 2. **Group Events**: Groups events by entity ID and type using a hash map 3. **Merge Events**: Events with the same key are merged using the `Merge() `__ method 4. **Serialize**: Each merged event is serialized to a `RayEvent` protobuf via `Serialize() `__ 5. **Send via gRPC**: Events are sent to the aggregator agent via `EventAggregatorClient::AddEvents() `__ 6. **Clear Buffer**: The buffer is cleared after successful export Event Merging Logic ~~~~~~~~~~~~~~~~~~~ Event merging is an optimization that reduces data size by combining related events. Events with the same entity ID and type are merged: - **Definition Events**: Typically don't change when merged (e.g., actor definition) - **Lifecycle Events**: State transitions are appended to form a time series (e.g., task state transitions: started → running → completed) The merging maintains the order of events while combining them into a single event with all state transitions. Error Handling ~~~~~~~~~~~~~~ If the gRPC export fails: - An error is logged - The process continues (doesn't crash) - The next export interval will attempt to send events again - Events remain in the buffer until successfully exported (or the buffer is full and old events are dropped) Event Reception and Buffering (Python Side) --------------------------------------------- The `AggregatorAgent `__ receives events from C++ components via a gRPC service and buffers them for publishing. AggregatorAgent ~~~~~~~~~~~~~~~ The `AggregatorAgent` is a dashboard agent module that: - Implements `EventAggregatorServiceServicer` for gRPC event reception - Maintains a `MultiConsumerEventBuffer` for event storage - Manages `RayEventPublisher` instances for publishing to external http endpoints - Tracks metrics for events received, buffer and publisher operations AddEvents gRPC Handler ~~~~~~~~~~~~~~~~~~~~~~~ The `AddEvents() `__ method is the gRPC handler that receives events: 1. Checks if event processing is enabled 2. Iterates through events in the request 3. Records metrics for each received event 4. Adds each event to the `MultiConsumerEventBuffer` via `add_event() `__ 5. Handles errors if adding events fails MultiConsumerEventBuffer ~~~~~~~~~~~~~~~~~~~~~~~~~ The `MultiConsumerEventBuffer `__ is an asyncio-friendly buffer that: - **Supports Multiple Consumers**: Each consumer has an independent cursor index. RayEventPublisher and other consumers share this same buffer. - **Tracks Evictions**: When the buffer is full, oldest events are dropped and tracked per consumer - **Bounded Buffer**: Uses `deque` with `maxlen` to limit buffer size - **Asyncio-Safe**: Uses `asyncio.Lock` and `asyncio.Condition` for synchronization Key operations: - **add_event()**: Adds an event to the buffer, dropping oldest if full - **wait_for_batch()**: Waits for a batch of events up to `max_batch_size`, with timeout. The timeout only applies when there is at least one event in the buffer. If the buffer is empty, `wait_for_batch()` can block indefinitely. - **register_consumer()**: Registers a new consumer with a unique name Event Filtering ~~~~~~~~~~~~~~~ The agent checks if events can be exposed to external services via `_can_expose_event() `__. Only events whose type is in the `EXPOSABLE_EVENT_TYPES` set are allowed to be published externally. Event Publishing to HTTP ------------------------ Events are published to external HTTP services by the `RayEventPublisher`, which reads from the event buffer and sends HTTP POST requests. RayEventPublisher ~~~~~~~~~~~~~~~~~ The `RayEventPublisher `__ runs a worker loop that: 1. Registers as a consumer of the `MultiConsumerEventBuffer` 2. Continuously waits for batches of events via `wait_for_batch()` 3. Publishes batches using the configured `PublisherClientInterface` 4. Handles retries with exponential backoff on failures 5. Records metrics for publish success, failures, and latency The publisher runs in an async context and uses `asyncio` for non-blocking operations. AsyncHttpPublisherClient ~~~~~~~~~~~~~~~~~~~~~~~~~ The `AsyncHttpPublisherClient `__ handles HTTP publishing: 1. **Event Filtering**: Filters events using `events_filter_fn` (typically `_can_expose_event`) 2. **JSON Conversion**: Converts protobuf events to JSON dictionaries - Uses `message_to_json()` from protobuf - Optionally preserves proto field names or converts to camelCase - Runs in `ThreadPoolExecutor` to avoid blocking the event loop 3. **HTTP POST**: Sends filtered events as JSON to the configured endpoint 4. **Error Handling**: Catches exceptions and returns failure status 5. **Session Management**: Uses `aiohttp.ClientSession` for HTTP requests Batch Publishing ~~~~~~~~~~~~~~~~ Events are published in batches: - Batch size is limited by `max_batch_size` (default: 10,000 events) - Batches are created by `wait_for_batch()` which waits up to a timeout for events - Larger batches reduce HTTP request overhead but increase latency Retry Logic ~~~~~~~~~~~ The publisher implements retry logic with exponential backoff: - Retries failed publishes up to `max_retries` times (default: infinite) - Uses exponential backoff with jitter between retries - If max retries are exhausted, we drop the events and record a metric for dropped events Configuration ~~~~~~~~~~~~~ HTTP publishing is configured via environment variables: - **RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR**: HTTP endpoint URL (e.g., `http://localhost:8080/events`) - **RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES**: Comma-separated list of event types to expose - **RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISH_EVENTS_TO_EXTERNAL_HTTP_SERVICE**: Enable/disable flag (default: True) Creating New Event Types ------------------------- To create a new event type, follow these steps: Step 1: Define Protobuf Message ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create a new `.proto` file in `src/ray/protobuf/public/` following the naming convention `events__event.proto`. For example, see `events_task_definition_event.proto `__. Define your event-specific message with the fields you need: .. code-block:: protobuf syntax = "proto3"; package ray.rpc.events; message MyNewEvent { // Define your event-specific fields here string entity_id = 1; // ... other fields } Step 2: Add to Base Event ~~~~~~~~~~~~~~~~~~~~~~~~~~ Update `events_base_event.proto `__: 1. Add import for your new proto file 2. Add new `EventType` enum value (e.g., `MY_NEW_EVENT = 11`) 3. Add new field to `RayEvent` message (e.g., `MyNewEvent my_new_event = 18`) Step 3: Implement RayEventInterface ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create a C++ class that implements `RayEventInterface `__. The easiest approach is to extend `RayEvent` template class, as shown in `ray_actor_definition_event.h `__. You need to implement: - **GetEntityId()**: Return a unique identifier for the entity (e.g., task ID + attempt, actor ID) - **MergeData()**: Implement merging logic for events with the same entity ID - Definition events typically don't change when merged - Lifecycle events append state transitions - **SerializeData()**: Convert the event data to a `RayEvent` protobuf - **GetEventType()**: Return the `EventType` enum value for this event See `ray_actor_definition_event.cc `__ for a complete example. Step 4: Update Exposable Event Types (if needed) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your event should be exposed to external HTTP services, add it to `DEFAULT_EXPOSABLE_EVENT_TYPES `__ in `aggregator_agent.py`. Alternatively, users can configure it via the `RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES` environment variable. Step 5: Update RayEventRecorder to publish your new event type ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use `RayEventRecorder::AddEvent() `__ to add your new event type to the buffer. Step 6: Update AggregatorAgent to publish your new event type ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Update `AggregatorAgent `__ to publish your new event type. --- .. _rpc-fault-tolerance: RPC Fault Tolerance =================== All RPCs added to Ray Core should be fault tolerant and use the retryable gRPC client. Ideally, they should be idempotent, or at the very least, the lack of idempotency should be documented and the client must be able to take retries into account. If you aren't familiar with what idempotency is, consider a function that writes "hello" to a file. On retry, it writes "hello" again, resulting in "hellohello". This isn't idempotent. To make it idempotent, you could check the file contents before writing "hello" again, ensuring that the observable state after multiple identical function calls is the same as after a single call. This guide walks you through a case study of a RPC that wasn't fault tolerant or idempotent, and how it was fixed. By the end of this guide, you should understand what to look for when adding new RPCs and which testing methods to use to verify fault tolerance. Case study: RequestWorkerLease ------------------------------- Problem ~~~~~~~ Prior to the fix described here, ``RequestWorkerLease`` could not be made retryable because its handler in the Raylet was not idempotent. This was because once leases were granted, they were considered occupied until ``ReturnWorker`` was called. Until this RPC was called, the worker and its resources were never returned to the pool of available workers and resources. The raylet assumed that the original RPC and its retry were both fresh lease requests and couldn't deduplicate them. For example, consider the following sequence of operations: 1. Request a new worker lease (Owner → Raylet) through ``RequestWorkerLease``. 2. Response is lost (Raylet → Owner). 3. Retry ``RequestWorkerLease`` for lease (Owner → Raylet). 4. Two sets of resources and workers are now granted, one for the original AND retry. On the retry, the raylet should detect that the lease request is a retry and forward the already leased worker address to the owner so a second lease isn't granted. Solution ~~~~~~~~ To implement idempotency, a unique identifier called ``LeaseID`` was added in `PR #55469 `_, which allowed for the deduplication of incoming lease requests. Once leases are granted, they're tracked in a ``leased_workers`` map which maps lease IDs to workers. If the new lease request is already present in the ``leased_workers`` map, the system knows this lease request is a retry and responds with the already leased worker address. Hidden problem: long-polling RPCs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Network transient errors can happen at any time. For most RPCs, they finish in one I/O context execution, so guarding against whether the request or response failed is sufficient. However, there are a few RPCs that are long-polling, meaning that once the ``HandleX`` function executes, it won't immediately respond to the client but will rather depend on some state change in the future to trigger the response back to the client. This was the case for ``RequestWorkerLease``. Leases aren't granted until all the args are pulled, so the system can't respond to the client until the pulling process is over. What happens if the client disconnects while the server logic executes on the raylet side and sends a retry? The ``leased_workers`` map only tracks leases that are granted, not in the process of granting. This caused a `RAY_CHECK to be triggered `_ in the ``lease_dependency_manager`` because the system wasn't able to deduplicate lease request retries while the server logic was executing. Specifically, consider this sequence of operations: 1. Request a new worker lease (Owner → Raylet) through ``RequestWorkerLease``. 2. The raylet is pulling the lease args asynchronously for the lease. 3. Retry ``RequestWorkerLease`` for the lease (Owner → Raylet). 4. The lease hasn't been granted yet, so it passes the idempotency check and the raylet fails to deduplicate the lease request. 5. ``RAY_CHECK`` hit since the raylet tries to pull args for the same lease again. The final fix was to take into account that the server logic could still be executing and track the lease as it goes through the phases of lease granting. At any phase, the system should be able to deduplicate requests. For any long-polling RPC, you should be **particularly careful** about idempotency because the client's retry won't necessarily wait for the response to be sent. Retryable gRPC client --------------------- The retryable gRPC client was updated during the RPC fault tolerance project. This section describes how it works and some gotchas to watch out for. For a basic introduction, read the `retryable_grpc_client.h comment `_. How it works ~~~~~~~~~~~~ The retryable gRPC client works as follows: - RPCs are sent using the retryable gRPC client. - If the client encounters a `gRPC transient network error `_, it pushes the callback into a queue. - Several checks are done on a periodic basis: - **Cheap gRPC channel state check**: This checks the state of the `gRPC channel `_ to see whether the system can start sending messages again. This check happens every second by default, but is configurable through `check_channel_status_interval_milliseconds `_. - **Potentially expensive GCS node status check**: If the exponential backoff period has passed and the channel is still down, the system calls `server_unavailable_timeout_callback_ `_. This callback is set in the client pool classes (`raylet_client_pool `_, `core_worker_client_pool `_). It checks if the client is subscribed for node status updates, and then checks the local subscriber cache to see whether a node death notification from the GCS has been received. If the client isn't subscribed or if there's no status for the node in the cache, it makes a RPC to the GCS. Note that for the GCS client, the ``server_unavailable_timeout_callback_`` `kills the process once called `_. This happens after ``gcs_rpc_server_reconnect_timeout_s`` seconds (60 by default). - **Per-RPC timeout check**: There's a `timeout check `_ that's customizable per RPC, but it's functionally disabled because it's `always set to -1 `_ (infinity) for each RPC. - With each additional failed RPC, the `exponential backoff period is increased `_, agnostic of the type of RPC that fails. The backoff period caps out to a max that you can customize for the core worker and raylet clients using either the ``core_worker_rpc_server_reconnect_timeout_max_s`` or ``raylet_rpc_server_reconnect_timeout_max_s`` config options. The GCS client doesn't have a max backoff period as noted above. - Once the channel check succeeds, the `exponential backoff period is reset and all RPCs in the queue are retried `_. - If the system successfully receives a node death notification (either through subscription or querying the GCS directly), it destroys the RPC client, which posts each callback to the I/O context with a `gRPC Disconnected error `_. Important considerations ~~~~~~~~~~~~~~~~~~~~~~~~ A few important points to keep in mind: - **Per-client queuing**: Each retryable gRPC client is unique to the client (``WorkerID`` for core worker clients, ``NodeID`` for raylet clients), not to the type of RPC. If you first submit RPC A that fails due to a transient network error, then RPC B to the same client that fails due to a transient network error, the queue will have two items: RPC A then RPC B. There isn't a separate queue on an RPC basis, but on a client basis. - **Client-level timeouts**: Each timeout needs to wait for the previous timeout to complete. If both RPC A and RPC B are submitted in short succession, then RPC A will wait in total for 1 second, and RPC B will wait in total for 1 + 2 = 3 seconds. Different RPCs don't matter and are treated the same. The reasoning is that transient network errors aren't RPC specific. If RPC A sees a network failure, you can assume that RPC B, if sent to the same client, will experience the same failure. Hence, the time that an RPC waits is the sum of the timeouts of all the previous RPCs in the queue and its own timeout. - **Destructor behavior**: In the destructor for ``RetryableGrpcClient``, the system fails all pending RPCs by posting their I/O contexts. These callbacks should ideally never modify state held by the client classes such as ``RayletClient``. If absolutely necessary, they must check if the client is still alive somehow, such as using a weak pointer. An example of this is in `PR #58744 `_. The application code should also take into account the `Disconnected error `_. Testing RPC fault tolerance ---------------------------- Ray Core has three layers of testing for RPC fault tolerance and idempotency. C++ unit tests ~~~~~~~~~~~~~~ For each RPC, there should be some form of C++ idempotency test that calls the ``HandleX`` server function twice and checks that the same result is outputted each time. Different state changes between the ``HandleX`` server function calls should be taken into account. For example, in ``RequestWorkerLease``, a C++ unit test was written to model the situation where the retry comes while the initial lease request is stuck in the args pulling stage. Python integration tests ~~~~~~~~~~~~~~~~~~~~~~~~ For each RPC, there should ideally be a Python integration test if it's straightforward. For some RPCs, it's challenging to test them fully deterministically using Python APIs, so having sufficient C++ unit testing can act as a good proxy. Hence, it's more of a nice-to-have, as integration tests also act as examples of how a user could run into idempotency issues. The main testing mechanism uses the ``RAY_testing_rpc_failure`` config option, which allows you to: - Trigger the RPC callback immediately with a gRPC error without sending the RPC (simulating a request failure). - Trigger the RPC callback with a gRPC error once the response arrives from the server (simulating a response failure). - Trigger the RPC callback immediately with a gRPC error but send the RPC to the server as well (simulating an in-flight failure, where the retry should ideally hit the server while it's executing the server code for long-polling RPCs). For more details, see the comment in the Ray config file at `ray_config_def.h `_. Chaos network release tests ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ IP table blackout ^^^^^^^^^^^^^^^^^ The IP table blackout approach involves SSH-ing into each node and blacking out the IP tables for a small amount of time (5 seconds) to simulate transient network errors. The IP table script runs in the background, periodically (60 seconds) causing network blackouts while the test script executes. For core release tests, we've added the IP table blackout approach to all existing chaos release tests in `PR #58868 `_. .. note:: Initially, Amazon FIS was considered. However, it has a 60-second minimum which caused node death due to configuration settings which was hard to debug, so the IP table approach was simpler and more flexible to use. --- .. _task-lifecycle: Task Lifecycle ============== This doc talks about the lifecycle of a task in Ray Core, including how tasks are defined, scheduled and executed. We will use the following code as an example and the internals are based on Ray 2.48. .. testcode:: import ray @ray.remote def my_task(arg): return f"Hello, {arg}!" obj_ref = my_task.remote("Ray") print(ray.get(obj_ref)) .. testoutput:: Hello, Ray! Defining a remote function -------------------------- The first step in the task lifecycle is defining a remote function using the :func:`ray.remote` decorator. :func:`ray.remote` wraps the Python function and returns an instance of `RemoteFunction `__. ``RemoteFunction`` stores the underlying function and all the user specified Ray task :meth:`options ` such as ``num_cpus``. Invoking a remote function -------------------------- Once a remote function is defined, it can be invoked using the `.remote()` method. Each invocation of a remote function creates a Ray task. This method submits the task for execution and returns an object reference (``ObjectRef``) that can be used to retrieve the result later. Under the hood, `.remote()` does the following: 1. `Pickles the underlying function `__ into bytes and `stores the bytes in GCS key-value store `__ with a `key `__ so that, later on, the remote executor (the core worker process that will execute the task) can get the bytes, unpickle, and execute the function. This is done once per remote function definition instead of once per invocation. 2. `Calls `__ Cython `submit_task `__ which `prepares `__ the arguments (3 types) and calls the C++ `CoreWorker::SubmitTask `__. 1. Pass-by-reference argument: the argument is an ``ObjectRef``. 2. Pass-by-value inline argument: the argument is a `small `__ Python object and the total size of such arguments so far is below the `threshold `__. In this case, it will be pickled, sent to the remote executor (as part of the ``PushTask`` RPC), and unpickled there. This is called inlining and plasma store is not involved in this case. 3. Pass-by-value non-inline argument: the argument is a normal Python object but it doesn't meet the inline criteria (e.g. size is too big), it is `put `__ in the local plasma store and the argument is replaced by the generated ``ObjectRef``, so it's effectively equivalent to ``.remote(ray.put(arg))``. 3. ``CoreWorker`` `builds `__ a `TaskSpecification `__ that contains all the information about the task including the `ID `__ of the function, all the user specified options and the arguments. This spec will be sent to the executor for execution. 4. The TaskSpecification is `submitted `__ to `NormalTaskSubmitter `__ asynchronously. This means the ``.remote()`` call returns immediately and the task is scheduled and executed asynchronously. Scheduling a task ----------------- Once the task is submitted to ``NormalTaskSubmitter``, a worker process on some Ray node is selected to execute the task and this process is called scheduling. 1. ``NormalTaskSubmitter`` first `waits `__ for all the ``ObjectRef`` arguments to be available. Available means tasks that produce those ``ObjectRef``\s finished execution and the data is available somewhere in the cluster. 1. If the object pointed to by the ``ObjectRef`` is in the plasma store, the ``ObjectRef`` itself is sent to the executor and the executor will resolve the ``ObjectRef`` to the actual data (pull from remote plasma store if needed) before calling the user function. 2. If the object pointed to by the ``ObjectRef`` is in the caller memory store, the data is `inlined `__ and sent to the executor as part of the ``PushTask`` RPC just like other pass-by-value inline arguments. 2. Once all the arguments are available, ``NormalTaskSubmitter`` will try to find an idle worker to execute the task. ``NormalTaskSubmitter`` gets workers for task execution from raylet via a process called worker lease and this is where scheduling happens. Specifically, it will `send `__ a ``RequestWorkerLease`` RPC to a `selected `__ (it's either the local raylet or a data-locality-favored raylet) raylet for a worker lease. 3. Raylet `handles `__ the ``RequestWorkerLease`` RPC. 4. When the ``RequestWorkerLease`` RPC returns with a leased worker address in the response, a worker lease is granted to the caller to execute the task. If the ``RequestWorkerLease`` response contains another raylet address instead, ``NormalTaskSubmitter`` will then request a worker lease from the specified raylet. This process continues until a worker lease is obtained. Executing a task ---------------- Once a leased worker is obtained, the task execution starts. 1. ``NormalTaskSubmitter`` `sends `__ a ``PushTask`` RPC to the leased worker with the ``TaskSpecification`` to execute. 2. The executor `receives `__ the ``PushTask`` RPC and executes (`1 `__ -> `2 `__ -> `3 `__ -> `4 `__ -> `5 `__) the task. 3. First step of executing the task is `getting `__ all the pass-by-reference arguments from the local plasma store (data is already pulled from remote plasma store to the local plasma store during scheduling). 4. Then the executor `gets `__ the pickled function bytes from GCS key-value store and unpickles it. 5. The next step is `unpickling `__ the arguments. 6. Finally, the user function is `called `__. Getting the return value ------------------------ After the user function is executed, the caller can get the return values. 1. After the user function returns, the executor `gets and stores `__ all the return values. If the return value is a `small `__ object and the total size of such return values so far is below the `threshold `__, it is returned directly to the caller as part of the ``PushTask`` RPC response. `Otherwise `__, it is put in the local plasma store and the reference is returned to the caller. 2. When the caller `receives `__ the ``PushTask`` RPC response, it `stores `__ the return values (actual data if the return value is small or a special value indicating the data is in plasma store if the return value is big) in the local memory store. 3. When the return value is `added `__ to the local memory store, ``ray.get()`` is `unblocked `__ and returns the value directly if the object is small, or it will `get `__ from the local plasma store (pull from remote plasma store first if needed) if the object is big. --- .. _token-authentication: Token Authentication ==================== Ray v2.52.0 introduced support for token authentication, enabling Ray to enforce the use of a single, statically generated token in the authorization header for all requests to the Ray Dashboard, GCS server, and other control-plane services. This document covers the design and architecture of token authentication in Ray, including configuration, token loading, propagation, and verification across C++, Python, and the Ray dashboard. Authentication Modes -------------------- Ray's authentication behavior is controlled by the **RAY_AUTH_MODE** environment variable. As of now, Ray supports two modes: - ``disabled`` - Default; no authentication. - ``token`` - Static bearer token authentication. **RAY_AUTH_MODE** must be set via the environment and should be configured consistently on every node in the Ray cluster. When ``RAY_AUTH_MODE=token``, token authentication is enabled and all supported RPC and HTTP entry points enforce token based authentication. Token Sources and Precedence ---------------------------- Once token auth is enabled, Ray looks for the token in the following order (highest to lowest precedence): 1. **RAY_AUTH_TOKEN** (environment variable): If set and non-empty, this value is used directly as the token string. 2. **RAY_AUTH_TOKEN_PATH** (environment variable pointing to file): If set, Ray reads the token from that file. If the file cannot be read or is empty, Ray treats this as a fatal misconfiguration and aborts rather than silently falling back. 3. **Default token file path**: If neither of the above are set, Ray falls back to a default path: - ``~/.ray/auth_token`` on POSIX systems - ``%USERPROFILE%\.ray\auth_token`` on Windows For local clusters started with ``ray.init()`` and auth enabled, Ray automatically generates a new token and persists it at the default path if no token exists. .. note:: Whitespace is stripped when reading the token from files to avoid issues from trailing newlines. Token Propagation and Verification ---------------------------------- Common Expectations ~~~~~~~~~~~~~~~~~~~ Across both C++ and Python, gRPC servers expect the token to be present in the authorization metadata key as: .. code-block:: text Authorization: Bearer HTTP servers similarly expect one of: 1. ``Authorization: Bearer `` - Used by Ray CLI and other internal HTTP clients. 2. Cookie ``ray-authentication-token=`` - Used by the browser-based dashboard. 3. ``X-Ray-Authorization: Bearer `` - Used by KubeRay and environments where the standard ``Authorization`` header may be stripped by a proxy. C++ Clients and Servers ~~~~~~~~~~~~~~~~~~~~~~~ On the C++ side, token attachment to outgoing RPCs is automated using gRPC's interceptor API. The client interceptor is defined in `token_auth_client_interceptor.h `_. All production C++ gRPC channels must be created through the ``BuildChannel()`` helper, which wires in the interceptor when token auth is enabled. Ray developers must not create channels directly with ``grpc::CreateCustomChannel``; doing so would bypass token attachment. ``BuildChannel()`` is the central enforcement point that ensures all C++ clients automatically add the correct ``Authorization: Bearer `` metadata. Server-side token validation compares the token presented by the client with the token the cluster was started with. This check is performed in `server_call.h `_ inside the generic request handling path. Because all gRPC services inherit from the same base call implementation, the validation applies uniformly to all C++ gRPC servers when token auth is enabled. Python Clients and Servers ~~~~~~~~~~~~~~~~~~~~~~~~~~ Most Python components use Cython bindings over the C++ clients, so they automatically inherit the same token behavior without additional Python-level code. For components that construct gRPC clients or servers directly in Python, explicit interceptors (both sync and async) add and validate authentication metadata: - `Client interceptors `_ - `Server interceptors `_ All Python gRPC clients and servers should be created using helper utilities from `grpc_utils.py `_. These helpers automatically attach the correct client/server interceptors when token auth is enabled. The convention is to always go through the shared utilities so that auth is consistently enforced, never constructing raw gRPC channels or servers directly. HTTP Clients and Servers ~~~~~~~~~~~~~~~~~~~~~~~~ For HTTP services, token authentication is implemented using aiohttp middleware in `http_token_authentication.py `_. The middleware must be explicitly added to each server's middleware list (e.g., ``dashboard_head`` service and ``runtime_env_agent`` service). Once configured, it: - Extracts the token from ``Authorization`` header, ``X-Ray-Authorization`` header, or ``ray-authentication-token`` cookie. - Validates the token and returns: - **401 Unauthorized** for missing token. - **403 Forbidden** for invalid token. Client-side, HTTP callers can use the ``get_auth_headers_if_auth_enabled()`` helper to attach headers. This helper computes ``Authorization: Bearer `` if token auth is enabled and merges it with any user-supplied headers. .. note:: For HTTP, middleware and header injection are not automatically wired up for new services; they must be added manually. Ray Dashboard Flow ------------------ When a Ray cluster is started with ``RAY_AUTH_MODE=token``, accessing the dashboard triggers an authentication flow in the UI: 1. The user sees a dialog prompting them to enter the authentication token. 2. Once the user submits the token, the frontend sends a ``POST`` request to the dashboard head's ``/api/authenticate`` endpoint with ``Authorization: Bearer `` header. 3. The dashboard head validates the token. 4. If validation succeeds, the server responds with **200 OK** and instructs the browser to set a cookie: - Name: ``ray-authentication-token`` - Value: ```` - Attributes: ``HttpOnly``, ``SameSite=Strict`` (and ``Secure`` when running over HTTPS) - max_age: 30 days (cookie is cleared after 30 days) From this point on, subsequent dashboard UI API calls automatically include the cookie and satisfy the middleware's authentication checks. If a backend request returns **401 Unauthorized** (no token) or **403 Forbidden** (invalid token or mode change), the dashboard UI interprets this as an authentication failure. It clears any stale state and re-opens the authentication dialog, prompting the user to re-enter a valid token. This approach keeps the token out of JavaScript-accessible storage and relies on standard browser cookie mechanics to secure subsequent requests. Ray CLI ------- Ray CLI commands that talk to an authenticated cluster automatically load the token from the same three mechanisms (in the same precedence order): - **RAY_AUTH_TOKEN**, **RAY_AUTH_TOKEN_PATH**, or the default token file. Once loaded, CLI commands pass the token along to their internal RPC calls. Depending on the underlying implementation, they either: - Use C++ clients (and thus C++ interceptors via ``BuildChannel()``), or - Use Python gRPC clients/servers and the Python interceptors via ``grpc_utils.py``, or - Use HTTP helpers that call ``get_auth_headers_if_auth_enabled()``. From the user's perspective, as long as the token is configured via one of the supported mechanisms, the CLI works against token-secured clusters. ray get-auth-token Command ~~~~~~~~~~~~~~~~~~~~~~~~~~ To retrieve and share the token used by a local Ray cluster (for example, to paste into the dashboard UI), Ray provides the ``ray get-auth-token`` command. By default, ``ray get-auth-token`` attempts to load an existing token from: - **RAY_AUTH_TOKEN**, **RAY_AUTH_TOKEN_PATH**, or the default token file. If a token is found, it is printed to ``stdout`` (suitable for scripting and export). If no token exists, the command fails with an error explaining that no token is configured. Users can pass the ``--generate`` flag to generate a new token and store it in the default token file path if no token is currently configured. This does not overwrite an existing token; it only creates one when none is present. Adding Token Authentication to New Services ------------------------------------------- When adding new gRPC or HTTP services to Ray, follow these guidelines to ensure proper token authentication support: gRPC Services ~~~~~~~~~~~~~ **C++ Services:** 1. Always create gRPC channels through ``BuildChannel()`` - never use ``grpc::CreateCustomChannel`` directly. 2. Server-side validation is automatic if your service inherits from the standard base call implementation. **Python Services:** 1. Use helper utilities from ``grpc_utils.py`` to create clients and servers. 2. The interceptors are automatically attached when token auth is enabled. HTTP Services ~~~~~~~~~~~~~ 1. Add the authentication middleware from ``http_token_authentication.py`` to your server's middleware list. 2. Use ``get_auth_headers_if_auth_enabled()`` for client-side header attachment. .. note:: HTTP middleware and header injection are not automatically wired up - they must be added manually to each new HTTP service. --- .. _ray-core-internals: Internals ========= This section provides a look into some of Ray Core internals. It's primarily intended for advanced users and developers of Ray Core. For the high level architecture overview, please refer to the `whitepaper `__. .. toctree:: :maxdepth: 1 internals/task-lifecycle.rst internals/autoscaler-v2.rst internals/rpc-fault-tolerance.rst internals/token-authentication.rst internals/metric-exporter.rst internals/ray-event-exporter.rst --- .. _core-key-concepts: Key Concepts ============ This section overviews Ray's key concepts. These primitives work together to enable Ray to flexibly support a broad range of distributed applications. .. _task-key-concept: Tasks ----- Ray enables arbitrary functions to execute asynchronously on separate worker processes. These asynchronous Ray functions are called tasks. Ray enables tasks to specify their resource requirements in terms of CPUs, GPUs, and custom resources. The cluster scheduler uses these resource requests to distribute tasks across the cluster for parallelized execution. See the :ref:`User Guide for Tasks `. .. _actor-key-concept: Actors ------ Actors extend the Ray API from functions (tasks) to classes. An actor is essentially a stateful worker (or a service). When you instantiate a new actor, Ray creates a new worker and schedules methods of the actor on that specific worker. The methods can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements. See the :ref:`User Guide for Actors `. Objects ------- Tasks and actors create objects and compute on objects. You can refer to these objects as *remote objects* because Ray stores them anywhere in a Ray cluster, and you use *object refs* to refer to them. Ray caches remote objects in its distributed `shared-memory `__ *object store* and creates one object store per node in the cluster. In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref. See the :ref:`User Guide for Objects `. Placement Groups ---------------- Placement groups allow users to atomically reserve groups of resources across multiple nodes. You can use them to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). A common use case is gang-scheduling actors or tasks. See the :ref:`User Guide for Placement Groups `. Environment Dependencies ------------------------ When Ray executes tasks and actors on remote machines, their environment dependencies, such as Python packages, local files, and environment variables, must be available on the remote machines. To address this problem, you can 1. Prepare your dependencies on the cluster in advance using the Ray :ref:`Cluster Launcher ` 2. Use Ray's :ref:`runtime environments ` to install them on the fly. See the :ref:`User Guide for Environment Dependencies `. --- Miscellaneous Topics ==================== This page will cover some miscellaneous topics in Ray. .. contents:: :local: Dynamic Remote Parameters ------------------------- You can dynamically adjust resource requirements or return values of ``ray.remote`` during execution with ``.options``. For example, here we instantiate many copies of the same actor with varying resource requirements. Note that to create these actors successfully, Ray will need to be started with sufficient CPU resources and the relevant custom resources: .. testcode:: import ray @ray.remote(num_cpus=4) class Counter(object): def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value a1 = Counter.options(num_cpus=1, resources={"Custom1": 1}).remote() a2 = Counter.options(num_cpus=2, resources={"Custom2": 1}).remote() a3 = Counter.options(num_cpus=3, resources={"Custom3": 1}).remote() You can specify different resource requirements for tasks (but not for actor methods): .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(num_cpus=1, num_gpus=1) @ray.remote def g(): return ray.get_gpu_ids() object_gpu_ids = g.remote() assert ray.get(object_gpu_ids) == [] dynamic_object_gpu_ids = g.options(num_cpus=1, num_gpus=1).remote() assert ray.get(dynamic_object_gpu_ids) == [0] And vary the number of return values for tasks (and actor methods too): .. testcode:: @ray.remote def f(n): return list(range(n)) id1, id2 = f.options(num_returns=2).remote(2) assert ray.get(id1) == 0 assert ray.get(id2) == 1 And specify a name for tasks (and actor methods too) at task submission time: .. testcode:: import psutil @ray.remote def f(x): assert psutil.Process().cmdline()[0] == "ray::special_f" return x + 1 obj = f.options(name="special_f").remote(3) assert ray.get(obj) == 4 This name will appear as the task name in the machine view of the dashboard, will appear as the worker process name when this task is executing (if a Python task), and will appear as the task name in the logs. .. image:: images/task_name_dashboard.png Overloaded Functions -------------------- Ray Java API supports calling overloaded java functions remotely. However, due to the limitation of Java compiler type inference, one must explicitly cast the method reference to the correct function type. For example, consider the following. Overloaded normal task call: .. code:: java public static class MyRayApp { public static int overloadFunction() { return 1; } public static int overloadFunction(int x) { return x; } } // Invoke overloaded functions. Assert.assertEquals((int) Ray.task((RayFunc0) MyRayApp::overloadFunction).remote().get(), 1); Assert.assertEquals((int) Ray.task((RayFunc1) MyRayApp::overloadFunction, 2).remote().get(), 2); Overloaded actor task call: .. code:: java public static class Counter { protected int value = 0; public int increment() { this.value += 1; return this.value; } } public static class CounterOverloaded extends Counter { public int increment(int diff) { super.value += diff; return super.value; } public int increment(int diff1, int diff2) { super.value += diff1 + diff2; return super.value; } } .. code:: java ActorHandle a = Ray.actor(CounterOverloaded::new).remote(); // Call an overloaded actor method by super class method reference. Assert.assertEquals((int) a.task(Counter::increment).remote().get(), 1); // Call an overloaded actor method, cast method reference first. a.task((RayFunc1) CounterOverloaded::increment).remote(); a.task((RayFunc2) CounterOverloaded::increment, 10).remote(); a.task((RayFunc3) CounterOverloaded::increment, 10, 10).remote(); Assert.assertEquals((int) a.task(Counter::increment).remote().get(), 33); Inspecting Cluster State ------------------------ Applications written on top of Ray will often want to have some information or diagnostics about the cluster. Some common questions include: 1. How many nodes are in my autoscaling cluster? 2. What resources are currently available in my cluster, both used and total? 3. What are the objects currently in my cluster? For this, you can use the global state API. Node Information ~~~~~~~~~~~~~~~~ To get information about the current nodes in your cluster, you can use ``ray.nodes()``: .. autofunction:: ray.nodes :noindex: .. testcode:: :hide: ray.shutdown() .. testcode:: import ray ray.init() print(ray.nodes()) .. testoutput:: :options: +MOCK [{'NodeID': '2691a0c1aed6f45e262b2372baf58871734332d7', 'Alive': True, 'NodeManagerAddress': '192.168.1.82', 'NodeManagerHostname': 'host-MBP.attlocal.net', 'NodeManagerPort': 58472, 'ObjectManagerPort': 52383, 'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_11-00-17_114725_17883/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-08-04_11-00-17_114725_17883/sockets/raylet', 'MetricsExportPort': 64860, 'alive': True, 'Resources': {'CPU': 16.0, 'memory': 100.0, 'object_store_memory': 34.0, 'node:192.168.1.82': 1.0}}] The above information includes: - `NodeID`: A unique identifier for the raylet. - `alive`: Whether the node is still alive. - `NodeManagerAddress`: PrivateIP of the node that the raylet is on. - `Resources`: The total resource capacity on the node. - `MetricsExportPort`: The port number at which metrics are exposed to through a `Prometheus endpoint `_. Resource Information ~~~~~~~~~~~~~~~~~~~~ To get information about the current total resource capacity of your cluster, you can use ``ray.cluster_resources()``. .. autofunction:: ray.cluster_resources :noindex: To get information about the current available resource capacity of your cluster, you can use ``ray.available_resources()``. .. autofunction:: ray.available_resources :noindex: Running Large Ray Clusters -------------------------- Here are some tips to run Ray with more than 1k nodes. When running Ray with such a large number of nodes, several system settings may need to be tuned to enable communication between such a large number of machines. Tuning Operating System Settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because all nodes and workers connect to the GCS, many network connections will be created and the operating system has to support that number of connections. Maximum open files ****************** The OS has to be configured to support opening many TCP connections since every worker and raylet connects to the GCS. In POSIX systems, the current limit can be checked by ``ulimit -n`` and if it's small, it should be increased according to the OS manual. ARP cache ********* Another thing that needs to be configured is the ARP cache. In a large cluster, all the worker nodes connect to the head node, which adds a lot of entries to the ARP table. Ensure that the ARP cache size is large enough to handle this many nodes. Failure to do this will result in the head node hanging. When this happens, ``dmesg`` will show errors like ``neighbor table overflow message``. In Ubuntu, the ARP cache size can be tuned in ``/etc/sysctl.conf`` by increasing the value of ``net.ipv4.neigh.default.gc_thresh1`` - ``net.ipv4.neigh.default.gc_thresh3``. For more details, please refer to the OS manual. Benchmark ~~~~~~~~~ The machine setup: - 1 head node: m5.4xlarge (16 vCPUs/64GB mem) - 2000 worker nodes: m5.large (2 vCPUs/8GB mem) The OS setup: - Set the maximum number of opening files to 1048576 - Increase the ARP cache size: - ``net.ipv4.neigh.default.gc_thresh1=2048`` - ``net.ipv4.neigh.default.gc_thresh2=4096`` - ``net.ipv4.neigh.default.gc_thresh3=8192`` The Ray setup: - ``RAY_event_stats=false`` Test workload: - Test script: `code `_ .. list-table:: Benchmark result :header-rows: 1 * - Number of actors - Actor launch time - Actor ready time - Total time * - 20k (10 actors / node) - 14.5s - 136.1s - 150.7s --- .. _namespaces-guide: Using Namespaces ================ A namespace is a logical grouping of jobs and named actors. When an actor is named, its name must be unique within the namespace. In order to set your applications namespace, it should be specified when you first connect to the cluster. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/namespaces.py :language: python :start-after: __init_namespace_start__ :end-before: __init_namespace_end__ .. tab-item:: Java .. code-block:: java System.setProperty("ray.job.namespace", "hello"); // set it before Ray.init() Ray.init(); .. tab-item:: C++ .. code-block:: c++ ray::RayConfig config; config.ray_namespace = "hello"; ray::Init(config); Please refer to `Driver Options `__ for ways of configuring a Java application. Named actors are only accessible within their namespaces. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/namespaces.py :language: python :start-after: __actor_namespace_start__ :end-before: __actor_namespace_end__ .. tab-item:: Java .. code-block:: java // `ray start --head` has been run to launch a local cluster. // Job 1 creates two actors, "orange" and "purple" in the "colors" namespace. System.setProperty("ray.address", "localhost:10001"); System.setProperty("ray.job.namespace", "colors"); try { Ray.init(); Ray.actor(Actor::new).setName("orange").remote(); Ray.actor(Actor::new).setName("purple").remote(); } finally { Ray.shutdown(); } // Job 2 is now connecting to a different namespace. System.setProperty("ray.address", "localhost:10001"); System.setProperty("ray.job.namespace", "fruits"); try { Ray.init(); // This fails because "orange" was defined in the "colors" namespace. Ray.getActor("orange").isPresent(); // return false // This succeeds because the name "orange" is unused in this namespace. Ray.actor(Actor::new).setName("orange").remote(); Ray.actor(Actor::new).setName("watermelon").remote(); } finally { Ray.shutdown(); } // Job 3 connects to the original "colors" namespace. System.setProperty("ray.address", "localhost:10001"); System.setProperty("ray.job.namespace", "colors"); try { Ray.init(); // This fails because "watermelon" was in the fruits namespace. Ray.getActor("watermelon").isPresent(); // return false // This returns the "orange" actor we created in the first job, not the second. Ray.getActor("orange").isPresent(); // return true } finally { Ray.shutdown(); } .. tab-item:: C++ .. code-block:: c++ // `ray start --head` has been run to launch a local cluster. // Job 1 creates two actors, "orange" and "purple" in the "colors" namespace. ray::RayConfig config; config.ray_namespace = "colors"; ray::Init(config); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("orange").Remote(); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("purple").Remote(); ray::Shutdown(); // Job 2 is now connecting to a different namespace. ray::RayConfig config; config.ray_namespace = "fruits"; ray::Init(config); // This fails because "orange" was defined in the "colors" namespace. ray::GetActor("orange"); // return nullptr; // This succeeds because the name "orange" is unused in this namespace. ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("orange").Remote(); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("watermelon").Remote(); ray::Shutdown(); // Job 3 connects to the original "colors" namespace. ray::RayConfig config; config.ray_namespace = "colors"; ray::Init(config); // This fails because "watermelon" was in the fruits namespace. ray::GetActor("watermelon"); // return nullptr; // This returns the "orange" actor we created in the first job, not the second. ray::GetActor("orange"); ray::Shutdown(); Specifying namespace for named actors ------------------------------------- You can specify a namespace for a named actor while creating it. The created actor belongs to the specified namespace, no matter what namespace of the current job is. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/namespaces.py :language: python :start-after: __specify_actor_namespace_start__ :end-before: __specify_actor_namespace_end__ .. tab-item:: Java .. code-block:: java // `ray start --head` has been run to launch a local cluster. System.setProperty("ray.address", "localhost:10001"); try { Ray.init(); // Create an actor with specified namespace. Ray.actor(Actor::new).setName("my_actor", "actor_namespace").remote(); // It is accessible in its namespace. Ray.getActor("my_actor", "actor_namespace").isPresent(); // return true } finally { Ray.shutdown(); } .. tab-item:: C++ .. code-block:: c++ // `ray start --head` has been run to launch a local cluster. ray::RayConfig config; ray::Init(config); // Create an actor with specified namespace. ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("my_actor", "actor_namespace").Remote(); // It is accessible in its namespace. ray::GetActor("my_actor", "actor_namespace"); ray::Shutdown(); Anonymous namespaces -------------------- When a namespace is not specified, Ray will place your job in an anonymous namespace. In an anonymous namespace, your job will have its own namespace and will not have access to actors in other namespaces. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/namespaces.py :language: python :start-after: __anonymous_namespace_start__ :end-before: __anonymous_namespace_end__ .. tab-item:: Java .. code-block:: java // `ray start --head` has been run to launch a local cluster. // Job 1 connects to an anonymous namespace by default. System.setProperty("ray.address", "localhost:10001"); try { Ray.init(); Ray.actor(Actor::new).setName("my_actor").remote(); } finally { Ray.shutdown(); } // Job 2 connects to a _different_ anonymous namespace by default System.setProperty("ray.address", "localhost:10001"); try { Ray.init(); // This succeeds because the second job is in its own namespace. Ray.actor(Actor::new).setName("my_actor").remote(); } finally { Ray.shutdown(); } .. tab-item:: C++ .. code-block:: c++ // `ray start --head` has been run to launch a local cluster. // Job 1 connects to an anonymous namespace by default. ray::RayConfig config; ray::Init(config); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("my_actor").Remote(); ray::Shutdown(); // Job 2 connects to a _different_ anonymous namespace by default ray::RayConfig config; ray::Init(config); // This succeeds because the second job is in its own namespace. ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("my_actor").Remote(); ray::Shutdown(); .. note:: Anonymous namespaces are implemented as UUID's. This makes it possible for a future job to manually connect to an existing anonymous namespace, but it is not recommended. Getting the current namespace ----------------------------- You can access to the current namespace using :ref:`runtime_context APIs `. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/namespaces.py :language: python :start-after: __get_namespace_start__ :end-before: __get_namespace_end__ .. tab-item:: Java .. code-block:: java System.setProperty("ray.job.namespace", "colors"); try { Ray.init(); // Will print namespace name "colors". System.out.println(Ray.getRuntimeContext().getNamespace()); } finally { Ray.shutdown(); } .. tab-item:: C++ .. code-block:: c++ ray::RayConfig config; config.ray_namespace = "colors"; ray::Init(config); // Will print namespace name "colors". std::cout << ray::GetNamespace() << std::endl; ray::Shutdown(); --- Object Spilling =============== .. _object-spilling: Ray spills objects to a directory in the local filesystem once the object store is full. By default, Ray spills objects to the temporary directory (for example, ``/tmp/ray/session_2025-03-28_00-05-20_204810_2814690``). Spilling to a custom directory ------------------------------- You can specify a custom directory for spilling objects by setting the ``object_spilling_directory`` parameter in the ``ray.init`` function or the ``--object-spilling-directory`` command line option in the ``ray start`` command. .. tab-set:: .. tab-item:: Python .. doctest:: ray.init(object_spilling_directory="/path/to/spill/dir") .. tab-item:: CLI .. doctest:: ray start --object-spilling-directory=/path/to/spill/dir For advanced usage and customizations, reach out to the `Ray team `_. Stats ----- When spilling is happening, the following INFO level messages are printed to the Raylet logs. For example, ``/tmp/ray/session_latest/logs/raylet.out``:: local_object_manager.cc:166: Spilled 50 MiB, 1 objects, write throughput 230 MiB/s local_object_manager.cc:334: Restored 50 MiB, 1 objects, read throughput 505 MiB/s You can also view cluster-wide spill stats by using the ``ray memory`` command:: --- Aggregate object store stats across all nodes --- Plasma memory usage 50 MiB, 1 objects, 50.0% full Spilled 200 MiB, 4 objects, avg write throughput 570 MiB/s Restored 150 MiB, 3 objects, avg read throughput 1361 MiB/s If you only want to display cluster-wide spill stats, use ``ray memory --stats-only``. --- .. _serialization-guide: Serialization ============= Since Ray processes do not share memory space, data transferred between workers and nodes will need to be **serialized** and **deserialized**. Ray uses the `Plasma object store `_ to efficiently transfer objects across different processes and different nodes. Numpy arrays in the object store are shared between workers on the same node (zero-copy deserialization). Overview -------- Ray has decided to use a customized `Pickle protocol version 5 `_ backport to replace the original PyArrow serializer. This gets rid of several previous limitations (e.g. cannot serialize recursive objects). Ray is currently compatible with Pickle protocol version 5, while Ray supports serialization of a wider range of objects (e.g. lambda & nested functions, dynamic classes) with the help of cloudpickle. .. _plasma-store: Plasma Object Store ~~~~~~~~~~~~~~~~~~~ Plasma is an in-memory object store. It has been originally developed as part of Apache Arrow. Prior to Ray's version 1.0.0 release, Ray forked Arrow's Plasma code into Ray's code base in order to disentangle and continue development with respect to Ray's architecture and performance needs. Plasma is used to efficiently transfer objects across different processes and different nodes. All objects in Plasma object store are **immutable** and held in shared memory. This is so that they can be accessed efficiently by many workers on the same node. Each node has its own object store. When data is put into the object store, it does not get automatically broadcasted to other nodes. Data remains local to the writer until requested by another task or actor on another node. .. _serialize-object-ref: Serializing ObjectRefs ~~~~~~~~~~~~~~~~~~~~~~ Explicitly serializing `ObjectRefs` using `ray.cloudpickle` should be used as a last resort. Passing `ObjectRefs` through Ray task arguments and return values is the recommended approach. Ray `ObjectRefs` can be serialized using `ray.cloudpickle`. The `ObjectRef` can then be deserialized and accessed with `ray.get()`. Note that `ray.cloudpickle` must be used; other pickle tools are not guaranteed to work. Additionally, the process that deserializes the `ObjectRef` must be part of the same Ray cluster that serialized it. When serialized, the `ObjectRef`'s value will remain pinned in Ray's shared memory object store. The object must be explicitly freed by calling `ray._private.internal_api.free(obj_ref)`. .. warning:: `ray._private.internal_api.free(obj_ref)` is a private API and may be changed in future Ray versions. This code example demonstrates how to serialize an `ObjectRef`, store it in external storage, deserialize and use it, and lastly free its object. .. literalinclude:: /ray-core/doc_code/object_ref_serialization.py Numpy Arrays ~~~~~~~~~~~~ Ray optimizes for numpy arrays by using Pickle protocol 5 with out-of-band data. The numpy array is stored as a read-only object, and all Ray workers on the same node can read the numpy array in the object store without copying (zero-copy reads). Each numpy array object in the worker process holds a pointer to the relevant array held in shared memory. Any writes to the read-only object will require the user to first copy it into the local process memory. .. tip:: You can often avoid serialization issues by using only native types (e.g., numpy arrays or lists/dicts of numpy arrays and other primitive types), or by using Actors to hold objects that cannot be serialized. Fixing "assignment destination is read-only" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because Ray puts numpy arrays in the object store, when deserialized as arguments in remote functions they will become read-only. For example, the following code snippet will crash: .. literalinclude:: /ray-core/doc_code/deser.py To avoid this issue, you can manually copy the array at the destination if you need to mutate it (``arr = arr.copy()``). Note that this is effectively like disabling the zero-copy deserialization feature provided by Ray. Serialization notes ------------------- - Ray is currently using Pickle protocol version 5. The default pickle protocol used by most python distributions is protocol 3. Protocol 4 & 5 are more efficient than protocol 3 for larger objects. - For non-native objects, Ray will always keep a single copy even it is referred multiple times in an object: .. testcode:: import ray import numpy as np obj = [np.zeros(42)] * 99 l = ray.get(ray.put(obj)) assert l[0] is l[1] # no problem! - Whenever possible, use numpy arrays or Python collections of numpy arrays for maximum performance. - Lock objects are mostly unserializable, because copying a lock is meaningless and could cause serious concurrency problems. You may have to come up with a workaround if your object contains a lock. Zero-Copy Serialization for Read-Only Tensors ---------------------------------------------- Ray provides optional zero-copy serialization for read-only PyTorch tensors. Ray serializes these tensors by converting them to NumPy arrays and leveraging pickle5's zero-copy buffer sharing. This avoids copying the underlying tensor data, which can improve performance when passing large tensors across tasks or actors. However, PyTorch does not natively support read-only tensors, so this feature must be used with caution. When the feature is enabled, Ray won't copy and allow a write to shared memory. One process changing a tensor after `ray.get()` could be reflected in another process if both processes are colocated on the same node. This feature works best under the following conditions: - The tensor has `requires_grad = False` (i.e., is detached from the autograd graph). - The tensor is contiguous in memory (`tensor.is_contiguous()`). - Performance benefits from this are larger if the tensor resides in CPU memory. - You are not using Ray Direct Transport. This feature is disabled by default. You can enable it by setting the environment variable `RAY_ENABLE_ZERO_COPY_TORCH_TENSORS`. Set this variable externally before running your script to enable zero-copy serialization in the driver process: .. code-block:: bash export RAY_ENABLE_ZERO_COPY_TORCH_TENSORS=1 The following example calculates the sum of a 1GiB tensor using `ray.get()`, leveraging zero-copy serialization: .. testcode:: :hide: ray.shutdown() .. testcode:: import ray import torch import time ray.init(runtime_env={"env_vars": {"RAY_ENABLE_ZERO_COPY_TORCH_TENSORS": "1"}}) @ray.remote def process(tensor): return tensor.sum() x = torch.ones(1024, 1024, 256) start_time = time.perf_counter() result = ray.get(process.remote(x)) elapsed_time = time.perf_counter() - start_time print(f"Elapsed time: {elapsed_time}s") assert result == x.sum() In this example, enabling zero-copy serialization reduces end-to-end latency by **66.3%**: .. code-block:: bash # Without Zero-Copy Serialization Elapsed time: 23.53883756196592s # With Zero-Copy Serialization Elapsed time: 7.933729998010676s Customized Serialization ------------------------ Sometimes you may want to customize your serialization process because the default serializer used by Ray (pickle5 + cloudpickle) does not work for you (fail to serialize some objects, too slow for certain objects, etc.). There are at least 3 ways to define your custom serialization process: 1. If you want to customize the serialization of a type of objects, and you have access to the code, you can define ``__reduce__`` function inside the corresponding class. This is commonly done by most Python libraries. Example code: .. testcode:: import ray import sqlite3 class DBConnection: def __init__(self, path): self.path = path self.conn = sqlite3.connect(path) # without '__reduce__', the instance is unserializable. def __reduce__(self): deserializer = DBConnection serialized_data = (self.path,) return deserializer, serialized_data original = DBConnection("/tmp/db") print(original.conn) copied = ray.get(ray.put(original)) print(copied.conn) .. testoutput:: 2. If you want to customize the serialization of a type of objects, but you cannot access or modify the corresponding class, you can register the class with the serializer you use: .. testcode:: import ray import threading class A: def __init__(self, x): self.x = x self.lock = threading.Lock() # could not be serialized! try: ray.get(ray.put(A(1))) # fail! except TypeError: pass def custom_serializer(a): return a.x def custom_deserializer(b): return A(b) # Register serializer and deserializer for class A: ray.util.register_serializer( A, serializer=custom_serializer, deserializer=custom_deserializer) ray.get(ray.put(A(1))) # success! # You can deregister the serializer at any time. ray.util.deregister_serializer(A) try: ray.get(ray.put(A(1))) # fail! except TypeError: pass # Nothing happens when deregister an unavailable serializer. ray.util.deregister_serializer(A) NOTE: Serializers are managed locally for each Ray worker. So for every Ray worker, if you want to use the serializer, you need to register the serializer. Deregister a serializer also only applies locally. If you register a new serializer for a class, the new serializer would replace the old serializer immediately in the worker. This API is also idempotent, there are no side effects caused by re-registering the same serializer. 3. We also provide you an example, if you want to customize the serialization of a specific object: .. testcode:: import threading class A: def __init__(self, x): self.x = x self.lock = threading.Lock() # could not serialize! try: ray.get(ray.put(A(1))) # fail! except TypeError: pass class SerializationHelperForA: """A helper class for serialization.""" def __init__(self, a): self.a = a def __reduce__(self): return A, (self.a.x,) ray.get(ray.put(SerializationHelperForA(A(1)))) # success! # the serializer only works for a specific object, not all A # instances, so we still expect failure here. try: ray.get(ray.put(A(1))) # still fail! except TypeError: pass .. _custom-exception-serializer: Custom Serializers for Exceptions ---------------------------------- When Ray tasks raise exceptions that cannot be serialized with the default pickle mechanism, you can register custom serializers to handle them (Note: the serializer must be registered in the driver and all workers). .. testcode:: import ray import threading class CustomError(Exception): def __init__(self, message, data): self.message = message self.data = data self.lock = threading.Lock() # Cannot be serialized def custom_serializer(exc): return {"message": exc.message, "data": str(exc.data)} def custom_deserializer(state): return CustomError(state["message"], state["data"]) # Register in the driver ray.util.register_serializer( CustomError, serializer=custom_serializer, deserializer=custom_deserializer ) @ray.remote def task_that_registers_serializer_and_raises(): # Register the custom serializer in the worker ray.util.register_serializer( CustomError, serializer=custom_serializer, deserializer=custom_deserializer ) # Now raise the custom exception raise CustomError("Something went wrong", {"complex": "data"}) # The custom exception will be properly serialized across worker boundaries try: ray.get(task_that_registers_serializer_and_raises.remote()) except ray.exceptions.RayTaskError as e: print(f"Caught exception: {e.cause}") # This will be our CustomError When a custom exception is raised in a remote task, Ray will: 1. Serialize the exception using your custom serializer 2. Wrap it in a :class:`RayTaskError ` 3. The deserialized exception will be available as ``ray_task_error.cause`` Whenever serialization fails, Ray throws an :class:`UnserializableException ` containing the string representation of the original stack trace. Troubleshooting --------------- Use ``ray.util.inspect_serializability`` to identify tricky pickling issues. This function can be used to trace a potential non-serializable object within any Python object -- whether it be a function, class, or object instance. Below, we demonstrate this behavior on a function with a non-serializable object (threading lock): .. testcode:: from ray.util import inspect_serializability import threading lock = threading.Lock() def test(): print(lock) inspect_serializability(test, name="test") The resulting output is: .. testoutput:: :options: +MOCK ============================================================= Checking Serializability of ============================================================= !!! FAIL serialization: cannot pickle '_thread.lock' object Detected 1 global variables. Checking serializability... Serializing 'lock' ... !!! FAIL serialization: cannot pickle '_thread.lock' object WARNING: Did not find non-serializable object in . This may be an oversight. ============================================================= Variable: FailTuple(lock [obj=, parent=]) was found to be non-serializable. There may be multiple other undetected variables that were non-serializable. Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class. ============================================================= Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information. If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/ ============================================================= For even more detailed information, set environmental variable ``RAY_PICKLE_VERBOSE_DEBUG='2'`` before importing Ray. This enables serialization with python-based backend instead of C-Pickle, so you can debug into python code at the middle of serialization. However, this would make serialization much slower. Known Issues ------------ Users could experience memory leak when using certain python3.8 & 3.9 versions. This is due to `a bug in python's pickle module `_. This issue has been solved for Python 3.8.2rc1, Python 3.9.0 alpha 4 or late versions. --- .. _objects-in-ray: Objects ======= In Ray, tasks and actors create and compute on objects. We refer to these objects as **remote objects** because they can be stored anywhere in a Ray cluster, and we use **object refs** to refer to them. Remote objects are cached in Ray's distributed `shared-memory `__ **object store**, and there is one object store per node in the cluster. In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref(s). An **object ref** is essentially a pointer or a unique ID that can be used to refer to a remote object without seeing its value. If you're familiar with futures, Ray object refs are conceptually similar. Object refs can be created in two ways. 1. They are returned by remote function calls. 2. They are returned by :func:`ray.put() `. .. tab-set:: .. tab-item:: Python .. testcode:: import ray # Put an object in Ray's object store. y = 1 object_ref = ray.put(y) .. tab-item:: Java .. code-block:: java // Put an object in Ray's object store. int y = 1; ObjectRef objectRef = Ray.put(y); .. tab-item:: C++ .. code-block:: c++ // Put an object in Ray's object store. int y = 1; ray::ObjectRef object_ref = ray::Put(y); .. note:: Remote objects are immutable. That is, their values cannot be changed after creation. This allows remote objects to be replicated in multiple object stores without needing to synchronize the copies. Fetching Object Data -------------------- You can use the :func:`ray.get() ` method to fetch the result of a remote object from an object ref. If the current node's object store does not contain the object, the object is downloaded. .. tab-set:: .. tab-item:: Python If the object is a `numpy array `__ or a collection of numpy arrays, the ``get`` call is zero-copy and returns arrays backed by shared object store memory. Otherwise, we deserialize the object data into a Python object. .. testcode:: import ray import time # Get the value of one object ref. obj_ref = ray.put(1) assert ray.get(obj_ref) == 1 # Get the values of multiple object refs in parallel. assert ray.get([ray.put(i) for i in range(3)]) == [0, 1, 2] # You can also set a timeout to return early from a ``get`` # that's blocking for too long. from ray.exceptions import GetTimeoutError # ``GetTimeoutError`` is a subclass of ``TimeoutError``. @ray.remote def long_running_function(): time.sleep(8) obj_ref = long_running_function.remote() try: ray.get(obj_ref, timeout=4) except GetTimeoutError: # You can capture the standard "TimeoutError" instead print("`get` timed out.") .. testoutput:: `get` timed out. .. tab-item:: Java .. code-block:: java // Get the value of one object ref. ObjectRef objRef = Ray.put(1); Assert.assertTrue(objRef.get() == 1); // You can also set a timeout(ms) to return early from a ``get`` that's blocking for too long. Assert.assertTrue(objRef.get(1000) == 1); // Get the values of multiple object refs in parallel. List> objectRefs = new ArrayList<>(); for (int i = 0; i < 3; i++) { objectRefs.add(Ray.put(i)); } List results = Ray.get(objectRefs); Assert.assertEquals(results, ImmutableList.of(0, 1, 2)); // Ray.get timeout example: Ray.get will throw an RayTimeoutException if time out. public class MyRayApp { public static int slowFunction() throws InterruptedException { TimeUnit.SECONDS.sleep(10); return 1; } } Assert.assertThrows(RayTimeoutException.class, () -> Ray.get(Ray.task(MyRayApp::slowFunction).remote(), 3000)); .. tab-item:: C++ .. code-block:: c++ // Get the value of one object ref. ray::ObjectRef obj_ref = ray::Put(1); assert(*obj_ref.Get() == 1); // Get the values of multiple object refs in parallel. std::vector> obj_refs; for (int i = 0; i < 3; i++) { obj_refs.emplace_back(ray::Put(i)); } auto results = ray::Get(obj_refs); assert(results.size() == 3); assert(*results[0] == 0); assert(*results[1] == 1); assert(*results[2] == 2); Passing Object Arguments ------------------------ Ray object references can be freely passed around a Ray application. This means that they can be passed as arguments to tasks, actor methods, and even stored in other objects. Objects are tracked via *distributed reference counting*, and their data is automatically freed once all references to the object are deleted. There are two different ways one can pass an object to a Ray task or method. Depending on the way an object is passed, Ray will decide whether to *de-reference* the object prior to task execution. **Passing an object as a top-level argument**: When an object is passed directly as a top-level argument to a task, Ray will de-reference the object. This means that Ray will fetch the underlying data for all top-level object reference arguments, not executing the task until the object data becomes fully available. .. literalinclude:: doc_code/obj_val.py **Passing an object as a nested argument**: When an object is passed within a nested object, for example, within a Python list, Ray will *not* de-reference it. This means that the task will need to call ``ray.get()`` on the reference to fetch the concrete value. However, if the task never calls ``ray.get()``, then the object value never needs to be transferred to the machine the task is running on. We recommend passing objects as top-level arguments where possible, but nested arguments can be useful for passing objects on to other tasks without needing to see the data. .. literalinclude:: doc_code/obj_ref.py The top-level vs not top-level passing convention also applies to actor constructors and actor method calls: .. testcode:: @ray.remote class Actor: def __init__(self, arg): pass def method(self, arg): pass obj = ray.put(2) # Examples of passing objects to actor constructors. actor_handle = Actor.remote(obj) # by-value actor_handle = Actor.remote([obj]) # by-reference # Examples of passing objects to actor method calls. actor_handle.method.remote(obj) # by-value actor_handle.method.remote([obj]) # by-reference Closure Capture of Objects -------------------------- You can also pass objects to tasks via *closure-capture*. This can be convenient when you have a large object that you want to share verbatim between many tasks or actors, and don't want to pass it repeatedly as an argument. Be aware however that defining a task that closes over an object ref will pin the object via reference-counting, so the object will not be evicted until the job completes. .. literalinclude:: doc_code/obj_capture.py Nested Objects -------------- Ray also supports nested object references. This allows you to build composite objects that themselves hold references to further sub-objects. .. testcode:: # Objects can be nested within each other. Ray will keep the inner object # alive via reference counting until all outer object references are deleted. object_ref_2 = ray.put([object_ref]) Fault Tolerance --------------- Ray can automatically recover from object data loss via :ref:`lineage reconstruction ` but not :ref:`owner ` failure. See :ref:`Ray fault tolerance ` for more details. More about Ray Objects ---------------------- .. toctree:: :maxdepth: 1 objects/serialization.rst objects/object-spilling.rst --- Pattern: Using an actor to synchronize other tasks and actors ============================================================= When you have multiple tasks that need to wait on some condition or otherwise need to synchronize across tasks & actors on a cluster, you can use a central actor to coordinate among them. Example use case ---------------- You can use an actor to implement a distributed ``asyncio.Event`` that multiple tasks can wait on. Code example ------------ .. literalinclude:: ../doc_code/actor-sync.py --- Anti-pattern: Closure capturing large objects harms performance =============================================================== **TLDR:** Avoid closure capturing large objects in remote functions or classes, use object store instead. When you define a :func:`ray.remote ` function or class, it is easy to accidentally capture large (more than a few MB) objects implicitly in the definition. This can lead to slow performance or even OOM since Ray is not designed to handle serialized functions or classes that are very large. For such large objects, there are two options to resolve this problem: - Use :func:`ray.put() ` to put the large objects in the Ray object store, and then pass object references as arguments to the remote functions or classes (*"better approach #1"* below) - Create the large objects inside the remote functions or classes by passing a lambda method (*"better approach #2"*). This is also the only option for using unserializable objects. Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_closure_capture_large_objects.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ **Better approach #1:** .. literalinclude:: ../doc_code/anti_pattern_closure_capture_large_objects.py :language: python :start-after: __better_approach_1_start__ :end-before: __better_approach_1_end__ **Better approach #2:** .. literalinclude:: ../doc_code/anti_pattern_closure_capture_large_objects.py :language: python :start-after: __better_approach_2_start__ :end-before: __better_approach_2_end__ --- Pattern: Using asyncio to run actor methods concurrently ======================================================== By default, a Ray :ref:`actor ` runs in a single thread and actor method calls are executed sequentially. This means that a long running method call blocks all the following ones. In this pattern, we use ``await`` to yield control from the long running method call so other method calls can run concurrently. Normally the control is yielded when the method is doing IO operations but you can also use ``await asyncio.sleep(0)`` to yield control explicitly. .. note:: You can also use :ref:`threaded actors ` to achieve concurrency. Example use case ---------------- You have an actor with a long polling method that continuously fetches tasks from the remote store and executes them. You also want to query the number of tasks executed while the long polling method is running. With the default actor, the code will look like this: .. literalinclude:: ../doc_code/pattern_async_actor.py :language: python :start-after: __sync_actor_start__ :end-before: __sync_actor_end__ This is problematic because ``TaskExecutor.run`` method runs forever and never yields control to run other methods. We can solve this problem by using :ref:`async actors ` and use ``await`` to yield control: .. literalinclude:: ../doc_code/pattern_async_actor.py :language: python :start-after: __async_actor_start__ :end-before: __async_actor_end__ Here, instead of using the blocking :func:`ray.get() ` to get the value of an ObjectRef, we use ``await`` so it can yield control while we are waiting for the object to be fetched. --- .. _forking-ray-processes-antipattern: Anti-pattern: Forking new processes in application code ======================================================== **Summary:** Don't fork new processes in Ray application code—for example, in driver, tasks or actors. Instead, use the "spawn" method to start new processes or use Ray tasks and actors to parallelize your workload Ray manages the lifecycle of processes for you. Ray Objects, Tasks, and Actors manage sockets to communicate with the Raylet and the GCS. If you fork new processes in your application code, the processes could share the same sockets without any synchronization. This can lead to corrupted messages and unexpected behavior. The solution is to: 1. use the "spawn" method to start new processes so that the parent process's memory space is not copied to the child processes or 2. use Ray tasks and actors to parallelize your workload and let Ray manage the lifecycle of the processes for you. Code example ------------ .. literalinclude:: ../doc_code/anti_pattern_fork_new_processes.py :language: python --- .. _generator-pattern: Pattern: Using generators to reduce heap memory usage ===================================================== In this pattern, we use **generators** in Python to reduce the total heap memory usage during a task. The key idea is that for tasks that return multiple objects, we can return them one at a time instead of all at once. This allows a worker to free the heap memory used by a previous return value before returning the next one. Example use case ---------------- You have a task that returns multiple large values. Another possibility is a task that returns a single large value, but you want to stream this value through Ray's object store by breaking it up into smaller chunks. Using normal Python functions, we can write such a task like this. Here's an example that returns numpy arrays of size 100MB each: .. literalinclude:: ../doc_code/pattern_generators.py :language: python :start-after: __large_values_start__ :end-before: __large_values_end__ However, this will require the task to hold all ``num_returns`` arrays in heap memory at the same time at the end of the task. If there are many return values, this can lead to high heap memory usage and potentially an out-of-memory error. We can fix the above example by rewriting ``large_values`` as a **generator**. Instead of returning all values at once as a tuple or list, we can ``yield`` one value at a time. .. literalinclude:: ../doc_code/pattern_generators.py :language: python :start-after: __large_values_generator_start__ :end-before: __large_values_generator_end__ Code example ------------ .. literalinclude:: ../doc_code/pattern_generators.py :language: python :start-after: __program_start__ .. code-block:: text $ RAY_IGNORE_UNHANDLED_ERRORS=1 python test.py 100 Using normal functions... ... -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker... Worker failed Using generators... (large_values_generator pid=373609) yielded return value 0 (large_values_generator pid=373609) yielded return value 1 (large_values_generator pid=373609) yielded return value 2 ... Success! --- Anti-pattern: Using global variables to share state between tasks and actors ============================================================================ **TLDR:** Don't use global variables to share state with tasks and actors. Instead, encapsulate the global variables in an actor and pass the actor handle to other tasks and actors. Ray drivers, tasks and actors are running in different processes, so they don’t share the same address space. This means that if you modify global variables in one process, changes are not reflected in other processes. The solution is to use an actor's instance variables to hold the global state and pass the actor handle to places where the state needs to be modified or accessed. Note that using class variables to manage state between instances of the same class is not supported. Each actor instance is instantiated in its own process, so each actor will have its own copy of the class variables. Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_global_variables.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ **Better approach:** .. literalinclude:: ../doc_code/anti_pattern_global_variables.py :language: python :start-after: __better_approach_start__ :end-before: __better_approach_end__ --- .. _core-patterns: Design Patterns & Anti-patterns =============================== This section is a collection of common design patterns and anti-patterns for writing Ray applications. .. toctree:: :maxdepth: 1 nested-tasks generators limit-pending-tasks limit-running-tasks concurrent-operations-async-actor actor-sync tree-of-actors pipelining return-ray-put nested-ray-get ray-get-loop unnecessary-ray-get ray-get-submission-order ray-get-too-many-objects too-fine-grained-tasks redefine-task-actor-loop pass-large-arg-by-value closure-capture-large-objects global-variables out-of-band-object-ref-serialization fork-new-processes --- .. _core-patterns-limit-pending-tasks: Pattern: Using ray.wait to limit the number of pending tasks ============================================================ In this pattern, we use :func:`ray.wait() ` to limit the number of pending tasks. If we continuously submit tasks faster than their process time, we will accumulate tasks in the pending task queue, which can eventually cause OOM. With ``ray.wait()``, we can apply backpressure and limit the number of pending tasks so that the pending task queue won't grow indefinitely and cause OOM. .. note:: If we submit a finite number of tasks, it's unlikely that we will hit the issue mentioned above since each task only uses a small amount of memory for bookkeeping in the queue. It's more likely to happen when we have an infinite stream of tasks to run. .. note:: This method is meant primarily to limit how many tasks should be in flight at the same time. It can also be used to limit how many tasks can run *concurrently*, but it is not recommended, as it can hurt scheduling performance. Ray automatically decides task parallelism based on resource availability, so the recommended method for adjusting how many tasks can run concurrently is to :ref:`modify each task's resource requirements ` instead. Example use case ---------------- You have a worker actor that processes tasks at a rate of X tasks per second and you want to submit tasks to it at a rate lower than X to avoid OOM. For example, Ray Serve uses this pattern to limit the number of pending queries for each worker. .. figure:: ../images/limit-pending-tasks.svg Limit number of pending tasks Code example ------------ **Without backpressure:** .. literalinclude:: ../doc_code/limit_pending_tasks.py :language: python :start-after: __without_backpressure_start__ :end-before: __without_backpressure_end__ **With backpressure:** .. literalinclude:: ../doc_code/limit_pending_tasks.py :language: python :start-after: __with_backpressure_start__ :end-before: __with_backpressure_end__ --- .. _core-patterns-limit-running-tasks: Pattern: Using resources to limit the number of concurrently running tasks ========================================================================== In this pattern, we use :ref:`resources ` to limit the number of concurrently running tasks. By default, Ray tasks require 1 CPU each and Ray actors require 0 CPU each, so the scheduler limits task concurrency to the available CPUs and actor concurrency to infinite. Tasks that use more than 1 CPU (e.g., via multithreading) may experience slowdown due to interference from concurrent ones, but otherwise are safe to run. However, tasks or actors that use more than their proportionate share of memory may overload a node and cause issues like OOM. If that is the case, we can reduce the number of concurrently running tasks or actors on each node by increasing the amount of resources requested by them. This works because Ray makes sure that the sum of the resource requirements of all of the concurrently running tasks and actors on a given node does not exceed the node's total resources. .. note:: For actor tasks, the number of running actors limits the number of concurrently running actor tasks we can have. Example use case ---------------- You have a data processing workload that processes each input file independently using Ray :ref:`remote functions `. Since each task needs to load the input data into heap memory and do the processing, running too many of them can cause OOM. In this case, you can use the ``memory`` resource to limit the number of concurrently running tasks (usage of other resources like ``num_cpus`` can achieve the same goal as well). Note that similar to ``num_cpus``, the ``memory`` resource requirement is *logical*, meaning that Ray will not enforce the physical memory usage of each task if it exceeds this amount. Code example ------------ **Without limit:** .. literalinclude:: ../doc_code/limit_running_tasks.py :language: python :start-after: __without_limit_start__ :end-before: __without_limit_end__ **With limit:** .. literalinclude:: ../doc_code/limit_running_tasks.py :language: python :start-after: __with_limit_start__ :end-before: __with_limit_end__ --- .. _nested-ray-get: Anti-pattern: Calling ray.get on task arguments harms performance ================================================================= **TLDR:** If possible, pass ``ObjectRefs`` as direct task arguments, instead of passing a list as the task argument and then calling :func:`ray.get() ` inside the task. When a task calls ``ray.get()``, it must block until the value of the ``ObjectRef`` is ready. If all cores are already occupied, this situation can lead to a deadlock, as the task that produces the ``ObjectRef``'s value may need the caller task's resources in order to run. To handle this issue, if the caller task would block in ``ray.get()``, Ray temporarily releases the caller's CPU resources to allow the pending task to run. This behavior can harm performance and stability because the caller continues to use a process and memory to hold its stack while other tasks run. Therefore, it is always better to pass ``ObjectRefs`` as direct arguments to a task and avoid calling ``ray.get`` inside of the task, if possible. For example, in the following code, prefer the latter method of invoking the dependent task. .. literalinclude:: ../doc_code/anti_pattern_nested_ray_get.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ Avoiding ``ray.get`` in nested tasks may not always be possible. Some valid reasons to call ``ray.get`` include: - :doc:`nested-tasks` - If the nested task has multiple ``ObjectRefs`` to ``ray.get``, and it wants to choose the order and number to get. --- .. _nested-tasks: Pattern: Using nested tasks to achieve nested parallelism ========================================================= In this pattern, a remote task can dynamically call other remote tasks (including itself) for nested parallelism. This is useful when sub-tasks can be parallelized. Keep in mind, though, that nested tasks come with their own cost: extra worker processes, scheduling overhead, bookkeeping overhead, etc. To achieve speedup with nested parallelism, make sure each of your nested tasks does significant work. See :doc:`too-fine-grained-tasks` for more details. Example use case ---------------- You want to quick-sort a large list of numbers. By using nested tasks, we can sort the list in a distributed and parallel fashion. .. figure:: ../images/tree-of-tasks.svg Tree of tasks Code example ------------ .. literalinclude:: ../doc_code/pattern_nested_tasks.py :language: python :start-after: __pattern_start__ :end-before: __pattern_end__ We call :func:`ray.get() ` after both ``quick_sort_distributed`` function invocations take place. This allows you to maximize parallelism in the workload. See :doc:`ray-get-loop` for more details. Notice in the execution times above that with smaller tasks, the non-distributed version is faster. However, as the task execution time increases, i.e. because the lists to sort are larger, the distributed version is faster. --- .. _ray-out-of-band-object-ref-serialization: Anti-pattern: Serialize ray.ObjectRef out of band ================================================= **TLDR:** Avoid serializing ``ray.ObjectRef`` because Ray can't know when to garbage collect the underlying object. Ray's ``ray.ObjectRef`` is distributed reference counted. Ray pins the underlying object until the reference isn't used by the system anymore. When all references to the pinned object are gone, Ray garbage collects the pinned object and cleans it up from the system. However, if user code serializes ``ray.ObjectRef``, Ray can't keep track of the reference. To avoid incorrect behavior, if ``ray.cloudpickle`` serializes ``ray.ObjectRef``, Ray pins the object for the lifetime of a worker. "Pin" means that object can't be evicted from the object store until the corresponding owner worker dies. It's prone to Ray object leaks, which can lead to disk spilling. See :ref:`this page ` for more details. To detect if this pattern exists in your code, you can set an environment variable ``RAY_allow_out_of_band_object_ref_serialization=0``. If Ray detects that ``ray.cloudpickle`` serialized ``ray.ObjectRef``, it raises an exception with helpful messages. Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_out_of_band_object_ref_serialization.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ --- .. _ray-pass-large-arg-by-value: Anti-pattern: Passing the same large argument by value repeatedly harms performance =================================================================================== **TLDR:** Avoid passing the same large argument by value to multiple tasks, use :func:`ray.put() ` and pass by reference instead. When passing a large argument (>100KB) by value to a task, Ray will implicitly store the argument in the object store and the worker process will fetch the argument to the local object store from the caller's object store before running the task. If we pass the same large argument to multiple tasks, Ray will end up storing multiple copies of the argument in the object store since Ray doesn't do deduplication. Instead of passing the large argument by value to multiple tasks, we should use ``ray.put()`` to store the argument to the object store once and get an ``ObjectRef``, then pass the argument reference to tasks. This way, we make sure all tasks use the same copy of the argument, which is faster and uses less object store memory. Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_pass_large_arg_by_value.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ **Better approach:** .. literalinclude:: ../doc_code/anti_pattern_pass_large_arg_by_value.py :language: python :start-after: __better_approach_start__ :end-before: __better_approach_end__ --- Pattern: Using pipelining to increase throughput ================================================ If you have multiple work items and each requires several steps to complete, you can use the `pipelining `__ technique to improve the cluster utilization and increase the throughput of your system. .. note:: Pipelining is an important technique to improve the performance and is heavily used by Ray libraries. See :ref:`Ray Data ` as an example. .. figure:: ../images/pipelining.svg Example use case ---------------- A component of your application needs to do both compute-intensive work and communicate with other processes. Ideally, you want to overlap computation and communication to saturate the CPU and increase the overall throughput. Code example ------------ .. literalinclude:: ../doc_code/pattern_pipelining.py In the example above, a worker actor pulls work off of a queue and then does some computation on it. Without pipelining, we call :func:`ray.get() ` immediately after requesting a work item, so we block while that RPC is in flight, causing idle CPU time. With pipelining, we instead preemptively request the next work item before processing the current one, so we can use the CPU while the RPC is in flight which increases the CPU utilization. --- .. _ray-get-loop: Anti-pattern: Calling ray.get in a loop harms parallelism ========================================================= **TLDR:** Avoid calling :func:`ray.get() ` in a loop since it's a blocking call; use ``ray.get()`` only for the final result. A call to ``ray.get()`` fetches the results of remotely executed functions. However, it is a blocking call, which means that it always waits until the requested result is available. If you call ``ray.get()`` in a loop, the loop will not continue to run until the call to ``ray.get()`` is resolved. If you also spawn the remote function calls in the same loop, you end up with no parallelism at all, as you wait for the previous function call to finish (because of ``ray.get()``) and only spawn the next call in the next iteration of the loop. The solution here is to separate the call to ``ray.get()`` from the call to the remote functions. That way all remote functions are spawned before we wait for the results and can run in parallel in the background. Additionally, you can pass a list of object references to ``ray.get()`` instead of calling it one by one to wait for all of the tasks to finish. Code example ------------ .. literalinclude:: ../doc_code/anti_pattern_ray_get_loop.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ .. figure:: ../images/ray-get-loop.svg Calling ``ray.get()`` in a loop When calling ``ray.get()`` right after scheduling the remote work, the loop blocks until the result is received. We thus end up with sequential processing. Instead, we should first schedule all remote calls, which are then processed in parallel. After scheduling the work, we can then request all the results at once. Other ``ray.get()`` related anti-patterns are: - :doc:`nested-ray-get` - :doc:`unnecessary-ray-get` - :doc:`ray-get-submission-order` - :doc:`ray-get-too-many-objects` --- Anti-pattern: Processing results in submission order using ray.get increases runtime ==================================================================================== **TLDR:** Avoid processing independent results in submission order using :func:`ray.get() ` since results may be ready in a different order than the submission order. A batch of tasks is submitted, and we need to process their results individually once they’re done. If each task takes a different amount of time to finish and we process results in submission order, we may waste time waiting for all of the slower (straggler) tasks that were submitted earlier to finish while later faster tasks have already finished. Instead, we want to process the tasks in the order that they finish using :func:`ray.wait() ` to speed up total time to completion. .. figure:: ../images/ray-get-submission-order.svg Processing results in submission order vs completion order Code example ------------ .. literalinclude:: ../doc_code/anti_pattern_ray_get_submission_order.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ Other ``ray.get()`` related anti-patterns are: - :doc:`unnecessary-ray-get` - :doc:`ray-get-loop` --- .. _ray-get-too-many-objects: Anti-pattern: Fetching too many objects at once with ray.get causes failure =========================================================================== **TLDR:** Avoid calling :func:`ray.get() ` on too many objects since this will lead to heap out-of-memory or object store out-of-space. Instead fetch and process one batch at a time. If you have a large number of tasks that you want to run in parallel, trying to do ``ray.get()`` on all of them at once could lead to failure with heap out-of-memory or object store out-of-space since Ray needs to fetch all the objects to the caller at the same time. Instead you should get and process the results one batch at a time. Once a batch is processed, Ray will evict objects in that batch to make space for future batches. .. figure:: ../images/ray-get-too-many-objects.svg Fetching too many objects at once with ``ray.get()`` Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_ray_get_too_many_objects.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ **Better approach:** .. literalinclude:: ../doc_code/anti_pattern_ray_get_too_many_objects.py :language: python :start-after: __better_approach_start__ :end-before: __better_approach_end__ Here besides getting one batch at a time to avoid failure, we are also using ``ray.wait()`` to process results in the finish order instead of the submission order to reduce the runtime. See :doc:`ray-get-submission-order` for more details. --- Anti-pattern: Redefining the same remote function or class harms performance ============================================================================ **TLDR:** Avoid redefining the same remote function or class. Decorating the same function or class multiple times using the :func:`ray.remote ` decorator leads to slow performance in Ray. For each Ray remote function or class, Ray will pickle it and upload to GCS. Later on, the worker that runs the task or actor will download and unpickle it. Each decoration of the same function or class generates a new remote function or class from Ray's perspective. As a result, the pickle, upload, download and unpickle work will happen every time we redefine and run the remote function or class. Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_redefine_task_actor_loop.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ **Better approach:** .. literalinclude:: ../doc_code/anti_pattern_redefine_task_actor_loop.py :language: python :start-after: __better_approach_start__ :end-before: __better_approach_end__ We should define the same remote function or class outside of the loop instead of multiple times inside a loop so that it's pickled and uploaded only once. --- Anti-pattern: Returning ray.put() ObjectRefs from a task harms performance and fault tolerance ============================================================================================== **TLDR:** Avoid calling :func:`ray.put() ` on task return values and returning the resulting ObjectRefs. Instead, return these values directly if possible. Returning ray.put() ObjectRefs are considered anti-patterns for the following reasons: - It disallows inlining small return values: Ray has a performance optimization to return small (<= 100KB) values inline directly to the caller, avoiding going through the distributed object store. On the other hand, ``ray.put()`` will unconditionally store the value to the object store which makes the optimization for small return values impossible. - Returning ObjectRefs involves extra distributed reference counting protocol which is slower than returning the values directly. - It's less :ref:`fault tolerant `: the worker process that calls ``ray.put()`` is the "owner" of the returned ``ObjectRef`` and the return value fate shares with the owner. If the worker process dies, the return value is lost. In contrast, the caller process (often the driver) is the owner of the return value if it's returned directly. Code example ------------ If you want to return a single value regardless if it's small or large, you should return it directly. .. literalinclude:: ../doc_code/anti_pattern_return_ray_put.py :language: python :start-after: __return_single_value_start__ :end-before: __return_single_value_end__ If you want to return multiple values and you know the number of returns before calling the task, you should use the :ref:`num_returns ` option. .. literalinclude:: ../doc_code/anti_pattern_return_ray_put.py :language: python :start-after: __return_static_multi_values_start__ :end-before: __return_static_multi_values_end__ If you don't know the number of returns before calling the task, you should use the :ref:`dynamic generator ` pattern if possible. .. literalinclude:: ../doc_code/anti_pattern_return_ray_put.py :language: python :start-after: __return_dynamic_multi_values_start__ :end-before: __return_dynamic_multi_values_end__ --- Anti-pattern: Over-parallelizing with too fine-grained tasks harms speedup ========================================================================== **TLDR:** Avoid over-parallelizing. Parallelizing tasks has higher overhead than using normal functions. Parallelizing or distributing tasks usually comes with higher overhead than an ordinary function call. Therefore, if you parallelize a function that executes very quickly, the overhead could take longer than the actual function call! To handle this problem, we should be careful about parallelizing too much. If you have a function or task that’s too small, you can use a technique called **batching** to make your tasks do more meaningful work in a single call. Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_too_fine_grained_tasks.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ **Better approach:** Use batching. .. literalinclude:: ../doc_code/anti_pattern_too_fine_grained_tasks.py :language: python :start-after: __batching_start__ :end-before: __batching_end__ As we can see from the example above, over-parallelizing has higher overhead and the program runs slower than the serial version. Through batching with a proper batch size, we are able to amortize the overhead and achieve the expected speedup. --- Pattern: Using a supervisor actor to manage a tree of actors ============================================================ Actor supervision is a pattern in which a supervising actor manages a collection of worker actors. The supervisor delegates tasks to subordinates and handles their failures. This pattern simplifies the driver since it manages only a few supervisors and does not deal with failures from worker actors directly. Furthermore, multiple supervisors can act in parallel to parallelize more work. .. figure:: ../images/tree-of-actors.svg Tree of actors .. note:: - If the supervisor dies (or the driver), the worker actors are automatically terminated thanks to actor reference counting. - Actors can be nested to multiple levels to form a tree. Example use case ---------------- You want to do data parallel training and train the same model with different hyperparameters in parallel. For each hyperparameter, you can launch a supervisor actor to do the orchestration and it will create worker actors to do the actual training per data shard. .. note:: For data parallel training and hyperparameter tuning, it's recommended to use :ref:`Ray Train ` (:py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` and :ref:`Ray Tune's Tuner `) which applies this pattern under the hood. Code example ------------ .. literalinclude:: ../doc_code/pattern_tree_of_actors.py :language: python --- .. _unnecessary-ray-get: Anti-pattern: Calling ray.get unnecessarily harms performance ============================================================= **TLDR:** Avoid calling :func:`ray.get() ` unnecessarily for intermediate steps. Work with object references directly, and only call ``ray.get()`` at the end to get the final result. When ``ray.get()`` is called, objects must be transferred to the worker/node that calls ``ray.get()``. If you don't need to manipulate the object, you probably don't need to call ``ray.get()`` on it! Typically, it’s best practice to wait as long as possible before calling ``ray.get()``, or even design your program to avoid having to call ``ray.get()`` at all. Code example ------------ **Anti-pattern:** .. literalinclude:: ../doc_code/anti_pattern_unnecessary_ray_get.py :language: python :start-after: __anti_pattern_start__ :end-before: __anti_pattern_end__ .. figure:: ../images/unnecessary-ray-get-anti.svg **Better approach:** .. literalinclude:: ../doc_code/anti_pattern_unnecessary_ray_get.py :language: python :start-after: __better_approach_start__ :end-before: __better_approach_end__ .. figure:: ../images/unnecessary-ray-get-better.svg Notice in the anti-pattern example, we call ``ray.get()`` which forces us to transfer the large rollout to the driver, then again to the *reduce* worker. In the fixed version, we only pass the reference to the object to the *reduce* task. The ``reduce`` worker will implicitly call ``ray.get()`` to fetch the actual rollout data directly from the ``generate_rollout`` worker, avoiding the extra copy to the driver. Other ``ray.get()`` related anti-patterns are: - :doc:`ray-get-loop` - :doc:`ray-get-submission-order` --- .. _ray-dag-guide: Lazy Computation Graphs with the Ray DAG API ============================================ With ``ray.remote`` you have the flexibility of running an application where computation is executed remotely at runtime. For a ``ray.remote`` decorated class or function, you can also use ``.bind`` on the body to build a static computation graph. .. note:: Ray DAG is designed to be a developer facing API where recommended use cases are 1) Locally iterate and test your application authored by higher level libraries. 2) Build libraries on top of the Ray DAG APIs. .. note:: Ray has introduced an experimental API for high-performance workloads that is especially well suited for applications using multiple GPUs. This API is built on top of the Ray DAG API. See :ref:`Ray Compiled Graph ` for more details. When ``.bind()`` is called on a ``ray.remote`` decorated class or function, it will generate an intermediate representation (IR) node that act as backbone and building blocks of the DAG that is statically holding the computation graph together, where each IR node is resolved to value at execution time with respect to their topological order. The IR node can also be assigned to a variable and passed into other nodes as arguments. Ray DAG with functions ---------------------- The IR node generated by ``.bind()`` on a ``ray.remote`` decorated function is executed as a Ray Task upon execution which will be solved to the task output. This example shows how to build a chain of functions where each node can be executed as root node while iterating, or used as input args or kwargs of other functions to form more complex DAGs. Any IR node can be executed directly ``dag_node.execute()`` that acts as root of the DAG, where all other non-reachable nodes from the root will be ignored. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/ray-dag.py :language: python :start-after: __dag_tasks_begin__ :end-before: __dag_tasks_end__ Ray DAG with classes and class methods -------------------------------------- The IR node generated by ``.bind()`` on a ``ray.remote`` decorated class is executed as a Ray Actor upon execution. The Actor will be instantiated every time the node is executed, and the classmethod calls can form a chain of function calls specific to the parent actor instance. DAG IR nodes generated from a function, class or classmethod can be combined together to form a DAG. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/ray-dag.py :language: python :start-after: __dag_actors_begin__ :end-before: __dag_actors_end__ Ray DAG with custom InputNode ----------------------------- ``InputNode`` is the singleton node of a DAG that represents user input value at runtime. It should be used within a context manager with no args, and called as args of ``dag_node.execute()`` .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/ray-dag.py :language: python :start-after: __dag_input_node_begin__ :end-before: __dag_input_node_end__ Ray DAG with multiple MultiOutputNode ------------------------------------- ``MultiOutputNode`` is useful when you have more than 1 output from a DAG. ``dag_node.execute()`` returns a list of Ray object references passed to ``MultiOutputNode``. The below example shows the multi output node of 2 outputs. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/ray-dag.py :language: python :start-after: __dag_multi_output_node_begin__ :end-before: __dag_multi_output_node_end__ Reuse Ray Actors in DAGs ------------------------ Actors can be a part of the DAG definition with the ``Actor.bind()`` API. However, when a DAG finishes execution, Ray kills Actors created with ``bind``. You can avoid killing your Actors whenever DAG finishes by creating Actors with ``Actor.remote()``. .. tab-set:: .. tab-item:: Python .. literalinclude:: ./doc_code/ray-dag.py :language: python :start-after: __dag_actor_reuse_begin__ :end-before: __dag_actor_reuse_end__ More resources -------------- You can find more application patterns and examples in the following resources from other Ray libraries built on top of Ray DAG API with the same mechanism. | `Ray Serve Compositions of Models `_ | `Visualization of Ray Compiled Graph `_ --- .. _generators: Ray Generators ============== `Python generators `_ are functions that behave like iterators, yielding one value per iteration. Ray also supports the generators API. Any generator function decorated with ``ray.remote`` becomes a Ray generator task. Generator tasks stream outputs back to the caller before the task finishes. .. code-block:: diff +import ray import time # Takes 25 seconds to finish. +@ray.remote def f(): for i in range(5): time.sleep(5) yield i -for obj in f(): +for obj_ref in f.remote(): # Prints every 5 seconds and stops after 25 seconds. - print(obj) + print(ray.get(obj_ref)) The above Ray generator yields the output every 5 seconds 5 times. With a normal Ray task, you have to wait 25 seconds to access the output. With a Ray generator, the caller can access the object reference before the task ``f`` finishes. **The Ray generator is useful when** - You want to reduce heap memory or object store memory usage by yielding and garbage collecting (GC) the output before the task finishes. - You are familiar with the Python generator and want the equivalent programming models. **Ray libraries use the Ray generator to support streaming use cases** - :ref:`Ray Serve ` uses Ray generators to support :ref:`streaming responses `. - :ref:`Ray Data ` is a streaming data processing library, which uses Ray generators to control and reduce concurrent memory usages. **Ray generator works with existing Ray APIs seamlessly** - You can use Ray generators in both actor and non-actor tasks. - Ray generators work with all actor execution models, including :ref:`threaded actors ` and :ref:`async actors `. - Ray generators work with built-in :ref:`fault tolerance features ` such as retry or lineage reconstruction. - Ray generators work with Ray APIs such as :ref:`ray.wait `, :ref:`ray.cancel `, etc. Getting started --------------- Define a Python generator function and decorate it with ``ray.remote`` to create a Ray generator. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_define_start__ :end-before: __streaming_generator_define_end__ The Ray generator task returns an ``ObjectRefGenerator`` object, which is compatible with generator and async generator APIs. You can access the ``next``, ``__iter__``, ``__anext__``, ``__aiter__`` APIs from the class. Whenever a task invokes ``yield``, a corresponding output is ready and available from a generator as a Ray object reference. You can call ``next(gen)`` to obtain an object reference. If ``next`` has no more items to generate, it raises ``StopIteration``. If ``__anext__`` has no more items to generate, it raises ``StopAsyncIteration`` The ``next`` API blocks the thread until the task generates a next object reference with ``yield``. Since the ``ObjectRefGenerator`` is just a Python generator, you can also use a for loop to iterate object references. If you want to avoid blocking a thread, you can either use asyncio or :ref:`ray.wait API `. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_execute_start__ :end-before: __streaming_generator_execute_end__ .. note:: For a normal Python generator, a generator function is paused and resumed when ``next`` function is called on a generator. Ray eagerly executes a generator task to completion regardless of whether the caller is polling the partial results or not. Error handling -------------- If a generator task has a failure (by an application exception or system error such as an unexpected node failure), the ``next(gen)`` returns an object reference that contains an exception. When you call ``ray.get``, Ray raises the exception. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_exception_start__ :end-before: __streaming_generator_exception_end__ In the above example, if an application fails the task, Ray returns the object reference with an exception in a correct order. For example, if Ray raises the exception after the second yield, the third ``next(gen)`` returns an object reference with an exception all the time. If a system error fails the task, (e.g., a node failure or worker process failure), ``next(gen)`` returns the object reference that contains the system level exception at any time without an ordering guarantee. It means when you have N yields, the generator can create from 1 to N + 1 object references (N output + ref with a system-level exception) when there failures occur. Generator from Actor Tasks -------------------------- The Ray generator is compatible with **all actor execution models**. It seamlessly works with regular actors, :ref:`async actors `, and :ref:`threaded actors `. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_actor_model_start__ :end-before: __streaming_generator_actor_model_end__ Using the Ray generator with asyncio ------------------------------------ The returned ``ObjectRefGenerator`` is also compatible with asyncio. You can use ``__anext__`` or ``async for`` loops. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_asyncio_start__ :end-before: __streaming_generator_asyncio_end__ Garbage collection of object references --------------------------------------- The returned ref from ``next(generator)`` is just a regular Ray object reference and is distributed ref counted in the same way. If references are not consumed from a generator by the ``next`` API, references are garbage collected (GC’ed) when the generator is GC’ed. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_gc_start__ :end-before: __streaming_generator_gc_end__ In the following example, Ray counts ``ref1`` as a normal Ray object reference after Ray returns it. Other references that aren't consumed with ``next(gen)`` are removed when the generator is GC'ed. In this example, garbage collection happens when you call ``del gen``. Fault tolerance --------------- :ref:`Fault tolerance features ` work with Ray generator tasks and actor tasks. For example; - :ref:`Task fault tolerance features `: ``max_retries``, ``retry_exceptions`` - :ref:`Actor fault tolerance features `: ``max_restarts``, ``max_task_retries`` - :ref:`Object fault tolerance features `: object reconstruction .. _generators-cancel: Cancellation ------------ The :func:`ray.cancel() ` function works with both Ray generator tasks and actor tasks. Semantic-wise, cancelling a generator task isn't different from cancelling a regular task. When you cancel a task, ``next(gen)`` can return the reference that contains :class:`TaskCancelledError ` without any special ordering guarantee. .. _generators-wait: How to wait for generator without blocking a thread (compatibility to ray.wait and ray.get) ------------------------------------------------------------------------------------------- When using a generator, ``next`` API blocks its thread until a next object reference is available. However, you may not want this behavior all the time. You may want to wait for a generator without blocking a thread. Unblocking wait is possible with the Ray generator in the following ways: **Wait until a generator task completes** ``ObjectRefGenerator`` has an API ``completed``. It returns an object reference that is available when a generator task finishes or errors. For example, you can do ``ray.get(.completed())`` to wait until a task completes. Note that using ``ray.get`` to ``ObjectRefGenerator`` isn't allowed. **Use asyncio and await** ``ObjectRefGenerator`` is compatible with asyncio. You can create multiple asyncio tasks that create a generator task and wait for it to avoid blocking a thread. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_concurrency_asyncio_start__ :end-before: __streaming_generator_concurrency_asyncio_end__ **Use ray.wait** You can pass ``ObjectRefGenerator`` as an input to ``ray.wait``. The generator is "ready" if a `next item` is available. Once Ray finds from a ready list, ``next(gen)`` returns the next object reference immediately without blocking. See the example below for more details. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_wait_simple_start__ :end-before: __streaming_generator_wait_simple_end__ All the input arguments (such as ``timeout``, ``num_returns``, and ``fetch_local``) from ``ray.wait`` works with a generator. ``ray.wait`` can mix regular Ray object references with generators for inputs. In this case, the application should handle all input arguments (such as ``timeout``, ``num_returns``, and ``fetch_local``) from ``ray.wait`` work with generators. .. literalinclude:: doc_code/streaming_generator.py :language: python :start-after: __streaming_generator_wait_complex_start__ :end-before: __streaming_generator_wait_complex_end__ Thread safety ------------- ``ObjectRefGenerator`` object is not thread-safe. Limitation ---------- Ray generators don't support these features: - ``throw``, ``send``, and ``close`` APIs. - ``return`` statements from generators. - Passing ``ObjectRefGenerator`` to another task or actor. - :ref:`Ray Client ` Deprecated Dynamic Generator ---------------------------- .. toctree:: :maxdepth: 1 tasks/dynamic_generators.rst --- .. _gpu-support: .. _accelerator-support: Accelerator Support =================== Accelerators like GPUs are critical for many machine learning apps. Ray Core natively supports many accelerators as pre-defined :ref:`resource ` types and allows tasks and actors to specify their accelerator :ref:`resource requirements `. The accelerators natively supported by Ray Core are: .. list-table:: :header-rows: 1 * - Accelerator - Ray Resource Name - Support Level * - NVIDIA GPU - GPU - Fully tested, supported by the Ray team * - AMD GPU - GPU - Experimental, supported by the community * - Intel GPU - GPU - Experimental, supported by the community * - `AWS Neuron Core `_ - neuron_cores - Experimental, supported by the community * - Google TPU - TPU - Experimental, supported by the community * - Intel Gaudi - HPU - Experimental, supported by the community * - Huawei Ascend - NPU - Experimental, supported by the community * - Rebellions RBLN - RBLN - Experimental, supported by the community * - METAX GPU - GPU - Experimental, supported by the community Starting Ray nodes with accelerators ------------------------------------ By default, Ray sets the quantity of accelerator resources of a node to the physical quantities of accelerators auto detected by Ray. If you need to, you can :ref:`override ` this. .. tab-set:: .. tab-item:: NVIDIA GPU :sync: NVIDIA GPU .. tip:: You can set the ``CUDA_VISIBLE_DEVICES`` environment variable before starting a Ray node to limit the NVIDIA GPUs that are visible to Ray. For example, ``CUDA_VISIBLE_DEVICES=1,3 ray start --head --num-gpus=2`` lets Ray only see devices 1 and 3. .. tab-item:: AMD GPU :sync: AMD GPU .. tip:: You can set the ``ROCR_VISIBLE_DEVICES`` environment variable before starting a Ray node to limit the AMD GPUs that are visible to Ray. For example, ``ROCR_VISIBLE_DEVICES=1,3 ray start --head --num-gpus=2`` lets Ray only see devices 1 and 3. .. tab-item:: Intel GPU :sync: Intel GPU .. tip:: You can set the ``ONEAPI_DEVICE_SELECTOR`` environment variable before starting a Ray node to limit the Intel GPUs that are visible to Ray. For example, ``ONEAPI_DEVICE_SELECTOR=1,3 ray start --head --num-gpus=2`` lets Ray only see devices 1 and 3. .. tab-item:: AWS Neuron Core :sync: AWS Neuron Core .. tip:: You can set the ``NEURON_RT_VISIBLE_CORES`` environment variable before starting a Ray node to limit the AWS Neuron Cores that are visible to Ray. For example, ``NEURON_RT_VISIBLE_CORES=1,3 ray start --head --resources='{"neuron_cores": 2}'`` lets Ray only see devices 1 and 3. See the `Amazon documentation `_ for more examples of Ray on Neuron with EKS as an orchestration substrate. .. tab-item:: Google TPU :sync: Google TPU .. tip:: You can set the ``TPU_VISIBLE_CHIPS`` environment variable before starting a Ray node to limit the Google TPUs that are visible to Ray. For example, ``TPU_VISIBLE_CHIPS=1,3 ray start --head --resources='{"TPU": 2}'`` lets Ray only see devices 1 and 3. .. tab-item:: Intel Gaudi :sync: Intel Gaudi .. tip:: You can set the ``HABANA_VISIBLE_MODULES`` environment variable before starting a Ray node to limit the Intel Gaudi HPUs that are visible to Ray. For example, ``HABANA_VISIBLE_MODULES=1,3 ray start --head --resources='{"HPU": 2}'`` lets Ray only see devices 1 and 3. .. tab-item:: Huawei Ascend :sync: Huawei Ascend .. tip:: You can set the ``ASCEND_RT_VISIBLE_DEVICES`` environment variable before starting a Ray node to limit the Huawei Ascend NPUs that are visible to Ray. For example, ``ASCEND_RT_VISIBLE_DEVICES=1,3 ray start --head --resources='{"NPU": 2}'`` lets Ray only see devices 1 and 3. .. tab-item:: Rebellions RBLN :sync: Rebellions RBLN .. tip:: You can set the ``RBLN_DEVICES`` environment variable before starting a Ray node to limit the Rebellions RBLNs that are visible to Ray. For example, ``RBLN_DEVICES=1,3 ray start --head --resources='{"RBLN": 2}'`` lets Ray only see devices 1 and 3. .. tab-item:: METAX GPU :sync: METAX GPU .. tip:: You can set the ``CUDA_VISIBLE_DEVICES`` environment variable before starting a Ray node to limit the METAX GPUs that are visible to Ray. For example, ``CUDA_VISIBLE_DEVICES=1,3 ray start --head --num-gpus=2`` lets Ray only see devices 1 and 3. .. note:: There's nothing preventing you from specifying a larger number of accelerator resources (e.g., ``num_gpus``) than the true number of accelerators on the machine given Ray resources are :ref:`logical `. In this case, Ray acts as if the machine has the number of accelerators you specified for the purposes of scheduling tasks and actors that require accelerators. Trouble only occurs if those tasks and actors attempt to actually use accelerators that don't exist. Using accelerators in Tasks and Actors -------------------------------------- If a task or actor requires accelerators, you can specify the corresponding :ref:`resource requirements ` (e.g. ``@ray.remote(num_gpus=1)``). Ray then schedules the task or actor to a node that has enough free accelerator resources and assign accelerators to the task or actor by setting the corresponding environment variable (e.g. ``CUDA_VISIBLE_DEVICES``) before running the task or actor code. .. tab-set:: .. tab-item:: NVIDIA GPU :sync: NVIDIA GPU .. testcode:: import os import ray ray.init(num_gpus=2) @ray.remote(num_gpus=1) class GPUActor: def ping(self): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"])) @ray.remote(num_gpus=1) def gpu_task(): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"])) gpu_actor = GPUActor.remote() ray.get(gpu_actor.ping.remote()) # The actor uses the first GPU so the task uses the second one. ray.get(gpu_task.remote()) .. testoutput:: :options: +MOCK (GPUActor pid=52420) GPU IDs: [0] (GPUActor pid=52420) CUDA_VISIBLE_DEVICES: 0 (gpu_task pid=51830) GPU IDs: [1] (gpu_task pid=51830) CUDA_VISIBLE_DEVICES: 1 .. tab-item:: AMD GPU :sync: AMD GPU .. testcode:: :hide: ray.shutdown() .. testcode:: :skipif: True import os import ray ray.init(num_gpus=2) @ray.remote(num_gpus=1) class GPUActor: def ping(self): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"])) @ray.remote(num_gpus=1) def gpu_task(): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"])) gpu_actor = GPUActor.remote() ray.get(gpu_actor.ping.remote()) # The actor uses the first GPU so the task uses the second one. ray.get(gpu_task.remote()) .. testoutput:: :options: +MOCK (GPUActor pid=52420) GPU IDs: [0] (GPUActor pid=52420) ROCR_VISIBLE_DEVICES: 0 (gpu_task pid=51830) GPU IDs: [1] (gpu_task pid=51830) ROCR_VISIBLE_DEVICES: 1 .. tab-item:: Intel GPU :sync: Intel GPU .. testcode:: :hide: ray.shutdown() .. testcode:: :skipif: True import os import ray ray.init(num_gpus=2) @ray.remote(num_gpus=1) class GPUActor: def ping(self): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("ONEAPI_DEVICE_SELECTOR: {}".format(os.environ["ONEAPI_DEVICE_SELECTOR"])) @ray.remote(num_gpus=1) def gpu_task(): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("ONEAPI_DEVICE_SELECTOR: {}".format(os.environ["ONEAPI_DEVICE_SELECTOR"])) gpu_actor = GPUActor.remote() ray.get(gpu_actor.ping.remote()) # The actor uses the first GPU so the task uses the second one. ray.get(gpu_task.remote()) .. testoutput:: :options: +MOCK (GPUActor pid=52420) GPU IDs: [0] (GPUActor pid=52420) ONEAPI_DEVICE_SELECTOR: 0 (gpu_task pid=51830) GPU IDs: [1] (gpu_task pid=51830) ONEAPI_DEVICE_SELECTOR: 1 .. tab-item:: AWS Neuron Core :sync: AWS Neuron Core .. testcode:: :hide: ray.shutdown() .. testcode:: import os import ray ray.init(resources={"neuron_cores": 2}) @ray.remote(resources={"neuron_cores": 1}) class NeuronCoreActor: def ping(self): print("Neuron Core IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["neuron_cores"])) print("NEURON_RT_VISIBLE_CORES: {}".format(os.environ["NEURON_RT_VISIBLE_CORES"])) @ray.remote(resources={"neuron_cores": 1}) def neuron_core_task(): print("Neuron Core IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["neuron_cores"])) print("NEURON_RT_VISIBLE_CORES: {}".format(os.environ["NEURON_RT_VISIBLE_CORES"])) neuron_core_actor = NeuronCoreActor.remote() ray.get(neuron_core_actor.ping.remote()) # The actor uses the first Neuron Core so the task uses the second one. ray.get(neuron_core_task.remote()) .. testoutput:: :options: +MOCK (NeuronCoreActor pid=52420) Neuron Core IDs: [0] (NeuronCoreActor pid=52420) NEURON_RT_VISIBLE_CORES: 0 (neuron_core_task pid=51830) Neuron Core IDs: [1] (neuron_core_task pid=51830) NEURON_RT_VISIBLE_CORES: 1 .. tab-item:: Google TPU :sync: Google TPU .. testcode:: :hide: ray.shutdown() .. testcode:: import os import ray ray.init(resources={"TPU": 2}) @ray.remote(resources={"TPU": 1}) class TPUActor: def ping(self): print("TPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["TPU"])) print("TPU_VISIBLE_CHIPS: {}".format(os.environ["TPU_VISIBLE_CHIPS"])) @ray.remote(resources={"TPU": 1}) def tpu_task(): print("TPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["TPU"])) print("TPU_VISIBLE_CHIPS: {}".format(os.environ["TPU_VISIBLE_CHIPS"])) tpu_actor = TPUActor.remote() ray.get(tpu_actor.ping.remote()) # The actor uses the first TPU so the task uses the second one. ray.get(tpu_task.remote()) .. testoutput:: :options: +MOCK (TPUActor pid=52420) TPU IDs: [0] (TPUActor pid=52420) TPU_VISIBLE_CHIPS: 0 (tpu_task pid=51830) TPU IDs: [1] (tpu_task pid=51830) TPU_VISIBLE_CHIPS: 1 .. tab-item:: Intel Gaudi :sync: Intel Gaudi .. testcode:: :hide: ray.shutdown() .. testcode:: import os import ray ray.init(resources={"HPU": 2}) @ray.remote(resources={"HPU": 1}) class HPUActor: def ping(self): print("HPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["HPU"])) print("HABANA_VISIBLE_MODULES: {}".format(os.environ["HABANA_VISIBLE_MODULES"])) @ray.remote(resources={"HPU": 1}) def hpu_task(): print("HPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["HPU"])) print("HABANA_VISIBLE_MODULES: {}".format(os.environ["HABANA_VISIBLE_MODULES"])) hpu_actor = HPUActor.remote() ray.get(hpu_actor.ping.remote()) # The actor uses the first HPU so the task uses the second one. ray.get(hpu_task.remote()) .. testoutput:: :options: +MOCK (HPUActor pid=52420) HPU IDs: [0] (HPUActor pid=52420) HABANA_VISIBLE_MODULES: 0 (hpu_task pid=51830) HPU IDs: [1] (hpu_task pid=51830) HABANA_VISIBLE_MODULES: 1 .. tab-item:: Huawei Ascend :sync: Huawei Ascend .. testcode:: :hide: ray.shutdown() .. testcode:: import os import ray ray.init(resources={"NPU": 2}) @ray.remote(resources={"NPU": 1}) class NPUActor: def ping(self): print("NPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["NPU"])) print("ASCEND_RT_VISIBLE_DEVICES: {}".format(os.environ["ASCEND_RT_VISIBLE_DEVICES"])) @ray.remote(resources={"NPU": 1}) def npu_task(): print("NPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["NPU"])) print("ASCEND_RT_VISIBLE_DEVICES: {}".format(os.environ["ASCEND_RT_VISIBLE_DEVICES"])) npu_actor = NPUActor.remote() ray.get(npu_actor.ping.remote()) # The actor uses the first NPU so the task uses the second one. ray.get(npu_task.remote()) .. testoutput:: :options: +MOCK (NPUActor pid=52420) NPU IDs: [0] (NPUActor pid=52420) ASCEND_RT_VISIBLE_DEVICES: 0 (npu_task pid=51830) NPU IDs: [1] (npu_task pid=51830) ASCEND_RT_VISIBLE_DEVICES: 1 .. tab-item:: Rebellions RBLN :sync: Rebellions RBLN .. testcode:: :hide: ray.shutdown() .. testcode:: import os import ray ray.init(resources={"RBLN": 2}) @ray.remote(resources={"RBLN": 1}) class RBLNActor: def ping(self): print("RBLN IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["RBLN"])) print("RBLN_DEVICES: {}".format(os.environ["RBLN_DEVICES"])) @ray.remote(resources={"RBLN": 1}) def rbln_task(): print("RBLN IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["RBLN"])) print("RBLN_DEVICES: {}".format(os.environ["RBLN_DEVICES"])) rbln_actor = RBLNActor.remote() ray.get(rbln_actor.ping.remote()) # The actor uses the first RBLN so the task uses the second one. ray.get(rbln_task.remote()) .. testoutput:: :options: +MOCK (RBLNActor pid=52420) RBLN IDs: [0] (RBLNActor pid=52420) RBLN_DEVICES: 0 (rbln_task pid=51830) RBLN IDs: [1] (rbln_task pid=51830) RBLN_DEVICES: 1 .. tab-item:: METAX GPU :sync: METAX GPU .. testcode:: :hide: ray.shutdown() .. testcode:: import os import ray ray.init(num_gpus=2) @ray.remote(num_gpus=1) class GPUActor: def ping(self): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"])) @ray.remote(num_gpus=1) def gpu_task(): print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"])) gpu_actor = GPUActor.remote() ray.get(gpu_actor.ping.remote()) # The actor uses the first GPU so the task uses the second one. ray.get(gpu_task.remote()) .. testoutput:: :options: +MOCK (GPUActor pid=52420) GPU IDs: [0] (GPUActor pid=52420) CUDA_VISIBLE_DEVICES: 0 (gpu_task pid=51830) GPU IDs: [1] (gpu_task pid=51830) CUDA_VISIBLE_DEVICES: 1 Inside a task or actor, :func:`ray.get_runtime_context().get_accelerator_ids() ` returns a list of accelerator IDs that are available to the task or actor. Typically, it is not necessary to call ``get_accelerator_ids()`` because Ray automatically sets the corresponding environment variable (e.g. ``CUDA_VISIBLE_DEVICES``), which most ML frameworks respect for purposes of accelerator assignment. **Note:** The remote function or actor defined above doesn't actually use any accelerators. Ray schedules it on a node which has at least one accelerator, and reserves one accelerator for it while it is being executed, however it is up to the function to actually make use of the accelerator. This is typically done through an external library like TensorFlow. Here is an example that actually uses accelerators. In order for this example to work, you need to install the GPU version of TensorFlow. .. testcode:: @ray.remote(num_gpus=1) def gpu_task(): import tensorflow as tf # Create a TensorFlow session. TensorFlow restricts itself to use the # GPUs specified by the CUDA_VISIBLE_DEVICES environment variable. tf.Session() **Note:** It is certainly possible for the person to ignore assigned accelerators and to use all of the accelerators on the machine. Ray does not prevent this from happening, and this can lead to too many tasks or actors using the same accelerator at the same time. However, Ray does automatically set the environment variable (e.g. ``CUDA_VISIBLE_DEVICES``), which restricts the accelerators used by most deep learning frameworks assuming it's not overridden by the user. Fractional Accelerators ----------------------- Ray supports :ref:`fractional resource requirements ` so multiple tasks and actors can share the same accelerator. .. tab-set:: .. tab-item:: NVIDIA GPU :sync: NVIDIA GPU .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(num_cpus=4, num_gpus=1) @ray.remote(num_gpus=0.25) def f(): import time time.sleep(1) # The four tasks created here can execute concurrently # and share the same GPU. ray.get([f.remote() for _ in range(4)]) .. tab-item:: AMD GPU :sync: AMD GPU .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(num_cpus=4, num_gpus=1) @ray.remote(num_gpus=0.25) def f(): import time time.sleep(1) # The four tasks created here can execute concurrently # and share the same GPU. ray.get([f.remote() for _ in range(4)]) .. tab-item:: Intel GPU :sync: Intel GPU .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(num_cpus=4, num_gpus=1) @ray.remote(num_gpus=0.25) def f(): import time time.sleep(1) # The four tasks created here can execute concurrently # and share the same GPU. ray.get([f.remote() for _ in range(4)]) .. tab-item:: AWS Neuron Core :sync: AWS Neuron Core AWS Neuron Core doesn't support fractional resource. .. tab-item:: Google TPU :sync: Google TPU Google TPU doesn't support fractional resource. .. tab-item:: Intel Gaudi :sync: Intel Gaudi Intel Gaudi doesn't support fractional resource. .. tab-item:: Huawei Ascend :sync: Huawei Ascend .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(num_cpus=4, resources={"NPU": 1}) @ray.remote(resources={"NPU": 0.25}) def f(): import time time.sleep(1) # The four tasks created here can execute concurrently # and share the same NPU. ray.get([f.remote() for _ in range(4)]) .. tab-item:: Rebellions RBLN :sync: Rebellions RBLN Rebellions RBLN doesn't support fractional resources. .. tab-item:: METAX GPU :sync: METAX GPU .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(num_cpus=4, num_gpus=1) @ray.remote(num_gpus=0.25) def f(): import time time.sleep(1) # The four tasks created here can execute concurrently # and share the same GPU. ray.get([f.remote() for _ in range(4)]) **Note:** It is the user's responsibility to make sure that the individual tasks don't use more than their share of the accelerator memory. Pytorch and TensorFlow can be configured to limit its memory usage. When Ray assigns accelerators of a node to tasks or actors with fractional resource requirements, it packs one accelerator before moving on to the next one to avoid fragmentation. .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init(num_gpus=3) @ray.remote(num_gpus=0.5) class FractionalGPUActor: def ping(self): print("GPU id: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"])) fractional_gpu_actors = [FractionalGPUActor.remote() for _ in range(3)] # Ray tries to pack GPUs if possible. [ray.get(fractional_gpu_actors[i].ping.remote()) for i in range(3)] .. testoutput:: :options: +MOCK (FractionalGPUActor pid=57417) GPU id: [0] (FractionalGPUActor pid=57416) GPU id: [0] (FractionalGPUActor pid=57418) GPU id: [1] .. _gpu-leak: Workers not Releasing GPU Resources ----------------------------------- Currently, when a worker executes a task that uses a GPU (e.g., through TensorFlow), the task may allocate memory on the GPU and may not release it when the task finishes executing. This can lead to problems the next time a task tries to use the same GPU. To address the problem, Ray disables the worker process reuse between GPU tasks by default, where the GPU resources is released after the task process exits. Since this adds overhead to GPU task scheduling, you can re-enable worker reuse by setting ``max_calls=0`` in the :func:`ray.remote ` decorator. .. testcode:: # By default, ray does not reuse workers for GPU tasks to prevent # GPU resource leakage. @ray.remote(num_gpus=1) def leak_gpus(): import tensorflow as tf # This task allocates memory on the GPU and then never release it. tf.Session() .. _accelerator-types: Accelerator Types ----------------- Ray supports resource specific accelerator types. The `accelerator_type` option can be used to force to a task or actor to run on a node with a specific type of accelerator. Under the hood, the accelerator type option is implemented as a :ref:`custom resource requirement ` of ``"accelerator_type:": 0.001``. This forces the task or actor to be placed on a node with that particular accelerator type available. This also lets the multi-node-type autoscaler know that there is demand for that type of resource, potentially triggering the launch of new nodes providing that accelerator. .. testcode:: :hide: ray.shutdown() import ray.util.accelerators v100_resource_name = f"accelerator_type:{ray.util.accelerators.NVIDIA_TESLA_V100}" ray.init(num_gpus=4, resources={v100_resource_name: 1}) .. testcode:: from ray.util.accelerators import NVIDIA_TESLA_V100 @ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100) def train(data): return "This function was run on a node with a Tesla V100 GPU" ray.get(train.remote(1)) See :ref:`ray.util.accelerators ` for available accelerator types. --- .. _ray-scheduling: Scheduling ========== This page provides an overview of how Ray decides to schedule tasks and actors to nodes. .. DJS 19 Sept 2025: There should be an overview of all features and configs that impact scheduling here. This should include descriptions for default values and behaviors, and links to things like default labels or resource definitions that can be used for scheduling without customization. Labels ------ Labels provide a simplified solution for controlling scheduling for tasks, actors, and placement group bundles using default and custom labels. See :doc:`./labels`. Labels are a beta feature. As this feature becomes stable, the Ray team recommends using labels to replace the following patterns: - NodeAffinitySchedulingStrategy when `soft=false`. Use the default `ray.io/node-id` label instead. - The `accelerator_type` option for tasks and actors. Use the default `ray.io/accelerator-type` label instead. .. note:: A legacy pattern recommended using custom resources for label-based scheduling. We now recommend only using custom resources when you need to manage scheduling using numeric values. .. _ray-scheduling-resources: Resources --------- Each task or actor has the :ref:`specified resource requirements `. Given that, a node can be in one of the following states: - Feasible: the node has the required resources to run the task or actor. Depending on the current availability of these resources, there are two sub-states: - Available: the node has the required resources and they are free now. - Unavailable: the node has the required resources but they are currently being used by other tasks or actors. - Infeasible: the node doesn't have the required resources. For example a CPU-only node is infeasible for a GPU task. Resource requirements are **hard** requirements meaning that only feasible nodes are eligible to run the task or actor. If there are feasible nodes, Ray will either choose an available node or wait until an unavailable node to become available depending on other factors discussed below. If all nodes are infeasible, the task or actor cannot be scheduled until feasible nodes are added to the cluster. .. _ray-scheduling-strategies: Scheduling Strategies --------------------- Tasks or actors support a :func:`scheduling_strategy ` option to specify the strategy used to decide the best node among feasible nodes. Currently the supported strategies are the followings. "DEFAULT" ~~~~~~~~~ ``"DEFAULT"`` is the default strategy used by Ray. Ray schedules tasks or actors onto a group of the top k nodes. Specifically, the nodes are sorted to first favor those that already have tasks or actors scheduled (for locality), then to favor those that have low resource utilization (for load balancing). Within the top k group, nodes are chosen randomly to further improve load-balancing and mitigate delays from cold-start in large clusters. Implementation-wise, Ray calculates a score for each node in a cluster based on the utilization of its logical resources. If the utilization is below a threshold (controlled by the OS environment variable ``RAY_scheduler_spread_threshold``, default is 0.5), the score is 0, otherwise it is the resource utilization itself (score 1 means the node is fully utilized). Ray selects the best node for scheduling by randomly picking from the top k nodes with the lowest scores. The value of ``k`` is the max of (number of nodes in the cluster * ``RAY_scheduler_top_k_fraction`` environment variable) and ``RAY_scheduler_top_k_absolute`` environment variable. By default, it's 20% of the total number of nodes. Currently Ray handles actors that don't require any resources (i.e., ``num_cpus=0`` with no other resources) specially by randomly choosing a node in the cluster without considering resource utilization. Since nodes are randomly chosen, actors that don't require any resources are effectively SPREAD across the cluster. .. literalinclude:: ../doc_code/scheduling.py :language: python :start-after: __default_scheduling_strategy_start__ :end-before: __default_scheduling_strategy_end__ "SPREAD" ~~~~~~~~ ``"SPREAD"`` strategy will try to spread the tasks or actors among available nodes. .. literalinclude:: ../doc_code/scheduling.py :language: python :start-after: __spread_scheduling_strategy_start__ :end-before: __spread_scheduling_strategy_end__ PlacementGroupSchedulingStrategy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :py:class:`~ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy` will schedule the task or actor to where the placement group is located. This is useful for actor gang scheduling. See :ref:`Placement Group ` for more details. NodeAffinitySchedulingStrategy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :py:class:`~ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy` is a low-level strategy that allows a task or actor to be scheduled onto a particular node specified by its node id. The ``soft`` flag specifies whether the task or actor is allowed to run somewhere else if the specified node doesn't exist (e.g. if the node dies) or is infeasible because it does not have the resources required to run the task or actor. In these cases, if ``soft`` is True, the task or actor will be scheduled onto a different feasible node. Otherwise, the task or actor will fail with :py:class:`~ray.exceptions.TaskUnschedulableError` or :py:class:`~ray.exceptions.ActorUnschedulableError`. As long as the specified node is alive and feasible, the task or actor will only run there regardless of the ``soft`` flag. This means if the node currently has no available resources, the task or actor will wait until resources become available. This strategy should *only* be used if other high level scheduling strategies (e.g. :ref:`placement group `) cannot give the desired task or actor placements. It has the following known limitations: - It's a low-level strategy which prevents optimizations by a smart scheduler. - It cannot fully utilize an autoscaling cluster since node ids must be known when the tasks or actors are created. - It can be difficult to make the best static placement decision especially in a multi-tenant cluster: for example, an application won't know what else is being scheduled onto the same nodes. .. literalinclude:: ../doc_code/scheduling.py :language: python :start-after: __node_affinity_scheduling_strategy_start__ :end-before: __node_affinity_scheduling_strategy_end__ .. _ray-scheduling-locality: Locality-Aware Scheduling ------------------------- By default, Ray prefers available nodes that have large task arguments local to avoid transferring data over the network. If there are multiple large task arguments, the node with most object bytes local is preferred. This takes precedence over the ``"DEFAULT"`` scheduling strategy, which means Ray will try to run the task on the locality preferred node regardless of the node resource utilization. However, if the locality preferred node is not available, Ray may run the task somewhere else. When other scheduling strategies are specified, they have higher precedence and data locality is no longer considered. .. note:: Locality-aware scheduling is only for tasks not actors. .. literalinclude:: ../doc_code/scheduling.py :language: python :start-after: __locality_aware_scheduling_start__ :end-before: __locality_aware_scheduling_end__ More about Ray Scheduling ------------------------- .. toctree:: :maxdepth: 1 labels resources accelerators placement-group memory-management ray-oom-prevention --- .. _memory: Memory Management ================= This page describes how memory management works in Ray. Also view :ref:`Debugging Out of Memory ` to learn how to troubleshoot out-of-memory issues. Concepts ~~~~~~~~ There are several ways that Ray applications use memory: .. https://docs.google.com/drawings/d/1wHHnAJZ-NsyIv3TUXQJTYpPz6pjB6PUm2M40Zbfb1Ak/edit .. image:: ../images/memory.svg Ray system memory: this is memory used internally by Ray - **GCS**: memory used for storing the list of nodes and actors present in the cluster. The amount of memory used for these purposes is typically quite small. - **Raylet**: memory used by the C++ raylet process running on each node. This cannot be controlled, but is typically quite small. Application memory: this is memory used by your application - **Worker heap**: memory used by your application (e.g., in Python code or TensorFlow), best measured as the *resident set size (RSS)* of your application minus its *shared memory usage (SHR)* in commands such as ``top``. The reason you need to subtract *SHR* is that object store shared memory is reported by the OS as shared with each worker. Not subtracting *SHR* will result in double counting memory usage. - **Object store memory**: memory used when your application creates objects in the object store via ``ray.put`` and when it returns values from remote functions. Objects are reference counted and evicted when they fall out of scope. An object store server runs on each node. By default, when starting an instance, Ray reserves 30% of available memory. The size of the object store can be controlled by `--object-store-memory `_. The memory is by default allocated to ``/dev/shm`` (shared memory) for Linux. For MacOS, Ray uses ``/tmp`` (disk), which can impact the performance compared to Linux. In Ray 1.3+, objects are :ref:`spilled to disk ` if the object store fills up. - **Object store shared memory**: memory used when your application reads objects via ``ray.get``. Note that if an object is already present on the node, this does not cause additional allocations. This allows large objects to be efficiently shared among many actors and tasks. ObjectRef Reference Counting ---------------------------- Ray implements distributed reference counting so that any ``ObjectRef`` in scope in the cluster is pinned in the object store. This includes local python references, arguments to pending tasks, and IDs serialized inside of other objects. .. _debug-with-ray-memory: Debugging using 'ray memory' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``ray memory`` command can be used to help track down what ``ObjectRef`` references are in scope and may be causing an ``ObjectStoreFullError``. Running ``ray memory`` from the command line while a Ray application is running will give you a dump of all of the ``ObjectRef`` references that are currently held by the driver, actors, and tasks in the cluster. .. code-block:: text ======== Object references status: 2021-02-23 22:02:22.072221 ======== Grouping by node address... Sorting by object size... --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 287 MiB 4 0 0 1 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (put object) | test.py: :17 192.168.0.15 6465 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (task call) | test.py: ::18 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000002000000 18 MiB CAPTURED_IN_OBJECT (put object) | test.py: :19 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000004000000 21 MiB LOCAL_REFERENCE (put object) | test.py: :20 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000003000000 218 MiB LOCAL_REFERENCE (put object) | test.py: :20 --- Aggregate object store stats across all nodes --- Plasma memory usage 0 MiB, 4 objects, 0.0% full Each entry in this output corresponds to an ``ObjectRef`` that's currently pinning an object in the object store along with where the reference is (in the driver, in a worker, etc.), what type of reference it is (see below for details on the types of references), the size of the object in bytes, the process ID and IP address where the object was instantiated, and where in the application the reference was created. ``ray memory`` comes with features to make the memory debugging experience more effective. For example, you can add arguments ``sort-by=OBJECT_SIZE`` and ``group-by=STACK_TRACE``, which may be particularly helpful for tracking down the line of code where a memory leak occurs. You can see the full suite of options by running ``ray memory --help``. There are five types of references that can keep an object pinned: **1. Local ObjectRef references** .. testcode:: import ray @ray.remote def f(arg): return arg a = ray.put(None) b = f.remote(None) In this example, we create references to two objects: one that is ``ray.put()`` in the object store and another that's the return value from ``f.remote()``. .. code-block:: text --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 30 MiB 2 0 0 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 6867 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (put object) | test.py: :12 192.168.0.15 6867 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (task call) | test.py: ::13 In the output from ``ray memory``, we can see that each of these is marked as a ``LOCAL_REFERENCE`` in the driver process, but the annotation in the "Reference Creation Site" indicates that the first was created as a "put object" and the second from a "task call." **2. Objects pinned in memory** .. testcode:: import numpy as np a = ray.put(np.zeros(1)) b = ray.get(a) del a In this example, we create a ``numpy`` array and then store it in the object store. Then, we fetch the same numpy array from the object store and delete its ``ObjectRef``. In this case, the object is still pinned in the object store because the deserialized copy (stored in ``b``) points directly to the memory in the object store. .. code-block:: text --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 243 MiB 0 1 0 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7066 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 243 MiB PINNED_IN_MEMORY test. py::19 The output from ``ray memory`` displays this as the object being ``PINNED_IN_MEMORY``. If we ``del b``, the reference can be freed. **3. Pending task references** .. testcode:: @ray.remote def f(arg): while True: pass a = ray.put(None) b = f.remote(a) In this example, we first create an object via ``ray.put()`` and then submit a task that depends on the object. .. code-block:: text --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 25 MiB 1 1 1 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7207 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 ? LOCAL_REFERENCE (task call) | test.py: ::29 192.168.0.15 7241 Worker ffffffffffffffffffffffffffffffffffffffff0100000001000000 10 MiB PINNED_IN_MEMORY (deserialize task arg) __main__.f 192.168.0.15 7207 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB USED_BY_PENDING_TASK (put object) | test.py: :28 While the task is running, we see that ``ray memory`` shows both a ``LOCAL_REFERENCE`` and a ``USED_BY_PENDING_TASK`` reference for the object in the driver process. The worker process also holds a reference to the object because the Python ``arg`` is directly referencing the memory in the plasma, so it can't be evicted; therefore it is ``PINNED_IN_MEMORY``. **4. Serialized ObjectRef references** .. testcode:: @ray.remote def f(arg): while True: pass a = ray.put(None) b = f.remote([a]) In this example, we again create an object via ``ray.put()``, but then pass it to a task wrapped in another object (in this case, a list). .. code-block:: text --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 15 MiB 2 0 1 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7411 Worker ffffffffffffffffffffffffffffffffffffffff0100000001000000 ? LOCAL_REFERENCE (deserialize task arg) __main__.f 192.168.0.15 7373 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 ? LOCAL_REFERENCE (task call) | test.py: ::38 192.168.0.15 7373 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB USED_BY_PENDING_TASK (put object) | test.py: :37 Now, both the driver and the worker process running the task hold a ``LOCAL_REFERENCE`` to the object in addition to it being ``USED_BY_PENDING_TASK`` on the driver. If this was an actor task, the actor could even hold a ``LOCAL_REFERENCE`` after the task completes by storing the ``ObjectRef`` in a member variable. **5. Captured ObjectRef references** .. testcode:: a = ray.put(None) b = ray.put([a]) del a In this example, we first create an object via ``ray.put()``, then capture its ``ObjectRef`` inside of another ``ray.put()`` object, and delete the first ``ObjectRef``. In this case, both objects are still pinned. .. code-block:: text --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 233 MiB 1 0 0 1 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7473 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB CAPTURED_IN_OBJECT (put object) | test.py: :41 192.168.0.15 7473 Driver ffffffffffffffffffffffffffffffffffffffff0100000002000000 218 MiB LOCAL_REFERENCE (put object) | test.py: :42 In the output of ``ray memory``, we see that the second object displays as a normal ``LOCAL_REFERENCE``, but the first object is listed as ``CAPTURED_IN_OBJECT``. .. _memory-aware-scheduling: Memory Aware Scheduling ~~~~~~~~~~~~~~~~~~~~~~~ By default, Ray does not take into account the potential memory usage of a task or actor when scheduling. This is simply because it cannot estimate ahead of time how much memory is required. However, if you know how much memory a task or actor requires, you can specify it in the resource requirements of its ``ray.remote`` decorator to enable memory-aware scheduling: .. important:: Specifying a memory requirement does NOT impose any limits on memory usage. The requirements are used for admission control during scheduling only (similar to how CPU scheduling works in Ray). It is up to the task itself to not use more memory than it requested. To tell the Ray scheduler a task or actor requires a certain amount of available memory to run, set the ``memory`` argument. The Ray scheduler will then reserve the specified amount of available memory during scheduling, similar to how it handles CPU and GPU resources: .. testcode:: # reserve 500MiB of available memory to place this task @ray.remote(memory=500 * 1024 * 1024) def some_function(x): pass # reserve 2.5GiB of available memory to place this actor @ray.remote(memory=2500 * 1024 * 1024) class SomeActor: def __init__(self, a, b): pass In the above example, the memory quota is specified statically by the decorator, but you can also set them dynamically at runtime using ``.options()`` as follows: .. testcode:: # override the memory quota to 100MiB when submitting the task some_function.options(memory=100 * 1024 * 1024).remote(x=1) # override the memory quota to 1GiB when creating the actor SomeActor.options(memory=1000 * 1024 * 1024).remote(a=1, b=2) Questions or Issues? -------------------- .. include:: /_includes/_help.rst --- Placement Groups ================ .. _ray-placement-group-doc-ref: Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can then be used to schedule Ray tasks and actors packed together for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks. Here are some real-world use cases: - **Distributed Machine Learning Training**: Distributed Training (e.g., :ref:`Ray Train ` and :ref:`Ray Tune `) uses the placement group APIs to enable gang scheduling. In these settings, all resources for a trial must be available at the same time. Gang scheduling is a critical technique to enable all-or-nothing scheduling for deep learning training. - **Fault tolerance in distributed training**: Placement groups can be used to configure fault tolerance. In Ray Tune, it can be beneficial to pack related resources from a single trial together, so that a node failure impacts a low number of trials. In libraries that support elastic training (e.g., XGBoost-Ray), spreading the resources across multiple nodes can help to ensure that training continues even when a node dies. Key Concepts ------------ Bundles ~~~~~~~ A **bundle** is a collection of "resources". It could be a single resource, ``{"CPU": 1}``, or a group of resources, ``{"CPU": 1, "GPU": 4}``. A bundle is a unit of reservation for placement groups. "Scheduling a bundle" means we find a node that fits the bundle and reserve the resources specified by the bundle. A bundle must be able to fit on a single node on the Ray cluster. For example, if you only have an 8 CPU node, and if you have a bundle that requires ``{"CPU": 9}``, this bundle cannot be scheduled. Placement Group ~~~~~~~~~~~~~~~ A **placement group** reserves the resources from the cluster. The reserved resources can only be used by tasks or actors that use the :ref:`PlacementGroupSchedulingStrategy `. - Placement groups are represented by a list of bundles. For example, ``{"CPU": 1} * 4`` means you'd like to reserve 4 bundles of 1 CPU (i.e., it reserves 4 CPUs). - Bundles are then placed according to the :ref:`placement strategies ` across nodes on the cluster. - After the placement group is created, tasks or actors can be then scheduled according to the placement group and even on individual bundles. Create a Placement Group (Reserve Resources) -------------------------------------------- You can create a placement group using :func:`ray.util.placement_group`. Placement groups take in a list of bundles and a :ref:`placement strategy `. Note that each bundle must be able to fit on a single node on the Ray cluster. For example, if you only have an 8 CPU node, and if you have a bundle that requires ``{"CPU": 9}``, this bundle cannot be scheduled. Bundles are specified by a list of dictionaries, e.g., ``[{"CPU": 1}, {"CPU": 1, "GPU": 1}]``). - ``CPU`` corresponds to ``num_cpus`` as used in :func:`ray.remote `. - ``GPU`` corresponds to ``num_gpus`` as used in :func:`ray.remote `. - ``memory`` corresponds to ``memory`` as used in :func:`ray.remote ` - Other resources corresponds to ``resources`` as used in :func:`ray.remote ` (E.g., ``ray.init(resources={"disk": 1})`` can have a bundle of ``{"disk": 1}``). Placement group scheduling is asynchronous. The `ray.util.placement_group` returns immediately. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __create_pg_start__ :end-before: __create_pg_end__ .. tab-item:: Java .. code-block:: java // Initialize Ray. Ray.init(); // Construct a list of bundles. Map bundle = ImmutableMap.of("CPU", 1.0); List> bundles = ImmutableList.of(bundle); // Make a creation option with bundles and strategy. PlacementGroupCreationOptions options = new PlacementGroupCreationOptions.Builder() .setBundles(bundles) .setStrategy(PlacementStrategy.STRICT_SPREAD) .build(); PlacementGroup pg = PlacementGroups.createPlacementGroup(options); .. tab-item:: C++ .. code-block:: c++ // Initialize Ray. ray::Init(); // Construct a list of bundles. std::vector> bundles{{{"CPU", 1.0}}}; // Make a creation option with bundles and strategy. ray::internal::PlacementGroupCreationOptions options{ false, "my_pg", bundles, ray::internal::PlacementStrategy::PACK}; ray::PlacementGroup pg = ray::CreatePlacementGroup(options); You can block your program until the placement group is ready using one of two APIs: * :func:`ready `, which is compatible with ``ray.get`` * :func:`wait `, which blocks the program until the placement group is ready) .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __ready_pg_start__ :end-before: __ready_pg_end__ .. tab-item:: Java .. code-block:: java // Wait for the placement group to be ready within the specified time(unit is seconds). boolean ready = pg.wait(60); Assert.assertTrue(ready); // You can look at placement group states using this API. List allPlacementGroup = PlacementGroups.getAllPlacementGroups(); for (PlacementGroup group: allPlacementGroup) { System.out.println(group); } .. tab-item:: C++ .. code-block:: c++ // Wait for the placement group to be ready within the specified time(unit is seconds). bool ready = pg.Wait(60); assert(ready); // You can look at placement group states using this API. std::vector all_placement_group = ray::GetAllPlacementGroups(); for (const ray::PlacementGroup &group : all_placement_group) { std::cout << group.GetName() << std::endl; } Let's verify the placement group is successfully created. .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray list placement-groups .. code-block:: bash ======== List: 2023-04-07 01:15:05.682519 ======== Stats: ------------------------------ Total: 1 Table: ------------------------------ PLACEMENT_GROUP_ID NAME CREATOR_JOB_ID STATE 0 3cd6174711f47c14132155039c0501000000 01000000 CREATED The placement group is successfully created. Out of the ``{"CPU": 2, "GPU": 2}`` resources, the placement group reserves ``{"CPU": 1, "GPU": 1}``. The reserved resources can only be used when you schedule tasks or actors with a placement group. The diagram below demonstrates the "1 CPU and 1 GPU" bundle that the placement group reserved. .. image:: ../images/pg_image_1.png :align: center Placement groups are atomically created; if a bundle cannot fit in any of the current nodes, the entire placement group is not ready and no resources are reserved. To illustrate, let's create another placement group that requires ``{"CPU":1}, {"GPU": 2}`` (2 bundles). .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __create_pg_failed_start__ :end-before: __create_pg_failed_end__ You can verify the new placement group is pending creation. .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray list placement-groups .. code-block:: bash ======== List: 2023-04-07 01:16:23.733410 ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ PLACEMENT_GROUP_ID NAME CREATOR_JOB_ID STATE 0 3cd6174711f47c14132155039c0501000000 01000000 CREATED 1 e1b043bebc751c3081bddc24834d01000000 01000000 PENDING <---- the new placement group. You can also verify that the ``{"CPU": 1, "GPU": 2}`` bundles cannot be allocated, using the ``ray status`` CLI command. .. code-block:: bash ray status .. code-block:: bash Resources --------------------------------------------------------------- Usage: 0.0/2.0 CPU (0.0 used of 1.0 reserved in placement groups) 0.0/2.0 GPU (0.0 used of 1.0 reserved in placement groups) 0B/3.46GiB memory 0B/1.73GiB object_store_memory Demands: {'CPU': 1.0} * 1, {'GPU': 2.0} * 1 (PACK): 1+ pending placement groups <--- 1 placement group is pending creation. The current cluster has ``{"CPU": 2, "GPU": 2}``. We already created a ``{"CPU": 1, "GPU": 1}`` bundle, so only ``{"CPU": 1, "GPU": 1}`` is left in the cluster. If we create 2 bundles ``{"CPU": 1}, {"GPU": 2}``, we can create a first bundle successfully, but can't schedule the second bundle. Since we cannot create every bundle on the cluster, the placement group is not created, including the ``{"CPU": 1}`` bundle. .. image:: ../images/pg_image_2.png :align: center When the placement group cannot be scheduled in any way, it is called "infeasible". Imagine you schedule ``{"CPU": 4}`` bundle, but you only have a single node with 2 CPUs. There's no way to create this bundle in your cluster. The Ray Autoscaler is aware of placement groups, and auto-scales the cluster to ensure pending groups can be placed as needed. If Ray Autoscaler cannot provide resources to schedule a placement group, Ray does *not* print a warning about infeasible groups and tasks and actors that use the groups. You can observe the scheduling state of the placement group from the :ref:`dashboard or state APIs `. .. _ray-placement-group-schedule-tasks-actors-ref: Schedule Tasks and Actors to Placement Groups (Use Reserved Resources) ---------------------------------------------------------------------- In the previous section, we created a placement group that reserved ``{"CPU": 1, "GPU: 1"}`` from a 2 CPU and 2 GPU node. Now let's schedule an actor to the placement group. You can schedule actors or tasks to a placement group using :class:`options(scheduling_strategy=PlacementGroupSchedulingStrategy(...)) `. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __schedule_pg_start__ :end-before: __schedule_pg_end__ .. tab-item:: Java .. code-block:: java public static class Counter { private int value; public Counter(int initValue) { this.value = initValue; } public int getValue() { return value; } public static String ping() { return "pong"; } } // Create GPU actors on a gpu bundle. for (int index = 0; index < 1; index++) { Ray.actor(Counter::new, 1) .setPlacementGroup(pg, 0) .remote(); } .. tab-item:: C++ .. code-block:: c++ class Counter { public: Counter(int init_value) : value(init_value){} int GetValue() {return value;} std::string Ping() { return "pong"; } private: int value; }; // Factory function of Counter class. static Counter *CreateCounter() { return new Counter(); }; RAY_REMOTE(&Counter::Ping, &Counter::GetValue, CreateCounter); // Create GPU actors on a gpu bundle. for (int index = 0; index < 1; index++) { ray::Actor(CreateCounter) .SetPlacementGroup(pg, 0) .Remote(1); } .. note:: By default, Ray actors require 1 logical CPU at schedule time, but after being scheduled, they do not acquire any CPU resources. In other words, by default, actors cannot get scheduled on a zero-cpu node, but an infinite number of them can run on any non-zero cpu node. Thus, when scheduling an actor with the default resource requirements and a placement group, the placement group has to be created with a bundle containing at least 1 CPU (since the actor requires 1 CPU for scheduling). However, after the actor is created, it doesn't consume any placement group resources. To avoid any surprises, always specify resource requirements explicitly for actors. If resources are specified explicitly, they are required both at schedule time and at execution time. The actor is scheduled now! One bundle can be used by multiple tasks and actors (i.e., the bundle to task (or actor) is a one-to-many relationship). In this case, since the actor uses 1 CPU, 1 GPU remains from the bundle. You can verify this from the CLI command ``ray status``. You can see the 1 CPU is reserved by the placement group, and 1.0 is used (by the actor we created). .. code-block:: bash ray status .. code-block:: bash Resources --------------------------------------------------------------- Usage: 1.0/2.0 CPU (1.0 used of 1.0 reserved in placement groups) <--- 0.0/2.0 GPU (0.0 used of 1.0 reserved in placement groups) 0B/4.29GiB memory 0B/2.00GiB object_store_memory Demands: (no resource demands) You can also verify the actor is created using ``ray list actors``. .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray list actors --detail .. code-block:: bash - actor_id: b5c990f135a7b32bfbb05e1701000000 class_name: Actor death_cause: null is_detached: false job_id: '01000000' name: '' node_id: b552ca3009081c9de857a31e529d248ba051a4d3aeece7135dde8427 pid: 8795 placement_group_id: d2e660ac256db230dbe516127c4a01000000 <------ ray_namespace: e5b19111-306c-4cd8-9e4f-4b13d42dff86 repr_name: '' required_resources: CPU_group_d2e660ac256db230dbe516127c4a01000000: 1.0 serialized_runtime_env: '{}' state: ALIVE Since 1 GPU remains, let's create a new actor that requires 1 GPU. This time, we also specify the ``placement_group_bundle_index``. Each bundle is given an "index" within the placement group. For example, a placement group of 2 bundles ``[{"CPU": 1}, {"GPU": 1}]`` has index 0 bundle ``{"CPU": 1}`` and index 1 bundle ``{"GPU": 1}``. Since we only have 1 bundle, we only have index 0. If you don't specify a bundle, the actor (or task) is scheduled on a random bundle that has unallocated reserved resources. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __schedule_pg_3_start__ :end-before: __schedule_pg_3_end__ We succeed to schedule the GPU actor! The below image describes 2 actors scheduled into the placement group. .. image:: ../images/pg_image_3.png :align: center You can also verify that the reserved resources are all used, with the ``ray status`` command. .. code-block:: bash ray status .. code-block:: bash Resources --------------------------------------------------------------- Usage: 1.0/2.0 CPU (1.0 used of 1.0 reserved in placement groups) 1.0/2.0 GPU (1.0 used of 1.0 reserved in placement groups) <---- 0B/4.29GiB memory 0B/2.00GiB object_store_memory .. _pgroup-strategy: Placement Strategy ------------------ One of the features the placement group provides is to add placement constraints among bundles. For example, you'd like to pack your bundles to the same node or spread out to multiple nodes as much as possible. You can specify the strategy via ``strategy`` argument. This way, you can make sure your actors and tasks can be scheduled with certain placement constraints. The example below creates a placement group with 2 bundles with a PACK strategy; both bundles have to be created in the same node. Note that it is a soft policy. If the bundles cannot be packed into a single node, they are spread to other nodes. If you'd like to avoid the problem, you can instead use `STRICT_PACK` policies, which fail to create placement groups if placement requirements cannot be satisfied. .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __strategy_pg_start__ :end-before: __strategy_pg_end__ The image below demonstrates the PACK policy. Three of the ``{"CPU": 2}`` bundles are located in the same node. .. image:: ../images/pg_image_4.png :align: center The image below demonstrates the SPREAD policy. Each of three of the ``{"CPU": 2}`` bundles are located in three different nodes. .. image:: ../images/pg_image_5.png :align: center Ray supports four placement group strategies. The default scheduling policy is ``PACK``. **STRICT_PACK** All bundles must be placed into a single node on the cluster. Use this strategy when you want to maximize the locality. **PACK** All provided bundles are packed onto a single node on a best-effort basis. If strict packing is not feasible (i.e., some bundles do not fit on the node), bundles can be placed onto other nodes. **STRICT_SPREAD** Each bundle must be scheduled in a separate node. **SPREAD** Each bundle is spread onto separate nodes on a best-effort basis. If strict spreading is not feasible, bundles can be placed on overlapping nodes. Remove Placement Groups (Free Reserved Resources) ------------------------------------------------- By default, a placement group's lifetime is scoped to the driver that creates placement groups (unless you make it a :ref:`detached placement group `). When the placement group is created from a :ref:`detached actor `, the lifetime is scoped to the detached actor. In Ray, the driver is the Python script that calls ``ray.init``. Reserved resources (bundles) from the placement group are automatically freed when the driver or detached actor that creates placement group exits. To free the reserved resources manually, remove the placement group using the :func:`remove_placement_group ` API (which is also an asynchronous API). .. note:: When you remove the placement group, actors or tasks that still use the reserved resources are forcefully killed. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __remove_pg_start__ :end-before: __remove_pg_end__ .. tab-item:: Java .. code-block:: java PlacementGroups.removePlacementGroup(placementGroup.getId()); PlacementGroup removedPlacementGroup = PlacementGroups.getPlacementGroup(placementGroup.getId()); Assert.assertEquals(removedPlacementGroup.getState(), PlacementGroupState.REMOVED); .. tab-item:: C++ .. code-block:: c++ ray::RemovePlacementGroup(placement_group.GetID()); ray::PlacementGroup removed_placement_group = ray::GetPlacementGroup(placement_group.GetID()); assert(removed_placement_group.GetState(), ray::PlacementGroupState::REMOVED); .. _ray-placement-group-observability-ref: Observe and Debug Placement Groups ---------------------------------- Ray provides several useful tools to inspect the placement group states and resource usage. - **Ray Status** is a CLI tool for viewing the resource usage and scheduling resource requirements of placement groups. - **Ray Dashboard** is a UI tool for inspecting placement group states. - **Ray State API** is a CLI for inspecting placement group states. .. tab-set:: .. tab-item:: ray status (CLI) The CLI command ``ray status`` provides the autoscaling status of the cluster. It provides the "resource demands" from unscheduled placement groups as well as the resource reservation status. .. code-block:: bash Resources --------------------------------------------------------------- Usage: 1.0/2.0 CPU (1.0 used of 1.0 reserved in placement groups) 0.0/2.0 GPU (0.0 used of 1.0 reserved in placement groups) 0B/4.29GiB memory 0B/2.00GiB object_store_memory .. tab-item:: Dashboard The :ref:`dashboard job view ` provides the placement group table that displays the scheduling state and metadata of the placement group. .. note:: Ray dashboard is only available when you install Ray is with ``pip install "ray[default]"``. .. tab-item:: Ray State API :ref:`Ray state API ` is a CLI tool for inspecting the state of Ray resources (tasks, actors, placement groups, etc.). ``ray list placement-groups`` provides the metadata and the scheduling state of the placement group. ``ray list placement-groups --detail`` provides statistics and scheduling state in a greater detail. .. note:: State API is only available when you install Ray is with ``pip install "ray[default]"`` Inspect Placement Group Scheduling State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With the above tools, you can see the state of the placement group. The definition of states are specified in the following files: - `High level state `_ - `Details `_ .. image:: ../images/pg_image_6.png :align: center [Advanced] Child Tasks and Actors --------------------------------- By default, child actors and tasks don't share the same placement group that the parent uses. To automatically schedule child actors or tasks to the same placement group, set ``placement_group_capture_child_tasks`` to True. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_capture_child_tasks_example.py :language: python :start-after: __child_capture_pg_start__ :end-before: __child_capture_pg_end__ .. tab-item:: Java It's not implemented for Java APIs yet. When ``placement_group_capture_child_tasks`` is True, but you don't want to schedule child tasks and actors to the same placement group, specify ``PlacementGroupSchedulingStrategy(placement_group=None)``. .. literalinclude:: ../doc_code/placement_group_capture_child_tasks_example.py :language: python :start-after: __child_capture_disable_pg_start__ :end-before: __child_capture_disable_pg_end__ .. warning:: The value of ``placement_group_capture_child_tasks`` for a given actor isn't inherited from its parent. If you're creating nested actors of depth greater than 1 and should all use the same placement group, you should explicitly set ``placement_group_capture_child_tasks`` explicitly set for each actor. [Advanced] Named Placement Group -------------------------------- Within a :ref:`namespace `, you can *name* a placement group. You can use the name of a placement group to retrieve the placement group from any job in the Ray cluster, as long as the job is within the same namespace. This is useful if you can't directly pass the placement group handle to the actor or task that needs it, or if you are trying to access a placement group launched by another driver. The placement group is destroyed when the original creation job completes if its lifetime isn't `detached`. You can avoid this by using a :ref:`detached placement group ` Note that this feature requires that you specify a :ref:`namespace ` associated with it, or else you can't retrieve the placement group across jobs. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __get_pg_start__ :end-before: __get_pg_end__ .. tab-item:: Java .. code-block:: java // Create a placement group with a unique name. Map bundle = ImmutableMap.of("CPU", 1.0); List> bundles = ImmutableList.of(bundle); PlacementGroupCreationOptions options = new PlacementGroupCreationOptions.Builder() .setBundles(bundles) .setStrategy(PlacementStrategy.STRICT_SPREAD) .setName("global_name") .build(); PlacementGroup pg = PlacementGroups.createPlacementGroup(options); pg.wait(60); ... // Retrieve the placement group later somewhere. PlacementGroup group = PlacementGroups.getPlacementGroup("global_name"); Assert.assertNotNull(group); .. tab-item:: C++ .. code-block:: c++ // Create a placement group with a globally unique name. std::vector> bundles{{{"CPU", 1.0}}}; ray::PlacementGroupCreationOptions options{ true/*global*/, "global_name", bundles, ray::PlacementStrategy::STRICT_SPREAD}; ray::PlacementGroup pg = ray::CreatePlacementGroup(options); pg.Wait(60); ... // Retrieve the placement group later somewhere. ray::PlacementGroup group = ray::GetGlobalPlacementGroup("global_name"); assert(!group.Empty()); We also support non-global named placement group in C++, which means that the placement group name is only valid within the job and cannot be accessed from another job. .. code-block:: c++ // Create a placement group with a job-scope-unique name. std::vector> bundles{{{"CPU", 1.0}}}; ray::PlacementGroupCreationOptions options{ false/*non-global*/, "non_global_name", bundles, ray::PlacementStrategy::STRICT_SPREAD}; ray::PlacementGroup pg = ray::CreatePlacementGroup(options); pg.Wait(60); ... // Retrieve the placement group later somewhere in the same job. ray::PlacementGroup group = ray::GetPlacementGroup("non_global_name"); assert(!group.Empty()); .. _placement-group-detached: [Advanced] Detached Placement Group ----------------------------------- By default, the lifetimes of placement groups belong to the driver and actor. - If the placement group is created from a driver, it is destroyed when the driver is terminated. - If it is created from a detached actor, it is killed when the detached actor is killed. To keep the placement group alive regardless of its job or detached actor, specify `lifetime="detached"`. For example: .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/placement_group_example.py :language: python :start-after: __detached_pg_start__ :end-before: __detached_pg_end__ .. tab-item:: Java The lifetime argument is not implemented for Java APIs yet. Let's terminate the current script and start a new Python script. Call ``ray list placement-groups``, and you can see the placement group is not removed. Note that the lifetime option is decoupled from the name. If we only specified the name without specifying ``lifetime="detached"``, then the placement group can only be retrieved as long as the original driver is still running. It is recommended to always specify the name when creating the detached placement group. [Advanced] Fault Tolerance -------------------------- .. _ray-placement-group-ft-ref: Rescheduling Bundles on a Dead Node ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If nodes that contain some bundles of a placement group die, all the bundles are rescheduled on different nodes by GCS (i.e., we try reserving resources again). This means that the initial creation of placement group is "atomic", but once it is created, there could be partial placement groups. Rescheduling bundles have higher scheduling priority than other placement group scheduling. Provide Resources for Partially Lost Bundles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there are not enough resources to schedule the partially lost bundles, the placement group waits, assuming Ray Autoscaler will start a new node to satisfy the resource requirements. If the additional resources cannot be provided (e.g., you don't use the Autoscaler or the Autoscaler hits the resource limit), the placement group remains in the partially created state indefinitely. Fault Tolerance of Actors and Tasks that Use the Bundle ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Actors and tasks that use the bundle (reserved resources) are rescheduled based on their :ref:`fault tolerant policy ` once the bundle is recovered. API Reference ------------- :ref:`Placement Group API reference ` --- .. _ray-oom-prevention: Out-Of-Memory Prevention ======================== If application tasks or actors consume a large amount of heap space, it can cause the node to run out of memory (OOM). When that happens, the operating system will start killing worker or raylet processes, disrupting the application. OOM may also stall metrics and if this happens on the head node, it may stall the :ref:`dashboard ` or other control processes and cause the cluster to become unusable. In this section we will go over: - What is the memory monitor and how it works - How to enable and configure it - How to use the memory monitor to detect and resolve memory issues Also view :ref:`Debugging Out of Memory ` to learn how to troubleshoot out-of-memory issues. .. _ray-oom-monitor: What is the memory monitor? --------------------------- The memory monitor is a component that runs within the :ref:`raylet ` process on each node. It periodically checks the memory usage, which includes the worker heap, the object store, and the raylet as described in :ref:`memory management `. If the combined usage exceeds a configurable threshold the raylet will kill a task or actor process to free up memory and prevent Ray from failing. It's available on Linux and is tested with Ray running inside a container that is using cgroup v1/v2. If you encounter issues when running the memory monitor outside of a container, :ref:`file an issue or post a question `. How do I disable the memory monitor? -------------------------------------- The memory monitor is enabled by default and can be disabled by setting the environment variable ``RAY_memory_monitor_refresh_ms`` to zero when Ray starts (e.g., RAY_memory_monitor_refresh_ms=0 ray start ...). How do I configure the memory monitor? -------------------------------------- The memory monitor is controlled by the following environment variables: - ``RAY_memory_monitor_refresh_ms (int, defaults to 250)`` is the interval to check memory usage and kill tasks or actors if needed. Task killing is disabled when this value is 0. The memory monitor selects and kills one task at a time and waits for it to be killed before choosing another one, regardless of how frequent the memory monitor runs. - ``RAY_memory_usage_threshold (float, defaults to 0.95)`` is the threshold when the node is beyond the memory capacity. If the memory usage is above this fraction it will start killing processes to free up memory. Ranges from [0, 1]. Using the Memory Monitor ------------------------ .. _ray-oom-retry-policy: Retry policy ~~~~~~~~~~~~ When a task or actor is killed by the memory monitor it will be retried with exponential backoff. There is a cap on the retry delay, which is 60 seconds. If tasks are killed by the memory monitor, it retries infinitely (not respecting :ref:`max_retries `). If actors are killed by the memory monitor, it doesn't recreate the actor infinitely (It respects :ref:`max_restarts `, which is 0 by default). Worker killing policy ~~~~~~~~~~~~~~~~~~~~~ The memory monitor avoids infinite loops of task retries by ensuring at least one task is able to run for each caller on each node. If it is unable to ensure this, the workload will fail with an OOM error. Note that this is only an issue for tasks, since the memory monitor will not indefinitely retry actors. If the workload fails, refer to :ref:`how to address memory issues ` on how to adjust the workload to make it pass. For code example, see the :ref:`last task ` example below. When a worker needs to be killed, the policy first prioritizes tasks that are retriable, i.e. when :ref:`max_retries ` or :ref:`max_restarts ` is > 0. This is done to minimize workload failure. Actors by default are not retriable since :ref:`max_restarts ` defaults to 0. Therefore, by default, tasks are preferred to actors when it comes to what gets killed first. When there are multiple callers that has created tasks, the policy will pick a task from the caller with the most number of running tasks. If two callers have the same number of tasks it picks the caller whose earliest task has a later start time. This is done to ensure fairness and allow each caller to make progress. Amongst the tasks that share the same caller, the latest started task will be killed first. Below is an example to demonstrate the policy. In the example we have a script that creates two tasks, which in turn creates four more tasks each. The tasks are colored such that each color forms a "group" of tasks where they belong to the same caller. .. image:: ../images/oom_killer_example.svg :width: 1024 :alt: Initial state of the task graph If, at this point, the node runs out of memory, it will pick a task from the caller with the most number of tasks, and kill its task whose started the last: .. image:: ../images/oom_killer_example_killed_one.svg :width: 1024 :alt: Initial state of the task graph If, at this point, the node still runs out of memory, the process will repeat: .. image:: ../images/oom_killer_example_killed_two.svg :width: 1024 :alt: Initial state of the task graph .. _last-task-example: .. dropdown:: Example: Workloads fails if the last task of the caller is killed Let's create an application oom.py that runs a single task that requires more memory than what is available. It is set to infinite retry by setting ``max_retries`` to -1. The worker killer policy sees that it is the last task of the caller, and will fail the workload when it kills the task as it is the last one for the caller, even when the task is set to retry forever. .. literalinclude:: ../doc_code/ray_oom_prevention.py :language: python :start-after: __last_task_start__ :end-before: __last_task_end__ Set ``RAY_event_stats_print_interval_ms=1000`` so it prints the worker kill summary every second, since by default it prints every minute. .. code-block:: bash RAY_event_stats_print_interval_ms=1000 python oom.py (raylet) node_manager.cc:3040: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 2c82620270df6b9dd7ae2791ef51ee4b5a9d5df9f795986c10dd219c, IP: 172.31.183.172) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 172.31.183.172` (raylet) (raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero. task failed with OutOfMemoryError, which is expected Verify the task was indeed executed twice via ``task_oom_retry``: .. dropdown:: Example: memory monitor prefers to kill a retriable task Let's first start ray and specify the memory threshold. .. code-block:: bash RAY_memory_usage_threshold=0.4 ray start --head Let's create an application two_actors.py that submits two actors, where the first one is retriable and the second one is non-retriable. .. literalinclude:: ../doc_code/ray_oom_prevention.py :language: python :start-after: __two_actors_start__ :end-before: __two_actors_end__ Run the application to see that only the first actor was killed. .. code-block:: bash $ python two_actors.py First started actor, which is retriable, was killed by the memory monitor. Second started actor, which is not-retriable, finished. .. _addressing-memory-issues: Addressing memory issues ------------------------ When the application fails due to OOM, consider reducing the memory usage of the tasks and actors, increasing the memory capacity of the node, or :ref:`limit the number of concurrently running tasks `. .. _oom-questions: Questions or Issues? -------------------- .. include:: /_includes/_help.rst --- .. _core-resources: Resources ========= Ray allows you to seamlessly scale your applications from a laptop to a cluster without code change. **Ray resources** are key to this capability. They abstract away physical machines and let you express your computation in terms of resources, while the system manages scheduling and autoscaling based on resource requests. A resource in Ray is a key-value pair where the key denotes a resource name, and the value is a float quantity. For convenience, Ray has native support for CPU, GPU, and memory resource types; CPU, GPU and memory are called **pre-defined resources**. Besides those, Ray also supports :ref:`custom resources `. .. _logical-resources: Physical Resources and Logical Resources ---------------------------------------- Physical resources are resources that a machine physically has such as physical CPUs and GPUs and logical resources are virtual resources defined by a system. Ray resources are **logical** and don’t need to have 1-to-1 mapping with physical resources. For example, you can start a Ray head node with 0 logical CPUs via ``ray start --head --num-cpus=0`` even if it physically has eight (This signals the Ray scheduler to not schedule any tasks or actors that require logical CPU resources on the head node, mainly to reserve the head node for running Ray system processes.). They are mainly used for admission control during scheduling. The fact that resources are logical has several implications: - Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage. For example, Ray doesn't prevent a ``num_cpus=1`` task from launching multiple threads and using multiple physical CPUs. It's your responsibility to make sure tasks or actors use no more resources than specified via resource requirements. - Ray doesn't provide CPU isolation for tasks or actors. For example, Ray won't reserve a physical CPU exclusively and pin a ``num_cpus=1`` task to it. Ray will let the operating system schedule and run the task instead. If needed, you can use operating system APIs like ``sched_setaffinity`` to pin a task to a physical CPU. - Ray does provide :ref:`GPU ` isolation in the form of *visible devices* by automatically setting the ``CUDA_VISIBLE_DEVICES`` environment variable, which most ML frameworks will respect for purposes of GPU assignment. .. _omp-num-thread-note: .. note:: Ray sets the environment variable ``OMP_NUM_THREADS=`` if ``num_cpus`` is set on the task/actor via :func:`ray.remote() ` and :meth:`task.options() `/:meth:`actor.options() `. Ray sets ``OMP_NUM_THREADS=1`` if ``num_cpus`` is not specified; this is done to avoid performance degradation with many workers (issue #6998). You can also override this by explicitly setting ``OMP_NUM_THREADS`` to override anything Ray sets by default. ``OMP_NUM_THREADS`` is commonly used in numpy, PyTorch, and Tensorflow to perform multi-threaded linear algebra. In multi-worker setting, we want one thread per worker instead of many threads per worker to avoid contention. Some other libraries may have their own way to configure parallelism. For example, if you're using OpenCV, you should manually set the number of threads using cv2.setNumThreads(num_threads) (set to 0 to disable multi-threading). .. figure:: ../images/physical_resources_vs_logical_resources.svg Physical resources vs logical resources .. _custom-resources: Custom Resources ---------------- You can specify custom resources for a Ray node and reference them to control scheduling for your tasks or actors. Use custom resources when you need to manage scheduling using numeric values. If you need simple label-based scheduling, use labels instead. See :doc:`labels`. .. _specify-node-resources: Specifying Node Resources ------------------------- By default, Ray nodes start with pre-defined CPU, GPU, and memory resources. The quantities of these logical resources on each node are set to the physical quantities auto detected by Ray. By default, logical resources are configured by the following rule. .. warning:: Ray **does not permit dynamic updates of resource capacities after Ray has been started on a node**. - **Number of logical CPUs** (``num_cpus``): Set to the number of CPUs of the machine/container. - **Number of logical GPUs** (``num_gpus``): Set to the number of GPUs of the machine/container. - **Memory** (``memory``): Set to 70% of "available memory" when ray runtime starts. - **Object Store Memory** (``object_store_memory``): Set to 30% of "available memory" when ray runtime starts. Note that the object store memory is not logical resource, and users cannot use it for scheduling. However, you can always override that by manually specifying the quantities of pre-defined resources and adding custom resources. There are several ways to do that depending on how you start the Ray cluster: .. tab-set:: .. tab-item:: ray.init() If you are using :func:`ray.init() ` to start a single node Ray cluster, you can do the following to manually specify node resources: .. literalinclude:: ../doc_code/resources.py :language: python :start-after: __specifying_node_resources_start__ :end-before: __specifying_node_resources_end__ .. tab-item:: ray start If you are using :ref:`ray start ` to start a Ray node, you can run: .. code-block:: shell ray start --head --num-cpus=3 --num-gpus=4 --resources='{"special_hardware": 1, "custom_label": 1}' .. tab-item:: ray up If you are using :ref:`ray up ` to start a Ray cluster, you can set the :ref:`resources field ` in the yaml file: .. code-block:: yaml available_node_types: head: ... resources: CPU: 3 GPU: 4 special_hardware: 1 custom_label: 1 .. tab-item:: KubeRay If you are using :ref:`KubeRay ` to start a Ray cluster, you can set the :ref:`rayStartParams field ` in the yaml file: .. code-block:: yaml headGroupSpec: rayStartParams: num-cpus: "3" num-gpus: "4" resources: '"{\"special_hardware\": 1, \"custom_label\": 1}"' .. _resource-requirements: Specifying Task or Actor Resource Requirements ---------------------------------------------- Ray allows specifying a task or actor's logical resource requirements (e.g., CPU, GPU, and custom resources). The task or actor will only run on a node if there are enough required logical resources available to execute the task or actor. By default, Ray tasks use 1 logical CPU resource and Ray actors use 1 logical CPU for scheduling, and 0 logical CPU for running. (This means, by default, actors cannot get scheduled on a zero-cpu node, but an infinite number of them can run on any non-zero cpu node. The default resource requirements for actors was chosen for historical reasons. It's recommended to always explicitly set ``num_cpus`` for actors to avoid any surprises. If resources are specified explicitly, they are required both at schedule time and at execution time.) You can also explicitly specify a task's or actor's logical resource requirements (for example, one task may require a GPU) instead of using default ones via :func:`ray.remote() ` and :meth:`task.options() `/:meth:`actor.options() `. .. tab-set:: .. tab-item:: Python .. literalinclude:: ../doc_code/resources.py :language: python :start-after: __specifying_resource_requirements_start__ :end-before: __specifying_resource_requirements_end__ .. tab-item:: Java .. code-block:: java // Specify required resources. Ray.task(MyRayApp::myFunction).setResource("CPU", 1.0).setResource("GPU", 1.0).setResource("special_hardware", 1.0).remote(); Ray.actor(Counter::new).setResource("CPU", 2.0).setResource("GPU", 1.0).remote(); .. tab-item:: C++ .. code-block:: c++ // Specify required resources. ray::Task(MyFunction).SetResource("CPU", 1.0).SetResource("GPU", 1.0).SetResource("special_hardware", 1.0).Remote(); ray::Actor(CreateCounter).SetResource("CPU", 2.0).SetResource("GPU", 1.0).Remote(); Task and actor resource requirements have implications for the Ray's scheduling concurrency. In particular, the sum of the logical resource requirements of all of the concurrently executing tasks and actors on a given node cannot exceed the node's total logical resources. This property can be used to :ref:`limit the number of concurrently running tasks or actors to avoid issues like OOM `. .. _fractional-resource-requirements: Fractional Resource Requirements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray supports fractional resource requirements. For example, if your task or actor is IO bound and has low CPU usage, you can specify fractional CPU ``num_cpus=0.5`` or even zero CPU ``num_cpus=0``. The precision of the fractional resource requirement is 0.0001 so you should avoid specifying a double that's beyond that precision. .. literalinclude:: ../doc_code/resources.py :language: python :start-after: __specifying_fractional_resource_requirements_start__ :end-before: __specifying_fractional_resource_requirements_end__ .. note:: GPU, TPU, and neuron_cores resource requirements that are greater than 1, need to be whole numbers. For example, ``num_gpus=1.5`` is invalid. .. tip:: Besides resource requirements, you can also specify an environment for a task or actor to run in, which can include Python packages, local files, environment variables, and more. See :ref:`Runtime Environments ` for details. --- .. _start-ray: Starting Ray ============ This page covers how to start Ray on your single machine or cluster of machines. .. tip:: Be sure to have :ref:`installed Ray ` before following the instructions on this page. What is the Ray runtime? ------------------------ Ray programs are able to parallelize and distribute by leveraging an underlying *Ray runtime*. The Ray runtime consists of multiple services/processes started in the background for communication, data transfer, scheduling, and more. The Ray runtime can be started on a laptop, a single server, or multiple servers. There are three ways of starting the Ray runtime: * Implicitly via ``ray.init()`` (:ref:`start-ray-init`) * Explicitly via CLI (:ref:`start-ray-cli`) * Explicitly via the cluster launcher (:ref:`start-ray-up`) In all cases, ``ray.init()`` will try to automatically find a Ray instance to connect to. It checks, in order: 1. The ``RAY_ADDRESS`` OS environment variable. 2. The concrete address passed to ``ray.init(address=
)``. 3. If no address is provided, the latest Ray instance that was started on the same machine using ``ray start``. .. _start-ray-init: Starting Ray on a single machine -------------------------------- Calling ``ray.init()`` starts a local Ray instance on your laptop/machine. This laptop/machine becomes the "head node". .. note:: In recent versions of Ray (>=1.5), ``ray.init()`` will automatically be called on the first use of a Ray remote API. .. tab-set:: .. tab-item:: Python .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray # Other Ray APIs will not work until `ray.init()` is called. ray.init() .. tab-item:: Java .. code-block:: java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { // Other Ray APIs will not work until `Ray.init()` is called. Ray.init(); ... } } .. tab-item:: C++ .. code-block:: c++ #include // Other Ray APIs will not work until `ray::Init()` is called. ray::Init() When the process calling ``ray.init()`` terminates, the Ray runtime will also terminate. To explicitly stop or restart Ray, use the shutdown API. .. tab-set:: .. tab-item:: Python .. testcode:: :hide: ray.shutdown() .. testcode:: import ray ray.init() ... # ray program ray.shutdown() .. tab-item:: Java .. code-block:: java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { Ray.init(); ... // ray program Ray.shutdown(); } } .. tab-item:: C++ .. code-block:: c++ #include ray::Init() ... // ray program ray::Shutdown() To check if Ray is initialized, use the ``is_initialized`` API. .. tab-set:: .. tab-item:: Python .. testcode:: import ray ray.init() assert ray.is_initialized() ray.shutdown() assert not ray.is_initialized() .. tab-item:: Java .. code-block:: java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { Ray.init(); Assert.assertTrue(Ray.isInitialized()); Ray.shutdown(); Assert.assertFalse(Ray.isInitialized()); } } .. tab-item:: C++ .. code-block:: c++ #include int main(int argc, char **argv) { ray::Init(); assert(ray::IsInitialized()); ray::Shutdown(); assert(!ray::IsInitialized()); } See the `Configuration `__ documentation for the various ways to configure Ray. .. _start-ray-cli: Starting Ray via the CLI (``ray start``) ---------------------------------------- Use ``ray start`` from the CLI to start a 1 node ray runtime on a machine. This machine becomes the "head node". .. code-block:: bash $ ray start --head --port=6379 Local node IP: 192.123.1.123 2020-09-20 10:38:54,193 INFO services.py:1166 -- View the Ray dashboard at http://localhost:8265 -------------------- Ray runtime started. -------------------- ... You can connect to this Ray instance by starting a driver process on the same node as where you ran ``ray start``. ``ray.init()`` will now automatically connect to the latest Ray instance. .. tab-set:: .. tab-item:: Python .. testcode:: import ray ray.init() .. tab-item:: java .. code-block:: java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { Ray.init(); ... } } .. code-block:: bash java -classpath \ -Dray.address=
\ .. tab-item:: C++ .. code-block:: c++ #include int main(int argc, char **argv) { ray::Init(); ... } .. code-block:: bash RAY_ADDRESS=
./ You can connect other nodes to the head node, creating a Ray cluster by also calling ``ray start`` on those nodes. See :ref:`on-prem` for more details. Calling ``ray.init()`` on any of the cluster machines will connect to the same Ray cluster. .. _start-ray-up: Launching a Ray cluster (``ray up``) ------------------------------------ Ray clusters can be launched with the :ref:`Cluster Launcher `. The ``ray up`` command uses the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. Underneath the hood, it automatically calls ``ray start`` to create a Ray cluster. Your code **only** needs to execute on one machine in the cluster (usually the head node). Read more about :ref:`running programs on a Ray cluster `. To connect to the Ray cluster, call ``ray.init`` from one of the machines in the cluster. This will connect to the latest Ray cluster: .. testcode:: :hide: ray.shutdown() .. testcode:: ray.init() Note that the machine calling ``ray up`` will not be considered as part of the Ray cluster, and therefore calling ``ray.init`` on that same machine will not attach to the cluster. What's next? ------------ Check out our `Deployment section <../cluster/getting-started.html>`_ for more information on deploying Ray in different settings, including `Kubernetes <../cluster/kubernetes/index.html>`_, `YARN <../cluster/vms/user-guides/community/yarn.html>`_, and `SLURM <../cluster/vms/user-guides/community/slurm.html>`_. --- .. _dynamic_generators: Dynamic generators ================== .. warning:: ``num_returns="dynamic"`` :ref:`generator API ` is soft deprecated as of Ray 2.8 due to its :ref:`limitation `. Use the :ref:`streaming generator API` instead. Python generators are functions that behave like iterators, yielding one value per iteration. Ray supports remote generators for two use cases: 1. To reduce max heap memory usage when returning multiple values from a remote function. See the :ref:`design pattern guide ` for an example. 2. When the number of return values is set dynamically by the remote function instead of by the caller. Remote generators can be used in both actor and non-actor tasks. .. _static-generators: `num_returns` set by the task caller ------------------------------------ Where possible, the caller should set the remote function's number of return values using ``@ray.remote(num_returns=x)`` or ``foo.options(num_returns=x).remote()``. Ray will return this many ``ObjectRefs`` to the caller. The remote task should then return the same number of values, usually as a tuple or list. Compared to setting the number of return values dynamically, this adds less complexity to user code and less performance overhead, as Ray will know exactly how many ``ObjectRefs`` to return to the caller ahead of time. Without changing the caller's syntax, we can also use a remote generator function to yield the values iteratively. The generator should yield the same number of return values specified by the caller, and these will be stored one at a time in Ray's object store. An error will be raised for generators that yield a different number of values from the one specified by the caller. For example, we can swap the following code that returns a list of return values: .. literalinclude:: ../doc_code/pattern_generators.py :language: python :start-after: __large_values_start__ :end-before: __large_values_end__ for this code, which uses a generator function: .. literalinclude:: ../doc_code/pattern_generators.py :language: python :start-after: __large_values_generator_start__ :end-before: __large_values_generator_end__ The advantage of doing so is that the generator function does not need to hold all of its return values in memory at once. It can yield the arrays one at a time to reduce memory pressure. .. _dynamic-generators: `num_returns` set by the task executor -------------------------------------- In some cases, the caller may not know the number of return values to expect from a remote function. For example, suppose we want to write a task that breaks up its argument into equal-size chunks and returns these. We may not know the size of the argument until we execute the task, so we don't know the number of return values to expect. In these cases, we can use a remote generator function that returns a *dynamic* number of values. To use this feature, set ``num_returns="dynamic"`` in the ``@ray.remote`` decorator or the remote function's ``.options()``. Then, when invoking the remote function, Ray will return a *single* ``ObjectRef`` that will get populated with an ``DynamicObjectRefGenerator`` when the task completes. The ``DynamicObjectRefGenerator`` can be used to iterate over a list of ``ObjectRefs`` containing the actual values returned by the task. .. literalinclude:: ../doc_code/generator.py :language: python :start-after: __dynamic_generator_start__ :end-before: __dynamic_generator_end__ We can also pass the ``ObjectRef`` returned by a task with ``num_returns="dynamic"`` to another task. The task will receive the ``DynamicObjectRefGenerator``, which it can use to iterate over the task's return values. Similarly, you can also pass an ``ObjectRefGenerator`` as a task argument. .. literalinclude:: ../doc_code/generator.py :language: python :start-after: __dynamic_generator_pass_start__ :end-before: __dynamic_generator_pass_end__ Exception handling ------------------ If a generator function raises an exception before yielding all its values, the values that it already stored will still be accessible through their ``ObjectRefs``. The remaining ``ObjectRefs`` will contain the raised exception. This is true for both static and dynamic ``num_returns``. If the task was called with ``num_returns="dynamic"``, the exception will be stored as an additional final ``ObjectRef`` in the ``DynamicObjectRefGenerator``. .. literalinclude:: ../doc_code/generator.py :language: python :start-after: __generator_errors_start__ :end-before: __generator_errors_end__ Note that there is currently a known bug where exceptions will not be propagated for generators that yield more values than expected. This can occur in two cases: 1. When ``num_returns`` is set by the caller, but the generator task returns more than this value. 2. When a generator task with ``num_returns="dynamic"`` is :ref:`re-executed `, and the re-executed task yields more values than the original execution. Note that in general, Ray does not guarantee correctness for task re-execution if the task is nondeterministic, and it is recommended to set ``@ray.remote(max_retries=0)`` for such tasks. .. literalinclude:: ../doc_code/generator.py :language: python :start-after: __generator_errors_unsupported_start__ :end-before: __generator_errors_unsupported_end__ .. _dynamic-generators-limitation: Limitations ----------- Although a generator function creates ``ObjectRefs`` one at a time, currently Ray will not schedule dependent tasks until the entire task is complete and all values have been created. This is similar to the semantics used by tasks that return multiple values as a list. --- Nested Remote Functions ======================= Remote functions can call other remote functions, resulting in nested tasks. For example, consider the following. .. literalinclude:: ../doc_code/nested-tasks.py :language: python :start-after: __nested_start__ :end-before: __nested_end__ Then calling ``g`` and ``h`` produces the following behavior. .. code-block:: bash >>> ray.get(g.remote()) [ObjectRef(b1457ba0911ae84989aae86f89409e953dd9a80e), ObjectRef(7c14a1d13a56d8dc01e800761a66f09201104275), ObjectRef(99763728ffc1a2c0766a2000ebabded52514e9a6), ObjectRef(9c2f372e1933b04b2936bb6f58161285829b9914)] >>> ray.get(h.remote()) [1, 1, 1, 1] **One limitation** is that the definition of ``f`` must come before the definitions of ``g`` and ``h`` because as soon as ``g`` is defined, it will be pickled and shipped to the workers, and so if ``f`` hasn't been defined yet, the definition will be incomplete. Yielding Resources While Blocked -------------------------------- Ray will release CPU resources when being blocked. This prevents deadlock cases where the nested tasks are waiting for the CPU resources held by the parent task. Consider the following remote function. .. literalinclude:: ../doc_code/nested-tasks.py :language: python :start-after: __yield_start__ :end-before: __yield_end__ When a ``g`` task is executing, it will release its CPU resources when it gets blocked in the call to ``ray.get``. It will reacquire the CPU resources when ``ray.get`` returns. It will retain its GPU resources throughout the lifetime of the task because the task will most likely continue to use GPU memory. --- .. _ray-remote-functions: Tasks ===== Ray enables arbitrary functions to be executed asynchronously on separate worker processes. Such functions are called **Ray remote functions** and their asynchronous invocations are called **Ray tasks**. Here is an example. .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/tasks.py :language: python :start-after: __tasks_start__ :end-before: __tasks_end__ See the :func:`ray.remote` API for more details. .. tab-item:: Java .. code-block:: java public class MyRayApp { // A regular Java static method. public static int myFunction() { return 1; } } // Invoke the above method as a Ray task. // This will immediately return an object ref (a future) and then create // a task that will be executed on a worker process. ObjectRef res = Ray.task(MyRayApp::myFunction).remote(); // The result can be retrieved with ``ObjectRef::get``. Assert.assertTrue(res.get() == 1); public class MyRayApp { public static int slowFunction() throws InterruptedException { TimeUnit.SECONDS.sleep(10); return 1; } } // Ray tasks are executed in parallel. // All computation is performed in the background, driven by Ray's internal event loop. for(int i = 0; i < 4; i++) { // This doesn't block. Ray.task(MyRayApp::slowFunction).remote(); } .. tab-item:: C++ .. code-block:: c++ // A regular C++ function. int MyFunction() { return 1; } // Register as a remote function by `RAY_REMOTE`. RAY_REMOTE(MyFunction); // Invoke the above method as a Ray task. // This will immediately return an object ref (a future) and then create // a task that will be executed on a worker process. auto res = ray::Task(MyFunction).Remote(); // The result can be retrieved with ``ray::ObjectRef::Get``. assert(*res.Get() == 1); int SlowFunction() { std::this_thread::sleep_for(std::chrono::seconds(10)); return 1; } RAY_REMOTE(SlowFunction); // Ray tasks are executed in parallel. // All computation is performed in the background, driven by Ray's internal event loop. for(int i = 0; i < 4; i++) { // This doesn't block. ray::Task(SlowFunction).Remote(); } Use `ray summary tasks` from :ref:`State API ` to see running and finished tasks and count: .. code-block:: bash # This API is only available when you download Ray via `pip install "ray[default]"` ray summary tasks .. code-block:: bash ======== Tasks Summary: 2023-05-26 11:09:32.092546 ======== Stats: ------------------------------------ total_actor_scheduled: 0 total_actor_tasks: 0 total_tasks: 5 Table (group by func_name): ------------------------------------ FUNC_OR_CLASS_NAME STATE_COUNTS TYPE 0 slow_function RUNNING: 4 NORMAL_TASK 1 my_function FINISHED: 1 NORMAL_TASK Specifying required resources ----------------------------- You can specify resource requirements in tasks (see :ref:`resource-requirements` for more details.) .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/tasks.py :language: python :start-after: __resource_start__ :end-before: __resource_end__ .. tab-item:: Java .. code-block:: java // Specify required resources. Ray.task(MyRayApp::myFunction).setResource("CPU", 4.0).setResource("GPU", 2.0).remote(); .. tab-item:: C++ .. code-block:: c++ // Specify required resources. ray::Task(MyFunction).SetResource("CPU", 4.0).SetResource("GPU", 2.0).Remote(); .. _ray-object-refs: Passing object refs to Ray tasks --------------------------------------- In addition to values, `Object refs `__ can also be passed into remote functions. When the task gets executed, inside the function body **the argument will be the underlying value**. For example, take this function: .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/tasks.py :language: python :start-after: __pass_by_ref_start__ :end-before: __pass_by_ref_end__ .. tab-item:: Java .. code-block:: java public class MyRayApp { public static int functionWithAnArgument(int value) { return value + 1; } } ObjectRef objRef1 = Ray.task(MyRayApp::myFunction).remote(); Assert.assertTrue(objRef1.get() == 1); // You can pass an object ref as an argument to another Ray task. ObjectRef objRef2 = Ray.task(MyRayApp::functionWithAnArgument, objRef1).remote(); Assert.assertTrue(objRef2.get() == 2); .. tab-item:: C++ .. code-block:: c++ static int FunctionWithAnArgument(int value) { return value + 1; } RAY_REMOTE(FunctionWithAnArgument); auto obj_ref1 = ray::Task(MyFunction).Remote(); assert(*obj_ref1.Get() == 1); // You can pass an object ref as an argument to another Ray task. auto obj_ref2 = ray::Task(FunctionWithAnArgument).Remote(obj_ref1); assert(*obj_ref2.Get() == 2); Note the following behaviors: - As the second task depends on the output of the first task, Ray will not execute the second task until the first task has finished. - If the two tasks are scheduled on different machines, the output of the first task (the value corresponding to ``obj_ref1/objRef1``) will be sent over the network to the machine where the second task is scheduled. Waiting for Partial Results --------------------------- Calling **ray.get** on Ray task results will block until the task finished execution. After launching a number of tasks, you may want to know which ones have finished executing without blocking on all of them. This could be achieved by :func:`ray.wait() `. The function works as follows. .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/tasks.py :language: python :start-after: __wait_start__ :end-before: __wait_end__ .. tab-item:: Java .. code-block:: java WaitResult waitResult = Ray.wait(objectRefs, /*num_returns=*/0, /*timeoutMs=*/1000); System.out.println(waitResult.getReady()); // List of ready objects. System.out.println(waitResult.getUnready()); // list of unready objects. .. tab-item:: C++ .. code-block:: c++ ray::WaitResult wait_result = ray::Wait(object_refs, /*num_objects=*/0, /*timeout_ms=*/1000); Generators ---------- Ray is compatible with Python generator syntax. See :ref:`Ray Generators ` for more details. .. _ray-task-returns: Multiple returns ---------------- By default, a Ray task only returns a single Object Ref. However, you can configure Ray tasks to return multiple Object Refs, by setting the ``num_returns`` option. .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/tasks.py :language: python :start-after: __multiple_returns_start__ :end-before: __multiple_returns_end__ For tasks that return multiple objects, Ray also supports remote generators that allow a task to return one object at a time to reduce memory usage at the worker. Ray also supports an option to set the number of return values dynamically, which can be useful when the task caller does not know how many return values to expect. See the :ref:`user guide ` for more details on use cases. .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/tasks.py :language: python :start-after: __generator_start__ :end-before: __generator_end__ .. _ray-task-cancel: Cancelling tasks ---------------- Ray tasks can be canceled by calling :func:`ray.cancel() ` on the returned Object ref. .. tab-set:: .. tab-item:: Python .. literalinclude:: doc_code/tasks.py :language: python :start-after: __cancel_start__ :end-before: __cancel_end__ Scheduling ---------- For each task, Ray will choose a node to run it and the scheduling decision is based on a few factors like :ref:`the task's resource requirements `, :ref:`the specified scheduling strategy ` and :ref:`locations of task arguments `. See :ref:`Ray scheduling ` for more details. Fault Tolerance --------------- By default, Ray will :ref:`retry ` failed tasks due to system failures and specified application-level failures. You can change this behavior by setting ``max_retries`` and ``retry_exceptions`` options in :func:`ray.remote() ` and :meth:`.options() `. See :ref:`Ray fault tolerance ` for more details. .. _task-events: Task Events ----------- By default, Ray traces the execution of tasks, reporting task status events and profiling events that the Ray Dashboard and :ref:`State API ` use. You can change this behavior by setting ``enable_task_events`` options in :func:`ray.remote() ` and :meth:`.options() ` to disable task events, which reduces the overhead of task execution, and the amount of data the task sends to the Ray Dashboard. Nested tasks don't inherit the task events settings from the parent task. You need to set the task events settings for each task separately. More about Ray Tasks -------------------- .. toctree:: :maxdepth: 1 tasks/nested-tasks.rst --- Tips for first-time users ========================= Ray provides a highly flexible, yet minimalist and easy to use API. On this page, we describe several tips that can help first-time Ray users to avoid some common mistakes that can significantly hurt the performance of their programs. For an in-depth treatment of advanced design patterns, please read :ref:`core design patterns `. .. list-table:: The core Ray API we use in this document. :header-rows: 1 * - API - Description * - ``ray.init()`` - Initialize Ray context. * - ``@ray.remote`` - | Function or class decorator specifying that the function will be | executed as a task or the class as an actor in a different process. * - ``.remote()`` - | Postfix to every remote function, remote class declaration, or | invocation of a remote class method. | Remote operations are asynchronous. * - ``ray.put()`` - | Store object in object store, and return its ID. | This ID can be used to pass object as an argument | to any remote function or method call. | This is a synchronous operation. * - ``ray.get()`` - | Return an object or list of objects from the object ID | or list of object IDs. | This is a synchronous (i.e., blocking) operation. * - ``ray.wait()`` - | From a list of object IDs, returns | (1) the list of IDs of the objects that are ready, and | (2) the list of IDs of the objects that are not ready yet. | By default, it returns one ready object ID at a time. All the results reported in this page were obtained on a 13-inch MacBook Pro with a 2.7 GHz Core i7 CPU and 16GB of RAM. While ``ray.init()`` automatically detects the number of cores when it runs on a single machine, to reduce the variability of the results you observe on your machine when running the code below, here we specify num_cpus = 4, i.e., a machine with 4 CPUs. Since each task requests by default one CPU, this setting allows us to execute up to four tasks in parallel. As a result, our Ray system consists of one driver executing the program, and up to four workers running remote tasks or actors. .. _tip-delay-get: Tip 1: Delay ray.get() ---------------------- With Ray, the invocation of every remote operation (e.g., task, actor method) is asynchronous. This means that the operation immediately returns a promise/future, which is essentially an identifier (ID) of the operation’s result. This is key to achieving parallelism, as it allows the driver program to launch multiple operations in parallel. To get the actual results, the programmer needs to call ``ray.get()`` on the IDs of the results. This call blocks until the results are available. As a side effect, this operation also blocks the driver program from invoking other operations, which can hurt parallelism. Unfortunately, it is quite natural for a new Ray user to inadvertently use ``ray.get()``. To illustrate this point, consider the following simple Python code which calls the ``do_some_work()`` function four times, where each invocation takes around 1 sec: .. testcode:: import ray import time def do_some_work(x): time.sleep(1) # Replace this with work you need to do. return x start = time.time() results = [do_some_work(x) for x in range(4)] print("duration =", time.time() - start) print("results =", results) The output of a program execution is below. As expected, the program takes around 4 seconds: .. testoutput:: :options: +MOCK duration = 4.0149290561676025 results = [0, 1, 2, 3] Now, let’s parallelize the above program with Ray. Some first-time users will do this by just making the function remote, i.e., .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import time import ray ray.init(num_cpus=4) # Specify this system has 4 CPUs. @ray.remote def do_some_work(x): time.sleep(1) # Replace this with work you need to do. return x start = time.time() results = [do_some_work.remote(x) for x in range(4)] print("duration =", time.time() - start) print("results =", results) However, when executing the above program one gets: .. testoutput:: :options: +MOCK duration = 0.0003619194030761719 results = [ObjectRef(df5a1a828c9685d3ffffffff0100000001000000), ObjectRef(cb230a572350ff44ffffffff0100000001000000), ObjectRef(7bbd90284b71e599ffffffff0100000001000000), ObjectRef(bd37d2621480fc7dffffffff0100000001000000)] When looking at this output, two things jump out. First, the program finishes immediately, i.e., in less than 1 ms. Second, instead of the expected results (i.e., [0, 1, 2, 3]), we get a bunch of identifiers. Recall that remote operations are asynchronous and they return futures (i.e., object IDs) instead of the results themselves. This is exactly what we see here. We measure only the time it takes to invoke the tasks, not their running times, and we get the IDs of the results corresponding to the four tasks. To get the actual results, we need to use ray.get(), and here the first instinct is to just call ``ray.get()`` on the remote operation invocation, i.e., replace line 12 with: .. testcode:: results = [ray.get(do_some_work.remote(x)) for x in range(4)] By re-running the program after this change we get: .. testoutput:: :options: +MOCK duration = 4.018050909042358 results = [0, 1, 2, 3] So now the results are correct, but it still takes 4 seconds, so no speedup! What’s going on? The observant reader will already have the answer: ``ray.get()`` is blocking so calling it after each remote operation means that we wait for that operation to complete, which essentially means that we execute one operation at a time, hence no parallelism! To enable parallelism, we need to call ``ray.get()`` after invoking all tasks. We can easily do so in our example by replacing line 12 with: .. testcode:: results = ray.get([do_some_work.remote(x) for x in range(4)]) By re-running the program after this change we now get: .. testoutput:: :options: +MOCK duration = 1.0064549446105957 results = [0, 1, 2, 3] So finally, success! Our Ray program now runs in just 1 second which means that all invocations of ``do_some_work()`` are running in parallel. In summary, always keep in mind that ``ray.get()`` is a blocking operation, and thus if called eagerly it can hurt the parallelism. Instead, you should try to write your program such that ``ray.get()`` is called as late as possible. Tip 2: Avoid tiny tasks ----------------------- When a first-time developer wants to parallelize their code with Ray, the natural instinct is to make every function or class remote. Unfortunately, this can lead to undesirable consequences; if the tasks are very small, the Ray program can take longer than the equivalent Python program. Let’s consider again the above examples, but this time we make the tasks much shorter (i.e, each takes just 0.1ms), and dramatically increase the number of task invocations to 100,000. .. testcode:: import time def tiny_work(x): time.sleep(0.0001) # Replace this with work you need to do. return x start = time.time() results = [tiny_work(x) for x in range(100000)] print("duration =", time.time() - start) By running this program we get: .. testoutput:: :options: +MOCK duration = 13.36544418334961 This result should be expected since the lower bound of executing 100,000 tasks that take 0.1ms each is 10s, to which we need to add other overheads such as function calls, etc. Let’s now parallelize this code using Ray, by making every invocation of ``tiny_work()`` remote: .. testcode:: import time import ray @ray.remote def tiny_work(x): time.sleep(0.0001) # Replace this with work you need to do. return x start = time.time() result_ids = [tiny_work.remote(x) for x in range(100000)] results = ray.get(result_ids) print("duration =", time.time() - start) The result of running this code is: .. testoutput:: :options: +MOCK duration = 27.46447515487671 Surprisingly, not only Ray didn’t improve the execution time, but the Ray program is actually slower than the sequential program! What’s going on? Well, the issue here is that every task invocation has a non-trivial overhead (e.g., scheduling, inter-process communication, updating the system state) and this overhead dominates the actual time it takes to execute the task. One way to speed up this program is to make the remote tasks larger in order to amortize the invocation overhead. Here is one possible solution where we aggregate 1000 ``tiny_work()`` function calls in a single bigger remote function: .. testcode:: import time import ray def tiny_work(x): time.sleep(0.0001) # replace this is with work you need to do return x @ray.remote def mega_work(start, end): return [tiny_work(x) for x in range(start, end)] start = time.time() result_ids = [] [result_ids.append(mega_work.remote(x*1000, (x+1)*1000)) for x in range(100)] results = ray.get(result_ids) print("duration =", time.time() - start) Now, if we run the above program we get: .. testoutput:: :options: +MOCK duration = 3.2539820671081543 This is approximately one fourth of the sequential execution, in line with our expectations (recall, we can run four tasks in parallel). Of course, the natural question is how large is large enough for a task to amortize the remote invocation overhead. One way to find this is to run the following simple program to estimate the per-task invocation overhead: .. testcode:: @ray.remote def no_work(x): return x start = time.time() num_calls = 1000 [ray.get(no_work.remote(x)) for x in range(num_calls)] print("per task overhead (ms) =", (time.time() - start)*1000/num_calls) Running the above program on a 2018 MacBook Pro notebook shows: .. testoutput:: :options: +MOCK per task overhead (ms) = 0.4739549160003662 In other words, it takes almost half a millisecond to execute an empty task. This suggests that we will need to make sure a task takes at least a few milliseconds to amortize the invocation overhead. One caveat is that the per-task overhead will vary from machine to machine, and between tasks that run on the same machine versus remotely. This being said, making sure that tasks take at least a few milliseconds is a good rule of thumb when developing Ray programs. Tip 3: Avoid passing same object repeatedly to remote tasks ----------------------------------------------------------- When we pass a large object as an argument to a remote function, Ray calls ``ray.put()`` under the hood to store that object in the local object store. This can significantly improve the performance of a remote task invocation when the remote task is executed locally, as all local tasks share the object store. However, there are cases when automatically calling ``ray.put()`` on a task invocation leads to performance issues. One example is passing the same large object as an argument repeatedly, as illustrated by the program below: .. testcode:: import time import numpy as np import ray @ray.remote def no_work(a): return start = time.time() a = np.zeros((5000, 5000)) result_ids = [no_work.remote(a) for x in range(10)] results = ray.get(result_ids) print("duration =", time.time() - start) This program outputs: .. testoutput:: :options: +MOCK duration = 1.0837509632110596 This running time is quite large for a program that calls just 10 remote tasks that do nothing. The reason for this unexpected high running time is that each time we invoke ``no_work(a)``, Ray calls ``ray.put(a)`` which results in copying array ``a`` to the object store. Since array ``a`` has 2.5 million entries, copying it takes a non-trivial time. To avoid copying array ``a`` every time ``no_work()`` is invoked, one simple solution is to explicitly call ``ray.put(a)``, and then pass ``a``’s ID to ``no_work()``, as illustrated below: .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import time import numpy as np import ray ray.init(num_cpus=4) @ray.remote def no_work(a): return start = time.time() a_id = ray.put(np.zeros((5000, 5000))) result_ids = [no_work.remote(a_id) for x in range(10)] results = ray.get(result_ids) print("duration =", time.time() - start) Running this program takes only: .. testoutput:: :options: +MOCK duration = 0.132796049118042 This is 7 times faster than the original program which is to be expected since the main overhead of invoking ``no_work(a)`` was copying the array ``a`` to the object store, which now happens only once. Arguably a more important advantage of avoiding multiple copies of the same object to the object store is that it precludes the object store filling up prematurely and incur the cost of object eviction. Tip 4: Pipeline data processing ------------------------------- If we use ``ray.get()`` on the results of multiple tasks we will have to wait until the last one of these tasks finishes. This can be an issue if tasks take widely different amounts of time. To illustrate this issue, consider the following example where we run four ``do_some_work()`` tasks in parallel, with each task taking a time uniformly distributed between 0 and 4 seconds. Next, assume the results of these tasks are processed by ``process_results()``, which takes 1 sec per result. The expected running time is then (1) the time it takes to execute the slowest of the ``do_some_work()`` tasks, plus (2) 4 seconds which is the time it takes to execute ``process_results()``. .. testcode:: import time import random import ray @ray.remote def do_some_work(x): time.sleep(random.uniform(0, 4)) # Replace this with work you need to do. return x def process_results(results): sum = 0 for x in results: time.sleep(1) # Replace this with some processing code. sum += x return sum start = time.time() data_list = ray.get([do_some_work.remote(x) for x in range(4)]) sum = process_results(data_list) print("duration =", time.time() - start, "\nresult = ", sum) The output of the program shows that it takes close to 8 sec to run: .. testoutput:: :options: +MOCK duration = 7.82636022567749 result = 6 Waiting for the last task to finish when the others tasks might have finished much earlier unnecessarily increases the program running time. A better solution would be to process the data as soon it becomes available. Fortunately, Ray allows you to do exactly this by calling ``ray.wait()`` on a list of object IDs. Without specifying any other parameters, this function returns as soon as an object in its argument list is ready. This call has two returns: (1) the ID of the ready object, and (2) the list containing the IDs of the objects not ready yet. The modified program is below. Note that one change we need to do is to replace ``process_results()`` with ``process_incremental()`` that processes one result at a time. .. testcode:: import time import random import ray @ray.remote def do_some_work(x): time.sleep(random.uniform(0, 4)) # Replace this with work you need to do. return x def process_incremental(sum, result): time.sleep(1) # Replace this with some processing code. return sum + result start = time.time() result_ids = [do_some_work.remote(x) for x in range(4)] sum = 0 while len(result_ids): done_id, result_ids = ray.wait(result_ids) sum = process_incremental(sum, ray.get(done_id[0])) print("duration =", time.time() - start, "\nresult = ", sum) This program now takes just a bit over 4.8sec, a significant improvement: .. testoutput:: :options: +MOCK duration = 4.852453231811523 result = 6 To aid the intuition, Figure 1 shows the execution timeline in both cases: when using ``ray.get()`` to wait for all results to become available before processing them, and using ``ray.wait()`` to start processing the results as soon as they become available. .. figure:: /images/pipeline.png Figure 1: (a) Execution timeline when using ray.get() to wait for all results from ``do_some_work()`` tasks before calling ``process_results()``. (b) Execution timeline when using ``ray.wait()`` to process results as soon as they become available. --- .. _core-use-guide: User Guides =========== This section explains how to use Ray's key concepts to build distributed applications. If you’re brand new to Ray, we recommend starting with the :ref:`walkthrough `. .. toctree:: :maxdepth: 4 tasks actors objects handling-dependencies scheduling/index.rst fault-tolerance patterns/index.rst direct-transport compiled-graph/ray-compiled-graph advanced-topics --- Lifetimes of a User-Spawn Process ================================= When you spawn child processes from Ray workers, you are responsible for managing the lifetime of child processes. However, it is not always possible, especially when worker crashes and child processes are spawned from libraries (torch dataloader). To avoid leaking user-spawned processes, Ray provides mechanisms to kill all user-spawned processes when a worker that starts it exits. This feature prevents GPU memory leaks from child processes (e.g., torch). Ray provides following mechanisms to handle subprocess killing on worker exit: - ``RAY_kill_child_processes_on_worker_exit`` (default ``true``): Only works on Linux. If true, the worker kills all *direct* child processes on exit. This won't work if the worker crashed. This is NOT recursive, in that grandchild processes are not killed by this mechanism. - ``RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper`` (default ``false``): Only works on Linux greater than or equal to 3.4. If true, Raylet *recursively* kills any child processes and grandchild processes that were spawned by the worker after the worker exits. This works even if the worker crashed. The killing happens within 10 seconds after the worker death. - ``RAY_process_group_cleanup_enabled`` (default ``false``): If true (POSIX), Ray isolates each worker into its own process group at spawn and cleans up the worker’s process group on worker exit via `killpg`. Processes that intentionally call `setsid()` will detach and not be killed by this cleanup. On non-Linux platforms, subreaper is not available. Per‑worker process groups are supported on POSIX platforms; on Windows, neither subreaper nor PGs apply. Users should manage child processes explicitly on platforms without support. Note: The feature is meant to be a last resort to kill orphaned processes. It is not a replacement for proper process management. Users should still manage the lifetime of their processes and clean up properly. .. contents:: :local: User-Spawned Process Killed on Worker Exit ------------------------------------------ The following example uses a Ray Actor to spawn a user process. The user process is a sleep process. .. testcode:: import ray import psutil import subprocess import time import os ray.init(_system_config={"kill_child_processes_on_worker_exit_with_raylet_subreaper":True}) @ray.remote class MyActor: def __init__(self): pass def start(self): # Start a user process process = subprocess.Popen(["/bin/bash", "-c", "sleep 10000"]) return process.pid def signal_my_pid(self): import signal os.kill(os.getpid(), signal.SIGKILL) actor = MyActor.remote() pid = ray.get(actor.start.remote()) assert psutil.pid_exists(pid) # the subprocess running actor.signal_my_pid.remote() # sigkill'ed, the worker's subprocess killing no longer works time.sleep(11) # raylet kills orphans every 10s assert not psutil.pid_exists(pid) Enabling the feature ------------------------- To enable the subreaper feature (deprecated), set via `_system_config` or equivalent cluster configuration at start. You must restart the cluster to apply the change. Prefer enabling `process_group_cleanup_enabled` instead. .. code-block:: bash RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true ray start --head Another way is to enable it during ``ray.init()`` by adding a ``_system_config`` like this: .. code-block:: ray.init(_system_config={"kill_child_processes_on_worker_exit_with_raylet_subreaper":True}) ⚠️ Caution: Core worker now reaps zombies, toggle back if you wait to ``waitpid`` ---------------------------------------------------------------------------------- When subreaper is enabled, the worker process also becomes a subreaper (Linux), meaning some grandchildren processes can be reparented to the worker process. The worker sets ``SIGCHLD`` to ``SIG_IGN``. If you need to wait for a child process to exit, reset ``SIGCHLD`` to ``SIG_DFL`` first. .. code-block:: import signal signal.signal(signal.SIGCHLD, signal.SIG_DFL) Under the hood ------------------------- This feature is implemented by setting the `prctl(PR_SET_CHILD_SUBREAPER, 1)` flag on the Raylet process which spawns all Ray workers. See `prctl(2) `_. This flag makes the Raylet process a "subreaper" which means that if a descendant child process dies, the dead child's children processes reparent to the Raylet process. Subreaper is deprecated in favor of per‑worker process groups. Raylet maintains a list of "known" direct children pid it spawns, and when the Raylet process receives the SIGCHLD signal, it knows that one of its child processes (e.g. the workers) has died, and maybe there are reparented orphan processes. Raylet lists all children pids (with ppid = raylet pid), and if a child pid is not "known" (i.e. not in the list of direct children pids), Raylet thinks it is an orphan process and kills it via `SIGKILL`. For a deep chain of process creations, Raylet would do the killing step by step. For example, in a chain like this: .. code-block:: raylet -> the worker -> user process A -> user process B -> user process C When the ``the worker`` dies, ``Raylet`` kills the ``user process A``, because it's not on the "known" children list. When ``user process A`` dies, ``Raylet`` kills ``user process B``, and so on. An edge case is, if the ``the worker`` is still alive but the ``user process A`` is dead, then ``user process B`` gets reparented and risks being killed. To mitigate, ``Ray`` also sets the ``the worker`` as a subreaper, so it can adopt the reparented processes. ``Core worker`` does not kill unknown children processes, so a user "daemon" process e.g. ``user process B`` that outlives ``user process A`` can live along. However if the ``the worker`` dies, the user daemon process gets reparented to ``raylet`` and gets killed. Related PR: `Use subreaper to kill unowned subprocesses in raylet. (#42992) `_ --- Working with Jupyter Notebooks & JupyterLab =========================================== This document describes best practices for using Ray with Jupyter Notebook / JupyterLab. We use AWS for the purpose of illustration, but the arguments should also apply to other Cloud providers. Feel free to contribute if you think this document is missing anything. Setting Up Notebook ------------------- 1. Ensure your EC2 instance has enough EBS volume if you plan to run the Notebook on it. The Deep Learning AMI, pre-installed libraries and environmental set-up will by default consume ~76% of the disk prior to any Ray work. With additional applications running, the Notebook could fail frequently due to full disk. Kernel restart loses progressing cell outputs, especially if we rely on them to track experiment progress. Related issue: `Autoscaler should allow configuration of disk space and should use a larger default. `_. 2. Avoid unnecessary memory usage. IPython stores the output of every cell in a local Python variable indefinitely. This causes Ray to pin the objects even though you application may not actually be using them. Therefore, explicitly calling ``print`` or ``repr`` is better than letting the Notebook automatically generate the output. Another option is to just altogether disable IPython caching with the following (run from bash/zsh): .. code-block:: console echo 'c = get_config() c.InteractiveShell.cache_size = 0 # disable cache ' >> ~/.ipython/profile_default/ipython_config.py This will still allow printing, but stop IPython from caching altogether. .. tip:: While the above settings help reduce memory footprint, it's always a good practice to remove references that are no longer needed in your application to free space in the object store. 3. Understand the node’s responsibility. Assuming the Notebook runs on a EC2 instance, do you plan to start a ray runtime locally on this instance, or do you plan to use this instance as a cluster launcher? Jupyter Notebook is more suitable for the first scenario. CLI’s such as ``ray exec`` and ``ray submit`` fit the second use case better. 4. Forward the ports. Assuming the Notebook runs on an EC2 instance, you should forward both the Notebook port and the Ray Dashboard port. The default ports are 8888 and 8265 respectively. They will increase if the default ones are not available. You can forward them with the following (run from bash/zsh): .. code-block:: console ssh -i /path/my-key-pair.pem -N -f -L localhost:8888:localhost:8888 my-instance-user-name@my-instance-IPv6-address ssh -i /path/my-key-pair.pem -N -f -L localhost:8265:localhost:8265 my-instance-user-name@my-instance-IPv6-address --- .. _core-walkthrough: What's Ray Core? ================= .. toctree:: :maxdepth: 1 :hidden: Key Concepts User Guides Examples api/index Internals Ray Core is a powerful distributed computing framework that provides a small set of essential primitives (tasks, actors, and objects) for building and scaling distributed applications. This walk-through introduces you to these core concepts with simple examples that demonstrate how to transform your Python functions and classes into distributed Ray tasks and actors, and how to work effectively with Ray objects. .. note:: Ray has introduced an experimental API for high-performance workloads that's especially well suited for applications using multiple GPUs. See :ref:`Ray Compiled Graph ` for more details. Getting Started --------------- To get started, install Ray using ``pip install -U ray``. For additional installation options, see :ref:`Installing Ray `. The first step is to import and initialize Ray: .. literalinclude:: doc_code/getting_started.py :language: python :start-after: __starting_ray_start__ :end-before: __starting_ray_end__ .. note:: Unless you explicitly call ``ray.init()``, the first use of a Ray remote API call will implicitly call `ray.init()` with no arguments. Running a Task -------------- Tasks are the simplest way to parallelize your Python functions across a Ray cluster. To create a task: 1. Decorate your function with ``@ray.remote`` to indicate it should run remotely 2. Call the function with ``.remote()`` instead of a normal function call 3. Use ``ray.get()`` to retrieve the result from the returned future (Ray *object reference*) Here's a simple example: .. literalinclude:: doc_code/getting_started.py :language: python :start-after: __running_task_start__ :end-before: __running_task_end__ Calling an Actor ---------------- While tasks are stateless, Ray actors allow you to create stateful workers that maintain their internal state between method calls. When you instantiate a Ray actor: 1. Ray starts a dedicated worker process somewhere in your cluster 2. The actor's methods run on that specific worker and can access and modify its state 3. The actor executes method calls serially in the order it receives them, preserving consistency Here's a simple Counter example: .. literalinclude:: doc_code/getting_started.py :language: python :start-after: __calling_actor_start__ :end-before: __calling_actor_end__ The preceding example demonstrates basic actor usage. For a more comprehensive example that combines both tasks and actors, see the :ref:`Monte Carlo Pi estimation example `. Passing Objects --------------- Ray's distributed object store efficiently manages data across your cluster. There are three main ways to work with objects in Ray: 1. **Implicit creation**: When tasks and actors return values, they are automatically stored in Ray's :ref:`distributed object store `, returning *object references* that can be later retrieved. 2. **Explicit creation**: Use ``ray.put()`` to directly place objects in the store. 3. **Passing references**: You can pass object references to other tasks and actors, avoiding unnecessary data copying and enabling lazy execution. Here's an example showing these techniques: .. literalinclude:: doc_code/getting_started.py :language: python :start-after: __passing_object_start__ :end-before: __passing_object_end__ Next Steps ---------- .. tip:: To monitor your application's performance and resource usage, check out the :ref:`Ray dashboard `. You can combine Ray's simple primitives in powerful ways to express virtually any distributed computation pattern. To dive deeper into Ray's :ref:`key concepts `, explore these user guides: .. grid:: 1 2 3 3 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :img-top: /images/tasks.png :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: ray-remote-functions Using remote functions (Tasks) .. grid-item-card:: :img-top: /images/actors.png :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: ray-remote-classes Using remote classes (Actors) .. grid-item-card:: :img-top: /images/objects.png :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: objects-in-ray Working with Ray Objects --- .. _dask-on-ray: Using Dask on Ray ================= `Dask `__ is a Python parallel computing library geared towards scaling analytics and scientific computing workloads. It provides `big data collections `__ that mimic the APIs of the familiar `NumPy `__ and `Pandas `__ libraries, allowing those abstractions to represent larger-than-memory data and/or allowing operations on that data to be run on a multi-machine cluster, while also providing automatic data parallelism, smart scheduling, and optimized operations. Operations on these collections create a task graph, which is executed by a scheduler. Ray provides a scheduler for Dask (`dask_on_ray`) which allows you to build data analyses using Dask's collections and execute the underlying tasks on a Ray cluster. `dask_on_ray` uses Dask's scheduler API, which allows you to specify any callable as the scheduler that you would like Dask to use to execute your workload. Using the Dask-on-Ray scheduler, the entire Dask ecosystem can be executed on top of Ray. .. note:: We always ensure that the latest Dask versions are compatible with Ray nightly. The table below shows the latest Dask versions that are tested with Ray versions. .. list-table:: Latest Dask versions for each Ray version. :header-rows: 1 * - Ray Version - Dask Version * - ``2.48.0`` or above - | ``2023.6.1 (Python version < 3.12)`` | ``2025.5.0 (Python version >= 3.12)`` * - ``2.40.0`` to ``2.47.1`` - | ``2022.10.2 (Python version < 3.12)`` | ``2024.6.0 (Python version >= 3.12)`` * - ``2.34.0`` to ``2.39.0`` - | ``2022.10.1 (Python version < 3.12)`` | ``2024.6.0 (Python version >= 3.12)`` * - ``2.8.0`` to ``2.33.x`` - ``2022.10.1`` * - ``2.5.0`` to ``2.7.x`` - | ``2022.2.0 (Python version < 3.8)`` | ``2022.10.1 (Python version >= 3.8)`` * - ``2.4.0`` - ``2022.10.1`` * - ``2.3.0`` - ``2022.10.1`` * - ``2.2.0`` - ``2022.10.1`` * - ``2.1.0`` - ``2022.2.0`` * - ``2.0.0`` - ``2022.2.0`` * - ``1.13.0`` - ``2022.2.0`` * - ``1.12.0`` - ``2022.2.0`` * - ``1.11.0`` - ``2022.1.0`` * - ``1.10.0`` - ``2021.12.0`` * - ``1.9.2`` - ``2021.11.0`` * - ``1.9.1`` - ``2021.11.0`` * - ``1.9.0`` - ``2021.11.0`` * - ``1.8.0`` - ``2021.9.1`` * - ``1.7.0`` - ``2021.9.1`` * - ``1.6.0`` - ``2021.8.1`` * - ``1.5.0`` - ``2021.7.0`` * - ``1.4.1`` - ``2021.6.1`` * - ``1.4.0`` - ``2021.5.0`` Scheduler --------- .. _dask-on-ray-scheduler: The Dask-on-Ray scheduler can execute any valid Dask graph, and can be used with any Dask `.compute() `__ call. Here's an example: .. literalinclude:: doc_code/dask_on_ray_scheduler_example.py :language: python .. note:: For execution on a Ray cluster, you should *not* use the `Dask.distributed `__ client; simply use plain Dask and its collections, and pass ``ray_dask_get`` to ``.compute()`` calls, set the scheduler in one of the other ways detailed `here `__, or use our ``enable_dask_on_ray`` configuration helper. Follow the instructions for :ref:`using Ray on a cluster ` to modify the ``ray.init()`` call. Why use Dask on Ray? 1. To take advantage of Ray-specific features such as the :ref:`launching cloud clusters ` and :ref:`shared-memory store `. 2. If you'd like to use Dask and Ray libraries in the same application without having two different clusters. 3. If you'd like to create data analyses using the familiar NumPy and Pandas APIs provided by Dask and execute them on a fast, fault-tolerant distributed task execution system geared towards production, like Ray. Dask-on-Ray is an ongoing project and is not expected to achieve the same performance as using Ray directly. All `Dask abstractions `__ should run seamlessly on top of Ray using this scheduler, so if you find that one of these abstractions doesn't run on Ray, please `open an issue `__. Best Practice for Large Scale workloads --------------------------------------- For Ray 1.3, the default scheduling policy is to pack tasks to the same node as much as possible. It is more desirable to spread tasks if you run a large scale / memory intensive Dask on Ray workloads. In this case, there are two recommended setups. - Reducing the config flag `scheduler_spread_threshold` to tell the scheduler to prefer spreading tasks across the cluster instead of packing. - Setting the head node's `num-cpus` to 0 so that tasks are not scheduled on a head node. .. code-block:: bash # Head node. Set `num_cpus=0` to avoid tasks being scheduled on a head node. RAY_scheduler_spread_threshold=0.0 ray start --head --num-cpus=0 # Worker node. RAY_scheduler_spread_threshold=0.0 ray start --address=[head-node-address] Out-of-Core Data Processing --------------------------- .. _dask-on-ray-out-of-core: Processing datasets larger than cluster memory is supported via Ray's :ref:`object spilling `: if the in-memory object store is full, objects will be spilled to external storage (local disk by default). This feature is available but off by default in Ray 1.2, and is on by default in Ray 1.3+. Please see your Ray version's object spilling documentation for steps to enable and/or configure object spilling. Persist ------- .. _dask-on-ray-persist: Dask-on-Ray patches `dask.persist() `__ in order to match `Dask Distributed's persist semantics `__; namely, calling `dask.persist()` with a Dask-on-Ray scheduler will submit the tasks to the Ray cluster and return Ray futures inlined in the Dask collection. This is nice if you wish to compute some base collection (such as a Dask array), followed by multiple different downstream computations (such as aggregations): those downstream computations will be faster since that base collection computation was kicked off early and referenced by all downstream computations, often via shared memory. .. literalinclude:: doc_code/dask_on_ray_persist_example.py :language: python Annotations, Resources, and Task Options ---------------------------------------- .. _dask-on-ray-annotations: Dask-on-Ray supports specifying resources or any other Ray task option via `Dask's annotation API `__. This annotation context manager can be used to attach resource requests (or any other Ray task option) to specific Dask operations, with the annotations funneling down to the underlying Ray tasks. Resource requests and other Ray task options can also be specified globally via the ``.compute(ray_remote_args={...})`` API, which will serve as a default for all Ray tasks launched via the Dask workload. Annotations on individual Dask operations will override this global default. .. literalinclude:: doc_code/dask_on_ray_annotate_example.py :language: python Note that you may need to disable graph optimizations since it can break annotations, see `this Dask issue `__. Custom optimization for Dask DataFrame shuffling ------------------------------------------------ .. _dask-on-ray-shuffle-optimization: Dask-on-Ray provides a Dask DataFrame optimizer that leverages Ray's ability to execute multiple-return tasks in order to speed up shuffling by as much as 4x on Ray. Simply set the `dataframe_optimize` configuration option to our optimizer function, similar to how you specify the Dask-on-Ray scheduler: .. literalinclude:: doc_code/dask_on_ray_shuffle_optimization.py :language: python Callbacks --------- .. _dask-on-ray-callbacks: Dask's `custom callback abstraction `__ is extended with Ray-specific callbacks, allowing the user to hook into the Ray task submission and execution lifecycles. With these hooks, implementing Dask-level scheduler and task introspection, such as progress reporting, diagnostics, caching, etc., is simple. Here's an example that measures and logs the execution time of each task using the ``ray_pretask`` and ``ray_posttask`` hooks: .. literalinclude:: doc_code/dask_on_ray_callbacks.py :language: python :start-after: __timer_callback_begin__ :end-before: __timer_callback_end__ The following Ray-specific callbacks are provided: 1. :code:`ray_presubmit(task, key, deps)`: Run before submitting a Ray task. If this callback returns a non-`None` value, a Ray task will _not_ be created and this value will be used as the would-be task's result value. 2. :code:`ray_postsubmit(task, key, deps, object_ref)`: Run after submitting a Ray task. 3. :code:`ray_pretask(key, object_refs)`: Run before executing a Dask task within a Ray task. This executes after the task has been submitted, within a Ray worker. The return value of this task will be passed to the ray_posttask callback, if provided. 4. :code:`ray_posttask(key, result, pre_state)`: Run after executing a Dask task within a Ray task. This executes within a Ray worker. This callback receives the return value of the ray_pretask callback, if provided. 5. :code:`ray_postsubmit_all(object_refs, dsk)`: Run after all Ray tasks have been submitted. 6. :code:`ray_finish(result)`: Run after all Ray tasks have finished executing and the final result has been returned. See the docstring for :class:`~ray.util.dask.RayDaskCallback` for further details about these callbacks, their arguments, and their return values. When creating your own callbacks, you can use :class:`RayDaskCallback ` directly, passing the callback functions as constructor arguments: .. literalinclude:: doc_code/dask_on_ray_callbacks.py :language: python :start-after: __ray_dask_callback_direct_begin__ :end-before: __ray_dask_callback_direct_end__ or you can subclass it, implementing the callback methods that you need: .. literalinclude:: doc_code/dask_on_ray_callbacks.py :language: python :start-after: __ray_dask_callback_subclass_begin__ :end-before: __ray_dask_callback_subclass_end__ You can also specify multiple callbacks: .. literalinclude:: doc_code/dask_on_ray_callbacks.py :language: python :start-after: __multiple_callbacks_begin__ :end-before: __multiple_callbacks_end__ Combining Dask callbacks with an actor yields simple patterns for stateful data aggregation, such as capturing task execution statistics and caching results. Here is an example that does both, caching the result of a task if its execution time exceeds some user-defined threshold: .. literalinclude:: doc_code/dask_on_ray_callbacks.py :language: python :start-after: __caching_actor_begin__ :end-before: __caching_actor_end__ .. note:: The existing Dask scheduler callbacks (``start``, ``start_state``, ``pretask``, ``posttask``, ``finish``) are also available, which can be used to introspect the Dask task to Ray task conversion process, but note that the ``pretask`` and ``posttask`` hooks are executed before and after the Ray task is *submitted*, not executed, and that ``finish`` is executed after all Ray tasks have been *submitted*, not executed. This callback API is currently unstable and subject to change. API --- .. autosummary:: :nosignatures: :toctree: doc/ ~ray.util.dask.RayDaskCallback ~ray.util.dask.callbacks.RayDaskCallback._ray_presubmit ~ray.util.dask.callbacks.RayDaskCallback._ray_postsubmit ~ray.util.dask.callbacks.RayDaskCallback._ray_pretask ~ray.util.dask.callbacks.RayDaskCallback._ray_posttask ~ray.util.dask.callbacks.RayDaskCallback._ray_postsubmit_all ~ray.util.dask.callbacks.RayDaskCallback._ray_finish --- More Ray ML Libraries ===================== .. toctree:: :hidden: joblib multiprocessing ray-collective dask-on-ray raydp mars-on-ray modin/index data_juicer_distributed_data_processing .. TODO: we added the three Ray Core examples below, since they don't really belong there. Going forward, make sure that all "Ray Lightning" and XGBoost topics are in one document or group, and not next to each other. Ray has a variety of additional integrations with ecosystem libraries. - :ref:`ray-joblib` - :ref:`ray-multiprocessing` - :ref:`ray-collective` - :ref:`dask-on-ray` - :ref:`spark-on-ray` - :ref:`mars-on-ray` - :ref:`modin-on-ray` - `daft `_ .. _air-ecosystem-map: Ecosystem Map ------------- The following map visualizes the landscape and maturity of Ray components and their integrations. Solid lines denote integrations between Ray components; dotted lines denote integrations with the broader ML ecosystem. * **Stable**: This component is stable. * **Beta**: This component is under development and APIs may be subject to change. * **Alpha**: This component is in early development. * **Community-Maintained**: These integrations are community-maintained and may vary in quality. .. image:: /images/air-ecosystem.svg --- .. _ray-joblib: Distributed Scikit-learn / Joblib ================================= .. _`issue on GitHub`: https://github.com/ray-project/ray/issues Ray supports running distributed `scikit-learn`_ programs by implementing a Ray backend for `joblib`_ using `Ray Actors `__ instead of local processes. This makes it easy to scale existing applications that use scikit-learn from a single node to a cluster. .. note:: This API is new and may be revised in future Ray releases. If you encounter any bugs, please file an `issue on GitHub`_. .. _`joblib`: https://joblib.readthedocs.io .. _`scikit-learn`: https://scikit-learn.org Quickstart ---------- To get started, first `install Ray `__, then use ``from ray.util.joblib import register_ray`` and run ``register_ray()``. This will register Ray as a joblib backend for scikit-learn to use. Then run your original scikit-learn code inside ``with joblib.parallel_backend('ray')``. This will start a local Ray cluster. See the `Run on a Cluster`_ section below for instructions to run on a multi-node Ray cluster instead. .. code-block:: python import numpy as np from sklearn.datasets import load_digits from sklearn.model_selection import RandomizedSearchCV from sklearn.svm import SVC digits = load_digits() param_space = { 'C': np.logspace(-6, 6, 30), 'gamma': np.logspace(-8, 8, 30), 'tol': np.logspace(-4, -1, 30), 'class_weight': [None, 'balanced'], } model = SVC(kernel='rbf') search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=10) import joblib from ray.util.joblib import register_ray register_ray() with joblib.parallel_backend('ray'): search.fit(digits.data, digits.target) You can also set the ``ray_remote_args`` argument in ``parallel_backend`` to :func:`configure the Ray Actors ` making up the Pool. This can be used to e.g., :ref:`assign resources to Actors, such as GPUs `. .. code-block:: python # Allows to use GPU-enabled estimators, such as cuML with joblib.parallel_backend('ray', ray_remote_args=dict(num_gpus=1)): search.fit(digits.data, digits.target) Run on a Cluster ---------------- This section assumes that you have a running Ray cluster. To start a Ray cluster, see the :ref:`cluster setup ` instructions. To connect scikit-learn to a running Ray cluster, you have to specify the address of the head node by setting the ``RAY_ADDRESS`` environment variable. You can also start Ray manually by calling ``ray.init()`` (with any of its supported configuration options) before calling ``with joblib.parallel_backend('ray')``. .. warning:: If you do not set the ``RAY_ADDRESS`` environment variable and do not provide ``address`` in ``ray.init(address=
)`` then scikit-learn will run on a SINGLE node! --- .. _mars-on-ray: Using Mars on Ray ================= .. _`issue on GitHub`: https://github.com/mars-project/mars/issues `Mars`_ is a tensor-based unified framework for large-scale data computation which scales NumPy, Pandas and Scikit-learn. Mars on Ray makes it easy to scale your programs with a Ray cluster. Currently Mars on Ray supports both Ray actors and tasks as an execution backend. The task will be scheduled by Mars scheduler if Ray actors are used. This mode can reuse all Mars scheduler optimizations. If Ray tasks mode is used, all tasks will be scheduled by Ray, which can reuse failover and pipeline capabilities provided by Ray futures. .. _`Mars`: https://mars-project.readthedocs.io/en/latest/ Installation ------------- You can simply install Mars via pip: .. code-block:: bash pip install pymars>=0.8.3 Getting started ---------------- It's easy to run Mars jobs on a Ray cluster. Starting a new Mars on Ray runtime locally via: .. code-block:: python import ray ray.init() import mars mars.new_ray_session() import mars.tensor as mt mt.random.RandomState(0).rand(1000_0000, 5).sum().execute() Or connecting to a Mars on Ray runtime which is already initialized: .. code-block:: python import mars mars.new_ray_session('http://:') # perform computation Interact with Dataset: .. code-block:: python import mars.tensor as mt import mars.dataframe as md df = md.DataFrame( mt.random.rand(1000_0000, 4), columns=list('abcd')) # Convert mars dataframe to ray dataset import ray # ds = md.to_ray_dataset(df) ds = ray.data.from_mars(df) print(ds.schema(), ds.count()) ds.filter(lambda row: row["a"] > 0.5).show(5) # Convert ray dataset to mars dataframe # df2 = md.read_ray_dataset(ds) df2 = ds.to_mars() print(df2.head(5).execute()) Refer to `Mars on Ray`_ for more information. .. _`Mars on Ray`: https://mars-project.readthedocs.io/en/latest/installation/ray.html#mars-ray --- .. _modin-on-ray: Using Pandas on Ray (Modin) =========================== Modin_, previously Pandas on Ray, is a dataframe manipulation library that allows users to speed up their pandas workloads by acting as a drop-in replacement. Modin also provides support for other APIs (e.g. spreadsheet) and libraries, like xgboost. .. code-block:: python import modin.pandas as pd import ray ray.init() df = pd.read_parquet("s3://my-bucket/big.parquet") You can use Modin on Ray with your laptop or cluster. In this document, we show instructions for how to set up a Modin compatible Ray cluster and connect Modin to Ray. .. note:: In previous versions of Modin, you had to initialize Ray before importing Modin. As of Modin 0.9.0, this is no longer the case. Using Modin with Ray's autoscaler --------------------------------- In order to use Modin with :ref:`Ray's autoscaler `, you need to ensure that the correct dependencies are installed at startup. Modin's repository has an example `yaml file and set of tutorial notebooks`_ to ensure that the Ray cluster has the correct dependencies. Once the cluster is up, connect Modin by simply importing. .. code-block:: python import modin.pandas as pd import ray ray.init(address="auto") df = pd.read_parquet("s3://my-bucket/big.parquet") As long as Ray is initialized before any dataframes are created, Modin will be able to connect to and use the Ray cluster. How Modin uses Ray ------------------ Modin has a layered architecture, and the core abstraction for data manipulation is the Modin Dataframe, which implements a novel algebra that enables Modin to handle all of pandas (see Modin's documentation_ for more on the architecture). Modin's internal dataframe object has a scheduling layer that is able to partition and operate on data with Ray. Dataframe operations '''''''''''''''''''' The Modin Dataframe uses Ray Tasks to perform data manipulations. Ray Tasks have a number of benefits over the actor model for data manipulation: - Multiple tasks may be manipulating the same objects simultaneously - Objects in Ray's object store are immutable, making provenance and lineage easier to track - As new workers come online the shuffling of data will happen as tasks are scheduled on the new node - Identical partitions need not be replicated, especially beneficial for operations that selectively mutate the data (e.g., ``fillna``). - Finer grained parallelism with finer grained placement control Machine Learning '''''''''''''''' Modin uses Ray Actors for the machine learning support it currently provides. Modin's implementation of XGBoost is able to spin up one actor for each node and aggregate all of the partitions on that node to the XGBoost actor. Modin is able to specify precisely the node IP for each actor on creation, giving fine-grained control over placement - a must for distributed training performance. .. _Modin: https://github.com/modin-project/modin .. _documentation: https://modin.readthedocs.io/en/latest/development/architecture.html .. _yaml file and set of tutorial notebooks: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster --- .. _ray-multiprocessing: Distributed multiprocessing.Pool ================================ .. _`issue on GitHub`: https://github.com/ray-project/ray/issues Ray supports running distributed Python programs with the `multiprocessing.Pool API`_ using `Ray Actors `__ instead of local processes. This makes it easy to scale existing applications that use ``multiprocessing.Pool`` from a single node to a cluster. .. _`multiprocessing.Pool API`: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool Quickstart ---------- To get started, first `install Ray `__, then use ``ray.util.multiprocessing.Pool`` in place of ``multiprocessing.Pool``. This will start a local Ray cluster the first time you create a ``Pool`` and distribute your tasks across it. See the `Run on a Cluster`_ section below for instructions to run on a multi-node Ray cluster instead. .. code-block:: python from ray.util.multiprocessing import Pool def f(index): return index pool = Pool() for result in pool.map(f, range(100)): print(result) The full ``multiprocessing.Pool`` API is currently supported. Please see the `multiprocessing documentation`_ for details. .. warning:: The ``context`` argument in the ``Pool`` constructor is ignored when using Ray. .. _`multiprocessing documentation`: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool Run on a Cluster ---------------- This section assumes that you have a running Ray cluster. To start a Ray cluster, see the :ref:`cluster setup ` instructions. To connect a ``Pool`` to a running Ray cluster, you can specify the address of the head node in one of two ways: - By setting the ``RAY_ADDRESS`` environment variable. - By passing the ``ray_address`` keyword argument to the ``Pool`` constructor. .. code-block:: python from ray.util.multiprocessing import Pool # Starts a new local Ray cluster. pool = Pool() # Connects to a running Ray cluster, with the current node as the head node. # Alternatively, set the environment variable RAY_ADDRESS="auto". pool = Pool(ray_address="auto") # Connects to a running Ray cluster, with a remote node as the head node. # Alternatively, set the environment variable RAY_ADDRESS=":". pool = Pool(ray_address=":") You can also start Ray manually by calling ``ray.init()`` (with any of its supported configuration options) before creating a ``Pool``. --- .. This part of the docs is generated from the ray.util.collective readme using m2r To update: - run `m2r RAY_ROOT/python/ray/util/collective/README.md` - copy the contents of README.rst here - Be sure not to delete the API reference section in the bottom of this file. .. _ray-collective: Ray Collective Communication Lib ================================ The Ray collective communication library (\ ``ray.util.collective``\ ) offers a set of native collective primitives for communication between distributed CPUs or GPUs. Ray collective communication library * enables 10x more efficient out-of-band collective communication between Ray actor and task processes, * operates on both distributed CPUs and GPUs, * uses NCCL and GLOO as the optional high-performance communication backends, * is suitable for distributed ML programs on Ray. Collective Primitives Support Matrix ------------------------------------ See below the current support matrix for all collective calls with different backends. .. list-table:: :header-rows: 1 * - Backend - `torch.distributed.gloo `_ - - `nccl `_ - * - Device - CPU - GPU - CPU - GPU * - send - ✔ - ✘ - ✘ - ✔ * - recv - ✔ - ✘ - ✘ - ✔ * - broadcast - ✔ - ✘ - ✘ - ✔ * - allreduce - ✔ - ✘ - ✘ - ✔ * - reduce - ✔ - ✘ - ✘ - ✔ * - allgather - ✔ - ✘ - ✘ - ✔ * - gather - ✘ - ✘ - ✘ - ✘ * - scatter - ✘ - ✘ - ✘ - ✘ * - reduce_scatter - ✔ - ✘ - ✘ - ✔ * - all-to-all - ✘ - ✘ - ✘ - ✘ * - barrier - ✔ - ✘ - ✘ - ✔ Supported Tensor Types ---------------------- * ``torch.Tensor`` * ``numpy.ndarray`` * ``cupy.ndarray`` Usage ----- Installation and Importing ^^^^^^^^^^^^^^^^^^^^^^^^^^ Ray collective library is bundled with the released Ray wheel. Besides Ray, users need to install either `torch `_ or `cupy `_ in order to use collective communication with the GLOO (torch.distributed.gloo) and NCCL backend, respectively. .. code-block:: python pip install torch pip install cupy-cudaxxx # replace xxx with the right cuda version in your environment To use these APIs, import the collective package in your actor/task or driver code via: .. code-block:: python import ray.util.collective as col Initialization ^^^^^^^^^^^^^^ Collective functions operate on collective groups. A collective group contains a number of processes (in Ray, they are usually Ray-managed actors or tasks) that will together enter the collective function calls. Before making collective calls, users need to declare a set of actors/tasks, statically, as a collective group. Below is an example code snippet that uses the two APIs ``init_collective_group()`` and ``create_collective_group()`` to initialize collective groups among a few remote actors. Refer to `APIs <#api-reference>`_ for the detailed descriptions of the two APIs. .. code-block:: python import ray import ray.util.collective as collective import cupy as cp @ray.remote(num_gpus=1) class Worker: def __init__(self): self.send = cp.ones((4, ), dtype=cp.float32) self.recv = cp.zeros((4, ), dtype=cp.float32) def setup(self, world_size, rank): collective.init_collective_group(world_size, rank, "nccl", "default") return True def compute(self): collective.allreduce(self.send, "default") return self.send def destroy(self): collective.destroy_group() # imperative num_workers = 2 workers = [] init_rets = [] for i in range(num_workers): w = Worker.remote() workers.append(w) init_rets.append(w.setup.remote(num_workers, i)) _ = ray.get(init_rets) results = ray.get([w.compute.remote() for w in workers]) # declarative for i in range(num_workers): w = Worker.remote() workers.append(w) _options = { "group_name": "177", "world_size": 2, "ranks": [0, 1], "backend": "nccl" } collective.create_collective_group(workers, **_options) results = ray.get([w.compute.remote() for w in workers]) Note that for the same set of actors/task processes, multiple collective groups can be constructed, with ``group_name`` as their unique identifier. This enables specifying complex communication patterns between different (sub)set of processes. Collective Communication ^^^^^^^^^^^^^^^^^^^^^^^^ Check `the support matrix <#collective-primitives-support-matrix>`_ for the current status of supported collective calls and backends. Note that the current set of collective communication APIs are imperative, and exhibit the following behaviours: * All the collective APIs are synchronous blocking calls * Since each API only specifies a part of the collective communication, the API is expected to be called by each participating process of the (pre-declared) collective group. Once all the processes have made the call and rendezvous with each other, the collective communication happens and proceeds. * The APIs are imperative and the communication happens out-of-band --- they need to be used inside the collective process (actor/task) code. An example of using ``ray.util.collective.allreduce`` is below: .. code-block:: python import ray import cupy import ray.util.collective as col @ray.remote(num_gpus=1) class Worker: def __init__(self): self.buffer = cupy.ones((10,), dtype=cupy.float32) def compute(self): col.allreduce(self.buffer, "default") return self.buffer # Create two actors A and B and create a collective group following the previous example... A = Worker.remote() B = Worker.remote() # Invoke allreduce remotely ray.get([A.compute.remote(), B.compute.remote()]) Point-to-point Communication ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``ray.util.collective`` also supports P2P send/recv communication between processes. The send/recv exhibits the same behavior with the collective functions: they are synchronous blocking calls -- a pair of send and recv must be called together on paired processes in order to specify the entire communication, and must successfully rendezvous with each other to proceed. See the code example below: .. code-block:: python import ray import cupy import ray.util.collective as col @ray.remote(num_gpus=1) class Worker: def __init__(self): self.buffer = cupy.ones((10,), dtype=cupy.float32) def get_buffer(self): return self.buffer def do_send(self, target_rank=0): # this call is blocking col.send(target_rank) def do_recv(self, src_rank=0): # this call is blocking col.recv(src_rank) def do_allreduce(self): # this call is blocking as well col.allreduce(self.buffer) return self.buffer # Create two actors A = Worker.remote() B = Worker.remote() # Put A and B in a collective group col.create_collective_group([A, B], options={rank=[0, 1], ...}) # let A to send a message to B; a send/recv has to be specified once at each worker ray.get([A.do_send.remote(target_rank=1), B.do_recv.remote(src_rank=0)]) # An anti-pattern: the following code will hang, because it doesn't instantiate the recv side call ray.get([A.do_send.remote(target_rank=1)]) Single-GPU and Multi-GPU Collective Primitives ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In many cluster setups, a machine usually has more than 1 GPU; effectively leveraging the GPU-GPU bandwidth, such as `NVLINK `_\ , can significantly improve communication performance. ``ray.util.collective`` supports multi-GPU collective calls, in which case, a process (actor/tasks) manages more than 1 GPU (e.g., via ``ray.remote(num_gpus=4)``\ ). Using these multi-GPU collective functions are normally more performance-advantageous than using single-GPU collective API and spawning the number of processes equal to the number of GPUs. See the API references for the signatures of multi-GPU collective APIs. Also of note that all multi-GPU APIs are with the following restrictions: * Only NCCL backend is supported. * Collective processes that make multi-GPU collective or P2P calls need to own the same number of GPU devices. * The input to multi-GPU collective functions are normally a list of tensors, each located on a different GPU device owned by the caller process. An example code utilizing the multi-GPU collective APIs is provided below: .. code-block:: python import ray import ray.util.collective as collective import cupy as cp from cupy.cuda import Device @ray.remote(num_gpus=2) class Worker: def __init__(self): with Device(0): self.send1 = cp.ones((4, ), dtype=cp.float32) with Device(1): self.send2 = cp.ones((4, ), dtype=cp.float32) * 2 with Device(0): self.recv1 = cp.ones((4, ), dtype=cp.float32) with Device(1): self.recv2 = cp.ones((4, ), dtype=cp.float32) * 2 def setup(self, world_size, rank): self.rank = rank collective.init_collective_group(world_size, rank, "nccl", "177") return True def allreduce_call(self): collective.allreduce_multigpu([self.send1, self.send2], "177") return [self.send1, self.send2] def p2p_call(self): if self.rank == 0: collective.send_multigpu(self.send1 * 2, 1, 1, "8") else: collective.recv_multigpu(self.recv2, 0, 0, "8") return self.recv2 # Note that the world size is 2 but there are 4 GPUs. num_workers = 2 workers = [] init_rets = [] for i in range(num_workers): w = Worker.remote() workers.append(w) init_rets.append(w.setup.remote(num_workers, i)) a = ray.get(init_rets) results = ray.get([w.allreduce_call.remote() for w in workers]) results = ray.get([w.p2p_call.remote() for w in workers]) More Resources -------------- The following links provide helpful resources on how to efficiently leverage the ``ray.util.collective`` library. * `More running examples `_ under ``ray.util.collective.examples``. * `Scaling up the spaCy Named Entity Recognition (NER) pipeline `_ using Ray collective library. * `Implementing the AllReduce strategy `_ for data-parallel distributed ML training. API References -------------- .. automodule:: ray.util.collective.collective :members: --- .. _spark-on-ray: ************************** Using Spark on Ray (RayDP) ************************** RayDP combines your Spark and Ray clusters, making it easy to do large scale data processing using the PySpark API and seamlessly use that data to train your models using TensorFlow and PyTorch. For more information and examples, see the RayDP GitHub page: https://github.com/oap-project/raydp ================ Installing RayDP ================ RayDP can be installed from PyPI and supports PySpark 3.0 and 3.1. .. code-block:: bash pip install raydp .. note:: RayDP requires ray >= 1.2.0 .. note:: In order to run Spark, the head and worker nodes will need Java installed. ======================== Creating a Spark Session ======================== To create a Spark session, call ``raydp.init_spark`` For example, .. code-block:: python import ray import raydp ray.init() spark = raydp.init_spark( app_name = "example", num_executors = 10, executor_cores = 64, executor_memory = "256GB" ) ==================================== Deep Learning with a Spark DataFrame ==================================== ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Training a Spark DataFrame with TensorFlow ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``raydp.tf.TFEstimator`` provides an API for training with TensorFlow. .. code-block:: python from pyspark.sql.functions import col df = spark.range(1, 1000) # calculate z = x + 2y + 1000 df = df.withColumn("x", col("id")*2)\ .withColumn("y", col("id") + 200)\ .withColumn("z", col("x") + 2*col("y") + 1000) from raydp.utils import random_split train_df, test_df = random_split(df, [0.7, 0.3]) # TensorFlow code from tensorflow import keras input_1 = keras.Input(shape=(1,)) input_2 = keras.Input(shape=(1,)) concatenated = keras.layers.concatenate([input_1, input_2]) output = keras.layers.Dense(1, activation='sigmoid')(concatenated) model = keras.Model(inputs=[input_1, input_2], outputs=output) optimizer = keras.optimizers.Adam(0.01) loss = keras.losses.MeanSquaredError() from raydp.tf import TFEstimator estimator = TFEstimator( num_workers=2, model=model, optimizer=optimizer, loss=loss, metrics=["accuracy", "mse"], feature_columns=["x", "y"], label_column="z", batch_size=1000, num_epochs=2, use_gpu=False, config={"fit_config": {"steps_per_epoch": 2}}) estimator.fit_on_spark(train_df, test_df) tensorflow_model = estimator.get_model() estimator.shutdown() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Training a Spark DataFrame with PyTorch ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Similarly, ``raydp.torch.TorchEstimator`` provides an API for training with PyTorch. .. code-block:: python from pyspark.sql.functions import col df = spark.range(1, 1000) # calculate z = x + 2y + 1000 df = df.withColumn("x", col("id")*2)\ .withColumn("y", col("id") + 200)\ .withColumn("z", col("x") + 2*col("y") + 1000) from raydp.utils import random_split train_df, test_df = random_split(df, [0.7, 0.3]) # PyTorch Code import torch class LinearModel(torch.nn.Module): def __init__(self): super(LinearModel, self).__init__() self.linear = torch.nn.Linear(2, 1) def forward(self, x, y): x = torch.cat([x, y], dim=1) return self.linear(x) model = LinearModel() optimizer = torch.optim.Adam(model.parameters()) loss_fn = torch.nn.MSELoss() def lr_scheduler_creator(optimizer, config): return torch.optim.lr_scheduler.MultiStepLR( optimizer, milestones=[150, 250, 350], gamma=0.1) # You can use the RayDP Estimator API or libraries like Ray Train for distributed training. from raydp.torch import TorchEstimator estimator = TorchEstimator( num_workers = 2, model = model, optimizer = optimizer, loss = loss_fn, lr_scheduler_creator=lr_scheduler_creator, feature_columns = ["x", "y"], label_column = ["z"], batch_size = 1000, num_epochs = 2 ) estimator.fit_on_spark(train_df, test_df) pytorch_model = estimator.get_model() estimator.shutdown() --- .. _observability-getting-started: Ray Dashboard ============= Ray provides a web-based dashboard for monitoring and debugging Ray applications. The visual representation of the system state, allows users to track the performance of applications and troubleshoot issues. .. raw:: html
Set up Dashboard ------------------ To access the dashboard, use `ray[default]` or :ref:`other installation commands ` that include the Ray Dashboard component. For example: .. code-block:: bash pip install -U "ray[default]" When you start a single-node Ray Cluster on your laptop, access the dashboard with the URL that Ray prints when it initializes (the default URL is **http://localhost:8265**) or with the context object returned by `ray.init`. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray context = ray.init() print(context.dashboard_url) .. This test output is flaky. If Ray isn't completely shutdown, the port can be "8266" instead of "8265". .. testoutput:: :options: +MOCK 127.0.0.1:8265 .. code-block:: text INFO worker.py:1487 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265. .. note:: If you start Ray in a docker container, ``--dashboard-host`` is a required parameter. For example, ``ray start --head --dashboard-host=0.0.0.0``. When you start a remote Ray Cluster with the :ref:`VM Cluster Launcher `, :ref:`KubeRay operator `, or manual configuration, Ray Dashboard launches on the head node but the dashboard port may not be publicly exposed. View :ref:`configuring the dashboard ` for how to view Dashboard from outside the Head Node. .. note:: When using the Ray Dashboard, it is highly recommended to also set up Prometheus and Grafana. They are necessary for critical features such as :ref:`Metrics View `. See :ref:`Configuring and Managing the Dashboard ` for how to integrate Prometheus and Grafana with Ray Dashboard. Navigate the views ------------------ The Dashboard has multiple tabs called views. Depending on your goal, you may use one or a combination of views: - Analyze, monitor, or visualize status and resource utilization metrics for logical or physical components: :ref:`Metrics view `, :ref:`Cluster view ` - Monitor Job and Task progress and status: :ref:`Jobs view ` - Locate logs and error messages for failed Tasks and Actors: :ref:`Jobs view `, :ref:`Logs view ` - Analyze CPU and memory usage of Tasks and Actors: :ref:`Metrics view `, :ref:`Cluster view ` - Monitor a Serve application: :ref:`Serve view ` .. _dash-jobs-view: Jobs view --------- .. raw:: html
The Jobs view lets you monitor the different Jobs that ran on your Ray Cluster. A :ref:`Ray Job ` is a Ray workload that uses Ray APIs (e.g., ``ray.init``). It is recommended to submit your Job to Clusters via :ref:`Ray Job API `. You can also interactively run Ray jobs (e.g., by executing a Python script within a Head Node). The Job view displays a list of active, finished, and failed Jobs, and clicking on an ID allows users to view detailed information about that Job. For more information on Ray Jobs, see the :ref:`Ray Job Overview section `. Job Profiling ~~~~~~~~~~~~~ You can profile Ray Jobs by clicking on the “Stack Trace” or “CPU Flame Graph” actions. See :ref:`Profiling ` for more details. .. _dash-workflow-job-progress: Task and Actor breakdown ~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/advanced-progress.png :align: center The Jobs view breaks down Tasks and Actors by their states. Tasks and Actors are grouped and nested by default. You can see the nested entries by clicking the expand button. Tasks and Actors are grouped and nested using the following criteria: - All Tasks and Actors are grouped together. View individual entries by expanding the corresponding row. - Tasks are grouped by their ``name`` attribute (e.g., ``task.options(name="").remote()``). - Child Tasks (nested Tasks) are nested under their parent Task's row. - Actors are grouped by their class name. - Child Actors (Actors created within an Actor) are nested under their parent Actor's row. - Actor Tasks (remote methods within an Actor) are nested under the Actor for the corresponding Actor method. .. note:: Job detail page can only display or retrieve up to 10K Tasks per Job. For Jobs with more than 10K Tasks, the portion of Tasks that exceed the 10K limit are unaccounted. The number of unaccounted Tasks is available from the Task breakdown. .. _dashboard-timeline: Task Timeline ~~~~~~~~~~~~~ First, download the chrome tracing file by clicking the download button. Alternatively, you can :ref:`use CLI or SDK to export the tracing file `. Second, use tools like ``chrome://tracing`` or the `Perfetto UI `_ and drop the downloaded chrome tracing file. We will use Perfetto as it is the recommended way to visualize chrome tracing files. In the timeline visualization of Ray Tasks and Actors, there are Node rows (hardware) and Worker rows (processes). Each Worker rows display a list of Task events (e.g., Task scheduled, Task running, input/output deserialization, etc.) happening from that Worker over time. Ray Status ~~~~~~~~~~ The Jobs view displays the status of the Ray Cluster. This information is the output of the ``ray status`` CLI command. The left panel shows the autoscaling status, including pending, active, and failed nodes. The right panel displays the resource demands, which are resources that cannot be scheduled to the Cluster at the moment. This page is useful for debugging resource deadlocks or slow scheduling. .. note:: The output shows the aggregated information across the Cluster (not by Job). If you run more than one Job, some of the demands may come from other Jobs. .. _dash-workflow-state-apis: Task, Actor, and Placement Group tables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Dashboard displays a table of the status of the Job's Tasks, Actors, and Placement Groups. This information is the output of the :ref:`Ray State APIs `. You can expand the table to see a list of each Task, Actor, and Placement Group. .. _dash-serve-view: Serve view ---------- .. raw:: html
See your general Serve configurations, a list of the Serve applications, and, if you configured :ref:`Grafana and Prometheus `, high-level metrics of your Serve applications. Click the name of a Serve application to go to the Serve Application Detail page. Serve Application Detail page ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See the Serve application's configurations and metadata and the list of :ref:`Serve deployments and replicas `. Click the expand button of a deployment to see the replicas. Each deployment has two available actions. You can view the Deployment config and, if you configured :ref:`Grafana and Prometheus `, you can open a Grafana dashboard with detailed metrics about that deployment. For each replica, there are two available actions. You can see the logs of that replica and, if you configured :ref:`Grafana and Prometheus `, you can open a Grafana dashboard with detailed metrics about that replica. Click on the replica name to go to the Serve Replica Detail page. Serve Replica Detail page ~~~~~~~~~~~~~~~~~~~~~~~~~ This page shows metadata about the Serve replica, high-level metrics about the replica if you configured :ref:`Grafana and Prometheus `, and a history of completed :ref:`Tasks ` of that replica. Serve metrics ~~~~~~~~~~~~~ Ray Serve exports various time-series metrics to help you understand the status of your Serve application over time. Find more details about these metrics :ref:`here `. To store and visualize these metrics, set up Prometheus and Grafana by following the instructions :ref:`here `. These metrics are available in the Ray Dashboard in the Serve page and the Serve Replica Detail page. They are also accessible as Grafana dashboards. Within the Grafana dashboard, use the dropdown filters on the top to filter metrics by route, deployment, or replica. Exact descriptions of each graph are available by hovering over the "info" icon on the top left of each graph. .. _dash-node-view: Cluster view ------------ .. raw:: html
The Cluster view is a visualization of the hierarchical relationship of machines (nodes) and Workers (processes). Each host machine consists of many Workers, that you can see by clicking the + button. See also the assignment of GPU resources to specific Actors or Tasks. Click the node ID to see the node detail page. In addition, the machine view lets you see **logs** for a node or a Worker. .. _dash-actors-view: Actors view ----------- Use the Actors view to see the logs for an Actor and which Job created the Actor. .. raw:: html
The information for up to 100000 dead Actors is stored. Override this value with the `RAY_maximum_gcs_destroyed_actor_cached_count` environment variable when starting Ray. Actor profiling ~~~~~~~~~~~~~~~ Run the profiler on a running Actor. See :ref:`Dashboard Profiling ` for more details. Actor Detail page ~~~~~~~~~~~~~~~~~ Click the ID, to see the detail view of the Actor. On the Actor Detail page, see the metadata, state, and all of the Actor's Tasks that have run. .. _dash-metrics-view: Metrics view ------------ .. raw:: html
Ray exports default metrics which are available from the :ref:`Metrics view `. Here are some available example metrics. - Tasks, Actors, and Placement Groups broken down by states - :ref:`Logical resource usage ` across nodes - Hardware resource usage across nodes - Autoscaler status See :ref:`System Metrics Page ` for available metrics. .. note:: The Metrics view requires the Prometheus and Grafana setup. See :ref:`Configuring and managing the Dashboard ` to learn how to set up Prometheus and Grafana. The Metrics view provides visualizations of the time series metrics emitted by Ray. You can select the time range of the metrics in the top right corner. The graphs refresh automatically every 15 seconds. There is also a convenient button to open the Grafana UI from the dashboard. The Grafana UI provides additional customizability of the charts. .. _dash-workflow-cpu-memory-analysis: Analyze the CPU and memory usage of Tasks and Actors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The :ref:`Metrics view ` in the Dashboard provides a "per-component CPU/memory usage graph" that displays CPU and memory usage over time for each Task and Actor in the application (as well as system components). You can identify Tasks and Actors that may be consuming more resources than expected and optimize the performance of the application. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/node_cpu_by_comp.png :align: center Per component CPU graph. 0.379 cores mean that it uses 40% of a single CPU core. Ray process names start with ``ray::``. ``raylet``, ``agent``, ``dashboard``, or ``gcs`` are system components. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/node_memory_by_comp.png :align: center Per component memory graph. Ray process names start with ``ray::``. ``raylet``, ``agent``, ``dashboard``, or ``gcs`` are system components. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/cluster_page.png :align: center Additionally, users can see a snapshot of hardware utilization from the :ref:`Cluster view `, which provides an overview of resource usage across the entire Ray Cluster. .. _dash-workflow-resource-utilization: View the resource utilization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray requires users to specify the number of :ref:`resources ` their Tasks and Actors use through arguments such as ``num_cpus``, ``num_gpus``, ``memory``, and ``resource``. These values are used for scheduling, but may not always match the actual resource utilization (physical resource utilization). - See the logical and physical resource utilization over time from the :ref:`Metrics view `. - The snapshot of physical resource utilization (CPU, GPU, memory, disk, network) is also available from the :ref:`Cluster view `. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/logical_resource.png :align: center The :ref:`logical resources ` usage. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/physical_resource.png :align: center The physical resources (hardware) usage. Ray provides CPU, GPU, Memory, GRAM, disk, and network usage for each machine in a Cluster. .. _dash-logs-view: Logs view --------- .. raw:: html
The Logs view lists the Ray logs in your Cluster. It is organized by node and log file name. Many log links in the other pages link to this view and filter the list so the relevant logs appear. To understand the logging structure of Ray, see :ref:`logging directory and file structure `. The Logs view provides search functionality to help you find specific log messages. **Driver logs** If the Ray Job is submitted by the :ref:`Job API `, the Job logs are available from the Dashboard. The log file follows the following format: ``job-driver-.log``. .. note:: If you execute the Driver directly on the Head Node of the Ray Cluster (without using the Job API) or run with :ref:`Ray Client `, the Driver logs are not accessible from the Dashboard. In this case, see the terminal or Jupyter Notebook output to view the Driver logs. **Task and Actor Logs (Worker logs)** Task and Actor logs are accessible from the :ref:`Task and Actor table view `. Click the "Log" button. You can see the ``stdout`` and ``stderr`` logs that contain the output emitted from Tasks and Actors. For Actors, you can also see the system logs for the corresponding Worker process. .. note:: Logs of asynchronous Actor Tasks or threaded Actor Tasks (concurrency>1) are only available as part of the Actor logs. Follow the instruction in the Dashboard to view the Actor logs. **Task and Actor errors** You can easily identify failed Tasks or Actors by looking at the Job progress bar. The Task and Actor tables display the name of the failed Tasks or Actors, respectively. They also provide access to their corresponding log or error messages. .. _dash-overview: Overview view ------------- .. image:: ./images/dashboard-overview.png :align: center The Overview view provides a high-level status of the Ray Cluster. **Overview metrics** The Overview Metrics page provides the Cluster-level hardware utilization and autoscaling status (number of pending, active, and failed nodes). **Recent Jobs** The Recent Jobs pane provides a list of recently submitted Ray Jobs. **Serve applications** The Serve Applications pane provides a list of recently deployed Serve applications .. _dash-event: **Events view** .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/event-page.png :align: center The Events view displays a list of events associated with a specific type (e.g., Autoscaler or Job) in chronological order. The same information is accessible with the ``ray list cluster-events`` :ref:`(Ray state APIs)` CLI commands. Two types of events are available: - Job: Events related to :ref:`Ray Jobs API `. - Autoscaler: Events related to the :ref:`Ray autoscaler `. Resources --------- - `Ray Summit observability talk `_ - `Ray metrics blog `_ - `Ray Dashboard roadmap `_ - `Observability Training Module `_ --- .. _observability-key-concepts: Key Concepts ============ This section covers key concepts for monitoring and debugging tools and features in Ray. Dashboard (Web UI) ------------------ Ray provides a web-based dashboard to help users monitor and debug Ray applications and Clusters. See :ref:`Getting Started ` for more details about the Dashboard. Ray States ---------- Ray States refer to the state of various Ray entities (e.g., Actor, Task, Object, etc.). Ray 2.0 and later versions support :ref:`querying the states of entities with the CLI and Python APIs ` The following command lists all the Actors from the Cluster: .. code-block:: bash ray list actors .. code-block:: text ======== List: 2022-07-23 21:29:39.323925 ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ ACTOR_ID CLASS_NAME NAME PID STATE 0 31405554844820381c2f0f8501000000 Actor 96956 ALIVE 1 f36758a9f8871a9ca993b1d201000000 Actor 96955 ALIVE View :ref:`Monitoring with the CLI or SDK ` for more details. Metrics ------- Ray collects and exposes the physical stats (e.g., CPU, memory, GRAM, disk, and network usage of each node), internal stats (e.g., number of Actors in the cluster, number of Worker failures in the Cluster), and custom application metrics (e.g., metrics defined by users). All stats can be exported as time series data (to Prometheus by default) and used to monitor the Cluster over time. View :ref:`Metrics View ` for where to view the metrics in Ray Dashboard. View :ref:`collecting metrics ` for how to collect metrics from Ray Clusters. Exceptions ---------- Creating a new Task or submitting an Actor Task generates an object reference. When ``ray.get`` is called on the Object Reference, the API raises an exception if anything goes wrong with a related Task, Actor or Object. For example, - :class:`RayTaskError ` is raised when an error from user code throws an exception. - :class:`RayActorError ` is raised when an Actor is dead (by a system failure, such as a node failure, or a user-level failure, such as an exception from ``__init__`` method). - :class:`RuntimeEnvSetupError ` is raised when the Actor or Task can't be started because :ref:`a runtime environment ` failed to be created. See :ref:`Exceptions Reference ` for more details. Debugger -------- Ray has a built-in debugger for debugging your distributed applications. Set breakpoints in Ray Tasks and Actors, and when hitting the breakpoint, drop into a PDB session to: - Inspect variables in that context - Step within a Task or Actor - Move up or down the stack View :ref:`Ray Debugger ` for more details. .. _profiling-concept: Profiling --------- Profiling is a way of analyzing the performance of an application by sampling the resource usage of it. Ray supports various profiling tools: - CPU profiling for Driver and Worker processes, including integration with :ref:`py-spy ` and :ref:`cProfile ` - Memory profiling for Driver and Worker processes with :ref:`memray ` - GPU profiling with :ref:`Pytorch Profiler ` and :ref:`Nsight System ` - Built in Task and Actor profiling tool called :ref:`Ray Timeline ` View :ref:`Profiling ` for more details. Note that this list isn't comprehensive and feel free to contribute to it if you find other useful tools. Tracing ------- To help debug and monitor Ray applications, Ray supports distributed tracing (integration with OpenTelemetry) across Tasks and Actors. See :ref:`Ray Tracing ` for more details. Application logs ---------------- Logs are important for general monitoring and debugging. For distributed Ray applications, logs are even more important but more complicated at the same time. A Ray application runs both on Driver and Worker processes (or even across multiple machines) and the logs of these processes are the main sources of application logs. .. image:: ./images/application-logging.png :alt: Application logging Driver logs ~~~~~~~~~~~ An entry point of Ray applications that calls ``ray.init()`` is called a **Driver**. All the Driver logs are handled in the same way as normal Python programs. .. _ray-worker-logs: Worker logs (stdout and stderr) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray executes Tasks or Actors remotely within Ray's Worker processes. Task and Actor logs are captured in the Worker stdout and stderr. Ray has special support to improve the visibility of stdout and stderr produced by Worker processes so that the Ray program appears like a non-distributed program, also known as "Worker log redirection to driver". - Ray directs stdout and stderr from all Tasks and Actors to the Worker log files, including any log messages generated by the Worker. See :ref:`logging directory and file structure ` to understand the Ray logging structure. - The Driver reads the Worker log files (where the stdout and stderr of all Tasks and Actors sit) and sends the log records to its own stdout and stderr (also known as "Worker logs being redirected to Driver output"). For the following code: .. testcode:: import ray # Initiate a driver. ray.init() @ray.remote def task_foo(): print("task!") ray.get(task_foo.remote()) .. testoutput:: :options: +MOCK (task_foo pid=12854) task! #. Ray Task ``task_foo`` runs on a Ray Worker process. String ``task!`` is saved into the corresponding Worker ``stdout`` log file. #. The Driver reads the Worker log file and sends it to its ``stdout`` (terminal) where you should be able to see the string ``task!``. When logs are printed, the process id (pid) and an IP address of the node that executes Tasks or Actors are printed together. Here is the output: .. code-block:: bash (pid=45601) task! Actor log messages look like the following by default: .. code-block:: bash (MyActor pid=480956) actor log message By default, all stdout and stderr of Tasks and Actors are redirected to the Driver output. View :ref:`Configuring Logging ` for how to disable this feature. Job logs ~~~~~~~~ Ray applications are usually run as Ray Jobs. Worker logs of Ray Jobs are always captured in the :ref:`Ray logging directory ` while Driver logs are not. Driver logs are captured only for Ray Jobs submitted via :ref:`Jobs API `. Find the captured Driver logs with the Dashboard UI, CLI (using the ``ray job logs`` :ref:`CLI command `), or the :ref:`Python SDK ` (``JobSubmissionClient.get_logs()`` or ``JobSubmissionClient.tail_job_logs()``). .. note:: View the Driver logs in your terminal or Jupyter Notebooks if you run Ray Jobs by executing the Ray Driver on the Head node directly or connecting via Ray Client. --- .. _ray-distributed-debugger: Ray Distributed Debugger ======================== The Ray Distributed Debugger includes a debugger backend and a `VS Code extension `_ frontend that streamline the debugging process with an interactive debugging experience. The Ray Debugger enables you to: - **Break into remote tasks**: Set a breakpoint in any remote task. A breakpoint pauses execution and allows you to connect with VS Code for debugging. - **Post-mortem debugging**: When Ray tasks fail with unhandled exceptions, Ray automatically freezes the failing task and waits for the Ray Debugger to attach, allowing you to inspect the state of the program at the time of the error. Ray Distributed Debugger abstracts the complexities of debugging distributed systems for you to debug Ray applications more efficiently, saving time and effort in the development workflow. .. note:: The Ray Distributed Debugger frontend is only available in VS Code and other VS Code-compatible IDEs like Cursor. If you need support for other IDEs, file a feature request on `GitHub `_. .. raw:: html
Set up the environment ~~~~~~~~~~~~~~~~~~~~~~ Create a new virtual environment and install dependencies. .. testcode:: :skipif: True conda create -n myenv python=3.10 conda activate myenv pip install "ray[default]" debugpy Start a Ray cluster ~~~~~~~~~~~~~~~~~~~ .. tab-set:: .. tab-item:: Local Run `ray start --head` to start a local Ray cluster. .. tab-item:: KubeRay (SSH) Follow the instructions in :doc:`the RayCluster quickstart <../cluster/kubernetes/getting-started/raycluster-quick-start>` to set up a cluster. You need to connect VS Code to the cluster. For example, add the following to the `ray-head` container and make sure `sshd` is running in the `ray-head` container. .. code-block:: yaml ports: - containerPort: 22 name: ssd .. note:: How to run `sshd` in the `ray-head` container depends on your setup. For example you can use `supervisord`. A simple way to run `sshd` interactively for testing is by logging into the head node pod and running: .. code-block:: bash sudo apt-get update && sudo apt-get install -y openssh-server sudo mkdir -p /run/sshd sudo /usr/sbin/sshd -D You can then connect to the cluster via SSH by running: .. code-block:: bash kubectl port-forward service/raycluster-sample-head-svc 2222:22 After checking that `ssh -p 2222 ray@localhost` works, set up VS Code as described in the `VS Code SSH documentation `_. .. tab-item:: KubeRay (Code Server, Community Maintained) Follow the instructions in :doc:`the RayCluster quickstart <../cluster/kubernetes/getting-started/raycluster-quick-start>` to set up a cluster. A simpler approach is to run a browser-based VS Code (Code Server) as a sidecar container in the Ray head pod. This eliminates network connectivity issues by placing VS Code inside the Kubernetes cluster. Add a sidecar container to the Ray head pod and configure a shared volume. Modify your Ray head pod template with the following additions: .. code-block:: yaml # In your RayCluster YAML, under spec.headGroupSpec.template.spec containers: - name: ray-head # ... your existing ray-head configuration ... # Add this volumeMount: volumeMounts: - mountPath: /tmp/ray name: shared-ray-volume # Add this sidecar container: - name: vscode-debugger image: docker.io/onesizefitsquorum/code-server-with-ray-distributed-debugger:4.101.2 ports: - containerPort: 8443 volumeMounts: - mountPath: /tmp/ray name: shared-ray-volume env: # Specifies the default directory that opens when VSCode Web starts, pointing to the workspace containing the Ray runtime resources. - name: DEFAULT_WORKSPACE value: "/tmp/ray/session_latest/runtime_resources" # Add this volume at the same level as `containers`: volumes: - name: shared-ray-volume emptyDir: {} After the Ray cluster is running, forward the Code Server port: .. code-block:: bash kubectl port-forward pod/ 8443:8443 Access VS Code in your browser at http://127.0.0.1:8443 and use the Ray Distributed Debugger extension to connect to http://127.0.0.1:8265. For more details, see the `Code Server with Ray Distributed Debugger `_ project. Register the cluster ~~~~~~~~~~~~~~~~~~~~ Find and click the Ray extension in the VS Code left side nav. Add the Ray cluster `IP:PORT` to the cluster list. The default `IP:PORT` is `127.0.0.1:8265`. You can change it when you start the cluster. Make sure your current machine can access the IP and port. .. image:: ./images/register-cluster.gif :align: center Create a Ray task ~~~~~~~~~~~~~~~~~ Create a file `job.py` with the following snippet. Add `breakpoint()` in the Ray task. If you want to use the post-mortem debugging below, also add the `RAY_DEBUG_POST_MORTEM=1` environment variable. .. literalinclude:: ./doc_code/ray-distributed-debugger.py :language: python Run your Ray app ~~~~~~~~~~~~~~~~ Start running your Ray app. .. code-block:: bash python job.py Attach to the paused task ~~~~~~~~~~~~~~~~~~~~~~~~~ When the debugger hits a breakpoint: - The task enters a paused state. - The terminal clearly indicates when the debugger pauses a task and waits for the debugger to attach. - The paused task is listed in the Ray Debugger extension. - Click the play icon next to the name of the paused task to attach the VS Code debugger. .. image:: ./images/attach-paused-task.gif :align: center Start and stop debugging ~~~~~~~~~~~~~~~~~~~~~~~~ Debug your Ray app as you would when developing locally. After you're done debugging this particular breakpoint, click the **Disconnect** button in the debugging toolbar so you can join another task in the **Paused Tasks** list. .. figure:: ./images/debugger-disconnect.gif Post-mortem debugging ===================== Use post-mortem debugging when Ray tasks encounter unhandled exceptions. In such cases, Ray automatically freezes the failing task, awaiting attachment by the Ray Debugger. This feature allows you to thoroughly investigate and inspect the program's state at the time of the error. Run a Ray task raised exception ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Run the same `job.py` file with an additional argument to raise an exception. .. code-block:: bash python job.py raise-exception Attach to the paused task ~~~~~~~~~~~~~~~~~~~~~~~~~ When the app throws an exception: - The debugger freezes the task. - The terminal clearly indicates when the debugger pauses a task and waits for the debugger to attach. - The paused task is listed in the Ray Debugger extension. - Click the play icon next to the name of the paused task to attach the debugger and start debugging. .. image:: ./images/post-mortem.gif :align: center Start debugging ~~~~~~~~~~~~~~~ Debug your Ray app as you would when developing locally. Share feedback ============== Join the `#ray-debugger `_ channel on the Ray Slack channel to get help. Next steps ========== - For guidance on debugging distributed apps in Ray, see :doc:`General debugging <./user-guides/debug-apps/general-debugging>`. - For tips on using the Ray debugger, see :doc:`Ray debugging <./user-guides/debug-apps/ray-debugging>`. --- .. _state-api-ref: State API ========= .. note:: APIs are :ref:`alpha `. This feature requires a full installation of Ray using ``pip install "ray[default]"``. For an overview with examples see :ref:`Monitoring Ray States `. For the CLI reference see :ref:`Ray State CLI Reference ` or :ref:`Ray Log CLI Reference `. State Python SDK ----------------- State APIs are also exported as functions. Summary APIs ~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ray.util.state.summarize_actors ray.util.state.summarize_objects ray.util.state.summarize_tasks List APIs ~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ray.util.state.list_actors ray.util.state.list_placement_groups ray.util.state.list_nodes ray.util.state.list_jobs ray.util.state.list_workers ray.util.state.list_tasks ray.util.state.list_objects ray.util.state.list_runtime_envs Get APIs ~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ray.util.state.get_actor ray.util.state.get_placement_group ray.util.state.get_node ray.util.state.get_worker ray.util.state.get_task ray.util.state.get_objects Log APIs ~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ray.util.state.list_logs ray.util.state.get_log .. _state-api-schema: State APIs Schema ----------------- .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_autosummary.rst ray.util.state.common.ActorState ray.util.state.common.TaskState ray.util.state.common.NodeState ray.util.state.common.PlacementGroupState ray.util.state.common.WorkerState ray.util.state.common.ObjectState ray.util.state.common.RuntimeEnvState ray.util.state.common.JobState ray.util.state.common.StateSummary ray.util.state.common.TaskSummaries ray.util.state.common.TaskSummaryPerFuncOrClassName ray.util.state.common.ActorSummaries ray.util.state.common.ActorSummaryPerClass ray.util.state.common.ObjectSummaries ray.util.state.common.ObjectSummaryPerKey State APIs Exceptions --------------------- .. autosummary:: :nosignatures: :toctree: doc/ ray.util.state.exception.RayStateApiException --- .. _state-api-cli-ref: State CLI ========= State ----- This section contains commands to access the :ref:`live state of Ray resources (actor, task, object, etc.) `. .. note:: APIs are :ref:`alpha `. This feature requires a full installation of Ray using ``pip install "ray[default]"``. This feature also requires the dashboard component to be available. The dashboard component needs to be included when starting the ray cluster, which is the default behavior for ``ray start`` and ``ray.init()``. For more in-depth debugging, you could check the dashboard log at ``/dashboard.log``, which is usually ``/tmp/ray/session_latest/logs/dashboard.log``. State CLI allows users to access the state of various resources (e.g., actor, task, object). .. click:: ray.util.state.state_cli:task_summary :prog: ray summary tasks .. click:: ray.util.state.state_cli:actor_summary :prog: ray summary actors .. click:: ray.util.state.state_cli:object_summary :prog: ray summary objects .. click:: ray.util.state.state_cli:ray_list :prog: ray list .. click:: ray.util.state.state_cli:ray_get :prog: ray get .. _ray-logs-api-cli-ref: Log --- This section contains commands to :ref:`access logs ` from Ray clusters. .. note:: APIs are :ref:`alpha `. This feature requires a full installation of Ray using ``pip install "ray[default]"``. Log CLI allows users to access the log from the cluster. Note that only the logs from alive nodes are available through this API. .. click:: ray.util.state.state_cli:logs_state_cli_group :prog: ray logs --- .. _system-metrics: System Metrics -------------- Ray exports a number of system metrics, which provide introspection into the state of Ray workloads, as well as hardware utilization statistics. The following table describes the officially supported metrics: .. note:: Certain labels are common across all metrics, such as `SessionName` (uniquely identifies a Ray cluster instance), `instance` (per-node label applied by Prometheus), and `JobId` (Ray job ID, as applicable). Starting with Ray 2.53+, the `WorkerId` label is no longer exported by default due to its high cardinality. The Ray team doesn't expect this to be a breaking change, as none of Ray’s built-in components rely on this label. However, if you have custom tooling that depends on `WorkerId` label, take note of this change. You can restore or adjust label behavior using the environment variable `RAY_metric_cardinality_level`: - `legacy`: Preserve all labels. (This was the default behavior before Ray 2.53.) - `recommended`: Drop high-cardinality labels. Ray internally determines specific labels; currently this includes only `WorkerId`. (This is the default behavior since Ray 2.53.) - `low`: Same as `recommended`, but also drops the Name label for tasks and actors. .. list-table:: Ray System Metrics :header-rows: 1 * - Prometheus Metric - Labels - Description * - `ray_tasks` - `Name`, `State`, `IsRetry` - Current number of tasks (both remote functions and actor calls) by state. The State label (e.g., RUNNING, FINISHED, FAILED) describes the state of the task. See `rpc::TaskState `_ for more information. The function/method name is available as the Name label. If the task was retried due to failure or reconstruction, the IsRetry label will be set to "1", otherwise "0". * - `ray_actors` - `Name`, `State` - Current number of actors in each state described in `rpc::ActorTableData::ActorState `. ALIVE has two sub-states: ALIVE_IDLE, and ALIVE_RUNNING_TASKS. An actor is considered ALIVE_IDLE if it is not running any tasks. * - `ray_resources` - `Name`, `State`, `instance` - Logical resource usage for each node of the cluster. Each resource has some quantity that's either in the USED or AVAILABLE state. The Name label defines the resource name (e.g., CPU, GPU). * - `ray_object_store_memory` - `Location`, `ObjectState`, `instance` - Object store memory usage in bytes, broken down by logical Location (SPILLED, MMAP_DISK, MMAP_SHM, and WORKER_HEAP). Definitions are as follows. SPILLED--Objects that have spilled to disk or a remote Storage solution (for example, AWS S3). The default is the disk. MMAP_DISK--Objects stored on a memory-mapped page on disk. This mode very slow and only happens under severe memory pressure. MMAP_SHM--Objects store on a memory-mapped page in Shared Memory. This mode is the default, in the absence of memory pressure. WORKER_HEAP--Objects, usually smaller, stored in the memory of the Ray Worker process itself. Small objects are stored in the worker heap. * - `ray_placement_groups` - `State` - Current number of placement groups by state. The State label (e.g., PENDING, CREATED, REMOVED) describes the state of the placement group. See `rpc::PlacementGroupTable `_ for more information. * - `ray_memory_manager_worker_eviction_total` - `Type`, `Name` - The number of tasks and actors killed by the Ray Out of Memory killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) broken down by types (whether it is tasks or actors) and names (name of tasks and actors). * - `ray_node_cpu_utilization` - `instance` - The CPU utilization per node as a percentage quantity (0..100). This should be scaled by the number of cores per node to convert the units into cores. * - `ray_node_cpu_count` - `instance` - The number of CPU cores per node. * - `ray_node_gpus_utilization` - `instance`, `GpuDeviceName`, `GpuIndex` - The GPU utilization per GPU as a percentage quantity (0..NGPU*100). `GpuDeviceName` is a name of a GPU device (e.g., NVIDIA A10G) and `GpuIndex` is the index of the GPU. * - `ray_node_disk_usage` - `instance` - The amount of disk space used per node, in bytes. * - `ray_node_disk_free` - `instance` - The amount of disk space available per node, in bytes. * - `ray_node_disk_write_iops` - `instance`, `node_type` - The disk write operations per second per node. * - `ray_node_disk_io_write_speed` - `instance` - The disk write throughput per node, in bytes per second. * - `ray_node_disk_read_iops` - `instance`, `node_type` - The disk read operations per second per node. * - `ray_node_disk_io_read_speed` - `instance` - The disk read throughput per node, in bytes per second. * - `ray_node_mem_available` - `instance`, `node_type` - The amount of physical memory available per node, in bytes. * - `ray_node_mem_shared_bytes` - `instance`, `node_type` - The amount of shared memory per node, in bytes. * - `ray_node_mem_used` - `instance` - The amount of physical memory used per node, in bytes. * - `ray_node_mem_total` - `instance` - The amount of physical memory available per node, in bytes. * - `ray_component_uss_mb` - `Component`, `instance` - The measured unique set size in megabytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. * - `ray_component_cpu_percentage` - `Component`, `instance` - The measured CPU percentage, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. * - `ray_node_gram_available` - `instance`, `node_type`, `GpuIndex`, `GpuDeviceName` - The amount of GPU memory available per GPU, in megabytes. * - `ray_node_gram_used` - `instance`, `GpuDeviceName`, `GpuIndex` - The amount of GPU memory used per GPU, in bytes. * - `ray_node_network_received` - `instance`, `node_type` - The total network traffic received per node, in bytes. * - `ray_node_network_sent` - `instance`, `node_type` - The total network traffic sent per node, in bytes. * - `ray_node_network_receive_speed` - `instance` - The network receive throughput per node, in bytes per second. * - `ray_node_network_send_speed` - `instance` - The network send throughput per node, in bytes per second. * - `ray_cluster_active_nodes` - `node_type` - The number of healthy nodes in the cluster, broken down by autoscaler node type. * - `ray_cluster_failed_nodes` - `node_type` - The number of failed nodes reported by the autoscaler, broken down by node type. * - `ray_cluster_pending_nodes` - `node_type` - The number of pending nodes reported by the autoscaler, broken down by node type. Metrics Semantics and Consistency ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray guarantees all its internal state metrics are *eventually* consistent even in the presence of failures--- should any worker fail, eventually the right state will be reflected in the Prometheus time-series output. However, any particular metrics query is not guaranteed to reflect an exact snapshot of the cluster state. For the `ray_tasks` and `ray_actors` metrics, you should use sum queries to plot their outputs (e.g., ``sum(ray_tasks) by (Name, State)``). The reason for this is that Ray's task metrics are emitted from multiple distributed components. Hence, there are multiple metric points, including negative metric points, emitted from different processes that must be summed to produce the correct logical view of the distributed system. For example, for a single task submitted and executed, Ray may emit ``(submitter) SUBMITTED_TO_WORKER: 1, (executor) SUBMITTED_TO_WORKER: -1, (executor) RUNNING: 1``, which reduces to ``SUBMITTED_TO_WORKER: 0, RUNNING: 1`` after summation. --- .. _application-level-metrics: Adding Application-Level Metrics -------------------------------- Ray provides a convenient API in :ref:`ray.util.metrics ` for defining and exporting custom metrics for visibility into your applications. Three metrics are supported: Counter, Gauge, and Histogram. These metrics correspond to the same `Prometheus metric types `_. Below is a simple example of an Actor that exports metrics using these APIs: .. literalinclude:: ../doc_code/metrics_example.py :language: python While the script is running, the metrics are exported to ``localhost:8080`` (this is the endpoint that Prometheus would be configured to scrape). Open this in the browser. You should see the following output: .. code-block:: none # HELP ray_request_latency Latencies of requests in ms. # TYPE ray_request_latency histogram ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="0.1"} 2.0 ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="1.0"} 2.0 ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="+Inf"} 2.0 ray_request_latency_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0 ray_request_latency_sum{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 0.11992454528808594 # HELP ray_curr_count Current count held by the actor. Goes up and down. # TYPE ray_curr_count gauge ray_curr_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} -15.0 # HELP ray_num_requests_total Number of requests processed by the actor. # TYPE ray_num_requests_total counter ray_num_requests_total{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0 Please see :ref:`ray.util.metrics ` for more details. --- .. _observability-programmatic: Monitoring with the CLI or SDK =============================== Monitoring and debugging capabilities in Ray are available through a CLI or SDK. CLI command ``ray status`` ---------------------------- You can monitor node status and resource usage by running the CLI command, ``ray status``, on the head node. It displays - **Node Status**: Nodes that are running and autoscaling up or down. Addresses of running nodes. Information about pending nodes and failed nodes. - **Resource Usage**: The Ray resource usage of the cluster. For example, requested CPUs from all Ray Tasks and Actors. Number of GPUs that are used. Following is an example output: .. code-block:: shell $ ray status ======== Autoscaler status: 2021-10-12 13:10:21.035674 ======== Node status --------------------------------------------------------------- Healthy: 1 ray.head.default 2 ray.worker.cpu Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/10.0 CPU 0.00/70.437 GiB memory 0.00/10.306 GiB object_store_memory Demands: (no resource demands) When you need more verbose info about each node, run ``ray status -v``. This is helpful when you need to investigate why particular nodes don't autoscale down. .. _state-api-overview-ref: Ray State CLI and SDK ---------------------------- .. tip:: Provide feedback on using Ray state APIs - `feedback form `_! Use Ray State APIs to access the current state (snapshot) of Ray through the CLI or Python SDK (developer APIs). .. note:: This feature requires a full installation of Ray using ``pip install "ray[default]"``. This feature also requires that the dashboard component is available. The dashboard component needs to be included when starting the Ray Cluster, which is the default behavior for ``ray start`` and ``ray.init()``. .. note:: State API CLI commands are :ref:`stable `, while Python SDKs are :ref:`DeveloperAPI `. CLI usage is recommended over Python SDKs. Get started ~~~~~~~~~~~ This example uses the following script that runs two Tasks and creates two Actors. .. testcode:: :hide: import ray ray.shutdown() .. testcode:: import ray import time ray.init(num_cpus=4) @ray.remote def task_running_300_seconds(): time.sleep(300) @ray.remote class Actor: def __init__(self): pass # Create 2 tasks tasks = [task_running_300_seconds.remote() for _ in range(2)] # Create 2 actors actors = [Actor.remote() for _ in range(2)] .. testcode:: :hide: # Wait for the tasks to be submitted. time.sleep(2) See the summarized states of tasks. If it doesn't return the output immediately, retry the command. .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray summary tasks .. code-block:: text ======== Tasks Summary: 2022-07-22 08:54:38.332537 ======== Stats: ------------------------------------ total_actor_scheduled: 2 total_actor_tasks: 0 total_tasks: 2 Table (group by func_name): ------------------------------------ FUNC_OR_CLASS_NAME STATE_COUNTS TYPE 0 task_running_300_seconds RUNNING: 2 NORMAL_TASK 1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import summarize_tasks print(summarize_tasks()) .. testoutput:: {'cluster': {'summary': {'task_running_300_seconds': {'func_or_class_name': 'task_running_300_seconds', 'type': 'NORMAL_TASK', 'state_counts': {'RUNNING': 2}}, 'Actor.__init__': {'func_or_class_name': 'Actor.__init__', 'type': 'ACTOR_CREATION_TASK', 'state_counts': {'FINISHED': 2}}}, 'total_tasks': 2, 'total_actor_tasks': 0, 'total_actor_scheduled': 2, 'summary_by': 'func_name'}} List all Actors. .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list actors .. code-block:: text ======== List: 2022-07-23 21:29:39.323925 ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ ACTOR_ID CLASS_NAME NAME PID STATE 0 31405554844820381c2f0f8501000000 Actor 96956 ALIVE 1 f36758a9f8871a9ca993b1d201000000 Actor 96955 ALIVE .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_actors print(list_actors()) .. testoutput:: [ActorState(actor_id='...', class_name='Actor', state='ALIVE', job_id='01000000', name='', node_id='...', pid=..., ray_namespace='...', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None), ActorState(actor_id='...', class_name='Actor', state='ALIVE', job_id='01000000', name='', node_id='...', pid=..., ray_namespace='...', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None)] Get the state of a single Task using the get API. .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash # In this case, 31405554844820381c2f0f8501000000 ray get actors .. code-block:: text --- actor_id: 31405554844820381c2f0f8501000000 class_name: Actor death_cause: null is_detached: false name: '' pid: 96956 resource_mapping: [] serialized_runtime_env: '{}' state: ALIVE .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_actor # In this case, 31405554844820381c2f0f8501000000 print(get_actor(id=)) Access logs through the ``ray logs`` API. .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list actors # In this case, ACTOR_ID is 31405554844820381c2f0f8501000000 ray logs actor --id .. code-block:: text --- Log has been truncated to last 1000 lines. Use `--tail` flag to toggle. --- :actor_name:Actor Actor created .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_log # In this case, ACTOR_ID is 31405554844820381c2f0f8501000000 for line in get_log(actor_id=): print(line) Key Concepts ~~~~~~~~~~~~~ Ray State APIs allow you to access **states** of **resources** through **summary**, **list**, and **get** APIs. It also supports **logs** API to access logs. - **states**: The state of the cluster of corresponding resources. States consist of immutable metadata (e.g., Actor's name) and mutable states (e.g., Actor's scheduling state or pid). - **resources**: Resources created by Ray. E.g., actors, tasks, objects, placement groups, and etc. - **summary**: API to return the summarized view of resources. - **list**: API to return every individual entity of resources. - **get**: API to return a single entity of resources in detail. - **logs**: API to access the log of Actors, Tasks, Workers, or system log files. User guides ~~~~~~~~~~~~~ Getting a summary of states of entities by type ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Return the summarized information of the given Ray entity (Objects, Actors, Tasks). It is recommended to start monitoring states through summary APIs first. When you find anomalies (e.g., Actors running for a long time, Tasks that are not scheduled for a long time), you can use ``list`` or ``get`` APIs to get more details for an individual abnormal entity. **Summarize all actors** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray summary actors .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import summarize_actors print(summarize_actors()) .. testoutput:: {'cluster': {'summary': {'Actor': {'class_name': 'Actor', 'state_counts': {'ALIVE': 2}}}, 'total_actors': 2, 'summary_by': 'class'}} **Summarize all tasks** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray summary tasks .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import summarize_tasks print(summarize_tasks()) .. testoutput:: {'cluster': {'summary': {'task_running_300_seconds': {'func_or_class_name': 'task_running_300_seconds', 'type': 'NORMAL_TASK', 'state_counts': {'RUNNING': 2}}, 'Actor.__init__': {'func_or_class_name': 'Actor.__init__', 'type': 'ACTOR_CREATION_TASK', 'state_counts': {'FINISHED': 2}}}, 'total_tasks': 2, 'total_actor_tasks': 0, 'total_actor_scheduled': 2, 'summary_by': 'func_name'}} **Summarize all objects** .. note:: By default, objects are summarized by callsite. However, callsite is not recorded by Ray by default. To get callsite info, set env variable `RAY_record_ref_creation_sites=1` when starting the Ray cluster: .. code-block:: bash RAY_record_ref_creation_sites=1 ray start --head .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray summary objects .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import summarize_objects print(summarize_objects()) .. testoutput:: {'cluster': {'summary': {'disabled': {'total_objects': 6, 'total_size_mb': 0.0, 'total_num_workers': 3, 'total_num_nodes': 1, 'task_state_counts': {'SUBMITTED_TO_WORKER': 2, 'FINISHED': 2, 'NIL': 2}, 'ref_type_counts': {'LOCAL_REFERENCE': 2, 'ACTOR_HANDLE': 4}}}, 'total_objects': 6, 'total_size_mb': 0.0, 'callsite_enabled': False, 'summary_by': 'callsite'}} See :ref:`state CLI reference ` for more details about ``ray summary`` command. List the states of all entities of certain type ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Get a list of resources. Possible resources include: - :ref:`Actors `, e.g., Actor ID, State, PID, death_cause (:class:`output schema `) - :ref:`Tasks `, e.g., name, scheduling state, type, runtime env info (:class:`output schema `) - :ref:`Objects `, e.g., object ID, callsites, reference types (:class:`output schema `) - :ref:`Jobs `, e.g., start/end time, entrypoint, status (:class:`output schema `) - :ref:`Placement Groups `, e.g., name, bundles, stats (:class:`output schema `) - Nodes (Ray worker nodes), e.g., node ID, node IP, node state (:class:`output schema `) - Workers (Ray worker processes), e.g., worker ID, type, exit type and details (:class:`output schema `) - :ref:`Runtime environments `, e.g., runtime envs, creation time, nodes (:class:`output schema `) **List all nodes** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list nodes .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_nodes list_nodes() **List all placement groups** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list placement-groups .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_placement_groups list_placement_groups() **List local referenced objects created by a process** .. tip:: You can list resources with one or multiple filters: using `--filter` or `-f` .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list objects -f pid= -f reference_type=LOCAL_REFERENCE .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_objects list_objects(filters=[("pid", "=", 1234), ("reference_type", "=", "LOCAL_REFERENCE")]) **List alive actors** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list actors -f state=ALIVE .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_actors list_actors(filters=[("state", "=", "ALIVE")]) **List running tasks** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list tasks -f state=RUNNING .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_tasks list_tasks(filters=[("state", "=", "RUNNING")]) **List non-running tasks** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list tasks -f state!=RUNNING .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_tasks list_tasks(filters=[("state", "!=", "RUNNING")]) **List running tasks that have a name func** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list tasks -f state=RUNNING -f name="task_running_300_seconds()" .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_tasks list_tasks(filters=[("state", "=", "RUNNING"), ("name", "=", "task_running_300_seconds()")]) **List tasks with more details** .. tip:: When ``--detail`` is specified, the API can query more data sources to obtain state information in details. .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray list tasks --detail .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: from ray.util.state import list_tasks list_tasks(detail=True) See :ref:`state CLI reference ` for more details about ``ray list`` command. Get the states of a particular entity (task, actor, etc.) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Get a task's states** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray get tasks .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_task get_task(id=) **Get a node's states** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray get nodes .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_node get_node(id=) See :ref:`state CLI reference ` for more details about ``ray get`` command. Fetch the logs of a particular entity (task, actor, etc.) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. _state-api-log-doc: State API also allows you to access Ray logs. Note that you cannot access the logs from a dead node. By default, the API prints logs from a head node. **Get all retrievable log file names from a head node in a cluster** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray logs cluster .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True # You could get the node ID / node IP from `ray list nodes` from ray.util.state import list_logs # `ray logs` by default print logs from a head node. # To list the same logs, you should provide the head node ID. # Get the node ID / node IP from `ray list nodes` list_logs(node_id=) **Get a particular log file from a node** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash # Get the node ID / node IP from `ray list nodes` ray logs cluster gcs_server.out --node-id # `ray logs cluster` is alias to `ray logs` when querying with globs. ray logs gcs_server.out --node-id .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_log # Node IP can be retrieved from list_nodes() or ray.nodes() for line in get_log(filename="gcs_server.out", node_id=): print(line) **Stream a log file from a node** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash # Get the node ID / node IP from `ray list nodes` ray logs raylet.out --node-ip --follow # Or, ray logs cluster raylet.out --node-ip --follow .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_log # Retrieve the Node IP from list_nodes() or ray.nodes() # The loop blocks with `follow=True` for line in get_log(filename="raylet.out", node_ip=, follow=True): print(line) **Stream log from an actor with actor id** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray logs actor --id= --follow .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_log # Get the Actor's ID from the output of `ray list actors`. # The loop blocks with `follow=True` for line in get_log(actor_id=, follow=True): print(line) **Stream log from a pid** .. tab-set:: .. tab-item:: CLI (Recommended) :sync: CLI (Recommended) .. code-block:: bash ray logs worker --pid= --follow .. tab-item:: Python SDK (Internal Developer API) :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True from ray.util.state import get_log # Retrieve the node IP from list_nodes() or ray.nodes() # get the PID of the worker running the Actor easily when output # of worker is directed to the driver (default) # The loop blocks with `follow=True` for line in get_log(pid=, node_ip=, follow=True): print(line) See :ref:`state CLI reference` for more details about ``ray logs`` command. Failure Semantics ^^^^^^^^^^^^^^^^^^^^^^^^^ The State APIs don't guarantee to return a consistent or complete snapshot of the cluster all the time. By default, all Python SDKs raise an exception when output is missing from the API. The CLI returns a partial result and provides warning messages. Here are cases where there can be missing output from the API. **Query Failures** State APIs query "data sources" (e.g., GCS, raylets, etc.) to obtain and build the snapshot of the Cluster. However, data sources are sometimes unavailable (e.g., the source is down or overloaded). In this case, APIs return a partial (incomplete) snapshot of the Cluster, and users are informed that the output is incomplete through a warning message. All warnings are printed through Python's ``warnings`` library, and they can be suppressed. **Data Truncation** When the returned number of entities (number of rows) is too large (> 100K), state APIs truncate the output data to ensure system stability (when this happens, there's no way to choose truncated data). When truncation happens it is informed through Python's ``warnings`` module. **Garbage Collected Resources** Depending on the lifecycle of the resources, some "finished" resources are not accessible through the APIs because they are already garbage collected. .. note:: Do not to rely on this API to obtain correct information on finished resources. For example, Ray periodically garbage collects DEAD state Actor data to reduce memory usage. Or it cleans up the FINISHED state of Tasks when its lineage goes out of scope. API Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~ - For the CLI Reference, see :ref:`State CLI Reference `. - For the SDK Reference, see :ref:`State API Reference `. - For the Log CLI Reference, see :ref:`Log CLI Reference `. Using Ray CLI tools from outside the cluster -------------------------------------------------------- These CLI commands have to be run on a node in the Ray Cluster. Examples for executing these commands from a machine outside the Ray Cluster are provided below. .. tab-set:: .. tab-item:: VM Cluster Launcher Execute a command on the cluster using ``ray exec``: .. code-block:: shell $ ray exec "ray status" .. tab-item:: KubeRay Execute a command on the cluster using ``kubectl exec`` and the configured RayCluster name. Ray uses the Service targeting the Ray head pod to execute a CLI command on the cluster. .. code-block:: shell # First, find the name of the Ray head service. $ kubectl get pod | grep -head # NAME READY STATUS RESTARTS AGE # -head-xxxxx 2/2 Running 0 XXs # Then, use the name of the Ray head service to run `ray status`. $ kubectl exec -head-xxxxx -- ray status --- .. _observability-debug-failures: Debugging Failures ================== What Kind of Failures Exist in Ray? ----------------------------------- Ray consists of two major APIs. ``.remote()`` to create a Task or Actor, and :func:`ray.get ` to get the result. Debugging Ray means identifying and fixing failures from remote processes that run functions and classes (Tasks and Actors) created by the ``.remote`` API. Ray APIs are future APIs (indeed, it is :ref:`possible to convert Ray object references to standard Python future APIs `), and the error handling model is the same. When any remote Tasks or Actors fail, the returned object ref contains an exception. When you call ``get`` API to the object ref, it raises an exception. .. testcode:: import ray @ray.remote def f(): raise ValueError("it's an application error") # Raises a ValueError. try: ray.get(f.remote()) except ValueError as e: print(e) .. testoutput:: ... ValueError: it's an application error In Ray, there are three types of failures. See exception APIs for more details. - **Application failures**: This means the remote task/actor fails by the user code. In this case, ``get`` API will raise the :func:`RayTaskError ` which includes the exception raised from the remote process. - **Intentional system failures**: This means Ray is failed, but the failure is intended. For example, when you call cancellation APIs like ``ray.cancel`` (for task) or ``ray.kill`` (for actors), the system fails remote tasks and actors, but it is intentional. - **Unintended system failures**: This means the remote tasks and actors failed due to unexpected system failures such as processes crashing (for example, by out-of-memory error) or nodes failing. 1. `Linux Out of Memory killer `_ or :ref:`Ray Memory Monitor ` kills processes with high memory usages to avoid out-of-memory. 2. The machine shuts down (e.g., spot instance termination) or a :term:`raylet ` crashed (e.g., by an unexpected failure). 3. System is highly overloaded or stressed (either machine or system components like Raylet or :term:`GCS `), which makes the system unstable and fail. Debugging Application Failures ------------------------------ Ray distributes users' code to multiple processes across many machines. Application failures mean bugs in users' code. Ray provides a debugging experience that's similar to debugging a single-process Python program. print ~~~~~ ``print`` debugging is one of the most common ways to debug Python programs. :ref:`Ray's Task and Actor logs are printed to the Ray Driver ` by default, which allows you to simply use the ``print`` function to debug the application failures. Debugger ~~~~~~~~ Many Python developers use a debugger to debug Python programs, and `Python pdb `_) is one of the popular choices. Ray has native integration to ``pdb``. You can simply add ``breakpoint()`` to Actors and Tasks code to enable ``pdb``. View :ref:`Ray Debugger ` for more details. Running out of file descriptors (``Too many open files``) --------------------------------------------------------- In a Ray cluster, arbitrary two system components can communicate with each other and make 1 or more connections. For example, some workers may need to communicate with GCS to schedule Actors (worker <-> GCS connection). Your Driver can invoke Actor methods (worker <-> worker connection). Ray can support 1000s of raylets and 10000s of worker processes. When a Ray cluster gets larger, each component can have an increasing number of network connections, which requires file descriptors. Linux typically limits the default file descriptors per process to 1024. When there are more than 1024 connections to the component, it can raise error messages below. .. code-block:: bash Too many open files It is especially common for the head node GCS process because it is a centralized component that many other components in Ray communicate with. When you see this error message, we recommend you adjust the max file descriptors limit per process via the ``ulimit`` command. We recommend you apply ``ulimit -n 65536`` to your host configuration. However, you can also selectively apply it for Ray components (view below example). Normally, each worker has 2~3 connections to GCS. Each raylet has 1~2 connections to GCS. 65536 file descriptors can handle 10000~15000 of workers and 1000~2000 of nodes. If you have more workers, you should consider using a higher number than 65536. .. code-block:: bash # Start head node components with higher ulimit. ulimit -n 65536 ray start --head # Start worker node components with higher ulimit. ulimit -n 65536 ray start --address # Start a Ray driver with higher ulimit. ulimit -n 65536 If that fails, double-check that the hard limit is sufficiently large by running ``ulimit -Hn``. If it is too small, you can increase the hard limit as follows (these instructions work on EC2). * Increase the hard ulimit for open file descriptors system-wide by running the following. .. code-block:: bash sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf" * Logout and log back in. Failures due to memory issues -------------------------------- View :ref:`debugging memory issues ` for more details. This document discusses some common problems that people run into when using Ray as well as some known problems. If you encounter other problems, `let us know`_. .. _`let us know`: https://github.com/ray-project/ray/issues --- .. _observability-debug-hangs: Debugging Hangs =============== View stack traces in Ray Dashboard ----------------------------------- The :ref:`Ray dashboard ` lets you profile Ray Driver or Worker processes, by clicking on the "CPU profiling" or "Stack Trace" actions for active Worker processes, Tasks, Actors, and Job's driver process. .. image:: /images/profile.png :align: center :width: 80% Clicking "Stack Trace" will return the current stack trace sample using ``py-spy``. By default, only the Python stack trace is shown. To show native code frames, set the URL parameter ``native=1`` (only supported on Linux). .. image:: /images/stack.png :align: center :width: 60% .. note:: You may run into permission errors when using py-spy in the docker containers. To fix the issue: * if you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it. * if you are a KubeRay user, follow the :ref:`guide to configure KubeRay ` and resolve it. .. note:: The following errors are conditional and not signals of failures for your Python programs: * If you see "No such file or direction", check if your worker process has exited. * If you see "No stack counts found", check if your worker process was sleeping and not active in the last 5s. .. _`py-spy documentation`: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker Use ``ray stack`` CLI command ------------------------------ Once ``py-spy`` is installed (it is automatically installed if "Ray Dashboard" component is included when :ref:`installing Ray `), you can run ``ray stack`` to dump the stack traces of all Ray Worker processes on the current node. This document discusses some common problems that people run into when using Ray as well as some known problems. If you encounter other problems, please `let us know`_. .. _`let us know`: https://github.com/ray-project/ray/issues --- .. _ray-core-mem-profiling: Debugging Memory Issues ======================= .. _troubleshooting-out-of-memory: Debugging Out of Memory ----------------------- Before reading this section, familiarize yourself with the Ray :ref:`Memory Management ` model. - If your cluster has out-of-memory problems, view :ref:`How to Detect Out-of-Memory Errors `. - To locate the source of the memory leak, view :ref:`Find per Task and Actor Memory Usage `. - If your head node has high memory usage, view :ref:`Head Node Out-of-Memory Error `. - If your memory usage is high due to high parallelism, view :ref:`Reduce Parallelism `. - If you want to profile per Task and Actor memory usage, view :ref:`Profile Task and Actor Memory Usage `. What's the Out-of-Memory Error? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Memory is a limited resource. When a process requests memory and the OS fails to allocate it, the OS executes a routine to free up memory by killing a process that has high memory usage (via SIGKILL) to avoid the OS becoming unstable. This routine is called the `Linux Out of Memory killer `_. One of the common problems of the Linux out-of-memory killer is that SIGKILL kills processes without Ray noticing it. Since SIGKILL cannot be handled by processes, Ray has difficulty raising a proper error message and taking proper actions for fault tolerance. To solve this problem, Ray has (from Ray 2.2) an application-level :ref:`memory monitor `, which continually monitors the memory usage of the host and kills the Ray Workers before the Linux out-of-memory killer executes. .. _troubleshooting-out-of-memory-how-to-detect: Detecting Out-of-Memory errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the Linux out-of-memory killer terminates Tasks or Actors, Ray Worker processes are unable to catch and display an exact root cause because SIGKILL cannot be handled by processes. If you call ``ray.get`` into the Tasks and Actors that were executed from the dead worker, it raises an exception with one of the following error messages (which indicates the worker is killed unexpectedly). .. code-block:: bash Worker exit type: UNEXPECTED_SY STEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. .. code-block:: bash Worker exit type: SYSTEM_ERROR Worker exit detail: The leased worker has unrecoverable failure. Worker is requested to be destroyed when it is returned. You can also use the `dmesg `_ CLI command to verify the processes are killed by the Linux out-of-memory killer. .. image:: ../../images/dmsg.png :align: center If Ray's memory monitor kills the worker, it is automatically retried (see the :ref:`link ` for details). If Tasks or Actors cannot be retried, they raise an exception with a much cleaner error message when you call ``ray.get`` to it. .. code-block:: bash ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Task was killed due to the node running low on memory. Memory on the node (IP: 10.0.62.231, ID: e5d953ef03e55e26f13973ea1b5a0fd0ecc729cd820bc89e4aa50451) where the task (task ID: 43534ce9375fa8e4cd0d0ec285d9974a6a95897401000000, name=allocate_memory, pid=11362, memory used=1.25GB) was running was 27.71GB / 28.80GB (0.962273), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 6f2ec5c8b0d5f5a66572859faf192d36743536c2e9702ea58084b037) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.62.231`. To see the logs of the worker, use `ray logs worker-6f2ec5c8b0d5f5a66572859faf192d36743536c2e9702ea58084b037*out -ip 10.0.62.231.` Top 10 memory users: PID MEM(GB) COMMAND 410728 8.47 510953 7.19 ray::allocate_memory 610952 6.15 ray::allocate_memory 711164 3.63 ray::allocate_memory 811156 3.63 ray::allocate_memory 911362 1.25 ray::allocate_memory 107230 0.09 python test.py --num-tasks 2011327 0.08 /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/dashboa... Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Ray memory monitor also periodically prints the aggregated out-of-memory killer summary to Ray drivers. .. code-block:: bash (raylet) [2023-04-09 07:23:59,445 E 395 395] (raylet) node_manager.cc:3049: 10 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: e5d953ef03e55e26f13973ea1b5a0fd0ecc729cd820bc89e4aa50451, IP: 10.0.62.231) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.0.62.231` (raylet) (raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero. Ray Dashboard's :ref:`metrics page ` and :ref:`event page ` also provides the out-of-memory killer-specific events and metrics. .. image:: ../../images/oom-metrics.png :align: center .. image:: ../../images/oom-events.png :align: center .. _troubleshooting-out-of-memory-task-actor-mem-usage: Find per Task and Actor Memory Usage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If Tasks or Actors fail because of out-of-memory errors, they are retried based on :ref:`retry policies `. However, it is often preferred to find the root causes of memory issues and fix them instead of relying on fault tolerance mechanisms. This section explains how to debug out-of-memory errors in Ray. First, find the Tasks and Actors that have high memory usage. View the :ref:`per Task and Actor memory usage graph ` for more details. The memory usage from the per component graph uses RSS - SHR. See below for reasoning. Alternatively, you can also use the CLI command `htop `_. .. image:: ../../images/htop.png :align: center See the ``allocate_memory`` row. See two columns, RSS and SHR. SHR usage is typically the memory usage from the Ray object store. The Ray object store allocates 30% of host memory to the shared memory (``/dev/shm``, unless you specify ``--object-store-memory``). If Ray workers access the object inside the object store using ``ray.get``, SHR usage increases. Since the Ray object store supports the :ref:`zero-copy ` deserialization, several workers can access the same object without copying them to in-process memory. For example, if 8 workers access the same object inside the Ray object store, each process' ``SHR`` usage increases. However, they are not using 8 * SHR memory (there's only 1 copy in the shared memory). Also note that Ray object store triggers :ref:`object spilling ` when the object usage goes beyond the limit, which means the memory usage from the shared memory won't exceed 30% of the host memory. Out-of-memory issues from a host, are due to RSS usage from each worker. Calculate per process memory usage by RSS - SHR because SHR is for Ray object store as explained above. The total memory usage is typically ``SHR (object store memory usage, 30% of memory) + sum(RSS - SHR from each ray proc) + sum(RSS - SHR from system components. e.g., raylet, GCS. Usually small)``. .. _troubleshooting-out-of-memory-head: Head node out-of-Memory error ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, check the head node memory usage from the metrics page. Find the head node address from the cluster page. .. image:: ../../images/head-node-addr.png :align: center Then check the memory usage from the head node from the node memory usage view inside the Dashboard :ref:`metrics view `. .. image:: ../../images/metrics-node-view.png :align: center The Ray head node has more memory-demanding system components such as GCS or the dashboard. Also, the driver runs from a head node by default. If the head node has the same memory capacity as worker nodes and if you execute the same number of Tasks and Actors from a head node, it can easily have out-of-memory problems. In this case, do not run any Tasks and Actors on the head node by specifying ``--num-cpus=0`` when starting a head node by ``ray start --head``. If you use KubeRay, view :ref:`here `. .. _troubleshooting-out-of-memory-reduce-parallelism: Reduce Parallelism ~~~~~~~~~~~~~~~~~~ High parallelism can trigger out-of-memory errors. For example, if you have 8 training workers that perform the data preprocessing -> training. If you load too much data into each worker, the total memory usage (``training worker mem usage * 8``) can exceed the memory capacity. Verify the memory usage by looking at the :ref:`per Task and Actor memory usage graph ` and the Task metrics. First, see the memory usage of an ``allocate_memory`` task. The total is 18GB. At the same time, verify the 15 concurrent tasks that are running. .. image:: ../../images/component-memory.png :align: center .. image:: ../../images/tasks-graph.png :align: center Each task uses about 18GB / 15 == 1.2 GB. To reduce the parallelism: - `Limit the max number of running tasks `_. - Increase the ``num_cpus`` options for :func:`ray.remote`. Modern hardware typically has 4GB of memory per CPU, so you can choose the CPU requirements accordingly. This example specifies 1 CPU per ``allocate_memory`` Task. Doubling the CPU requirements, runs only half(7) of the Tasks at the same time, and memory usage doesn't exceed 9GB. .. _troubleshooting-out-of-memory-profile: Profiling Task and Actor memory usage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is also possible tasks and actors use more memory than you expect. For example, actors or tasks can have a memory leak or have unnecessary copies. View the instructions below to learn how to memory profile individual actors and tasks. .. _memray-profiling: Memory Profiling Ray tasks and actors -------------------------------------- To memory profile Ray tasks or actors, use `memray `_. Note that you can also use other memory profiling tools if it supports a similar API. First, install ``memray``. .. code-block:: bash pip install memray ``memray`` supports a Python context manager to enable memory profiling. You can write the ``memray`` profiling file wherever you want. But in this example, we will write them to `/tmp/ray/session_latest/logs` because Ray dashboard allows you to download files inside the log folder. This will allow you to download profiling files from other nodes. .. tab-set:: .. tab-item:: Actors .. literalinclude:: ../../doc_code/memray_profiling.py :language: python :start-after: __memray_profiling_start__ :end-before: __memray_profiling_end__ .. tab-item:: Tasks Note that tasks have a shorter lifetime, so there could be lots of memory profiling files. .. literalinclude:: ../../doc_code/memray_profiling.py :language: python :start-after: __memray_profiling_task_start__ :end-before: __memray_profiling_task_end__ Once the task or actor runs, go to the :ref:`Logs view ` of the dashboard. Find and click the log file name. .. image:: ../../images/memory-profiling-files.png :align: center Click the download button. .. image:: ../../images/download-memory-profiling-files.png :align: center Now, you have the memory profiling file. Running .. code-block:: bash memray flamegraph And you can see the result of the memory profiling! --- .. _observability-general-debugging: Common Issues ======================= Distributed applications offer great power but also increased complexity. Some of Ray's behaviors may initially surprise users, but these design choices serve important purposes in distributed computing environments. This document outlines common issues encountered when running Ray in a cluster, highlighting key differences compared to running Ray locally. Environment variables aren't passed from the Driver process to Worker processes --------------------------------------------------------------------------------- **Issue:** When you set an environment variable on your Driver, it isn't propagated to the Worker processes. **Example:** Suppose you have a file ``baz.py`` in the directory where you run Ray, and you execute the following command: .. literalinclude:: /ray-observability/doc_code/gotchas.py :language: python :start-after: __env_var_start__ :end-before: __env_var_end__ **Expected behavior:** Users may expect that setting environment variables on the Driver sends them to all Worker processes as if running on a single machine, but it doesn't. **Fix:** Enable Runtime Environments to explicitly pass environment variables. When you call ``ray.init(runtime_env=...)``, it sends the specified environment variables to the Workers. Alternatively, you can set the environment variables as part of your cluster setup configuration. .. literalinclude:: /ray-observability/doc_code/gotchas.py :language: python :start-after: __env_var_fix_start__ :end-before: __env_var_fix_end__ Filenames work sometimes and not at other times ----------------------------------------------- **Issue:** Referencing a file by its name in a Task or Actor may sometimes succeed and sometimes fail. This inconsistency arises because the Task or Actor finds the file when running on the Head Node, but the file might not exist on other machines. **Example:** Consider the following scenario: .. code-block:: bash % touch /tmp/foo.txt And this code: .. testcode:: import os import ray @ray.remote def check_file(): foo_exists = os.path.exists("/tmp/foo.txt") return foo_exists futures = [] for _ in range(1000): futures.append(check_file.remote()) print(ray.get(futures)) In this case, you might receive a mixture of True and False. If ``check_file()`` runs on the Head Node or locally, it finds the file; however, on a Worker Node, it doesn't. **Expected behavior:** Users generally expect file references to either work consistently or to reliably fail, rather than behaving inconsistently. **Fix:** — Use only shared file paths for such applications. For example, a network file system or S3 storage can provide the required consistency. — Avoid relying on local files to be consistent across machines. Placement Groups aren't composable ----------------------------------- **Issue:** If you schedule a new task from the tasks or actors running within a Placement Group, the system might fail to allocate resources properly, causing the operation to hang. **Example:** Imagine you are using Ray Tune (which creates Placement Groups) and want to apply it to an objective function that in turn uses Ray Tasks. For example: .. testcode:: import ray from ray import tune from ray.util.placement_group import PlacementGroupSchedulingStrategy def create_task_that_uses_resources(): @ray.remote(num_cpus=10) def sample_task(): print("Hello") return return ray.get([sample_task.remote() for i in range(10)]) def objective(config): create_task_that_uses_resources() tuner = tune.Tuner(objective, param_space={"a": 1}) tuner.fit() This code errors with the message: .. code-block:: ValueError: Cannot schedule create_task_that_uses_resources..sample_task with the placement group because the resource request {'CPU': 10} cannot fit into any bundles for the placement group, [{'CPU': 1.0}]. **Expected behavior:** The code executes successfully without resource allocation issues. **Fix:** Ensure that in the ``@ray.remote`` declaration of tasks called within ``create_task_that_uses_resources()``, you include the parameter ``scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=None)``. .. code-block:: diff def create_task_that_uses_resources(): + @ray.remote(num_cpus=10, scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=None)) - @ray.remote(num_cpus=10) Outdated Function Definitions ----------------------------- Because of Python's subtleties, redefining a remote function may not always update Ray to use the latest version. For example, suppose you define a remote function ``f`` and then redefine it; Ray should use the new definition: .. testcode:: import ray @ray.remote def f(): return 1 @ray.remote def f(): return 2 print(ray.get(f.remote())) # This should print 2. .. testoutput:: 2 However, there are cases where modifying a remote function doesn't take effect without restarting the cluster: — **Imported function issue:** If ``f`` is defined in an external file (e.g., ``file.py``), and you modify its definition, re-importing the file may be ignored because Python treats the subsequent import as a no-op. A solution is to use ``from importlib import reload; reload(file)`` instead of a second import. — **Helper function dependency:** If ``f`` depends on a helper function ``h`` defined in an external file, changes to ``h`` may not propagate. The easiest solution is to restart the Ray cluster. Alternatively, you can redefine ``f`` to reload ``file.py`` before invoking ``h``: .. testcode:: @ray.remote def f(): from importlib import reload reload(file) return file.h() This forces the external module to reload on the Workers. Note that in Python 3, you must use ``from importlib import reload``. Capture task and actor call sites --------------------------------- Ray captures and displays a stack trace when you invoke a task, create an actor, or call an actor method. To enable call site capture, set the environment variable ``RAY_record_task_actor_creation_sites=true``. When enabled: — Ray captures a stack trace when creating tasks, actors, or invoking actor methods. — The captured stack trace is available in the Ray Dashboard (under task and actor details), output of the state CLI command ``ray list task --detail``, and state API responses. Note that Ray turns off stack trace capture by default due to potential performance impacts. Enable it only when you need it for debugging. Example: .. NOTE(edoakes): test is skipped because it reinitializes Ray. .. testcode:: :skipif: True import ray # Enable stack trace capture ray.init(runtime_env={"env_vars": {"RAY_record_task_actor_creation_sites": "true"}}) @ray.remote def my_task(): return 42 # Capture the stack trace upon task invocation. future = my_task.remote() result = ray.get(future) @ray.remote class Counter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value # Capture the stack trace upon actor creation. counter = Counter.remote() # Capture the stack trace upon method invocation. counter.increment.remote() This document outlines common problems encountered when using Ray along with potential solutions. If you encounter additional issues, please report them. .. _`let us know`: https://github.com/ray-project/ray/issues --- .. _observability-optimize-performance: Optimizing Performance ====================== No speedup ---------- You just ran an application using Ray, but it wasn't as fast as you expected it to be. Or worse, perhaps it was slower than the serial version of the application! The most common reasons are the following. - **Number of cores:** How many cores is Ray using? When you start Ray, it will determine the number of CPUs on each machine with ``psutil.cpu_count()``. Ray usually will not schedule more tasks in parallel than the number of CPUs. So if the number of CPUs is 4, the most you should expect is a 4x speedup. - **Physical versus logical CPUs:** Do the machines you're running on have fewer **physical** cores than **logical** cores? You can check the number of logical cores with ``psutil.cpu_count()`` and the number of physical cores with ``psutil.cpu_count(logical=False)``. This is common on a lot of machines and especially on EC2. For many workloads (especially numerical workloads), you often cannot expect a greater speedup than the number of physical CPUs. - **Small tasks:** Are your tasks very small? Ray introduces some overhead for each task (the amount of overhead depends on the arguments that are passed in). You will be unlikely to see speedups if your tasks take less than ten milliseconds. For many workloads, you can easily increase the sizes of your tasks by batching them together. - **Variable durations:** Do your tasks have variable duration? If you run 10 tasks with variable duration in parallel, you shouldn't expect an N-fold speedup (because you'll end up waiting for the slowest task). In this case, consider using ``ray.wait`` to begin processing tasks that finish first. - **Multi-threaded libraries:** Are all of your tasks attempting to use all of the cores on the machine? If so, they are likely to experience contention and prevent your application from achieving a speedup. This is common with some versions of ``numpy``. To avoid contention, set an environment variable like ``MKL_NUM_THREADS`` (or the equivalent depending on your installation) to ``1``. For many - but not all - libraries, you can diagnose this by opening ``top`` while your application is running. If one process is using most of the CPUs, and the others are using a small amount, this may be the problem. The most common exception is PyTorch, which will appear to be using all the cores despite needing ``torch.set_num_threads(1)`` to be called to avoid contention. If you are still experiencing a slowdown, but none of the above problems apply, we'd really like to know! Create a `GitHub issue`_ and Submit a minimal code example that demonstrates the problem. .. _`Github issue`: https://github.com/ray-project/ray/issues This document discusses some common problems that people run into when using Ray as well as some known problems. If you encounter other problems, `let us know`_. .. _`let us know`: https://github.com/ray-project/ray/issues .. _ray-core-timeline: Visualizing Tasks with Ray Timeline ------------------------------------- View :ref:`how to use Ray Timeline in the Dashboard ` for more details. Instead of using Dashboard UI to download the tracing file, you can also export the tracing file as a JSON file by running ``ray timeline`` from the command line or ``ray.timeline`` from the Python API. .. testcode:: import ray ray.init() ray.timeline(filename="timeline.json") .. _dashboard-profiling: Python CPU profiling in the Dashboard ------------------------------------- The :ref:`Ray dashboard ` lets you profile Ray worker processes by clicking on the "Stack Trace" or "CPU Flame Graph" actions for active workers, actors, and jobs. .. image:: /images/profile.png :align: center :width: 80% Clicking "Stack Trace" returns the current stack trace sample using ``py-spy``. By default, only the Python stack trace is shown. To show native code frames, set the URL parameter ``native=1`` (only supported on Linux). .. image:: /images/stack.png :align: center :width: 60% Clicking "CPU Flame Graph" takes a number of stack trace samples and combine them into a flame graph visualization. This flame graph can be useful for understanding the CPU activity of the particular process. To adjust the duration of the flame graph, you can change the ``duration`` parameter in the URL. Similarly, you can change the ``native`` parameter to enable native profiling. .. image:: /images/flamegraph.png :align: center :width: 80% The profiling feature requires ``py-spy`` to be installed. If it is not installed, or if the ``py-spy`` binary does not have root permissions, the Dashboard prompts with instructions on how to setup ``py-spy`` correctly: .. code-block:: This command requires `py-spy` to be installed with root permissions. You can install `py-spy` and give it root permissions as follows: $ pip install py-spy $ sudo chown root:root `which py-spy` $ sudo chmod u+s `which py-spy` Alternatively, you can start Ray with passwordless sudo / root permissions. .. note:: You may run into permission errors when using py-spy in the docker containers. To fix the issue: * If you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it. * if you are a KubeRay user, follow the :ref:`guide to configure KubeRay ` and resolve it. .. _`py-spy documentation`: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker .. _dashboard-cprofile: Profiling using Python's cProfile --------------------------------- You can use Python's native cProfile `profiling module`_ to profile the performance of your Ray application. Rather than tracking line-by-line of your application code, cProfile can give the total runtime of each loop function, as well as list the number of calls made and execution time of all function calls made within the profiled code. .. _`profiling module`: https://docs.python.org/3/library/profile.html#module-cProfile Unlike ``line_profiler`` above, this detailed list of profiled function calls **includes** internal function calls and function calls made within Ray. However, similar to ``line_profiler``, cProfile can be enabled with minimal changes to your application code (given that each section of the code you want to profile is defined as its own function). To use cProfile, add an import statement, then replace calls to the loop functions as follows: .. testcode:: :skipif: True import cProfile # Added import statement def ex1(): list1 = [] for i in range(5): list1.append(ray.get(func.remote())) def main(): ray.init() cProfile.run('ex1()') # Modified call to ex1 cProfile.run('ex2()') cProfile.run('ex3()') if __name__ == "__main__": main() Now, when you execute your Python script, a cProfile list of profiled function calls are printed on the terminal for each call made to ``cProfile.run()``. At the very top of cProfile's output gives the total execution time for ``'ex1()'``: .. code-block:: bash 601 function calls (595 primitive calls) in 2.509 seconds Following is a snippet of profiled function calls for ``'ex1()'``. Most of these calls are quick and take around 0.000 seconds, so the functions of interest are the ones with non-zero execution times: .. code-block:: bash ncalls tottime percall cumtime percall filename:lineno(function) ... 1 0.000 0.000 2.509 2.509 your_script_here.py:31(ex1) 5 0.000 0.000 0.001 0.000 remote_function.py:103(remote) 5 0.000 0.000 0.001 0.000 remote_function.py:107(_remote) ... 10 0.000 0.000 0.000 0.000 worker.py:2459(__init__) 5 0.000 0.000 2.508 0.502 worker.py:2535(get) 5 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) 10 0.000 0.000 2.507 0.251 worker.py:374(retrieve_and_deserialize) 5 0.000 0.000 2.508 0.502 worker.py:424(get_object) 5 0.000 0.000 0.000 0.000 worker.py:514(submit_task) ... The 5 separate calls to Ray's ``get``, taking the full 0.502 seconds each call, can be noticed at ``worker.py:2535(get)``. Meanwhile, the act of calling the remote function itself at ``remote_function.py:103(remote)`` only takes 0.001 seconds over 5 calls, and thus is not the source of the slow performance of ``ex1()``. Profiling Ray Actors with cProfile ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Considering that the detailed output of cProfile can be quite different depending on what Ray functionalities we use, let us see what cProfile's output might look like if our example involved Actors (for an introduction to Ray actors, see our :ref:`Actor documentation `). Now, instead of looping over five calls to a remote function like in ``ex1``, let's create a new example and loop over five calls to a remote function **inside an actor**. Our actor's remote function again just sleeps for 0.5 seconds: .. testcode:: # Our actor @ray.remote class Sleeper: def __init__(self): self.sleepValue = 0.5 # Equivalent to func(), but defined within an actor def actor_func(self): time.sleep(self.sleepValue) Recalling the suboptimality of ``ex1``, let's first see what happens if we attempt to perform all five ``actor_func()`` calls within a single actor: .. testcode:: def ex4(): # This is suboptimal in Ray, and should only be used for the sake of this example actor_example = Sleeper.remote() five_results = [] for i in range(5): five_results.append(actor_example.actor_func.remote()) # Wait until the end to call ray.get() ray.get(five_results) We enable cProfile on this example as follows: .. testcode:: :skipif: True def main(): ray.init() cProfile.run('ex4()') if __name__ == "__main__": main() Running our new Actor example, cProfile's abbreviated output is as follows: .. code-block:: bash 12519 function calls (11956 primitive calls) in 2.525 seconds ncalls tottime percall cumtime percall filename:lineno(function) ... 1 0.000 0.000 0.015 0.015 actor.py:546(remote) 1 0.000 0.000 0.015 0.015 actor.py:560(_remote) 1 0.000 0.000 0.000 0.000 actor.py:697(__init__) ... 1 0.000 0.000 2.525 2.525 your_script_here.py:63(ex4) ... 9 0.000 0.000 0.000 0.000 worker.py:2459(__init__) 1 0.000 0.000 2.509 2.509 worker.py:2535(get) 9 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) 4 0.000 0.000 2.508 0.627 worker.py:374(retrieve_and_deserialize) 1 0.000 0.000 2.509 2.509 worker.py:424(get_object) 8 0.000 0.000 0.001 0.000 worker.py:514(submit_task) ... It turns out that the entire example still took 2.5 seconds to execute, or the time for five calls to ``actor_func()`` to run in serial. If you recall ``ex1``, this behavior was because we did not wait until after submitting all five remote function tasks to call ``ray.get()``, but we can verify on cProfile's output line ``worker.py:2535(get)`` that ``ray.get()`` was only called once at the end, for 2.509 seconds. What happened? It turns out Ray cannot parallelize this example, because we have only initialized a single ``Sleeper`` actor. Because each actor is a single, stateful worker, our entire code is submitted and ran on a single worker the whole time. To better parallelize the actors in ``ex4``, we can take advantage that each call to ``actor_func()`` is independent, and instead create five ``Sleeper`` actors. That way, we are creating five workers that can run in parallel, instead of creating a single worker that can only handle one call to ``actor_func()`` at a time. .. testcode:: def ex4(): # Modified to create five separate Sleepers five_actors = [Sleeper.remote() for i in range(5)] # Each call to actor_func now goes to a different Sleeper five_results = [] for actor_example in five_actors: five_results.append(actor_example.actor_func.remote()) ray.get(five_results) Our example in total now takes only 1.5 seconds to run: .. code-block:: bash 1378 function calls (1363 primitive calls) in 1.567 seconds ncalls tottime percall cumtime percall filename:lineno(function) ... 5 0.000 0.000 0.002 0.000 actor.py:546(remote) 5 0.000 0.000 0.002 0.000 actor.py:560(_remote) 5 0.000 0.000 0.000 0.000 actor.py:697(__init__) ... 1 0.000 0.000 1.566 1.566 your_script_here.py:71(ex4) ... 21 0.000 0.000 0.000 0.000 worker.py:2459(__init__) 1 0.000 0.000 1.564 1.564 worker.py:2535(get) 25 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) 3 0.000 0.000 1.564 0.521 worker.py:374(retrieve_and_deserialize) 1 0.000 0.000 1.564 1.564 worker.py:424(get_object) 20 0.001 0.000 0.001 0.000 worker.py:514(submit_task) ... .. _performance-debugging-gpu-profiling: GPU Profiling with PyTorch Profiler ----------------------------------- Here are the steps to use PyTorch Profiler during training with Ray Train or batch inference with Ray Data: * Follow the `PyTorch Profiler documentation `_ to record events in your PyTorch code. * Convert your PyTorch script to a :ref:`Ray Train training script ` or a :ref:`Ray Data batch inference script `. (no change to your profiler-related code) * Run your training or batch inference script. * Collect the profiling results from all the nodes (compared to 1 node in a non-distributed setting). * You may want to upload results on each Node to NFS or object storage like S3 so that you don't have to fetch results from each Node respectively. * Visualize the results with tools like Tensorboard. GPU Profiling with Nsight System Profiler ------------------------------------------ GPU profiling is critical for ML training and inference. Ray allows users to run Nsight System Profiler with Ray actors and tasks. :ref:`See for details `. Profiling for developers ------------------------ If you are developing Ray Core or debugging some system level failures, profiling the Ray Core could help. In this case, see :ref:`Profiling for Ray developers `. --- .. _ray-debugger: Using the Ray Debugger ====================== Ray has a built in debugger that allows you to debug your distributed applications. It allows to set breakpoints in your Ray tasks and actors and when hitting the breakpoint you can drop into a PDB session that you can then use to: - Inspect variables in that context - Step within that task or actor - Move up or down the stack .. warning:: The Ray Debugger is deprecated. Use the :doc:`Ray Distributed Debugger <../../ray-distributed-debugger>` instead. Starting with Ray 2.39, the new debugger is the default and you need to set the environment variable `RAY_DEBUG=legacy` to use the old debugger (e.g. by using a runtime environment). Getting Started --------------- Take the following example: .. testcode:: :skipif: True import ray ray.init(runtime_env={"env_vars": {"RAY_DEBUG": "legacy"}}) @ray.remote def f(x): breakpoint() return x * x futures = [f.remote(i) for i in range(2)] print(ray.get(futures)) Put the program into a file named ``debugging.py`` and execute it using: .. code-block:: bash python debugging.py Each of the 2 executed tasks will drop into a breakpoint when the line ``breakpoint()`` is executed. You can attach to the debugger by running the following command on the head node of the cluster: .. code-block:: bash ray debug The ``ray debug`` command will print an output like this: .. code-block:: text 2021-07-13 16:30:40,112 INFO scripts.py:216 -- Connecting to Ray instance at 192.168.2.61:6379. 2021-07-13 16:30:40,112 INFO worker.py:740 -- Connecting to existing Ray cluster at address: 192.168.2.61:6379 Active breakpoints: index | timestamp | Ray task | filename:lineno 0 | 2021-07-13 23:30:37 | ray::f() | debugging.py:6 1 | 2021-07-13 23:30:37 | ray::f() | debugging.py:6 Enter breakpoint index or press enter to refresh: You can now enter ``0`` and hit Enter to jump to the first breakpoint. You will be dropped into PDB at the break point and can use the ``help`` to see the available actions. Run ``bt`` to see a backtrace of the execution: .. code-block:: text (Pdb) bt /home/ubuntu/ray/python/ray/workers/default_worker.py(170)() -> ray.worker.global_worker.main_loop() /home/ubuntu/ray/python/ray/worker.py(385)main_loop() -> self.core_worker.run_task_loop() > /home/ubuntu/tmp/debugging.py(7)f() -> return x * x You can inspect the value of ``x`` with ``print(x)``. You can see the current source code with ``ll`` and change stack frames with ``up`` and ``down``. For now let us continue the execution with ``c``. After the execution is continued, hit ``Control + D`` to get back to the list of break points. Select the other break point and hit ``c`` again to continue the execution. The Ray program ``debugging.py`` now finished and should have printed ``[0, 1]``. Congratulations, you have finished your first Ray debugging session! Running on a Cluster -------------------- The Ray debugger supports setting breakpoints inside of tasks and actors that are running across your Ray cluster. In order to attach to these from the head node of the cluster using ``ray debug``, you'll need to make sure to pass in the ``--ray-debugger-external`` flag to ``ray start`` when starting the cluster (likely in your ``cluster.yaml`` file or k8s Ray cluster spec). Note that this flag will cause the workers to listen for PDB commands on an external-facing IP address, so this should *only* be used if your cluster is behind a firewall. Debugger Commands ----------------- The Ray debugger supports the `same commands as PDB `_. Stepping between Ray tasks -------------------------- You can use the debugger to step between Ray tasks. Let's take the following recursive function as an example: .. testcode:: :skipif: True import ray ray.init(runtime_env={"env_vars": {"RAY_DEBUG": "legacy"}}) @ray.remote def fact(n): if n == 1: return n else: n_ref = fact.remote(n - 1) return n * ray.get(n_ref) @ray.remote def compute(): breakpoint() result_ref = fact.remote(5) result = ray.get(result_ref) ray.get(compute.remote()) After running the program by executing the Python file and calling ``ray debug``, you can select the breakpoint by pressing ``0`` and enter. This will result in the following output: .. code-block:: shell Enter breakpoint index or press enter to refresh: 0 > /home/ubuntu/tmp/stepping.py(16)() -> result_ref = fact.remote(5) (Pdb) You can jump into the call with the ``remote`` command in Ray's debugger. Inside the function, print the value of `n` with ``p(n)``, resulting in the following output: .. code-block:: shell -> result_ref = fact.remote(5) (Pdb) remote *** Connection closed by remote host *** Continuing pdb session in different process... --Call-- > /home/ubuntu/tmp/stepping.py(5)fact() -> @ray.remote (Pdb) ll 5 -> @ray.remote 6 def fact(n): 7 if n == 1: 8 return n 9 else: 10 n_ref = fact.remote(n - 1) 11 return n * ray.get(n_ref) (Pdb) p(n) 5 (Pdb) Now step into the next remote call again with ``remote`` and print `n`. You an now either continue recursing into the function by calling ``remote`` a few more times, or you can jump to the location where ``ray.get`` is called on the result by using the ``get`` debugger command. Use ``get`` again to jump back to the original call site and use ``p(result)`` to print the result: .. code-block:: shell Enter breakpoint index or press enter to refresh: 0 > /home/ubuntu/tmp/stepping.py(14)() -> result_ref = fact.remote(5) (Pdb) remote *** Connection closed by remote host *** Continuing pdb session in different process... --Call-- > /home/ubuntu/tmp/stepping.py(5)fact() -> @ray.remote (Pdb) p(n) 5 (Pdb) remote *** Connection closed by remote host *** Continuing pdb session in different process... --Call-- > /home/ubuntu/tmp/stepping.py(5)fact() -> @ray.remote (Pdb) p(n) 4 (Pdb) get *** Connection closed by remote host *** Continuing pdb session in different process... --Return-- > /home/ubuntu/tmp/stepping.py(5)fact()->120 -> @ray.remote (Pdb) get *** Connection closed by remote host *** Continuing pdb session in different process... --Return-- > /home/ubuntu/tmp/stepping.py(14)()->None -> result_ref = fact.remote(5) (Pdb) p(result) 120 (Pdb) Post Mortem Debugging --------------------- Often we do not know in advance where an error happens, so we cannot set a breakpoint. In these cases, we can automatically drop into the debugger when an error occurs or an exception is thrown. This is called *post-mortem debugging*. Copy the following code into a file called ``post_mortem_debugging.py``. The flag ``RAY_DEBUG_POST_MORTEM=1`` will have the effect that if an exception happens, Ray will drop into the debugger instead of propagating it further. .. testcode:: :skipif: True import ray ray.init(runtime_env={"env_vars": {"RAY_DEBUG": "legacy", "RAY_DEBUG_POST_MORTEM": "1"}}) @ray.remote def post_mortem(x): x += 1 raise Exception("An exception is raised.") return x ray.get(post_mortem.remote(10)) Let's start the program: .. code-block:: bash python post_mortem_debugging.py Now run ``ray debug``. After we do that, we see an output like the following: .. code-block:: text Active breakpoints: index | timestamp | Ray task | filename:lineno 0 | 2024-11-01 20:14:00 | /Users/pcmoritz/ray/python/ray/_private/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=49606 --object-store-name=/tmp/ray/session_2024-11-01_13-13-51_279910_8596/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-11-01_13-13-51_279910_8596/sockets/raylet --redis-address=None --metrics-agent-port=58655 --runtime-env-agent-port=56999 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --runtime-env-agent-port=56999 --gcs-address=127.0.0.1:6379 --session-name=session_2024-11-01_13-13-51_279910_8596 --temp-dir=/tmp/ray --webui=127.0.0.1:8265 --cluster-id=6d341469ae0f85b6c3819168dde27cceda12e95c8efdfc256e0fd8ce --startup-token=12 --worker-launch-time-ms=1730492039955 --node-id=0d43573a606286125da39767a52ce45ad101324c8af02cc25a9fbac7 --runtime-env-hash=-1746935720 | /Users/pcmoritz/ray/python/ray/_private/worker.py:920 Traceback (most recent call last): File "python/ray/_raylet.pyx", line 1856, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1957, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1862, in ray._raylet.execute_task File "/Users/pcmoritz/ray-debugger-test/post_mortem_debugging.py", line 8, in post_mortem raise Exception("An exception is raised.") Exception: An exception is raised. Enter breakpoint index or press enter to refresh: We now press ``0`` and then Enter to enter the debugger. With ``ll`` we can see the context and with ``print(x)`` we an print the value of ``x``. In a similar manner as above, you can also debug Ray actors. Happy debugging! Debugging APIs -------------- See :ref:`package-ref-debugging-apis`. --- .. _ray-event-export: Ray Event Export ================ Starting from 2.49, Ray supports exporting structured events to a configured HTTP endpoint. Each node sends events to the endpoint through an HTTP POST request. Previously, Ray's :ref:`task events ` were only used internally by the Ray Dashboard and :ref:`State API ` for monitoring and debugging. With the new event export feature, you can now send these raw events to external systems for custom analytics, monitoring, and integration with third-party tools. .. note:: Ray Event Export is still in alpha. The way to configure event reporting and the format of the events is subject to change. Enable event reporting ---------------------- To enable event reporting, you need to set the ``RAY_enable_core_worker_ray_event_to_aggregator`` environment variable to ``1`` when starting each Ray worker node. To set the target HTTP endpoint, set the ``RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR`` environment variable to a valid HTTP URL with the ``http://`` URL scheme. Event format ------------ Events are JSON objects in the POST request body. All events contain the same base fields and different event specific fields. See `src/ray/protobuf/public/events_base_event.proto `_ for the base fields. Task events ^^^^^^^^^^^ For each task, Ray exports two types of events: Task Definition Event and Task Execution Event. * Each task attempt generates one Task Definition Event which contains the metadata of the task. See `src/ray/protobuf/public/events_task_definition_event.proto `_ and `src/ray/protobuf/public/events_actor_task_definition_event.proto `_ for the event formats for normal tasks and actor tasks respectively. * Task Execution Events contain task state transition information and metadata generated during task execution. See `src/ray/protobuf/public/events_task_lifecycle_event.proto `_ for the event format. An example of a Task Definition Event and a Task Execution Event: .. code-block:: json // task definition event { "eventId":"N5n229xkwyjlZRFJDF2G1sh6ZNYlqChwJ4WPEQ==", "sourceType":"CORE_WORKER", "eventType":"TASK_DEFINITION_EVENT", "timestamp":"2025-09-03T18:52:14.467290Z", "severity":"INFO", "sessionName":"session_2025-09-03_11-52-12_635210_85618", "taskDefinitionEvent":{ "taskId":"yO9FzNARJXH///////////////8BAAAA", "taskFunc":{ "pythonFunctionDescriptor":{ "moduleName":"test-tasks", "functionName":"test_task", "functionHash":"37ddb110c0514b049bd4db5ab934627b", "className":"" } }, "taskName":"test_task", "requiredResources":{ "CPU":1.0 }, "serialized_runtime_env": "{}", "jobId":"AQAAAA==", "parentTaskId":"//////////////////////////8BAAAA", "placementGroupId":"////////////////////////", "taskAttempt":0, "taskType":"NORMAL_TASK", "language":"PYTHON", "refIds":{ } }, "message":"" } // task lifecycle event { "eventId":"vkIaAHlQC5KoppGosqs2kBq5k2WzsAAbawDDbQ==", "sourceType":"CORE_WORKER", "eventType":"TASK_LIFECYCLE_EVENT", "timestamp":"2025-09-03T18:52:14.469074Z", "severity":"INFO", "sessionName":"session_2025-09-03_11-52-12_635210_85618", "taskLifecycleEvent":{ "taskId":"yO9FzNARJXH///////////////8BAAAA", "stateTransitions": [ { "state":"PENDING_NODE_ASSIGNMENT", "timestamp":"2025-09-03T18:52:14.467402Z" }, { "state":"PENDING_ARGS_AVAIL", "timestamp":"2025-09-03T18:52:14.467290Z" }, { "state":"SUBMITTED_TO_WORKER", "timestamp":"2025-09-03T18:52:14.469074Z" } ], "nodeId":"ZvxTI6x9dlMFqMlIHErJpg5UEGK1INsKhW2zyg==", "workerId":"hMybCNYIFi+/yInYYhdc+qH8yMF65j/8+uCTmw==", "jobId":"AQAAAA==", "taskAttempt":0, "workerPid":0 }, "message":"" } Actor events ^^^^^^^^^^^^ For each actor, Ray exports two types of events: Actor Definition Events and Actor Lifecycle Events. * An Actor Definition Event contains the metadata of the actor when it is defined. See `src/ray/protobuf/public/events_actor_definition_event.proto `_ for the event format. * An Actor Lifecycle Event contains the actor state transition information and metadata associated with each transition. See `src/ray/protobuf/public/events_actor_lifecycle_event.proto `_ for the event format. .. code-block:: json // actor definition event { "eventId": "gsRtAfaWn5TZsjUPFm8nOXd/cKGz82FXdr3Lqg==", "sourceType": "GCS", "eventType": "ACTOR_DEFINITION_EVENT", "timestamp": "2025-10-24T21:12:10.742651Z", "severity": "INFO", "sessionName": "session_2025-10-24_14-12-05_804800_55420", "actorDefinitionEvent": { "actorId": "0AFtngcXtEoxwqmJAQAAAA==", "jobId": "AQAAAA==", "name": "actor-test", "rayNamespace": "bd2ad7f8-650b-495c-b709-55d4c8a7d09f", "serializedRuntimeEnv": "{}", "className": "test_ray_actor_events..A", "isDetached": false, "requiredResources": {}, "placementGroupId": "", "labelSelector": {} }, "message": "" } // actor lifecycle event { "eventId": "mOdfn5SRx3X0B05OvEDV0rcIOzqf/SGBJmrD/Q==", "sourceType": "GCS", "eventType": "ACTOR_LIFECYCLE_EVENT", "timestamp": "2025-10-24T21:12:10.742654Z", "severity": "INFO", "sessionName": "session_2025-10-24_14-12-05_804800_55420", "actorLifecycleEvent": { "actorId": "0AFtngcXtEoxwqmJAQAAAA==", "stateTransitions": [ { "timestamp": "2025-10-24T21:12:10.742654Z", "state": "ALIVE", "nodeId": "zpLG7coqThVMl8df9RYHnhK6thhJqrgPodtfjg==", "workerId": "nrBehSG3HXu0PvHZBkPl2kovmjzAaoCuVj2KHA==" } ] }, "message": "" } Driver job events ^^^^^^^^^^^^^^^^^^ For each driver job, Ray exports two types of events: Driver Job Definition Events and Driver Job Lifecycle Events. * A Driver Job Definition Event contains the metadata of the driver job when it is defined. See `src/ray/protobuf/public/events_driver_job_definition_event.proto `_ for the event format. * A Driver Job Lifecycle Event contains the driver job state transition information and metadata associated with each transition. See `src/ray/protobuf/public/events_driver_job_lifecycle_event.proto `_ for the event format. .. code-block:: json // driver job definition event { "eventId": "7YnwZPJr0KUC28T7KnzsvGyceEIrjNDTHuQfrg==", "sourceType": "GCS", "eventType": "DRIVER_JOB_DEFINITION_EVENT", "timestamp": "2025-10-24T21:17:07.316482Z", "severity": "INFO", "sessionName": "session_2025-10-24_14-17-05_575968_59360", "driverJobDefinitionEvent": { "jobId": "AQAAAA==", "driverPid": "59360", "driverNodeId": "9eHWUIruJWnMjQuPas0W+TRNUyjY5PwFpWUfjA==", "entrypoint": "...", "config": { "serializedRuntimeEnv": "{}", "metadata": {} } }, "message": "" } // driver job lifecycle event { "eventId": "0cmbCI/RQghYe4ZQiJ+HrnK1RiZH+cg8ltBx2w==", "sourceType": "GCS", "eventType": "DRIVER_JOB_LIFECYCLE_EVENT", "timestamp": "2025-10-24T21:17:07.316483Z", "severity": "INFO", "sessionName": "session_2025-10-24_14-17-05_575968_59360", "driverJobLifecycleEvent": { "jobId": "AQAAAA==", "stateTransitions": [ { "state": "CREATED", "timestamp": "2025-10-24T21:17:07.316483Z" } ] }, "message": "" } Node events ^^^^^^^^^^^ For each node, Ray exports two types of events: Node Definition Events and Node Lifecycle Events. * A Node Definition Event contains the metadata of the node when it is defined. See `src/ray/protobuf/public/events_node_definition_event.proto `_ for the event format. * A Node Lifecycle Event contains the node state transition information and metadata associated with each transition. See `src/ray/protobuf/public/events_node_lifecycle_event.proto `_ for the event format. .. code-block:: json // node definition event { "eventId": "l7r4gwq4UPhmZGFJYEym6mUkcxqafra60LB6/Q==", "sourceType": "GCS", "eventType": "NODE_DEFINITION_EVENT", "timestamp": "2025-10-24T21:19:14.063953Z", "severity": "INFO", "sessionName": "session_2025-10-24_14-19-12_675240_61141", "nodeDefinitionEvent": { "nodeId": "0yfRX1ex+VtcC+TFXjXcgesdpnEwM76+pEATrQ==", "nodeIpAddress": "127.0.0.1", "labels": { "ray.io/node-id": "d327d15f57b1f95b5c0be4c55e35dc81eb1da6713033bebea44013ad" }, "startTimestamp": "2025-10-24T21:19:14.063Z" }, "message": "" } // node lifecycle event { "eventId": "u3KTG8615MIKBH5PLcii0BMfGFWcvLuSOXM6zg==", "sourceType": "GCS", "eventType": "NODE_LIFECYCLE_EVENT", "timestamp": "2025-10-24T21:19:14.063955Z", "severity": "INFO", "sessionName": "session_2025-10-24_14-19-12_675240_61141", "nodeLifecycleEvent": { "nodeId": "0yfRX1ex+VtcC+TFXjXcgesdpnEwM76+pEATrQ==", "stateTransitions": [ { "timestamp": "2025-10-24T21:19:14.063955Z", "resources": {"node:__internal_head__": 1.0, "CPU": 1.0, "object_store_memory": 157286400.0, "node:127.0.0.1": 1.0, "memory": 42964287488.0}, "state": "ALIVE", "aliveSubState": "UNSPECIFIED" } ] }, "message": "" } High-level Architecture ----------------------- The following diagram shows the high-level architecture of Ray Event Export. .. image:: ../images/ray-event-export.png All Ray components send events to an aggregator agent through gRPC. There is an aggregator agent on each node. The aggregator agent collects all events on that node and sends the events to the configured HTTP endpoint. --- .. _ray-tracing: Tracing ======= To help debug and monitor Ray applications, Ray integrates with OpenTelemetry to facilitate exporting traces to external tracing stacks such as Jaeger. .. note:: Tracing is an Alpha feature and no longer under active development/being maintained. APIs are subject to change. Installation ------------ First, install OpenTelemetry: .. code-block:: shell pip install opentelemetry-api==1.34.1 pip install opentelemetry-sdk==1.34.1 pip install opentelemetry-exporter-otlp==1.34.1 Tracing startup hook -------------------- To enable tracing, you must provide a tracing startup hook with a function that sets up the :ref:`Tracer Provider `, :ref:`Remote Span Processors `, and :ref:`Additional Instruments `. The tracing startup hook is expected to be a function that is called with no args or kwargs. This hook needs to be available in the Python environment of all the worker processes. Below is an example tracing startup hook that sets up the default tracing provider, exports spans to files in ``/tmp/spans``, and does not have any additional instruments. .. testcode:: import ray import os from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import ( ConsoleSpanExporter, SimpleSpanProcessor, ) def setup_tracing() -> None: # Creates /tmp/spans folder os.makedirs("/tmp/spans", exist_ok=True) # Sets the tracer_provider. This is only allowed once per execution # context and will log a warning if attempted multiple times. trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor( ConsoleSpanExporter( out=open(f"/tmp/spans/{os.getpid()}.json", "a") ) ) ) For open-source users who want to experiment with tracing, Ray has a default tracing startup hook that exports spans to the folder ``/tmp/spans``. To run using this default hook, run the following code sample to set up tracing and trace a simple Ray Task. .. tab-set:: .. tab-item:: ray start .. code-block:: shell $ ray start --head --tracing-startup-hook=ray.util.tracing.setup_local_tmp_tracing:setup_tracing $ python ray.init() @ray.remote def my_function(): return 1 obj_ref = my_function.remote() .. tab-item:: ray.init() .. testcode:: ray.init(_tracing_startup_hook="ray.util.tracing.setup_local_tmp_tracing:setup_tracing") @ray.remote def my_function(): return 1 obj_ref = my_function.remote() If you want to provide your own custom tracing startup hook, provide it in the format of ``module:attribute`` where the attribute is the ``setup_tracing`` function to be run. .. _tracer-provider: Tracer provider ~~~~~~~~~~~~~~~ This configures how to collect traces. View the TracerProvider API `here `__. .. _remote-span-processors: Remote span processors ~~~~~~~~~~~~~~~~~~~~~~ This configures where to export traces to. View the SpanProcessor API `here `__. Users who want to experiment with tracing can configure their remote span processors to export spans to a local JSON file. Serious users developing locally can push their traces to Jaeger containers via the `Jaeger exporter `_. .. _additional-instruments: Additional instruments ~~~~~~~~~~~~~~~~~~~~~~ If you are using a library that has built-in tracing support, the ``setup_tracing`` function you provide should also patch those libraries. You can find more documentation for the instrumentation of these libraries `here `_. Custom traces ************* Add custom tracing in your programs. Within your program, get the tracer object with ``trace.get_tracer(__name__)`` and start a new span with ``tracer.start_as_current_span(...)``. See below for a simple example of adding custom tracing. .. testcode:: from opentelemetry import trace @ray.remote def my_func(): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("foo"): print("Hello world from OpenTelemetry Python!") --- .. _ref-overview-examples: Examples ======== .. toctree:: :maxdepth: 2 ./e2e-multimodal-ai-workloads/README.ipynb ./entity-recognition-with-llms/README.ipynb ./e2e-audio/README.ipynb ./e2e-xgboost/README.ipynb ./e2e-timeseries/README.ipynb ./object-detection/README.ipynb ./e2e-rag/README.ipynb ./mcp-ray-serve/README.ipynb ./langchain_agent_ray_serve/content/README.ipynb --- :orphan: :html_theme.sidebar_secondary.remove: .. _ref-ray-examples: Ray Examples ============ --- .. _installation: Installing Ray ============== .. raw:: html Run Quickstart on Anyscale

Ray currently officially supports x86_64, aarch64 (ARM) for Linux, and Apple silicon (M1) hardware. Ray on Windows is currently in beta. Official Releases ----------------- From Wheels ~~~~~~~~~~~ You can install the latest official version of Ray from PyPI on Linux, Windows, and macOS by choosing the option that best matches your use case. .. tab-set:: .. tab-item:: Recommended **For machine learning applications** .. code-block:: shell pip install -U "ray[data,train,tune,serve]" # For reinforcement learning support, install RLlib instead. # pip install -U "ray[rllib]" **For general Python applications** .. code-block:: shell pip install -U "ray[default]" # If you don't want Ray Dashboard or Cluster Launcher, install Ray with minimal dependencies instead. # pip install -U "ray" .. tab-item:: Advanced .. list-table:: :widths: 2 3 :header-rows: 1 * - Command - Installed components * - `pip install -U "ray"` - Core * - `pip install -U "ray[default]"` - Core, Dashboard, Cluster Launcher * - `pip install -U "ray[data]"` - Core, Data * - `pip install -U "ray[train]"` - Core, Train * - `pip install -U "ray[tune]"` - Core, Tune * - `pip install -U "ray[serve]"` - Core, Dashboard, Cluster Launcher, Serve * - `pip install -U "ray[serve-grpc]"` - Core, Dashboard, Cluster Launcher, Serve with gRPC support * - `pip install -U "ray[rllib]"` - Core, Tune, RLlib * - `pip install -U "ray[all]"` - Core, Dashboard, Cluster Launcher, Data, Train, Tune, Serve, RLlib. This option isn't recommended. Specify the extras you need as shown below instead. .. tip:: You can combine installation extras. For example, to install Ray with Dashboard, Cluster Launcher, and Train support, you can run: .. code-block:: shell pip install -U "ray[default,train]" .. _install-nightlies: Daily Releases (Nightlies) -------------------------- You can install the nightly Ray wheels via the following links. These daily releases are tested via automated tests but do not go through the full release process. To install these wheels, use the following ``pip`` command and wheels: .. code-block:: bash # Clean removal of previous install pip uninstall -y ray # Install Ray with support for the dashboard + cluster launcher pip install -U "ray[default] @ LINK_TO_WHEEL.whl" # Install Ray with minimal dependencies # pip install -U LINK_TO_WHEEL.whl .. tab-set:: .. tab-item:: Linux =============================================== ================================================ Linux (x86_64) Linux (arm64/aarch64) =============================================== ================================================ `Linux Python 3.10 (x86_64)`_ `Linux Python 3.10 (aarch64)`_ `Linux Python 3.11 (x86_64)`_ `Linux Python 3.11 (aarch64)`_ `Linux Python 3.12 (x86_64)`_ `Linux Python 3.12 (aarch64)`_ `Linux Python 3.13 (x86_64)`_ (beta) `Linux Python 3.13 (aarch64)`_ (beta) =============================================== ================================================ .. tab-item:: MacOS .. list-table:: :header-rows: 1 * - MacOS (arm64) * - `MacOS Python 3.10 (arm64)`_ * - `MacOS Python 3.11 (arm64)`_ * - `MacOS Python 3.12 (arm64)`_ * - `MacOS Python 3.13 (arm64)`_ (beta) .. tab-item:: Windows (beta) .. list-table:: :header-rows: 1 * - Windows (beta) * - `Windows Python 3.10 (amd64)`_ * - `Windows Python 3.11 (amd64)`_ * - `Windows Python 3.12 (amd64)`_ .. note:: On Windows, support for multi-node Ray clusters is currently experimental and untested. If you run into issues please file a report at https://github.com/ray-project/ray/issues. .. note:: :ref:`Usage stats ` collection is enabled by default (can be :ref:`disabled `) for nightly wheels including both local clusters started via ``ray.init()`` and remote clusters via cli. .. If you change the list of wheel links below, remember to update `get_wheel_filename()` in `https://github.com/ray-project/ray/blob/master/python/ray/_private/utils.py`. .. _`Linux Python 3.10 (x86_64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl .. _`Linux Python 3.11 (x86_64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp311-cp311-manylinux2014_x86_64.whl .. _`Linux Python 3.12 (x86_64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-manylinux2014_x86_64.whl .. _`Linux Python 3.13 (x86_64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp313-cp313-manylinux2014_x86_64.whl .. _`Linux Python 3.10 (aarch64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_aarch64.whl .. _`Linux Python 3.11 (aarch64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp311-cp311-manylinux2014_aarch64.whl .. _`Linux Python 3.12 (aarch64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-manylinux2014_aarch64.whl .. _`Linux Python 3.13 (aarch64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp313-cp313-manylinux2014_aarch64.whl .. _`MacOS Python 3.10 (arm64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-macosx_12_0_arm64.whl .. _`MacOS Python 3.11 (arm64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp311-cp311-macosx_12_0_arm64.whl .. _`MacOS Python 3.12 (arm64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-macosx_12_0_arm64.whl .. _`MacOS Python 3.13 (arm64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp313-cp313-macosx_12_0_arm64.whl .. _`Windows Python 3.10 (amd64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-win_amd64.whl .. _`Windows Python 3.11 (amd64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp311-cp311-win_amd64.whl .. _`Windows Python 3.12 (amd64)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-win_amd64.whl Installing from a specific commit --------------------------------- You can install the Ray wheels of any particular commit on ``master`` with the following template. You need to specify the commit hash, Ray version, Operating System, and Python version: .. code-block:: bash pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/{COMMIT_HASH}/ray-{RAY_VERSION}-{PYTHON_VERSION}-{PYTHON_VERSION}-{OS_VERSION}.whl For example, here are the Ray 3.0.0.dev0 wheels for Python 3.10, MacOS for commit ``4f2ec46c3adb6ba9f412f09a9732f436c4a5d0c9``: .. code-block:: bash pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/4f2ec46c3adb6ba9f412f09a9732f436c4a5d0c9/ray-3.0.0.dev0-cp310-cp310-macosx_12_0_arm64.whl There are minor variations to the format of the wheel filename; it's best to match against the format in the URLs listed in the :ref:`Nightlies section `. Here's a summary of the variations: * For MacOS x86_64, commits predating August 7, 2021 will have ``macosx_10_13`` in the filename instead of ``macosx_10_15``. * For MacOS x86_64, commits predating June 1, 2025 will have ``macosx_10_15`` in the filename instead of ``macosx_12_0``. .. _apple-silicon-support: M1 Mac (Apple Silicon) Support ------------------------------ Ray supports machines running Apple Silicon (such as M1 macs). Multi-node clusters are untested. To get started with local Ray development: #. Install `miniforge `_. * ``wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh`` * ``bash Miniforge3-MacOSX-arm64.sh`` * ``rm Miniforge3-MacOSX-arm64.sh # Cleanup.`` #. Ensure you're using the miniforge environment (you should see (base) in your terminal). * ``source ~/.bash_profile`` * ``conda activate`` #. Install Ray as you normally would. * ``pip install ray`` .. _windows-support: Windows Support --------------- Windows support is in Beta. Ray supports running on Windows with the following caveats (only the first is Ray-specific, the rest are true anywhere Windows is used): * Multi-node Ray clusters are untested. * Filenames are tricky on Windows and there still may be a few places where Ray assumes UNIX filenames rather than Windows ones. This can be true in downstream packages as well. * Performance on Windows is known to be slower since opening files on Windows is considerably slower than on other operating systems. This can affect logging. * Windows does not have a copy-on-write forking model, so spinning up new processes can require more memory. Submit any issues you encounter to `GitHub `_. Installing Ray on Arch Linux ---------------------------- Note: Installing Ray on Arch Linux is not tested by the Project Ray developers. Ray is available on Arch Linux via the Arch User Repository (`AUR`_) as ``python-ray``. You can manually install the package by following the instructions on the `Arch Wiki`_ or use an `AUR helper`_ like `yay`_ (recommended for ease of install) as follows: .. code-block:: bash yay -S python-ray To discuss any issues related to this package refer to the comments section on the AUR page of ``python-ray`` `here`_. .. _`AUR`: https://wiki.archlinux.org/index.php/Arch_User_Repository .. _`Arch Wiki`: https://wiki.archlinux.org/index.php/Arch_User_Repository#Installing_packages .. _`AUR helper`: https://wiki.archlinux.org/index.php/Arch_User_Repository#Installing_packages .. _`yay`: https://aur.archlinux.org/packages/yay .. _`here`: https://aur.archlinux.org/packages/python-ray .. _ray_anaconda: Installing From conda-forge --------------------------- Ray can also be installed as a conda package on Linux and Windows. .. code-block:: bash # also works with mamba conda create -c conda-forge python=3.10 -n ray conda activate ray # Install Ray with support for the dashboard + cluster launcher conda install -c conda-forge "ray-default" # Install Ray with minimal dependencies # conda install -c conda-forge ray To install Ray libraries, use ``pip`` as above or ``conda``/``mamba``. .. code-block:: bash conda install -c conda-forge "ray-data" # installs Ray + dependencies for Ray Data conda install -c conda-forge "ray-train" # installs Ray + dependencies for Ray Train conda install -c conda-forge "ray-tune" # installs Ray + dependencies for Ray Tune conda install -c conda-forge "ray-serve" # installs Ray + dependencies for Ray Serve conda install -c conda-forge "ray-rllib" # installs Ray + dependencies for Ray RLlib For a complete list of available ``ray`` libraries on Conda-forge, have a look at https://anaconda.org/conda-forge/ray-default .. note:: Ray conda packages are maintained by the community, not the Ray team. While using a conda environment, it is recommended to install Ray from PyPi using `pip install ray` in the newly created environment. Building Ray from Source ------------------------ Installing from ``pip`` should be sufficient for most Ray users. However, should you need to build from source, follow :ref:`these instructions for building ` Ray. .. _docker-images: Docker Source Images -------------------- Users can pull a Docker image from the ``rayproject/ray`` `Docker Hub repository `__. The images include Ray and all required dependencies. It comes with anaconda and various versions of Python. Images are `tagged` with the format ``{Ray version}[-{Python version}][-{Platform}]``. ``Ray version`` tag can be one of the following: .. list-table:: :widths: 25 50 :header-rows: 1 * - Ray version tag - Description * - latest - The most recent Ray release. * - x.y.z - A specific Ray release, e.g. 2.31.0 * - nightly - The most recent Ray development build (a recent commit from Github ``master``) The optional ``Python version`` tag specifies the Python version in the image. All Python versions supported by Ray are available, e.g. ``py310``, ``py311`` and ``py312``. If unspecified, the tag points to an image of the lowest Python version that the Ray version supports. The optional ``Platform`` tag specifies the platform where the image is intended for: .. list-table:: :widths: 16 40 :header-rows: 1 * - Platform tag - Description * - -cpu - These are based off of an Ubuntu image. * - -cuXX - These are based off of an NVIDIA CUDA image with the specified CUDA version. They require the NVIDIA Docker Runtime. * - -gpu - Aliases to a specific ``-cuXX`` tagged image. * - - Aliases to ``-cpu`` tagged images. Example: for the nightly image based on ``Python 3.10`` and without GPU support, the tag is ``nightly-py310-cpu``. If you want to tweak some aspects of these images and build them locally, refer to the following script: .. code-block:: bash cd ray ./build-docker.sh Review images by listing them: .. code-block:: bash docker images Output should look something like the following: .. code-block:: bash REPOSITORY TAG IMAGE ID CREATED SIZE rayproject/ray dev 7243a11ac068 2 days ago 1.11 GB rayproject/base-deps latest 5606591eeab9 8 days ago 512 MB ubuntu 22.04 1e4467b07108 3 weeks ago 73.9 MB Launch Ray in Docker ~~~~~~~~~~~~~~~~~~~~ Start out by launching the deployment container. .. code-block:: bash docker run --shm-size= -t -i rayproject/ray Replace ```` with a limit appropriate for your system, for example ``512M`` or ``2G``. A good estimate for this is to use roughly 30% of your available memory (this is what Ray uses internally for its Object Store). The ``-t`` and ``-i`` options here are required to support interactive use of the container. If you use a GPU version Docker image, remember to add ``--gpus all`` option. Replace ```` with your target ray version in the following command: .. code-block:: bash docker run --shm-size= -t -i --gpus all rayproject/ray:-gpu **Note:** Ray requires a **large** amount of shared memory because each object store keeps all of its objects in shared memory, so the amount of shared memory will limit the size of the object store. You should now see a prompt that looks something like: .. code-block:: bash root@ebc78f68d100:/ray# Test if the installation succeeded ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To test if the installation was successful, try running some tests. This assumes that you've cloned the git repository. .. code-block:: bash python -m pytest -v python/ray/tests/test_mini.py Installed Python dependencies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Our docker images are shipped with pre-installed Python dependencies required for Ray and its libraries. We publish the dependencies that are installed in our ``ray`` Docker images for Python 3.9. .. tab-set:: .. tab-item:: ray (Python 3.10) :sync: ray (Python 3.10) Ray version: nightly (`0ddb7ee `_) .. literalinclude:: ./pip_freeze_ray-py310-cpu.txt .. _ray-install-java: Install Ray Java with Maven --------------------------- .. note:: All Ray Java APIs are experimental and only supported by the community. Before installing Ray Java with Maven, you should install Ray Python with `pip install -U ray` . Note that the versions of Ray Java and Ray Python must match. Note that nightly Ray python wheels are also required if you want to install Ray Java snapshot version. Find the latest Ray Java release in the `central repository `__. To use the latest Ray Java release in your application, add the following entries in your ``pom.xml``: .. code-block:: xml io.ray ray-api ${ray.version} io.ray ray-runtime ${ray.version} The latest Ray Java snapshot can be found in `sonatype repository `__. To use the latest Ray Java snapshot in your application, add the following entries in your ``pom.xml``: .. code-block:: xml sonatype https://oss.sonatype.org/content/repositories/snapshots/ false true io.ray ray-api ${ray.version} io.ray ray-runtime ${ray.version} .. note:: When you run ``pip install`` to install Ray, Java jars are installed as well. The above dependencies are only used to build your Java code and to run your code in local mode. If you want to run your Java code in a multi-node Ray cluster, it's better to exclude Ray jars when packaging your code to avoid jar conflicts if the versions (installed Ray with ``pip install`` and maven dependencies) don't match. .. _ray-install-cpp: Install Ray C++ --------------- .. note:: All Ray C++ APIs are experimental and only supported by the community. You can install and use Ray C++ API as follows. .. code-block:: bash pip install -U ray[cpp] # Create a Ray C++ project template to start with. ray cpp --generate-bazel-project-template-to ray-template .. note:: If you build Ray from source, remove the build option ``build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"`` from the file ``cpp/example/.bazelrc`` before running your application. The related issue is `this `_. --- .. _ray-oss-list: The Ray Ecosystem ================= This page lists libraries that have integrations with Ray for distributed execution in alphabetical order. It's easy to add your own integration to this list. Simply open a pull request with a few lines of text, see the dropdown below for more information. .. dropdown:: Adding Your Integration To add an integration add an entry to this file, using the same ``grid-item-card`` directive that the other examples use. .. grid:: 1 2 2 3 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: .. figure:: ../images/rayai_logo.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/rayai-labs/agentic-ray?style=social)] :target: https://github.com/rayai-labs/agentic-ray Agentic-Ray enables agents built with any framework to use Ray as their runtime, distribute tool calls across a cluster, and provision sandbox environments for executing AI-generated code. +++ .. button-link:: https://rayai.com :color: primary :outline: :expand: Agentic-Ray Integration .. grid-item-card:: .. figure:: ../images/airflow_logo_full.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/astronomer/astro-provider-ray?style=social :target: https://github.com/astronomer/astro-provider-ray Apache Airflow® is an open-source platform that enables users to programmatically author, schedule, and monitor workflows using directed acyclic graphs (DAGs). With the Ray provider, users can seamlessly orchestrate Ray jobs within Airflow DAGs. +++ .. button-link:: https://astronomer.github.io/astro-provider-ray/ :color: primary :outline: :expand: Apache Airflow Integration .. grid-item-card:: .. figure:: ../images/buildflow.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/launchflow/buildflow?style=social :target: https://github.com/launchflow/buildflow BuildFlow is a backend framework that allows you to build and manage complex cloud infrastructure using pure python. With BuildFlow's decorator pattern you can turn any function into a component of your backend system. +++ .. button-link:: https://docs.launchflow.com/buildflow/introduction :color: primary :outline: :expand: BuildFlow Integration .. grid-item-card:: .. figure:: ../images/classyvision.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/facebookresearch/ClassyVision?style=social :target: https://github.com/facebookresearch/ClassyVision Classy Vision is a new end-to-end, PyTorch-based framework for large-scale training of state-of-the-art image and video classification models. The library features a modular, flexible design that allows anyone to train machine learning models on top of PyTorch using very simple abstractions. +++ .. button-link:: https://github.com/facebookresearch/ClassyVision/blob/main/tutorials/ray_aws.ipynb :color: primary :outline: :expand: Classy Vision Integration .. grid-item-card:: .. figure:: ../images/daft.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/Eventual-Inc/Daft?style=social :target: https://github.com/Eventual-Inc/Daft Daft is a high-performance multimodal data engine that provides simple and reliable data processing for any modality - from structured tables to images, audio, video, and embeddings. Built with Python and Rust for modern AI workflows, Daft offers seamless scaling from local to `distributed clusters `_, enabling efficient batch inference, document processing, and multimodal ETL pipelines at scale. +++ .. button-link:: https://docs.daft.ai/en/stable/distributed/ray/ :color: primary :outline: :expand: Daft Integration .. grid-item-card:: .. figure:: ../images/dask.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/dask/dask?style=social :target: https://github.com/dask/dask Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. Dask uses existing Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their Dask-powered equivalents. +++ .. button-ref:: dask-on-ray :color: primary :outline: :expand: Dask Integration .. grid-item-card:: .. figure:: ../images/data_juicer.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/modelscope/data-juicer?style=social :target: https://github.com/modelscope/data-juicer Data-Juicer is a one-stop multimodal data processing system to make data higher-quality, juicier, and more digestible for foundation models. It integrates with Ray for distributed data processing on large-scale datasets with over 100 multimodal operators and supports TB-size dataset deduplication. +++ .. button-link:: https://github.com/modelscope/data-juicer?tab=readme-ov-file#distributed-data-processing :color: primary :outline: :expand: Data-Juicer Integration .. grid-item-card:: .. figure:: ../images/flambe.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/asappresearch/flambe?style=social :target: https://github.com/asappresearch/flambe Flambé is a machine learning experimentation framework built to accelerate the entire research life cycle. Flambé’s main objective is to provide a unified interface for prototyping models, running experiments containing complex pipelines, monitoring those experiments in real-time, reporting results, and deploying a final model for inference. +++ .. button-link:: https://github.com/asappresearch/flambe :color: primary :outline: :expand: Flambé Integration .. grid-item-card:: .. figure:: ../images/flowdapt.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/emergentmethods/flowdapt?style=social :target: https://github.com/emergentmethods/flowdapt Flowdapt is a platform designed to help developers configure, debug, schedule, trigger, deploy and serve adaptive and reactive Artificial Intelligence workflows at large-scale. +++ .. button-link:: https://github.com/emergentmethods/flowdapt :color: primary :outline: :expand: Flowdapt Integration .. grid-item-card:: .. figure:: ../images/flyte.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/flyteorg/flyte?style=social :target: https://github.com/flyteorg/flyte Flyte is a Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source. +++ .. button-link:: https://flyte.org/ :color: primary :outline: :expand: Flyte Integration .. grid-item-card:: .. figure:: ../images/horovod.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/horovod/horovod?style=social :target: https://github.com/horovod/horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. +++ .. button-link:: https://horovod.readthedocs.io/en/stable/ray_include.html :color: primary :outline: :expand: Horovod Integration .. grid-item-card:: .. figure:: ../images/hugging.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/huggingface/transformers?style=social :target: https://github.com/huggingface/transformers State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0. It integrates with Ray for distributed hyperparameter tuning of transformer models. +++ .. button-link:: https://huggingface.co/transformers/master/main_classes/trainer.html#transformers.Trainer.hyperparameter_search :color: primary :outline: :expand: Hugging Face Transformers Integration .. grid-item-card:: .. figure:: ../images/zoo.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/intel-analytics/analytics-zoo?style=social :target: https://github.com/intel-analytics/analytics-zoo Analytics Zoo seamlessly scales TensorFlow, Keras and PyTorch to distributed big data (using Spark, Flink & Ray). +++ .. button-link:: https://analytics-zoo.github.io/master/#ProgrammingGuide/rayonspark/ :color: primary :outline: :expand: Intel Analytics Zoo Integration .. grid-item-card:: .. figure:: ../images/nlu.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/JohnSnowLabs/nlu?style=social :target: https://github.com/JohnSnowLabs/nlu The power of 350+ pre-trained NLP models, 100+ Word Embeddings, 50+ Sentence Embeddings, and 50+ Classifiers in 46 languages with 1 line of Python code. +++ .. button-link:: https://nlu.johnsnowlabs.com/docs/en/predict_api#modin-dataframe :color: primary :outline: :expand: NLU Integration .. grid-item-card:: .. figure:: ../images/ludwig.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/ludwig-ai/ludwig?style=social :target: https://github.com/ludwig-ai/ludwig Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. With Ludwig, you can train a deep learning model on Ray in zero lines of code, automatically leveraging Dask on Ray for data preprocessing, Horovod on Ray for distributed training, and Ray Tune for hyperparameter optimization. +++ .. button-link:: https://medium.com/ludwig-ai/ludwig-ai-v0-4-introducing-declarative-mlops-with-ray-dask-tabnet-and-mlflow-integrations-6509c3875c2e :color: primary :outline: :expand: Ludwig Integration .. grid-item-card:: .. figure:: ../images/mars.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/mars-project/mars?style=social :target: https://github.com/mars-project/mars Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn. Mars can scale in to a single machine, and scale out to a cluster with thousands of machines. +++ .. button-ref:: mars-on-ray :color: primary :outline: :expand: MARS Integration .. grid-item-card:: .. figure:: ../images/modin.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/modin-project/modin?style=social :target: https://github.com/modin-project/modin Scale your pandas workflows by changing one line of code. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. +++ .. button-link:: https://github.com/modin-project/modin :color: primary :outline: :expand: Modin Integration .. grid-item-card:: .. figure:: ../images/prefect.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/PrefectHQ/prefect-ray?style=social :target: https://github.com/PrefectHQ/prefect-ray Prefect is an open source workflow orchestration platform in Python. It allows you to easily define, track and schedule workflows in Python. This integration makes it easy to run a Prefect workflow on a Ray cluster in a distributed way. +++ .. button-link:: https://github.com/PrefectHQ/prefect-ray :color: primary :outline: :expand: Prefect Integration .. grid-item-card:: .. figure:: ../images/pycaret.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/pycaret/pycaret?style=social :target: https://github.com/pycaret/pycaret PyCaret is an open source low-code machine learning library in Python that aims to reduce the hypothesis to insights cycle time in a ML experiment. It enables data scientists to perform end-to-end experiments quickly and efficiently. +++ .. button-link:: https://github.com/pycaret/pycaret :color: primary :outline: :expand: PyCaret Integration .. grid-item-card:: .. figure:: ../images/intel.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/Intel-bigdata/oap-raydp?style=social :target: https://github.com/Intel-bigdata/oap-raydp RayDP ("Spark on Ray") enables you to easily use Spark inside a Ray program. You can use Spark to read the input data, process the data using SQL, Spark DataFrame, or Pandas (via Koalas) API, extract and transform features using Spark MLLib, and use RayDP Estimator API for distributed training on the preprocessed dataset. +++ .. button-link:: https://github.com/Intel-bigdata/oap-raydp :color: primary :outline: :expand: RayDP Integration .. grid-item-card:: .. figure:: ../images/raylight.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/komikndr/raylight?style=social)] :target: https://github.com/komikndr/raylight Raylight is an extension for ComfyUI that enables true multi-GPU capability using XDiT XFuser, and FSDP managed by Ray. It is designed to scale diffusion models efficiently across multiple GPUs. Raylight provides sequence parallelism, and optimized VRAM utilization, making it ideal for large video and image generation models. +++ .. button-link:: https://github.com/komikndr/raylight :color: primary :outline: :expand: Raylight Integration .. grid-item-card:: .. figure:: ../images/scikit.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/scikit-learn/scikit-learn?style=social :target: https://github.com/scikit-learn/scikit-learn Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. +++ .. button-link:: https://docs.ray.io/en/master/joblib.html :color: primary :outline: :expand: Scikit Learn Integration .. grid-item-card:: .. figure:: ../images/seldon.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/SeldonIO/alibi?style=social :target: https://github.com/SeldonIO/alibi Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The focus of the library is to provide high-quality implementations of black-box, white-box, local and global explanation methods for classification and regression models. +++ .. button-link:: https://github.com/SeldonIO/alibi :color: primary :outline: :expand: Seldon Alibi Integration .. grid-item-card:: .. figure:: ../images/sematic.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/sematic-ai/sematic?style=social :target: https://github.com/sematic-ai/sematic Sematic is an open-source ML pipelining tool written in Python. It enables users to write end-to-end pipelines that can seamlessly transition between your laptop and the cloud, with rich visualizations, traceability, reproducibility, and usability as first-class citizens. This integration enables dynamic allocation of Ray clusters within Sematic pipelines. +++ .. button-link:: https://docs.sematic.dev/integrations/ray :color: primary :outline: :expand: Sematic Integration .. grid-item-card:: .. figure:: ../images/spacy.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/explosion/spacy-ray?style=social :target: https://github.com/explosion/spacy-ray spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. +++ .. button-link:: https://pypi.org/project/spacy-ray/ :color: primary :outline: :expand: spaCy Integration .. grid-item-card:: .. figure:: ../images/xgboost_logo.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/ray-project/xgboost_ray?style=social :target: https://github.com/ray-project/xgboost_ray XGBoost is a popular gradient boosting library for classification and regression. It is one of the most popular tools in data science and workhorse of many top-performing Kaggle kernels. +++ .. button-link:: https://github.com/ray-project/xgboost_ray :color: primary :outline: :expand: XGBoost Integration .. grid-item-card:: .. figure:: ../images/lightgbm_logo.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/ray-project/lightgbm_ray?style=social :target: https://github.com/ray-project/lightgbm_ray LightGBM is a high-performance gradient boosting library for classification and regression. It is designed to be distributed and efficient. +++ .. button-link:: https://github.com/ray-project/lightgbm_ray :color: primary :outline: :expand: LightGBM Integration .. grid-item-card:: .. figure:: ./images/volcano.png :class: card-figure .. div:: .. image:: https://img.shields.io/github/stars/volcano-sh/volcano?style=social :target: https://github.com/volcano-sh/volcano/ Volcano is system for running high-performance workloads on Kubernetes. It features powerful batch scheduling capabilities required by ML and other data-intensive workloads. +++ .. button-link:: https://github.com/volcano-sh/volcano/releases/tag/v1.7.0 :color: primary :outline: :expand: Volcano Integration --- .. _ref-use-cases: Ray Use Cases ============= .. toctree:: :hidden: ../ray-air/getting-started This page indexes common Ray use cases for scaling ML. It contains highlighted references to blogs, examples, and tutorials also located elsewhere in the Ray documentation. .. _ref-use-cases-llm: LLMs and Gen AI --------------- Large language models (LLMs) and generative AI are rapidly changing industries, and demand compute at an astonishing pace. Ray provides a distributed compute framework for scaling these models, allowing developers to train and deploy models faster and more efficiently. With specialized libraries for data streaming, training, fine-tuning, hyperparameter tuning, and serving, Ray simplifies the process of developing and deploying large-scale AI models. .. figure:: /images/llm-stack.png .. query-param-ref:: ray-overview/examples :parameters: ?tags=llm :ref-type: doc :classes: example-gallery-link .. raw:: html Explore LLMs and Gen AI examples .. _ref-use-cases-batch-infer: Batch Inference --------------- Batch inference is the process of generating model predictions on a large "batch" of input data. Ray for batch inference works with any cloud provider and ML framework, and is fast and cheap for modern deep learning applications. It scales from single machines to large clusters with minimal code changes. As a Python-first framework, you can easily express and interactively develop your inference workloads in Ray. To learn more about running batch inference with Ray, see the :ref:`batch inference guide`. .. figure:: ../data/images/batch_inference.png .. query-param-ref:: ray-overview/examples :parameters: ?tags=inference :ref-type: doc :classes: example-gallery-link .. raw:: html Explore batch inference examples .. _ref-use-cases-model-serving: Model Serving ------------- :ref:`Ray Serve ` is well suited for model composition, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. It supports complex `model deployment patterns `_ requiring the orchestration of multiple Ray actors, where different actors provide inference for different models. Serve handles both batch and online inference and can scale to thousands of models in production. .. figure:: /images/multi_model_serve.png Deployment patterns with Ray Serve. (Click image to enlarge.) Learn more about model serving with the following resources. - `[Talk] Productionizing ML at Scale with Ray Serve `_ - `[Blog] Simplify your MLOps with Ray & Ray Serve `_ - :doc:`[Guide] Getting Started with Ray Serve ` - :doc:`[Guide] Model Composition in Serve ` - :doc:`[Gallery] Serve Examples Gallery ` - `[Gallery] More Serve Use Cases on the Blog `_ .. _ref-use-cases-hyperparameter-tuning: Hyperparameter Tuning --------------------- The :ref:`Ray Tune ` library enables any parallel Ray workload to be run under a hyperparameter tuning algorithm. Running multiple hyperparameter tuning experiments is a pattern apt for distributed computing because each experiment is independent of one another. Ray Tune handles the hard bit of distributing hyperparameter optimization and makes available key features such as checkpointing the best result, optimizing scheduling, and specifying search patterns. .. figure:: /images/tuning_use_case.png Distributed tuning with distributed training per trial. Learn more about the Tune library with the following talks and user guides. - :doc:`[Guide] Getting Started with Ray Tune ` - `[Blog] How to distribute hyperparameter tuning with Ray Tune `_ - `[Talk] Simple Distributed Hyperparameter Optimization `_ - `[Blog] Hyperparameter Search with 🤗 Transformers `_ - :doc:`[Gallery] Ray Tune Examples Gallery ` - `More Tune use cases on the Blog `_ .. _ref-use-cases-distributed-training: Distributed Training -------------------- The :ref:`Ray Train ` library integrates many distributed training frameworks under a simple Trainer API, providing distributed orchestration and management capabilities out of the box. In contrast to training many models, model parallelism partitions a large model across many machines for training. Ray Train has built-in abstractions for distributing shards of models and running training in parallel. .. figure:: /images/model_parallelism.png Model parallelism pattern for distributed large model training. Learn more about the Train library with the following talks and user guides. - `[Talk] Ray Train, PyTorch, TorchX, and distributed deep learning `_ - `[Blog] Elastic Distributed Training with XGBoost on Ray `_ - :doc:`[Guide] Getting Started with Ray Train ` - :doc:`[Example] Fine-tune a 🤗 Transformers model ` - :doc:`[Gallery] Ray Train Examples Gallery ` - `[Gallery] More Train Use Cases on the Blog `_ .. _ref-use-cases-reinforcement-learning: Reinforcement Learning ---------------------- RLlib is an open-source library for reinforcement learning (RL), offering support for production-level, highly distributed RL workloads while maintaining unified and simple APIs for a large variety of industry applications. RLlib is used by industry leaders in many different verticals, such as climate control, industrial control, manufacturing and logistics, finance, gaming, automobile, robotics, boat design, and many others. .. figure:: /images/rllib_use_case.png Decentralized distributed proximal policy optimization (DD-PPO) architecture. Learn more about reinforcement learning with the following resources. - `[Course] Applied Reinforcement Learning with RLlib `_ - `[Blog] Intro to RLlib: Example Environments `_ - :doc:`[Guide] Getting Started with RLlib ` - `[Talk] Deep reinforcement learning at Riot Games `_ - :doc:`[Gallery] RLlib Examples Gallery ` - `[Gallery] More RL Use Cases on the Blog `_ .. _ref-use-cases-ml-platform: ML Platform ----------- Ray and its AI libraries provide unified compute runtime for teams looking to simplify their ML platform. Ray's libraries such as Ray Train, Ray Data, and Ray Serve can be used to compose end-to-end ML workflows, providing features and APIs for data preprocessing as part of training, and transitioning from training to serving. Read more about building ML platforms with Ray in :ref:`this section `. .. https://docs.google.com/drawings/d/1PFA0uJTq7SDKxzd7RHzjb5Sz3o1WvP13abEJbD0HXTE/edit .. image:: /images/ray-air.svg End-to-End ML Workflows ----------------------- The following highlights examples utilizing Ray AI libraries to implement end-to-end ML workflows. - :doc:`[Example] Text classification with Ray ` - :doc:`[Example] Object detection with Ray ` - :doc:`[Example] Machine learning on tabular data ` Large Scale Workload Orchestration ---------------------------------- The following highlights feature projects leveraging Ray Core's distributed APIs to simplify the orchestration of large scale workloads. - `[Blog] Highly Available and Scalable Online Applications on Ray at Ant Group `_ - `[Blog] Ray Forward 2022 Conference: Hyper-scale Ray Application Use Cases `_ - `[Blog] A new world record on the CloudSort benchmark using Ray `_ - :doc:`[Example] Speed up your web crawler by parallelizing it with Ray ` --- :orphan: FAQ ============== .. toctree:: :maxdepth: 1 :caption: Frequently Asked Questions ./../tune/faq.rst Further Questions or Issues? ----------------------------- .. include:: /_includes/_help.rst --- .. _ray_glossary: Ray Glossary ============ On this page you find a list of important terminology used throughout the Ray documentation, sorted alphabetically. .. glossary:: Action space Property of an RL environment. The shape(s) and datatype(s) that actions within an RL environment are allowed to have. Examples: An RL environment, in which an agent can move up, down, left, or right might have an action space of ``Discrete(4)`` (integer values of 0, 1, 2, or 3). An RL environment, in which an agent can apply a torque between -1.0 and 1.0 to a joint, the action space might be ``Box(-1.0, 1.0, (1,), float32)`` (single float values between -1.0 and 1.0). Actor A Ray actor is a remote instance of a class, which is essentially a stateful service. :ref:`Learn more about Ray actors`. Actor task An invocation of a Ray actor method. Sometimes we just call it a task. Ray Agent Daemon process running on each Ray node. It has several functionalities like collecting metrics on the local node and installing runtime environments. Agent An acting entity inside an RL environment. One RL environment might contain one (single-agent RL) or more (multi-agent RL) acting agents. Different agents within the same environment might have different observation- and action-spaces, different reward functions, and act at different time-steps. Algorithm A class that holds the who/when/where/how for training one or more RL agent(s). The user interacts with an Algorithm instance directly to train their agents (it is the top-most user facing API of RLlib). Asynchronous execution An execution model where a later task can begin executing in parallel, without waiting for an earlier task to finish. Ray tasks and actor tasks are all executed asynchronously. Asynchronous sampling Sampling is the process of rolling out (playing) episodes within an RL environment and thereby collecting the training data (observations, actions and rewards). In an asynchronous sampling setup, Ray actors run sampling in the background and send collected samples back to a main driver script. The driver checks for such “ready” data frequently and then triggers central model learning updates. Hence, sampling and learning happen at the same time. Note that because of this, the policy/ies used for creating the samples (action computations) might be slightly behind the centrally learned policy model(s), even in an on-policy Algorithm. Autoscaler A Ray component that scales up and down the Ray cluster by adding and removing Ray nodes according to the resources requested by applications running on the cluster. Autoscaling The process of scaling up and down the Ray cluster automatically. Backend A class containing the initialization and teardown logic for a specific deep learning framework (e.g., Torch, TensorFlow), used to set up distributed data-parallel training for :ref:`Ray Train’s built-in trainers`. Batch format The way Ray Data represents batches of data. The batch format is independent from how Ray Data stores the underlying blocks, so you can use any batch format regardless of the internal block representation. Set ``batch_format`` in methods like :meth:`Dataset.iter_batches() ` and :meth:`Dataset.map_batches() ` to specify the batch type. .. doctest:: >>> import ray >>> dataset = ray.data.range(15) >>> next(iter(dataset.iter_batches(batch_format="numpy", batch_size=5))) {'id': array([0, 1, 2, 3, 4])} >>> next(iter(dataset.iter_batches(batch_format="pandas", batch_size=5))) id 0 0 1 1 2 2 3 3 4 4 >>> next(iter(dataset.iter_batches(batch_format="pyarrow", batch_size=5))) pyarrow.Table id: int64 ---- id: [[0],[1],...,[3],[4]] To learn more about batch formats, read :ref:`Configuring batch formats `. Batch size A batch size in the context of model training is the number of data points used to compute and apply one gradient update to the model weights. Block A processing unit of data. A :class:`~ray.data.Dataset` consists of a collection of blocks. Under the hood, Ray Data partitions rows into a set of distributed data blocks. This allows it to perform operations in parallel. Unlike a batch, which is a user-facing object, a block is an internal abstraction. Placement Group Bundle A collection of resources that must be reserved on a single Ray node. :ref:`Learn more`. Checkpoint A Ray Train Checkpoint is a common interface for accessing data and models across different Ray components and libraries. A Checkpoint can have its data represented as a directory on local (on-disk) storage, as a directory on an external storage (e.g., cloud storage), and as an in-memory dictionary. :class:`Learn more `. .. TODO: How does this relate to RLlib checkpoints etc.? Be clear here Ray Client The Ray Client is an API that connects a Python script to a remote Ray cluster. Effectively, it allows you to leverage a remote Ray cluster just like you would with Ray running on your local machine. :ref:`Learn more`. Ray Cluster A Ray cluster is a set of worker nodes connected to a common Ray head node. Ray clusters can be fixed-size, or they can autoscale up and down according to the resources requested by applications running on the cluster. .. TODO: Add "Concurrency" here, or try to avoid this in docs. Connector A connector performs transformations on data that comes out of a dataset or an RL environment and is about to be passed to a model. Connectors are flexible components and can be swapped out such that models are easily reusable and do not have to be retrained for different data transformations. Tune Config This is the set of hyperparameters corresponding to a Tune trial. Sampling from a hyperparameter search space will produce a config. .. TODO: DAG Ray Dashboard Ray’s built-in dashboard is a web interface that provides metrics, charts, and other features that help Ray users to understand and debug Ray applications. .. TODO: Data Shuffling Dataset (object) A class that produces a sequence of distributed data blocks. :class:`~ray.data.Dataset` exposes methods to read, transform, and consume data at scale. To learn more about Datasets and the operations they support, read the :ref:`Datasets API Reference `. Deployment A deployment is a group of actors that can handle traffic in Ray Serve. Deployments are defined as a single class with a number of options, including the number of “replicas” of the deployment, each of which will map to a Ray actor at runtime. Requests to a deployment are load balanced across its replicas. Ingress Deployment In Ray Serve, the “ingress” deployment is the one that receives and responds to inbound user traffic. It handles HTTP parsing and response formatting. In the case of model composition, it would also fan out requests to other deployments to do things like preprocessing and a forward pass of an ML model. Driver "Driver" is the name of the process running the main script that starts all other processes. For Python, this is usually the script you start with ``python ...``. Tune Driver The Tune driver is the main event loop that’s happening on the node that launched the Tune experiment. This event loop schedules trials given the cluster resources, executes training on remote Trainable actors, and processes results and checkpoints from those actors. Distributed Data-Parallel A distributed data-parallel (DDP) training job scales machine learning training to happen on multiple nodes, where each node processes one shard of the full dataset. Every worker holds a copy of the model weights, and a common strategy for updating weights is a “mirrored strategy”, where each worker will hold the exact same weights at all times, and computed gradients are averaged then applied across all workers. With N worker nodes and a dataset of size D, each worker is responsible for only ``D / N`` datapoints. If each worker node computes the gradient on a batch of size ``B``, then the effective batch size of the DDP training is ``N * B``. .. TODO: Entrypoint Environment The world or simulation, in which one or more reinforcement learning agents have to learn to behave optimally with respect to a given reward function. An environment consists of an observation space, a reward function, an action space, a state transition function, and a distribution over initial states (after a reset). Episodes consisting of one or more time-steps are played through an environment in order to generate and collect samples for learning. These samples contain one 4-tuple of ``[observation, action, reward, next observation]`` per timestep. Episode A series of subsequent RL environment timesteps, each of which is a 4-tuple: ``[observation, action, reward, next observation]``. Episodes can end with the terminated- or truncated-flags being True. An episode generally spans multiple time-steps for one or more agents. The Episode is an important concept in RL as "optimal agent behavior" is defined as choosing actions that maximize the sum of individual rewards over the course of an episode. Trial Executor An internal :ref:`Ray Tune component` that manages the resource management and execution of each trial’s corresponding remote Trainable actor. The trial executor’s responsibilities include launching training, checkpointing, and restoring remote tasks. Experiment A Ray Tune or Ray Train experiment is a collection of one or more training jobs that may correspond to different hyperparameter configurations. These experiments are launched via the :ref:`Tuner API` and the :ref:`Trainer API`. .. TODO: Event Fault tolerance Fault tolerance in Ray Train and Tune consists of experiment-level and trial-level restoration. Experiment-level restoration refers to resuming all trials, in the event that an experiment is interrupted in the middle of training due to a cluster-level failure. Trial-level restoration refers to resuming individual trials, in the event that a trial encounters a runtime error such as OOM. .. TODO: more on fault tolerance in Core Framework The deep-learning framework used for the model(s), loss(es), and optimizer(s) inside an RLlib Algorithm. RLlib currently supports PyTorch and TensorFlow. GCS / Global Control Service Centralized metadata server for a Ray cluster. It runs on the Ray head node and has functions like managing node membership and actor directory. It’s also known as the Global Control Store. Head node A node that runs extra cluster-level processes like GCS and API server in addition to those processes running on a worker node. A Ray cluster only has one head node. HPO Hyperparameter optimization (HPO) is the process of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter can be a parameter whose value is used to control the learning process (e.g., learning rate), define the model architecture (e.g, number of hidden layers), or influence data pre-processing. In the case of Ray Train, hyperparameters can also include compute processing scale-out parameters such as the number of distributed training workers. .. TODO: Inference Job A Ray job is a packaged Ray application that can be executed on a (remote) Ray cluster. :ref:`Learn more`. Lineage For Ray objects, this is the set of tasks that was originally executed to produce the object. If an object’s value is lost due to node failure, Ray may attempt to recover the value by re-executing the object’s lineage. .. TODO: Logs .. TODO: Metrics Model A function approximator with trainable parameters (e.g. a neural network) that can be trained by an algorithm on available data or collected data from an RL environment. The parameters are usually initialized at random (unlearned state). During the training process, checkpoints of the model can be created such that - after the learning process is shut down or crashes - training can resume from the latest weights rather than having to re-learn from scratch. After the training process is completed, models can be deployed into production for inference using Ray Serve. Multi-agent Denotes an RL environment setup, in which several (more than one) agents act in the same environment and learn either the same or different optimal behaviors. The relationship between the different agents in a multi-agent setup might be adversarial (playing against each other), cooperative (trying to reach a common goal) or neutral (the agents don’t really care about other agents’ actions). The NN model architectures that can be used for multi-agent training range from "independent" (each agent trains its own separate model), over "partially shared" (i.e. some agents might share their value function, because they have a common goal), to "identical" (all agents train on the same model). Namespace A namespace is a logical grouping of jobs and named actors. When an actor is named, its name must be unique within the namespace. When a namespace is not specified, Ray will place your job in an anonymous namespace. Node A Ray node is a physical or virtual machine that is part of a Ray cluster. See also :term:`Head node`. Object An application value. These are values that are returned by a task or created through ``ray.put``. Object ownership Ownership is the concept used to decide where metadata for a certain ``ObjectRef`` (and the task that creates the value) should be stored. If a worker calls ``foo.remote()`` or ``ray.put()``, it owns the metadata for the returned ``ObjectRef``, e.g., ref count and location information. If an object’s owner dies and another worker tries to get the value, it will receive an ``OwnerDiedError`` exception. Object reference A pointer to an application value, which can be stored anywhere in the cluster. Can be created by calling ``foo.remote()`` or ``ray.put()``. If using ``foo.remote()``, then the returned ``ObjectRef`` is also a future. Object store A distributed in-memory data store for storing Ray objects. Object spilling Objects in the object store are spilled to external storage once the capacity of the object store is used up. This enables out-of-core data processing for memory-intensive distributed applications. It comes with a performance penalty since data needs to be written to disk. .. TODO: Observability Observation The full or partial state of an RL environment, which an agent sees (has access to) at each timestep. A fully observable environment produces observations that contain all the information to sufficiently infer the current underlying state of the environment. Such states are also called “Markovian”. Examples for environments with Markovian observations are chess or 2D games, in which the player can see with each frame the entirety of the game’s state. A partially observable (or non-Markovian) environment produces observations that do not contain sufficient information to infer the exact underlying state. An example here would be a robot with a camera on its head facing forward. The robot walks around in a maze, but from a single camera frame might not know what’s currently behind it. Offline data Data collected in an RL environment up-front and stored in some data format (e.g. JSON). Offline data can be used to train an RL agent. The data might have been generated by a non-RL/ML system, such as a simple decision making script. Also, when training from offline data, the RL algorithm will not be able to explore new actions in new situations as all interactions with the environment already happened in the past (were recorded prior to training). Offline RL A sub-field of reinforcement learning (RL), in which specialized offline RL Algorithms learn how to compute optimal actions for an agent inside an environment without the ability to interact live with that environment. Instead, the data used for training has already been collected up-front (maybe even by a non-RL/ML system). This is very similar to a supervised learning setup. Examples for offline RL algorithms are MARWIL, CQL, and CRR. Off-Policy A type of RL Algorithm. In an off-policy algorithm, the policy used to compute the actions inside an RL environment (to generate the training data) might be different from the one that is being optimized. Examples for off-policy Algorithms are DQN, SAC, and DDPG. On-Policy A type of RL Algorithm. In an on-policy algorithm, the policy used to compute the actions inside an RL environment (to generate the training data) must be the exact same (matching NN weights at all times) as the one that's being optimized. Examples for on-policy Algorithms are PPO, APPO, and IMPALA. OOM (Out of Memory) Ray may run out of memory if the application is using too much memory on a single node. In this case the :ref:`Ray OOM killer` will kick in and kill worker processes to free up memory. Placement group Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can be then used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks. :ref:`Learn more`. Policy A (neural network) model that maps an RL environment observation of some agent to its next action inside an RL environment. .. TODO: Policy evaluation Predictor :class:`An interface for performing inference` (prediction) on input data with a trained model. Preprocessor :ref:`An interface used to preprocess a Dataset` for training and inference (prediction). Preprocessors can be stateful, as they can be fitted on the training dataset before being used to transform the training and evaluation datasets. .. TODO: Process Ray application A collection of Ray tasks, actors, and objects that originate from the same script. .. TODO: Ray Timeline Raylet A system process that runs on each Ray node. It’s responsible for scheduling and object management. Replica A replica is a single actor that handles requests to a given Serve deployment. A deployment may consist of many replicas, either statically-configured via ``num_replicas`` or dynamically configured using auto-scaling. Resource (logical and physical) Ray resources are logical resources (e.g. CPU, GPU) used by tasks and actors. It doesn't necessarily map 1-to-1 to physical resources of machines on which Ray cluster runs. :ref:`Learn more`. Reward A single floating point value that each agent within an RL environment receives after each action taken. An agent is defined to be acting optimally inside the RL environment when the sum over all received rewards within an episode is maximized. Note that rewards might be delayed (not immediately telling the agent, whether an action was good or bad) or sparse (often have a value of zero) making it harder for the agent to learn. Rollout The process of advancing through an episode in an RL environment (with one or more RL agents) by taking sequential actions. During rollouts, the algorithm should collect the environment produced 4-tuples [observations, actions, rewards, next observations] in order to (later or simultaneously) learn how to behave more optimally from this data. Rollout Worker Component within a RLlib Algorithm responsible for advancing and collecting observations and rewards in an RL environment. Actions for the different agent(s) within the environment are computed by the Algorithms’ policy models. A distributed algorithm might have several replicas of Rollout Workers running as Ray actors in order to scale the data collection process for faster RL training. .. START ROLLOUT WORKER RolloutWorkers are used as ``@ray.remote`` actors to collect and return samples from environments or offline files in parallel. An RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` usually has ``num_workers`` :py:class:`~ray.rllib.env.env_runner.EnvRunner` instances plus a single "local" :py:class:`~ray.rllib.env.env_runner.EnvRunner` (not ``@ray.remote``) in its :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` under ``self.workers``. Depending on its evaluation config settings, an additional :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` with :py:class:`~ray.rllib.env.env_runner.EnvRunner` instances for evaluation may be present in the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` under ``self.evaluation_workers``. .. END ROLLOUT WORKER .. TODO: Runtime Runtime environment A runtime environment defines dependencies such as files, packages, environment variables needed for a Python script to run. It is installed dynamically on the cluster at runtime, and can be specified for a Ray job, or for specific actors and tasks. :ref:`Learn more`. Remote Function See :term:`Task`. Remote Class See :term:`Actor`. (Ray) Scheduler A Ray component that assigns execution units (Task/Actor) to Ray nodes. Search Space The definition of the possible values for hyperparameters. Can be composed out of constants, discrete values, distributions of functions. This is also referred to as the “parameter space” (``param_space`` in the ``Tuner``). Search algorithm Search algorithms suggest new hyperparameter configurations to be evaluated by Tune. The default search algorithm is random search, where each new configuration is independent from the previous one. More sophisticated search algorithms such as ones using Bayesian optimization will fit a model to predict the hyperparameter configuration that will produce the best model, while also exploring the space of possible hyperparameters. Many popular search algorithms are built into Tune, most of which are integrations with other libraries. Serve application An application is the unit of upgrade in a Serve cluster. An application consists of one or more deployments. One of these deployments is considered the “ingress” deployment, which is where all inbound traffic is handled. Applications can be called via HTTP at their configured ``route_prefix``. DeploymentHandle DeploymentHandle is the Python API for making requests to Serve deployments. A handle is defined by passing one bound Serve deployment to the constructor of another. Then at runtime that reference can be used to make requests. This is used to combine multiple deployments for model composition. Session - A Ray Train/Tune session: Tune session at the experiment execution layer and Train session at the Data Parallel training layer if running data-parallel distributed training with Ray Train. The session allows access to metadata, such as which trial is being run, information about the total number of workers, as well as the rank of the current worker. The session is also the interface through which an individual Trainable can interact with the Tune experiment as a whole. This includes uses such as reporting an individual trial’s metrics, saving/loading checkpoints, and retrieving the corresponding dataset shards for each Train worker. - A Ray cluster: in some cases the session also means a :term:`Ray Cluster`. For example, logs of a Ray cluster are stored under ``session_xxx/logs/``. Spillback A task caller schedules a task by first sending a resource request to the preferred raylet for that request. If the preferred raylet chooses not to grant the resources locally, it may also “Spillback” and respond to the caller with the address of a remote raylet at which the caller should retry the resource request. State State of the environment an RL agent interacts with. Synchronous execution Two tasks A and B are executed synchronously if A must finish before B can start. For example, if you call ``ray.get`` immediately after launching a remote task with ``task.remote()``, you’ll be running with synchronous execution, since this will wait for the task to finish before the program continues. Synchronous sampling Sampling workers work in synchronous steps. All of them must finish collecting a new batch of samples before training can proceed to the next iteration. Task A remote function invocation. This is a single function invocation that executes on a process different from the caller, and potentially on a different machine. A task can be stateless (a ``@ray.remote`` function) or stateful (a method of a ``@ray.remote`` class - see Actor below). A task is executed asynchronously with the caller: the ``.remote()`` call immediately returns one or more ``ObjectRefs`` (futures) that can be used to retrieve the return value(s). See :term:`Actor task`. Trainable A :ref:`Trainable` is the interface that Ray Tune uses to perform custom training logic. User-defined Trainables take in a configuration as an input and can run user-defined training code as well as custom metric reporting and checkpointing. There are many types of trainables. Most commonly used is the function trainable API, which is simply a Python function that contains model training logic and metric reporting. Tune also exposes a class trainable API, which allows you to implement training, checkpointing, and restoring as different methods. Ray Tune associates each trial with its own Trainable – the Trainable is the one actually doing training. The Trainable is a remote actor that can be placed on any node in a Ray cluster. Trainer A Trainer is the top-level API to configure a single distributed training job. :ref:`There are built-in Trainers for different frameworks`, like PyTorch, Tensorflow, and XGBoost. Each trainer shares a common interface and otherwise defines framework-specific configurations and entrypoints. The main job of a trainer is to coordinate N distributed training workers and set up the communication backends necessary for these workers to communicate (e.g., for sharing computed gradients). Trainer configuration A Trainer can be configured in various ways. Some configurations are shared across all trainers, like the RunConfig, which configures things like the experiment storage, and ScalingConfig, which configures the number of training workers as well as resources needed per worker. Other configurations are specific to the trainer framework. Training iteration A partial training pass of input data up to pre-defined yield point (e.g., time or data consumed) for checkpointing of long running training jobs. A full training epoch can consist of multiple training iterations. .. TODO: RLlib Training epoch A full training pass of the input dataset. Typically, model training iterates through the full dataset in batches of size B, where gradients are calculated on each batch and then applied as an update to the model weights. Training jobs can consist of multiple epochs by training through the same dataset multiple times. Training step An RLlib-specific method of the Algorithm class which includes the core logic of an RL algorithm. Commonly includes gathering of experiences (either through sampling or from offline data), optimization steps, redistribution of learnt model weights. The particularities of this method are specific to algorithms and configurations. Transition A tuple of (observation, action, reward, next observation). A transition represents one step of an agent in an environment. Trial One training run within a Ray Tune experiment. If you run multiple trials, each trial usually corresponds to a different config (a set of hyperparameters). Trial scheduler When running a Ray Tune job, the scheduler will decide how to allocate resources to trials. In the most common case, this resource is time - the trial scheduler decides which trials to run at what time. Certain built-in schedulers like Asynchronous Hyperband (ASHA) perform early stopping of under-performing trials, while others like Population Based Training (PBT) will make under-performing trials copy the hyperparameter config and model weights of top performing trials and continue training. Tuner The Tuner is the top level Ray Tune API used to configure and run an experiment with many trials. .. TODO: Tunable .. TODO: (Ray) Workflow .. TODO: WorkerGroup .. TODO: Worker heap .. TODO: Worker node / worker node pod Worker process / worker The process that runs user defined tasks and actors. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-algo-configuration-docs: AlgorithmConfig API =================== .. include:: /_includes/rllib/new_api_stack.rst RLlib's :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` API is the auto-validated and type-safe gateway into configuring and building an RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`. In essence, you first create an instance of :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` and then call some of its methods to set various configuration options. RLlib uses the following `black `__-compliant format in all parts of its code. Note that you can chain together more than one method call, including the constructor: .. testcode:: from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( # Create an `AlgorithmConfig` instance. AlgorithmConfig() # Change the learning rate. .training(lr=0.0005) # Change the number of Learner actors. .learners(num_learners=2) ) .. hint:: For value checking and type-safety reasons, you should never set attributes in your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` directly, but always go through the proper methods: .. testcode:: # WRONG! config.env = "CartPole-v1" # <- don't set attributes directly # CORRECT! config.environment(env="CartPole-v1") # call the proper method Algorithm specific config classes --------------------------------- You don't use the base ``AlgorithmConfig`` class directly in practice, but always its algorithm-specific subclasses, such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPOConfig`. Each subclass comes with its own set of additional arguments to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training` method. Normally, you should pick the specific :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` subclass that matches the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` you would like to run your learning experiments with. For example, if you would like to use :ref:`IMPALA ` as your algorithm, you should import its specific config class: .. testcode:: from ray.rllib.algorithms.impala import IMPALAConfig config = ( # Create an `IMPALAConfig` instance. IMPALAConfig() # Specify the RL environment. .environment("CartPole-v1") # Change the learning rate. .training(lr=0.0004) ) To change algorithm-specific settings, here for ``IMPALA``, also use the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training` method: .. testcode:: # Change an IMPALA-specific setting (the entropy coefficient). config.training(entropy_coeff=0.01) You can build the :py:class:`~ray.rllib.algorithms.impala.IMPALA` instance directly from the config object through calling the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.build_algo` method: .. testcode:: # Build the algorithm instance. impala = config.build_algo() .. testcode:: :hide: impala.stop() The config object stored inside any built :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance is a copy of your original config. This allows you to further alter your original config object and build another algorithm instance without affecting the previously built one: .. testcode:: # Further alter the config without affecting the previously built IMPALA object ... config.training(lr=0.00123) # ... and build a new IMPALA from it. another_impala = config.build_algo() .. testcode:: :hide: another_impala.stop() If you are working with `Ray Tune `__, pass your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` instance into the constructor of the :py:class:`~ray.tune.tuner.Tuner`: .. code-block:: python from ray import tune tuner = tune.Tuner( "IMPALA", param_space=config, # <- your RLlib AlgorithmConfig object .. ) # Run the experiment with Ray Tune. results = tuner.fit() .. _rllib-algo-configuration-generic-settings: Generic config settings ----------------------- Most config settings are generic and apply to all of RLlib's :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` classes. The following sections walk you through the most important config settings users should pay close attention to before diving further into other config settings and before starting with hyperparameter fine tuning. RL Environment ~~~~~~~~~~~~~~ To configure, which :ref:`RL environment ` your algorithm trains against, use the ``env`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.environment` method: .. testcode:: config.environment("Humanoid-v5") See this :ref:`RL environment guide ` for more details. .. tip:: Install both `Atari `__ and `MuJoCo `__ to be able to run all of RLlib's :ref:`tuned examples `: .. code-block:: bash pip install "gymnasium[atari,accept-rom-license,mujoco]" Learning rate `lr` ~~~~~~~~~~~~~~~~~~ Set the learning rate for updating your models through the ``lr`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training` method: .. testcode:: config.training(lr=0.0001) .. _rllib-algo-configuration-train-batch-size: Train batch size ~~~~~~~~~~~~~~~~ Set the train batch size, per Learner actor, through the ``train_batch_size_per_learner`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training` method: .. testcode:: config.training(train_batch_size_per_learner=256) .. note:: You can compute the total, effective train batch size through multiplying ``train_batch_size_per_learner`` with ``(num_learners or 1)``. Or you can also just check the value of your config's :py:attr:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.total_train_batch_size` property: .. testcode:: config.training(train_batch_size_per_learner=256) config.learners(num_learners=2) print(config.total_train_batch_size) # expect: 512 = 256 * 2 Discount factor `gamma` ~~~~~~~~~~~~~~~~~~~~~~~ Set the `RL discount factor `__ through the ``gamma`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training` method: .. testcode:: config.training(gamma=0.995) Scaling with `num_env_runners` and `num_learners` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. todo (sven): link to scaling guide, once separated out in its own rst. Set the number of :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors used to collect training samples through the ``num_env_runners`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners` method: .. testcode:: config.env_runners(num_env_runners=4) # Also use `num_envs_per_env_runner` to vectorize your environment on each EnvRunner actor. # Note that this option is only available in single-agent setups. # The Ray Team is working on a solution for this restriction. config.env_runners(num_envs_per_env_runner=10) Set the number of :py:class:`~ray.rllib.core.learner.learner.Learner` actors used to update your models through the ``num_learners`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners` method. This should correspond to the number of GPUs you have available for training. .. testcode:: config.learners(num_learners=2) Disable `explore` behavior ~~~~~~~~~~~~~~~~~~~~~~~~~~ Switch off/on exploratory behavior through the ``explore`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners` method. To compute actions, the :py:class:`~ray.rllib.env.env_runner.EnvRunner` calls `forward_exploration()` on the RLModule when ``explore=True`` and `forward_inference()` when ``explore=False``. The default value is ``explore=True``. .. testcode:: # Disable exploration behavior. # When False, the EnvRunner calls `forward_inference()` on the RLModule to compute # actions instead of `forward_exploration()`. config.env_runners(explore=False) Rollout length ~~~~~~~~~~~~~~ Set the number of timesteps that each :py:class:`~ray.rllib.env.env_runner.EnvRunner` steps through with each of its RL environment copies through the ``rollout_fragment_length`` argument. Pass this argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners` method. Note that some algorithms, like :py:class:`~ray.rllib.algorithms.ppo.PPO`, set this value automatically, based on the :ref:`train batch size `, number of :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors and number of envs per :py:class:`~ray.rllib.env.env_runner.EnvRunner`. .. testcode:: config.env_runners(rollout_fragment_length=50) All available methods and their settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Besides the previously described most common settings, the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` class and its algo-specific subclasses come with many more configuration options. To structure things more semantically, :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` groups its various config settings into the following categories, each represented by its own method: - :ref:`Config settings for the RL environment ` - :ref:`Config settings for training behavior (including algo-specific settings) ` - :ref:`Config settings for EnvRunners ` - :ref:`Config settings for Learners ` - :ref:`Config settings for adding callbacks ` - :ref:`Config settings for multi-agent setups ` - :ref:`Config settings for offline RL ` - :ref:`Config settings for evaluating policies ` - :ref:`Config settings for the DL framework ` - :ref:`Config settings for reporting and logging behavior ` - :ref:`Config settings for checkpointing ` - :ref:`Config settings for debugging ` - :ref:`Experimental config settings ` To familiarize yourself with the vast number of RLlib's different config options, you can browse through `RLlib's examples folder `__ or take a look at this :ref:`examples folder overview page `. Each example script usually introduces a new config setting or shows you how to implement specific customizations through a combination of setting certain config options and adding custom code to your experiment. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-checkpoints-docs: Checkpointing ============= .. include:: /_includes/rllib/new_api_stack.rst RLlib offers a powerful checkpointing system for all its major classes, allowing you to save the states of :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instances and their subcomponents to local disk or cloud storage, and restore previously run experiment states and individual subcomponents. This system allows you to continue training models from a previous state or deploy bare-bones PyTorch models into production. .. figure:: images/checkpointing/save_and_restore.svg :width: 500 :align: left **Saving to and restoring from disk or cloud storage**: Use the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.save_to_path` method to write the current state of any :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable` component or your entire Algorithm to disk or cloud storage. To load a saved state back into a running component or into your Algorithm, use the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.restore_from_path` method. A checkpoint is a directory on disk or some `PyArrow `__-supported cloud location, like `gcs `__ or `S3 `__. It contains architecture information, such as the class and the constructor arguments for creating a new instance, a ``pickle`` or ``msgpack`` file with state information, and a human readable ``metadata.json`` file with information about the Ray version, git commit, and checkpoint version. You can generate a new :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance or other subcomponent, like an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, from an existing checkpoint using the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.from_checkpoint` method. For example, you can deploy a previously trained :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, without any of the other RLlib components, into production. .. figure:: images/checkpointing/from_checkpoint.svg :width: 750 :align: left **Creating a new instance directly from a checkpoint**: Use the ``classmethod`` :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.from_checkpoint` to instantiate objects directly from a checkpoint. RLlib first uses the saved meta data to create a bare-bones instance of the originally checkpointed object, and then restores its state from the state information in the checkpoint dir. Another possibility is to load only a certain subcomponent's state into the containing higher-level object. For example, you may want to load only the state of your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, located inside your :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`, but leave all the other components as-is. Checkpointable API ------------------ RLlib manages checkpointing through the :py:class:`~ray.rllib.utils.checkpoints.Checkpointable` API, which exposes the following three main methods: - :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.save_to_path` for creating a new checkpoint - :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.restore_from_path` for loading a state from a checkpoint into a running object - :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.from_checkpoint` for creating a new object from a checkpoint RLlib classes, which thus far support the :py:class:`~ray.rllib.utils.checkpoints.Checkpointable` API are: - :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` - :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` (and :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule`) - :py:class:`~ray.rllib.env.env_runner.EnvRunner` (thus, also :py:class:`~ray.rllib.env.single_agent_env_runner.SingleAgentEnvRunner` and :py:class:`~ray.rllib.env.multi_agent_env_runner.MultiAgentEnvRunner`) - :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` (thus, also :py:class:`~ray.rllib.connectors.connector_pipeline_v2.ConnectorPipelineV2`) - :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` - :py:class:`~ray.rllib.core.learner.learner.Learner` .. _rllib-checkpoints-save-to-path: Creating a new checkpoint with `save_to_path()` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You create a new checkpoint from an instantiated RLlib object through the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.save_to_path` method. The following are two examples, single- and multi-agent, using the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class, showing how to create checkpoints: .. tab-set:: .. tab-item:: Single-agent setup .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig # Configure and build an initial algorithm. config = ( PPOConfig() .environment("Pendulum-v1") ) ppo = config.build() # Train for one iteration, then save to a checkpoint. print(ppo.train()) checkpoint_dir = ppo.save_to_path() print(f"saved algo to {checkpoint_dir}") .. testcode:: :hide: _weights_check = ppo.get_module("default_policy").get_state() ppo.stop() .. tab-item:: Multi-agent setup .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.examples.envs.classes.multi_agent import MultiAgentPendulum from ray.tune import register_env register_env("multi-pendulum", lambda cfg: MultiAgentPendulum({"num_agents": 2})) # Configure and build an initial algorithm. multi_agent_config = ( PPOConfig() .environment("multi-pendulum") .multi_agent( policies={"p0", "p1"}, # Agent IDs are 0 and 1 -> map to p0 and p1, respectively. policy_mapping_fn=lambda aid, eps, **kw: f"p{aid}" ) ) ppo = multi_agent_config.build() # Train for one iteration, then save to a checkpoint. print(ppo.train()) multi_agent_checkpoint_dir = ppo.save_to_path() print(f"saved multi-agent algo to {multi_agent_checkpoint_dir}") .. testcode:: :hide: ppo.stop() .. note:: When running your experiments with `Ray Tune `__, Tune calls the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.save_to_path` method automatically on the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance, whenever the training iteration matches the checkpoint frequency configured through Tune. The default location where Tune creates these checkpoints is ``~/ray_results/[your experiment name]/[Tune trial name]/checkpoint_[sequence number]``. Checkpoint versions +++++++++++++++++++ RLlib uses a checkpoint versioning system to figure out how to restore an Algorithm or any subcomponent from a given directory. From Ray 2.40 on, you can find the checkpoint version in the human readable ``metadata.json`` file inside all checkpoint directories. Also starting from `Ray 2.40`, RLlib checkpoints are backward compatible. This means that a checkpoint created with Ray `2.x` can be read and handled by `Ray 2.x+n`, as long as `x >= 40`. The Ray team ensures backward compatibility with `comprehensive CI tests on checkpoints taken with previous Ray versions `__. .. _rllib-checkpoints-structure-of-checkpoint-dir: Structure of a checkpoint directory +++++++++++++++++++++++++++++++++++ After saving your PPO's state in the ``checkpoint_dir`` directory, or somewhere in ``~/ray_results/`` if you use Ray Tune, the directory looks like the following: .. code-block:: shell $ cd [your algo checkpoint dir] $ ls -la . .. env_runner/ learner_group/ algorithm_state.pkl class_and_ctor_args.pkl metadata.json Subdirectories inside a checkpoint dir, like ``env_runner/``, hint at a subcomponent's own checkpoint data. For example, an :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` always also saves its :py:class:`~ray.rllib.env.env_runner.EnvRunner` state and :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` state. .. note:: Each of the subcomponent's directories themselves contain a ``metadata.json`` file, a ``class_and_ctor_args.pkl`` file, and a ``pickle`` or ``msgpack`` state file, all serving the same purpose as their counterparts in the main algorithm checkpoint directory. For example, inside the ``learner_group/`` subdirectory, you would find the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`'s own architecture, state, and meta information: .. code-block:: shell $ cd env_runner/ $ ls -la . .. state.pkl class_and_ctor_args.pkl metadata.json See :ref:`RLlib component tree ` for details. The ``metadata.json`` file exists for your convenience only and RLlib doesn't need it. .. note:: The ``metadata.json`` file contains information about the Ray version used to create the checkpoint, the Ray commit, the RLlib checkpoint version, and the names of the state- and constructor-information files in the same directory. .. code-block:: shell $ more metadata.json { "class_and_ctor_args_file": "class_and_ctor_args.pkl", "state_file": "state", "ray_version": .., "ray_commit": .., "checkpoint_version": "2.1" } The ``class_and_ctor_args.pkl`` file stores meta information needed to construct a "fresh" object, without any particular state. This information, as the filename suggests, contains the class of the saved object and its constructor arguments and keyword arguments. RLlib uses this file to create the initial new object when calling :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.from_checkpoint`. Finally, the ``.._state.[pkl|msgpack]`` file contains the pickled or msgpacked state dict of the saved object. RLlib obtains this state dict, when saving a checkpoint, through calling the object's :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.get_state` method. .. note:: Support for ``msgpack`` based checkpoints is experimental, but might become the default in the future. Unlike ``pickle``, ``msgpack`` has the advantage of being independent of the python-version, thus allowing users to recover experiment and model states from old checkpoints they have generated with older python versions. The Ray team is working on completely separating state from architecture within checkpoints, meaning all state information should go into the ``state.msgpack`` file, which is python-version independent, whereas all architecture information should go into the ``class_and_ctor_args.pkl`` file, which still depends on the python version. At the time of loading from checkpoint, the user would have to provide the latter/architecture part of the checkpoint. `See here for an example that illustrates this in more detail `__. .. _rllib-checkpoints-component-tree: RLlib component tree +++++++++++++++++++++++ The following is the structure of the RLlib component tree, showing under which name you can access a subcomponent's own checkpoint within the higher-level checkpoint. At the highest level is the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class: .. code-block:: shell algorithm/ learner_group/ learner/ rl_module/ default_policy/ # <- single-agent case [module ID 1]/ # <- multi-agent case [module ID 2]/ # ... env_runner/ env_to_module_connector/ module_to_env_connector/ .. note:: The ``env_runner/`` subcomponent currently doesn't hold a copy of the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` checkpoint because it's already saved under ``learner/``. The Ray team is working on resolving this issue, probably through soft-linking to avoid duplicate files and unnecessary disk usage. .. _rllib-checkpoints-from-checkpoint: Creating instances from a checkpoint with `from_checkpoint` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once you have a checkpoint of either a trained :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` or any of its :ref:`subcomponents `, you can recreate new objects directly from this checkpoint. The following are two examples: .. tab-set:: .. tab-item:: Create a new Algorithm from a checkpoint To recreate an entire :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance from a checkpoint, you can do the following: .. testcode:: # Import the correct class to create from scratch using the checkpoint. from ray.rllib.algorithms.algorithm import Algorithm # Use the already existing checkpoint in `checkpoint_dir`. new_ppo = Algorithm.from_checkpoint(checkpoint_dir) # Confirm the `new_ppo` matches the originally checkpointed one. assert new_ppo.config.env == "Pendulum-v1" # Continue training. new_ppo.train() .. testcode:: :hide: new_ppo.stop() .. tab-item:: Create a new RLModule from an Algorithm checkpoint Creating a new RLModule from an Algorithm checkpoint is useful when deploying trained models into production or evaluating them in a separate process while training is ongoing. To recreate only the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` from the algorithm's checkpoint, you can do the following. .. testcode:: from pathlib import Path import torch # Import the correct class to create from scratch using the checkpoint. from ray.rllib.core.rl_module.rl_module import RLModule # Use the already existing checkpoint in `checkpoint_dir`, but go further down # into its subdirectory for the single RLModule. # See the preceding section on "RLlib component tree" for the various elements in the RLlib # component tree. rl_module_checkpoint_dir = Path(checkpoint_dir) / "learner_group" / "learner" / "rl_module" / "default_policy" # Now that you have the correct subdirectory, create the actual RLModule. rl_module = RLModule.from_checkpoint(rl_module_checkpoint_dir) # Run a forward pass to compute action logits. # Use a dummy Pendulum observation tensor (3d) and add a batch dim (B=1). results = rl_module.forward_inference( {"obs": torch.tensor([0.5, 0.25, -0.3]).unsqueeze(0).float()} ) print(results) See this `example of how to run policy inference after training `__ and this `example of how to run policy inference with an LSTM `__. .. hint:: Because your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` is also a `PyTorch Module `__, you can easily export your model to `ONNX `__, `IREE `__, or other deployment-friendly formats. See this `example script supporting ONNX `__ for more details. Restoring state from a checkpoint with `restore_from_path` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Normally, the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.save_to_path` and :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.from_checkpoint` methods are all you need to create checkpoints and re-create instances from them. However, sometimes, you already have an instantiated object up and running and would like to "load" another state into it. For example, consider training two :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` networks through multi-agent training, playing against each other in a self-play fashion. After a while, you would like to swap out, without interrupting your experiment, one of the ``RLModules`` with a third one that you have saved to disk or cloud storage a while back. This is where the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.restore_from_path` method comes in handy. It loads a state into an already running object, for example your Algorithm, or into a subcomponent of that object, for example a particular :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` within your :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`. .. tab-set:: .. tab-item:: Continue training When using RLlib directly, meaning without Ray Tune, the problem of loading a state into a running instance is straightforward: .. testcode:: # Recreate the preceding PPO from the config. new_ppo = config.build() # Load the state stored previously in `checkpoint_dir` into the # running algorithm instance. new_ppo.restore_from_path(checkpoint_dir) # Run another training iteration. new_ppo.train() .. testcode:: :hide: new_ppo.stop() .. tab-item:: Continue training with Ray Tune However, when running through Ray Tune, you don't have direct access to the Algorithm object or any of its subcomponents. You can use :ref:`RLlib's callbacks APIs ` to inject custom code and solve for this. Also, see here for an `example on how to continue training with a different config `__. .. testcode:: from ray import tune # Reuse the preceding PPOConfig (`config`). # Inject custom callback code that runs right after algorithm's initialization. config.callbacks( on_algorithm_init=( lambda algorithm, _dir=checkpoint_dir, **kw: algorithm.restore_from_path(_dir) ), ) # Run the experiment, continuing from the checkpoint, through Ray Tune. results = tune.Tuner( config.algo_class, param_space=config, run_config=tune.RunConfig(stop={"num_env_steps_sampled_lifetime": 8000}) ).fit() .. tab-item:: Swap out one RLModule and continue multi-agent training In the :ref:`preceding section on save_to_path `, you created a single-agent checkpoint with the ``default_policy`` ModuleID, and a multi-agent checkpoint with two ModuleIDs, ``p0`` and ``p1``. Here is how you can continue training the multi-agent experiment, but swap out ``p1`` with the state of the ``default_policy`` from the single-agent experiment. You can use :ref:`RLlib's callbacks APIs ` to inject custom code into a Ray Tune experiment: .. testcode:: # Reuse the preceding multi-agent PPOConfig (`multi_agent_config`). # But swap out ``p1`` with the state of the ``default_policy`` from the # single-agent run, using a callback and the correct path through the # RLlib component tree: multi_rl_module_component_tree = "learner_group/learner/rl_module" # Inject custom callback code that runs right after algorithm's initialization. def _on_algo_init(algorithm, **kwargs): algorithm.restore_from_path( # Checkpoint was single-agent (has "default_policy" subdir). path=Path(checkpoint_dir) / multi_rl_module_component_tree / "default_policy", # Algo is multi-agent (has "p0" and "p1" subdirs). component=multi_rl_module_component_tree + "/p1", ) # Inject callback. multi_agent_config.callbacks(on_algorithm_init=_on_algo_init) # Run the experiment through Ray Tune. results = tune.Tuner( multi_agent_config.algo_class, param_space=multi_agent_config, run_config=tune.RunConfig(stop={"num_env_steps_sampled_lifetime": 8000}) ).fit() .. testcode:: :hide: from ray.rllib.utils.test_utils import check _weights_check_2 = multi_agent_config.build().get_module("p1").get_state() check(_weights_check, _weights_check_2) --- .. include:: /_includes/rllib/we_are_hiring.rst .. _connector-v2-docs: .. grid:: 1 2 3 4 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :img-top: /rllib/images/connector_v2/connector_generic.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: connector-v2-docs ConnectorV2 overview (this page) .. grid-item-card:: :img-top: /rllib/images/connector_v2/env_to_module_connector.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: env-to-module-pipeline-docs Env-to-module pipelines .. grid-item-card:: :img-top: /rllib/images/connector_v2/learner_connector.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: learner-pipeline-docs Learner connector pipelines ConnectorV2 and ConnectorV2 pipelines ===================================== .. toctree:: :hidden: env-to-module-connector learner-connector .. include:: /_includes/rllib/new_api_stack.rst RLlib stores and transports all trajectory data in the form of :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` or :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` objects. **Connector pipelines** are the components that translate this episode data into tensor batches readable by neural network models right before the model forward pass. .. figure:: images/connector_v2/generic_connector_pipeline.svg :width: 1000 :align: left **Generic ConnectorV2 Pipeline**: All pipelines consist of one or more :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pieces. When calling the pipeline, you pass in a list of Episodes, the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instance, and a batch, which initially might be an empty dict. Each :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece in the pipeline takes its predecessor's output, starting on the left side with the batch, performs some transformations on the episodes, the batch, or both, and passes everything on to the next piece. Thereby, all :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pieces can read from and write to the provided episodes, add any data from these episodes to the batch, or change the data that's already in the batch. The pipeline then returns the output batch of the last piece. .. note:: Note that the batch output of the pipeline lives only as long as the succeeding :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` forward pass or `Env.step()` call. RLlib discards the data afterwards. The list of episodes, however, may persist longer. For example, if a env-to-module pipeline reads an observation from an episode, mutates that observation, and then writes it back into the episode, the subsequent module-to-env pipeline is able to see the changed observation. Also, the Learner pipeline operates on the same episodes that have already passed through both env-to-module and module-to-env pipelines and thus might have undergone changes. Three ConnectorV2 pipeline types -------------------------------- There are three different types of connector pipelines in RLlib: 1) :ref:`Env-to-module pipeline `, which creates tensor batches for action computing forward passes. 2) Module-to-env pipeline (documentation pending), which translates a model's output into RL environment actions. 3) :ref:`Learner connector pipeline `, which creates the train batch for a model update. The :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` API is an extremely powerful tool for customizing your RLlib experiments and algorithms. It allows you to take full control over accessing, changing, and re-assembling the episode data collected from your RL environments or your offline RL input files as well as controlling the exact nature and shape of the tensor batches that RLlib feeds into your models for computing actions or losses. .. figure:: images/connector_v2/location_of_connector_pipelines_in_rllib.svg :width: 900 :align: left **ConnectorV2 Pipelines**: Connector pipelines convert episodes into batched data, which your model can process (env-to-module and Learner) or convert your model's output into action batches, which your possibly vectorized RL environment needs for stepping (module-to-env). The env-to-module pipeline, located on an :py:class:`~ray.rllib.env.env_runner.EnvRunner`, takes a list of episodes as input and outputs a batch for an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` forward pass that computes the next action. The module-to-env pipeline on the same :py:class:`~ray.rllib.env.env_runner.EnvRunner` takes the output of that :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` and converts it into actions for the next call to your RL environment's `step()` method. Lastly, a Learner connector pipeline, located on a :py:class:`~ray.rllib.core.learner.learner.Learner` worker, converts a list of episodes into a train batch for the next :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` update. The succeeding pages discuss the three pipeline types in more detail, however, all three have in common: * All connector pipelines are sequences of one or more :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pieces. You can nest these as well, meaning some of the pieces may be connector pipelines themselves. * All connector pieces and -pipelines are Python callables, overriding the :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.__call__` method. * The call signatures are uniform across the different pipeline types. The main, mandatory arguments are the list of episodes, the batch to-be-built, and the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instance. See the :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.__call__` method for more details. * All connector pipelines can read from and write to the provided list of episodes as well as the batch and thereby perform data transforms as required. Batch construction phases and formats ------------------------------------- When you push a list of input episodes through a connector pipeline, the pipeline constructs a batch from the given data. This batch always starts as an empty python dictionary and undergoes different formats and phases while passing through the different pieces of the pipeline. The following applies to all :ref:`env-to-module ` and learner connector pipelines (documentation in progress). .. figure:: images/connector_v2/pipeline_batch_phases_single_agent.svg :width: 1000 :align: left **Batch construction phases and formats**: In the standard single-agent case, where only one ModuleID (``DEFAULT_MODULE_ID``) exists, the batch starts as an empty dictionary (left) and then undergoes a "collect data" phase, in which connector pieces add individual items to the batch by storing them under a) the column name, for example ``obs`` or ``rewards``, and b) under the episode ID, from which they extracted the item. In most cases, your custom connector pieces operate during this phase. Once all custom pieces have performed their data insertions and transforms, the :py:class:`~ray.rllib.connectors.common.agent_to_module_mapping.AgentToModuleMapping` default piece performs a "reorganize by ModuleID" operation (center), during which the batch's dictionary hierarchy changes to having the ModuleID (``DEFAULT_MODULE_ID``) at the top level and the column names thereunder. On the lowest level in the batch, data items still reside in python lists. Finally, the :py:class:`~ray.rllib.connectors.common.batch_individual_items.BatchIndividualItems` default piece creates NumPy arrays out of the python lists, thereby batching all data (right). For multi-agent setups, where there are more than one ModuleIDs the :py:class:`~ray.rllib.connectors.common.agent_to_module_mapping.AgentToModuleMapping` default connector piece makes sure that the constructed output batch maps module IDs to the respective module's forward batch: .. figure:: images/connector_v2/pipeline_batch_phases_multi_agent.svg :width: 1100 :align: left **Batch construction for multi-agent**: In a multi-agent setup, the default :py:class:`~ray.rllib.connectors.common.agent_to_module_mapping.AgentToModuleMapping` connector piece reorganizes the batch by ``ModuleID``, then column names, such that a :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` can loop through its sub-modules and provide each with a batch for the forward pass. RLlib's :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` can split up the forward passes into individual submodules' forward passes using the individual batches under the respective ``ModuleIDs``. See :ref:`here for how to write your own multi-module or multi-agent forward logic ` and override this default behavior of :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule`. Finally, if you have a stateful :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, for example an LSTM, RLlib adds two additional default connector pieces to the pipeline, :py:class:`~ray.rllib.connectors.common.add_time_dim_to_batch_and_zero_pad.AddTimeDimToBatchAndZeroPad` and :py:class:`~ray.rllib.connectors.common.add_states_from_episodes_to_batch.AddStatesFromEpisodesToBatch`: .. figure:: images/connector_v2/pipeline_batch_phases_single_agent_w_states.svg :width: 900 :align: left **Batch construction for stateful models**: For stateful :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances, RLlib automatically adds additional two default connector pieces to the pipeline. The :py:class:`~ray.rllib.connectors.common.add_time_dim_to_batch_and_zero_pad.AddTimeDimToBatchAndZeroPad` piece converts all lists of individual data items on the lowest batch level into sequences of a fixed length (``max_seq_len``, see note below for how to set this) and automatically zero-pads these if it encounters an episode end. The :py:class:`~ray.rllib.connectors.common.add_states_from_episodes_to_batch.AddStatesFromEpisodesToBatch` piece adds the previously generated ``state_out`` values of your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` under the ``state_in`` column name to the batch. Note that RLlib only adds the ``state_in`` values for the first timestep in each sequence and therefore also doesn't add a time dimension to the data in the ``state_in`` column. .. note:: To change the zero-padded sequence length for the :py:class:`~ray.rllib.connectors.common.add_time_dim_to_batch_and_zero_pad.AddTimeDimToBatchAndZeroPad` connector, set in your config for custom models: .. code-block:: python config.rl_module(model_config={"max_seq_len": ...}) And for RLlib's default models: .. code-block:: python from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig config.rl_module(model_config=DefaultModelConfig(max_seq_len=...)) .. Debugging ConnectorV2 Pipelines .. =============================== .. TODO (sven): Move the following to the "how to contribute to RLlib" page and rename that page "how to develop, debug and contribute to RLlib?" .. You can debug your custom ConnectorV2 pipelines (and any RLlib component in general) through the following simple steps: .. Run without any remote :py:class:`~ray.rllib.env.env_runner.EnvRunner` workers. After defining your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` object, do: `config.env_runners(num_env_runners=0)`. .. Run without any remote :py:class:`~ray.rllib.core.learner.learner.Learner` workers. After defining your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` object, do: `config.learners(num_learners=0)`. .. Switch off Ray Tune, if applicable. After defining your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` object, do: `algo = config.build()`, then `while True: algo.train()`. .. Set a breakpoint in the ConnectorV2 piece (or any other RLlib component) you would like to debug and start the experiment script in your favorite IDE in debugging mode. .. .. figure:: images/debugging_rllib_in_ide.png --- .. include:: /_includes/rllib/we_are_hiring.rst .. _env-to-module-pipeline-docs: .. grid:: 1 2 3 4 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :img-top: /rllib/images/connector_v2/connector_generic.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: connector-v2-docs ConnectorV2 overview .. grid-item-card:: :img-top: /rllib/images/connector_v2/env_to_module_connector.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: env-to-module-pipeline-docs Env-to-module pipelines (this page) .. grid-item-card:: :img-top: /rllib/images/connector_v2/learner_connector.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: learner-pipeline-docs Learner pipelines Env-to-module pipelines ======================= .. include:: /_includes/rllib/new_api_stack.rst On each :py:class:`~ray.rllib.env.env_runner.EnvRunner` resides one env-to-module pipeline responsible for handling the data flow from the `gymnasium.Env `__ to the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. .. figure:: images/connector_v2/env_runner_connector_pipelines.svg :width: 1000 :align: left **EnvRunner ConnectorV2 Pipelines**: Both env-to-module and module-to-env pipelines are located on the :py:class:`~ray.rllib.env.env_runner.EnvRunner` workers. The env-to-module pipeline sits between the RL environment, a `gymnasium.Env `__, and the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, and translates ongoing episodes into batches for the model's `forward_...()` methods. .. The module-to-env pipeline serves the other direction, converting the output of the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, such as action logits and action distribution parameters, to actual actions understandable by the `gymnasium.Env `__ and used in the env's next `step()` call. The env-to-module pipeline, when called, performs transformations from a list of ongoing :ref:`Episode objects ` to an ``RLModule``-readable tensor batch and RLlib passes this generated batch as the first argument into the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference` or :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` methods of the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, depending on your exploration settings. .. hint:: Set `config.exploration(explore=True)` in your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` to have RLlib call the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` method with the connector's output. Otherwise, RLlib calls :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference`. Note also that normally these two methods only differ in that actions are sampled when ``explore=True`` and greedily picked when ``explore=False``. However, the exact behavior in each case depends on your :ref:`RLModule's implementation `. .. _default-env-to-module-pipeline: Default env-to-module behavior ------------------------------ By default RLlib populates every env-to-module pipeline with the following built-in connector pieces. * :py:class:`~ray.rllib.connectors.common.add_observations_from_episodes_to_batch.AddObservationsFromEpisodesToBatch`: Places the most recent observation from each ongoing episode into the batch. The column name is ``obs``. Note that if you have a vector of ``N`` environments per :py:class:`~ray.rllib.env.env_runner.EnvRunner`, your batch size is also ``N``. * *Relevant for stateful models only:* :py:class:`~ray.rllib.connectors.common.add_time_dim_to_batch_and_zero_pad.AddTimeDimToBatchAndZeroPad`: If the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` is stateful, adds a single timestep, second axis to all data to make it sequential. * *Relevant for stateful models only:* :py:class:`~ray.rllib.connectors.common.add_states_from_episodes_to_batch.AddStatesFromEpisodesToBatch`: If the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` is stateful, places the most recent state outputs of the module as new state inputs into the batch. The column name is ``state_in`` and the values don't have a time-dimension. * *For multi-agent only:* :py:class:`~ray.rllib.connectors.common.agent_to_module_mapping.AgentToModuleMapping`: Maps per-agent data to the respective per-module data depending on your defined agent-to-module mapping function. * :py:class:`~ray.rllib.connectors.common.batch_individual_items.BatchIndividualItems`: Converts all data in the batch, which thus far are lists of individual items, into batched structures meaning NumPy arrays, whose 0th axis is the batch axis. * :py:class:`~ray.rllib.connectors.common.numpy_to_tensor.NumpyToTensor`: Converts all NumPy arrays in the batch into framework specific tensors and moves these to the GPU, if required. You can disable all the preceding default connector pieces by setting `config.env_runners(add_default_connectors_to_env_to_module_pipeline=False)` in your :ref:`algorithm config `. Note that the order of these transforms is very relevant for the functionality of the pipeline. See :ref:`here on how to write and add your own connector pieces ` to the pipeline. Constructing an env-to-module connector --------------------------------------- Normally, you wouldn't have to construct the env-to-module connector pipeline yourself. RLlib's :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors initially perform this operation. However, if you would like to test or debug either the default pipeline or a custom one, use the following code snippet as a starting point: .. testcode:: import gymnasium as gym from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.env.single_agent_episode import SingleAgentEpisode # Start with an algorithm config. config = ( PPOConfig() .environment("CartPole-v1") ) # Create an env to generate some episode data. env = gym.make("CartPole-v1") # Build the env-to-module connector through the config object. env_to_module = config.build_env_to_module_connector(env=env, spaces=None) Alternatively, in case there is no ``env`` object available, you should pass in the ``spaces`` argument instead. RLlib requires either of these pieces of information to compute the correct output observation space of the pipeline, so that the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` can receive the correct input space for its own setup procedure. The structure of the `spaces` argument should ideally be: .. code-block:: python spaces = { "__env__": ([env observation space], [env action space]), # <- may be vectorized "__env_single__": ([env observation space], [env action space]), # <- never vectorized! "[module ID, e.g. 'default_policy']": ([module observation space], [module action space]), ... # <- more modules in multi-agent case } However, for single-agent cases, it may be enough to provide the non-vectorized, single observation- and action spaces only: .. testcode:: # No `env` available? Use `spaces` instead: env_to_module = config.build_env_to_module_connector( env=None, spaces={ # At minimum, pass in a 2-tuple of the single, non-vectorized # observation- and action spaces: "__env_single__": (env.observation_space, env.action_space), }, ) To test the actual behavior or the created pipeline, look at these code snippets for stateless- and stateful :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` cases: .. tab-set:: .. tab-item:: Stateless RLModule .. testcode:: from ray.rllib.env.single_agent_episode import SingleAgentEpisode # Create two SingleAgentEpisode instances. You pass these to the connector pipeline # as input. episode1 = SingleAgentEpisode() episode2 = SingleAgentEpisode() # Fill episodes with some data, as if we were currently stepping through them # to collect samples. # - episode 1 (two timesteps) obs, _ = env.reset() episode1.add_env_reset(observation=obs) action = 0 obs, _, _, _, _ = env.step(action) episode1.add_env_step(observation=obs, action=action, reward=1.0) # - episode 2 (just one timestep) obs, _ = env.reset() episode2.add_env_reset(observation=obs) # Call the connector on the two running episodes. batch = {} batch = env_to_module( episodes=[episode1, episode2], batch=batch, rl_module=None, # in stateless case, RLModule is not strictly required explore=True, ) # Print out the resulting batch. print(batch) .. tab-item:: Stateful RLModule (RNN) .. testcode:: from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig from ray.rllib.env.single_agent_episode import SingleAgentEpisode # Alter the config to use the default LSTM model of RLlib. config.rl_module(model_config=DefaultModelConfig(use_lstm=True)) # For stateful RLModules, we do need to pass in the RLModule to every call to the # connector. so construct an instance here. rl_module_spec = config.get_rl_module_spec(env=env) rl_module = rl_module_spec.build() # Create a SingleAgentEpisode instance. You pass this to the connector pipeline # as input. episode = SingleAgentEpisode() # Initialize episode with first (reset) observation. obs, _ = env.reset() episode.add_env_reset(observation=obs) # Call the connector on the running episode. batch = {} batch = env_to_module( episodes=[episode], batch=batch, rl_module=rl_module, # in stateful case, RLModule is required explore=True, ) # Print out the resulting batch. print(batch) You can see that the pipeline extracted the current observations from the two running episodes and placed them under the ``obs`` column into the forward batch. The batch has a size of two, because we had two episodes, and should look similar to this: .. code-block:: text {'obs': tensor([[ 0.0212, -0.1996, -0.0414, 0.2848], [ 0.0292, 0.0259, -0.0322, -0.0004]])} In the stateful case, you can also expect the ``STATE_IN`` columns to be present. Note that because of the LSTM layer, the internal state of the module consists of two components, ``c`` and ``h``: .. code-block:: text { 'obs': tensor( [[ 0.0212, -0.1996, -0.0414, 0.2848], [ 0.0292, 0.0259, -0.0322, -0.0004]] ), 'state_in': { # Note: The shape of each state tensor here is # (B=2, [num LSTM-layers=1], [LSTM cell size]). 'h': tensor([[[0., 0., .., 0.]]]), 'c': tensor([[[0., 0., ... 0.]]]), }, } .. hint:: You are free to design the internal states of your custom :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` classes however you like. You only need to override the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.get_initial_state` method and make sure you return a new state of any nested structure and shape from your `forward_..()` methods under the fixed ``state_out`` key. See `here for an example `__ of an RLModule class with a custom LSTM layer in it. .. _writing_custom_env_to_module_connectors: Writing custom env-to-module connectors --------------------------------------- You can customize the default env-to-module pipeline that RLlib creates through specifying a function in your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`, which takes an optional RL environment object (`env`) and an optional `spaces` dictionary as input arguments and returns a single :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece or a list thereof. RLlib prepends these :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` instances to the :ref:`default env-to-module pipeline ` in the order returned, unless you set `add_default_connectors_to_env_to_module_pipeline=False` in your config, in which case RLlib exclusively uses the provided :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pieces without any automatically added default behavior. For example, to prepend a custom ConnectorV2 piece to the env-to-module pipeline, you can do this in your config: .. testcode:: :skipif: True # Your builder function must accept an optional `gymnasium.Env` and an optional `spaces` dict # as arguments. config.env_runners( env_to_module_connector=lambda env, spaces, device: MyEnvToModuleConnector(..), ) If you want to add multiple custom pieces to the pipeline, return them as a list: .. testcode:: :skipif: True # Return a list of connector pieces to make RLlib add all of them to your # env-to-module pipeline. config.env_runners( env_to_module_connector=lambda env, spaces, device: [ MyEnvToModuleConnector(..), MyOtherEnvToModuleConnector(..), AndOneMoreConnector(..), ], ) RLlib adds the connector pieces returned by your function to the beginning of the env-to-module pipeline, before the previously described default connector pieces that RLlib provides automatically: .. figure:: images/connector_v2/custom_pieces_in_env_to_module_pipeline.svg :width: 1000 :align: left **Inserting custom ConnectorV2 pieces into the env-to-module pipeline**: RLlib inserts custom connector pieces, such as observation preprocessors, before the default pieces. This way, if your custom connectors alter the input episodes in any way, for example by changing the observations as in an :ref:`ObservationPreprocessor `, the tailing default pieces automatically add these changed observations to the batch. .. _observation-preprocessors: Observation preprocessors ~~~~~~~~~~~~~~~~~~~~~~~~~ The simplest way of customizing an env-to-module pipeline is to write your own :py:class:`~ray.rllib.connectors.env_to_module.observation_preprocessor.SingleAgentObservationPreprocessor` subclass, implement two methods, and point your config to the new class: .. testcode:: import gymnasium as gym import numpy as np from ray.rllib.connectors.env_to_module.observation_preprocessor import SingleAgentObservationPreprocessor class IntObservationToOneHotTensor(SingleAgentObservationPreprocessor): """Converts int observations (Discrete) into one-hot tensors (Box).""" def recompute_output_observation_space(self, in_obs_space, in_act_space): # Based on the input observation space, either from the preceding connector piece or # directly from the environment, return the output observation space of this connector # piece. # Implementing this method is crucial for the pipeline to know its output # spaces, which are an important piece of information to construct the succeeding # RLModule. return gym.spaces.Box(0.0, 1.0, (in_obs_space.n,), np.float32) def preprocess(self, observation, episode): # Convert an input observation (int) into a one-hot (float) tensor. # Note that 99% of all connectors in RLlib operate on NumPy arrays. new_obs = np.zeros(shape=self.observation_space.shape, dtype=np.float32) new_obs[observation] = 1.0 return new_obs Note that any observation preprocessor actually changes the underlying episodes object in place, but and doesn't contribute anything to the batch under construction. Because RLlib always inserts any user defined preprocessor (and other custom :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pieces) before the default pieces, the :py:class:`~ray.rllib.connectors.common.add_observations_from_episodes_to_batch.AddObservationsFromEpisodesToBatch` default piece then automatically takes care of adding the preprocessed and updated observation from the episode to the batch: Now you can use the custom preprocessor in environments with integer observations, for example the `FrozenLake `__ RL environment: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig config = ( PPOConfig() # Configure a simple 2x2 grid-world. # ____ # |S | <- S=start position # | G| <- G=goal position # ---- .environment("FrozenLake-v1", env_config={"desc": ["SF", "FG"]}) # Plug your custom connector piece into the env-to-module pipeline. .env_runners( env_to_module_connector=( lambda env, spaces, device: IntObservationToOneHotTensor() ), ) ) algo = config.build() # Train one iteration. print(algo.train()) .. _observation-preprocessors-adding-rewards-to-obs: Example: Adding recent rewards to the batch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Assume you wrote a custom :ref:`RLModule ` that requires the last three received rewards as input in the calls to any of its `forward_..()` methods. You can use the same :py:class:`~ray.rllib.connectors.env_to_module.observation_preprocessor.SingleAgentObservationPreprocessor` API to achieve this. In the following example, you extract the last three rewards from the ongoing episode and concatenate them with the observation to form a new observation tensor. Note that you also have to change the observation space returned by the connector, since there are now three more values in each observation: .. testcode:: import gymnasium as gym import numpy as np from ray.rllib.connectors.env_to_module.observation_preprocessor import SingleAgentObservationPreprocessor class AddPastThreeRewards(SingleAgentObservationPreprocessor): """Extracts last three rewards from episode and concatenates them to the observation tensor.""" def recompute_output_observation_space(self, in_obs_space, in_act_space): # Based on the input observation space (), return the output observation # space. Implementing this method is crucial for the pipeline to know its output # spaces, which are an important piece of information to construct the succeeding # RLModule. assert isinstance(in_obs_space, gym.spaces.Box) and len(in_obs_space.shape) == 1 return gym.spaces.Box(-100.0, 100.0, (in_obs_space.shape[0] + 3,), np.float32) def preprocess(self, observation, episode): # Extract the last 3 rewards from the ongoing episode using a python `slice` object. # Alternatively, you can also pass in a list of indices, [-3, -2, -1]. past_3_rewards = episode.get_rewards(indices=slice(-3, None)) # Concatenate the rewards to the actual observation. new_observation = np.concatenate([ observation, np.array(past_3_rewards, np.float32) ]) # Return the new observation. return new_observation .. note:: Note that the preceding example should work without any further action required on your model, whether it's a custom one or a default one provided by RLlib, as long as the model determines its input layer's size through its own ``self.observation_space`` attribute. The connector pipeline correctly captures the observation space changes, from the environment's 1D-Box to the reward-enhanced, larger 1D-Box and passes this new observation space to your RLModule's :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.setup` method. Example: Preprocessing observations in multi-agent setups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In multi-agent setups, you have two options for preprocessing your agents' individual observations through customizing your env-to-module pipeline: 1) Agent-by-agent: Using the same API as in the previous examples, :py:class:`~ray.rllib.connectors.env_to_module.observation_preprocessor.SingleAgentObservationPreprocessor`, you can apply a single preprocessing logic across all agents. However, in case you need one distinct preprocessing logic per ``AgentID``, lookup the agent information from the provided ``episode`` argument in the :py:meth:`~ray.rllib.connectors.env_to_module.observation_preprocessor.SingleAgentObservationPreprocessor.preprocess` method: .. testcode:: :skipif: True def recompute_output_observation_space(self, in_obs_space, in_act_space): # `in_obs_space` is a `Dict` space, mapping agent IDs to individual agents' spaces. # Alter this dict according to which agents you want to preprocess observations for # and return the new `Dict` space. # For example: return gym.spaces.Dict({ "some_agent_id": [obs space], "other_agent_id": [another obs space], ... }) def preprocess(self, observation, episode): # Skip preprocessing for certain agent ID(s). if episode.agent_id != "some_agent_id": return observation # Preprocess other agents' observations. ... 1) Multi-agent preprocessor with access to the entire multi-agent observation dict: Alternatively, you can subclass the :py:class:`~ray.rllib.connectors.env_to_module.observation_preprocessor.MultiAgentObservationPreprocessor` API and override the same two methods, ``recompute_output_observation_space`` and ``preprocess``. See here for a `2-agent observation preprocessor example `__ showing how to enhance each agents' observations through adding information from the respective other agent to the observations. Use :py:class:`~ray.rllib.connectors.env_to_module.observation_preprocessor.MultiAgentObservationPreprocessor` whenever you need to preprocess observations of an agent by lookup information from other agents, for example their own observations, but also rewards and previous actions. Example: Adding new columns to the batch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ So far, you have altered the observations in the input episodes, either by :ref:`manipulating them directly ` or :ref:`adding additional information like rewards to them `. RLlib's default env-to-module connectors add the observations found in the episodes to the batch under the ``obs`` column. If you would like to create a new column in the batch, you can subclass :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` directly and implement its :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.__call__` method. This way, if you have an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` that requires certain custom columns to be present in the input batch, write a custom connector piece following this example here: .. testcode:: import numpy as np from ray.rllib.connectors.connector_v2 import ConnectorV2 class AddNewColumnToBatch(ConnectorV2): def __init__( self, input_observation_space=None, input_action_space=None, *, col_name: str = "last_3_rewards_mean", ): super().__init__(input_observation_space, input_action_space) self.col_name = col_name def __call__(self, *, episodes, batch, rl_module, explore, shared_data, **kwargs): # Use the convenience `single_agent_episode_iterator` to loop through given episodes. # Even if `episodes` are a list of MultiAgentEpisodes, RLlib splits them up into # their single-agent subcomponents. for sa_episode in self.single_agent_episode_iterator(episodes): # Compute some example new-data item for your `batch` (to be added # under a new column). # Here, we compile the average over the last 3 rewards. last_3_rewards = sa_episode.get_rewards( indices=[-3, -2, -1], fill=0.0, # at beginning of episode, fill with 0s ) new_data_item = np.mean(last_3_rewards) # Use the convenience utility: `add_item_to_batch` to add a new value to # a new or existing column. self.add_batch_item( batch=batch, column=self.col_name, item_to_add=new_data_item, single_agent_episode=sa_episode, ) # Return the altered batch (with the new column in it). return batch .. testcode:: :hide: config = ( PPOConfig() .environment("CartPole-v1") .env_runners( env_to_module_connector=lambda env, spaces, device: AddNewColumnToBatch() ) ) env = gym.make("CartPole-v1") env_to_module = config.build_env_to_module_connector(env=env, spaces=None) episode = SingleAgentEpisode() obs, _ = env.reset() episode.add_env_reset(observation=obs) action = 0 obs, _, _, _, _ = env.step(action) episode.add_env_step(observation=obs, action=action, reward=1.0) batch = {} batch = env_to_module( episodes=[episode], batch=batch, rl_module=None, # in stateless case, RLModule is not strictly required explore=True, ) # Print out the resulting batch. print(batch) You should see the new column in the batch, after running through this connector piece. Note, though, that if your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` also requires the new information in the train batch, you would also need to add the same custom connector piece to your Algorithm's :py:class:`~ray.rllib.connectors.learner.learner_connector_pipeline.LearnerConnectorPipeline`. See :ref:`the Learner connector pipeline documentation ` for more details on how to customize it. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-external-env-setups-doc: External Environments and Applications ====================================== .. include:: /_includes/rllib/new_api_stack.rst In many situations, it doesn't make sense for an RL environment to be "stepped" by RLlib. For example, if you train a policy inside a complex simulator that operates its own execution loop, like a game engine or a robotics simulation. A natural and user friendly approach is to flip this setup around and - instead of RLlib "stepping" the env - allow the agents in the simulation to fully control their own stepping. An external RLlib-powered service would be available for either querying individual actions or for accepting batched sample data. The service would cover the task of training the policies, but wouldn't pose any restrictions on when and how often per second the simulation should step. .. figure:: images/envs/external_env_setup_client_inference.svg :align: left :width: 600 **External application with client-side inference**: An external simulator (for example a game engine) connects to RLlib, which runs as a server through a tcp-cabable, custom EnvRunner. The simulator sends batches of data from time to time to the server and in turn receives weights updates. For better performance, actions are computed locally on the client side. RLlib provides an `external messaging protocol `__ called :ref:`RLlink ` for this purpose as well as the option to customize your :py:class:`~ray.rllib.env.env_runner.EnvRunner` class toward communicating through :ref:`RLlink ` with one or more clients. An example, `tcp-based EnvRunner implementation with RLlink is available here `__. It also contains a dummy (CartPole) client that can be used for testing and as a template for how your external application or simulator should utilize the :ref:`RLlink ` protocol. .. note:: External application support is still work-in-progress on RLlib's new API stack. The Ray team is working on more examples for custom EnvRunner implementations (besides `the already available tcp-based one `__) as well as various client-side, non-python RLlib-adapters, for example for popular game engines and other simulation software. .. _rllink-protocol-docs: The RLlink Protocol ------------------- RLlink is a simple, stateful protocol designed for communication between a reinforcement learning (RL) server (ex., RLlib) and an external client acting as an environment simulator. The protocol enables seamless exchange of RL-specific data such as episodes, configuration, and model weights, while also facilitating on-policy training workflows. Key Features ~~~~~~~~~~~~ - **Stateful Design**: The protocol maintains some state through sequences of message exchanges (ex., request-response pairs like `GET_CONFIG` -> `SET_CONFIG`). - **Strict Request-Response Design**: The protocol is strictly (client) request -> (server) response based. Due to the necessity to let the client simulation run in its own execution loop, the server side refrains from sending any unsolicited messages to the clients. - **RL-Specific Capabilities**: Tailored for RL workflows, including episode handling, model weight updates, and configuration management. - **Flexible Sampling**: Supports both on-policy and off-policy data collection modes. - **JSON**: For reasons of better debugging and faster iterations, the first versions of RLlink are entirely JSON-based, non-encrypted, and non-secure. Message Structure ~~~~~~~~~~~~~~~~~ RLlink messages consist of a header and a body: - **Header**: 8-byte length field indicating the size of the body, for example `00000016` for a body of length 16 (thus, in total, the message size). - **Body**: JSON-encoded content with a `type` field indicating the message type. Example Messages: PING and EPISODES_AND_GET_STATE +++++++++++++++++++++++++++++++++++++++++++++++++ Here is a complete simple example message for the `PING` message. Note the 8-byte header encoding the size of the following body to be of length `16`, followed by the message body with the mandatory "type" field. .. code-block:: 00000016{"type": "PING"} The `PING` message should be sent by the client after initiation of a new connection. The server then responds with: .. code-block:: 00000016{"type": "PONG"} Here is an example of an `EPISODES_AND_GET_STATE` message sent by the client to the server and carrying a batch of sampling data. With the same message, the client asks the server to send back the updated model weights. .. _example-rllink-episode-and-get-state-msg: .. code-block:: javascript { "type": "EPISODES_AND_GET_STATE", "episodes": [ { "obs": [[...]], // List of observations "actions": [...], // List of actions "rewards": [...], // List of rewards "is_terminated": false, "is_truncated": false } ], "env_steps": 128 } Overview of all Message Types ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Requests: Client → Server +++++++++++++++++++++++++ - **``PING``** - Example: ``{"type": "PING"}`` - Purpose: Initial handshake to establish communication. - Expected Response: ``{"type": "PONG"}``. - **``GET_CONFIG``** - Example: ``{"type": "GET_CONFIG"}`` - Purpose: Request the relevant configuration (for example, how many timesteps to collect for a single `EPISODES_AND_GET_STATE` message; see below). - Expected Response: ``{"type": "SET_CONFIG", "env_steps_per_sample": 500, "force_on_policy": true}``. - **``EPISODES_AND_GET_STATE``** - Example: :ref:`See here for an example message ` - Purpose: Combine ``EPISODES`` and ``GET_STATE`` into a single request. This is useful for workflows requiring on-policy (synchronous) updates to model weights after data collection. - Body: - ``episodes``: A list of JSON objects (dicts), each with mandatory keys "obs" (list of observations in the episode), "actions" (list of actions in the episode), "rewards" (list of rewards in the episode), "is_terminated" (bool), and "is_truncated" (bool). Note that the "obs" list has one item more than the lists for "actions" and "rewards" due to the initial "reset" observation. - ``weights_seq_no``: Sequence number for the model weights version, ensuring synchronization. - Expected Response: ``{"type": "SET_STATE", "weights_seq_no": 123, "mlir_file": ".. [b64 encoded string of the binary .mlir file with the model in it] .."}``. Responses: Server → Client ++++++++++++++++++++++++++ - **``PONG``** - Example: ``{"type": "PONG"}`` - Purpose: Acknowledgment of the ``PING`` request to confirm connectivity. - **``SET_STATE``** - Example: ``{"type": "SET_STATE", "weights_seq_no": 123, "onnx_file": "... [base64 encoded ONNX file] ..."}`` - Purpose: Provide the client with the current state (for example, model weights). - Body: - ``onnx_file``: Base64-encoded, compressed ONNX model file. - ``weights_seq_no``: Sequence number for the model weights, ensuring synchronization. - **``SET_CONFIG``** - Purpose: Send relevant configuration details to the client. - Body: - ``env_steps_per_sample``: Number of total env steps collected for one ``EPISODES_AND_GET_STATE`` message. - ``force_on_policy``: Whether on-policy sampling is enforced. If true, the client should wait after sending the ``EPISODES_AND_GET_STATE`` message for the ``SET_STATE`` response before continuing to collect the next round of samples. Workflow Examples +++++++++++++++++ **Initial Handshake** 1. Client sends ``PING``. 2. Server responds with ``PONG``. **Configuration Request** 1. Client sends ``GET_CONFIG``. 2. Server responds with ``SET_CONFIG``. **Training (on-policy)** 1. Client collects on-policy data and sends ``EPISODES_AND_GET_STATE``. 2. Server processes the episodes and responds with ``SET_STATE``. .. note:: This protocol is an initial draft of the attempt to develop a widely adapted protocol for communication between an external client and a remote RL-service. Expect many changes, enhancements, and upgrades as it moves toward maturity, including adding a safety layer and compression. For now, however, it offers a lightweight, simple, yet powerful interface for integrating external environments with RL frameworks. Example: External client connecting to tcp-based EnvRunner ---------------------------------------------------------- An example `tcp-based EnvRunner implementation with RLlink is available here `__. See `here for the full end-to-end example `__. Feel free to alter the underlying logic of your custom EnvRunner, for example, you could implement a shared memory based communication layer (instead of the tcp-based one). --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-getting-started: Getting Started =============== .. include:: /_includes/rllib/new_api_stack.rst .. _rllib-in-60min: RLlib in 60 minutes ------------------- .. figure:: images/rllib-index-header.svg In this tutorial, you learn how to design, customize, and run an end-to-end RLlib learning experiment from scratch. This includes picking and configuring an :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`, running a couple of training iterations, saving the state of your :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` from time to time, running a separate evaluation loop, and finally utilizing one of the checkpoints to deploy your trained model to an environment outside of RLlib and compute actions. You also learn how to customize your :ref:`RL environment ` and your :ref:`neural network model `. Installation ~~~~~~~~~~~~ First, install RLlib, `PyTorch `__, and `Farama Gymnasium `__ as shown below: .. code-block:: bash pip install "ray[rllib]" torch "gymnasium[atari,accept-rom-license,mujoco]" .. _rllib-python-api: Python API ~~~~~~~~~~ RLlib's Python API provides all the flexibility required for applying the library to any type of RL problem. You manage RLlib experiments through an instance of the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class. An :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` typically holds a neural network for computing actions, called ``policy``, the :ref:`RL environment ` that you want to optimize against, a loss function, an optimizer, and some code describing the algorithm's execution logic, like determining when to collect samples, when to update your model, etc.. In :ref:`multi-agent training `, :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` manages the querying and optimization of multiple policies at once. Through the algorithm's interface, you can train the policy, compute actions, or store your algorithm's state through checkpointing. Configure and build the algorithm +++++++++++++++++++++++++++++++++ You first create an :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` instance and change some default settings through the config object's various methods. For example, you can set the :ref:`RL environment ` you want to use by calling the config's :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.environment` method: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig # Create a config instance for the PPO algorithm. config = ( PPOConfig() .environment("Pendulum-v1") ) To scale your setup and define how many :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors you want to leverage, you can call the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners` method. ``EnvRunners`` are used to collect samples for training updates from your :ref:`environment `. .. testcode:: config.env_runners(num_env_runners=2) For training-related settings or any algorithm-specific settings, use the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training` method: .. testcode:: config.training( lr=0.0002, train_batch_size_per_learner=2000, num_epochs=10, ) Finally, you build the actual :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance through calling your config's :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.build_algo` method. .. testcode:: # Build the Algorithm (PPO). ppo = config.build_algo() .. note:: See here to learn about all the :ref:`methods you can use to configure your Algorithm `. Run the algorithm +++++++++++++++++ After you built your :ref:`PPO ` from its configuration, you can ``train`` it for a number of iterations through calling the :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.train` method, which returns a result dictionary that you can pretty-print for debugging purposes: .. testcode:: from pprint import pprint for _ in range(4): pprint(ppo.train()) Checkpoint the algorithm ++++++++++++++++++++++++ To save the current state of your :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`, create a checkpoint through calling its :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.save_to_path` method, which returns the directory of the saved checkpoint. Instead of not passing any arguments to this call and letting the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` decide where to save the checkpoint, you can also provide a checkpoint directory yourself: .. testcode:: checkpoint_path = ppo.save_to_path() # OR: # ppo.save_to_path([a checkpoint location of your choice]) Evaluate the algorithm ++++++++++++++++++++++ RLlib supports setting up a separate :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` for the sole purpose of evaluating your model from time to time on the :ref:`RL environment `. Use your config's :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.evaluation` method to set up the details. By default, RLlib doesn't perform evaluation during training and only reports the results of collecting training samples with its "regular" :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`. .. testcode:: :hide: ppo.stop() .. testcode:: config.evaluation( # Run one evaluation round every iteration. evaluation_interval=1, # Create 2 eval EnvRunners in the extra EnvRunnerGroup. evaluation_num_env_runners=2, # Run evaluation for exactly 10 episodes. Note that because you have # 2 EnvRunners, each one runs through 5 episodes. evaluation_duration_unit="episodes", evaluation_duration=10, ) # Rebuild the PPO, but with the extra evaluation EnvRunnerGroup ppo_with_evaluation = config.build_algo() for _ in range(3): pprint(ppo_with_evaluation.train()) .. testcode:: :hide: ppo_with_evaluation.stop() .. _rllib-with-ray-tune: RLlib with Ray Tune +++++++++++++++++++ All online RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` classes are compatible with the :ref:`Ray Tune API `. .. note:: The offline RL algorithms, like :ref:`BC `, :ref:`CQL `, and :ref:`MARWIL ` require more work on :ref:`Tune ` and :ref:`Ray Data ` to add Ray Tune support. This integration allows for utilizing your configured :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` in :ref:`Ray Tune ` experiments. For example, the following code performs a hyper-parameter sweep of your :ref:`PPO `, creating three ``Trials``, one for each of the configured learning rates: .. testcode:: from ray import tune from ray.rllib.algorithms.ppo import PPOConfig config = ( PPOConfig() .environment("Pendulum-v1") # Specify a simple tune hyperparameter sweep. .training( lr=tune.grid_search([0.001, 0.0005, 0.0001]), ) ) # Create a Tuner instance to manage the trials. tuner = tune.Tuner( config.algo_class, param_space=config, # Specify a stopping criterion. Note that the criterion has to match one of the # pretty printed result metrics from the results returned previously by # ``.train()``. Also note that -1100 is not a good episode return for # Pendulum-v1, we are using it here to shorten the experiment time. run_config=tune.RunConfig( stop={"env_runners/episode_return_mean": -1100.0}, ), ) # Run the Tuner and capture the results. results = tuner.fit() Note that each :py:class:`~ray.tune.trial.Trial` creates a separate :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance as a :ref:`Ray actor `, assigns compute resources to each ``Trial``, and runs them in parallel, if possible, on your Ray cluster: .. code-block:: text Trial status: 3 RUNNING Current time: 2025-01-17 18:47:33. Total running time: 3min 0s Logical resource usage: 9.0/12 CPUs, 0/0 GPUs ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Trial name status lr iter total time (s) episode_return_mean .._sampled_lifetime │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ PPO_Pendulum-v1_b5c41_00000 RUNNING 0.001 29 86.2426 -998.449 108000 │ │ PPO_Pendulum-v1_b5c41_00001 RUNNING 0.0005 25 74.4335 -997.079 100000 │ │ PPO_Pendulum-v1_b5c41_00002 RUNNING 0.0001 20 60.0421 -960.293 80000 │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ``Tuner.fit()`` returns a ``ResultGrid`` object that allows for a detailed analysis of the training process and for retrieving the :ref:`checkpoints ` of the trained algorithms and their models: .. testcode:: # Get the best result of the final iteration, based on a particular metric. best_result = results.get_best_result( metric="env_runners/episode_return_mean", mode="max", scope="last", ) # Get the best checkpoint corresponding to the best result # from the preceding experiment. best_checkpoint = best_result.checkpoint Deploy a trained model for inference ++++++++++++++++++++++++++++++++++++ After training, you might want to deploy your models into a new environment, for example to run inference in production. For this purpose, you can use the checkpoint directory created in the preceding example. To read more about checkpoints, model deployments, and restoring algorithm state, see this :ref:`page on checkpointing ` here. Here is how you would create a new model instance from the checkpoint and run inference through a single episode of your RL environment. Note in particular the use of the :py:meth:`~ray.rllib.utils.checkpoints.Checkpointable.from_checkpoint` method to create the model and the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference` method to compute actions: .. testcode:: from pathlib import Path import gymnasium as gym import numpy as np import torch from ray.rllib.core.rl_module import RLModule # Create only the neural network (RLModule) from our algorithm checkpoint. # See here (https://docs.ray.io/en/master/rllib/checkpoints.html) # to learn more about checkpointing and the specific "path" used. rl_module = RLModule.from_checkpoint( Path(best_checkpoint.path) / "learner_group" / "learner" / "rl_module" / "default_policy" ) # Create the RL environment to test against (same as was used for training earlier). env = gym.make("Pendulum-v1", render_mode="human") episode_return = 0.0 done = False # Reset the env to get the initial observation. obs, info = env.reset() while not done: # Uncomment this line to render the env. # env.render() # Compute the next action from a batch (B=1) of observations. obs_batch = torch.from_numpy(obs).unsqueeze(0) # add batch B=1 dimension model_outputs = rl_module.forward_inference({"obs": obs_batch}) # Extract the action distribution parameters from the output and dissolve batch dim. action_dist_params = model_outputs["action_dist_inputs"][0].numpy() # We have continuous actions -> take the mean (max likelihood). greedy_action = np.clip( action_dist_params[0:1], # 0=mean, 1=log(stddev), [0:1]=use mean, but keep shape=(1,) a_min=env.action_space.low[0], a_max=env.action_space.high[0], ) # For discrete actions, you should take the argmax over the logits: # greedy_action = np.argmax(action_dist_params) # Send the action to the environment for the next step. obs, reward, terminated, truncated, info = env.step(greedy_action) # Perform env-loop bookkeeping. episode_return += reward done = terminated or truncated print(f"Reached episode return of {episode_return}.") Alternatively, if you still have an :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance up and running in your script, you can get the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` through the :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.get_module` method: .. code-block:: python rl_module = ppo.get_module("default_policy") # Equivalent to `rl_module = ppo.get_module()` Customizing your RL environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the preceding examples, your :ref:`RL environment ` was a `Farama gymnasium `__ pre-registered one, like ``Pendulum-v1`` or ``CartPole-v1``. However, if you would like to run your experiments against a custom one, see this tab below for a less-than-50-lines example. See here for an :ref:`in-depth guide on how to setup RL environments in RLlib ` and how to customize them. .. dropdown:: Quickstart: Custom RL environment :animate: fade-in-slide-down .. testcode:: import gymnasium as gym from ray.rllib.algorithms.ppo import PPOConfig # Define your custom env class by subclassing gymnasium.Env: class ParrotEnv(gym.Env): """Environment in which the agent learns to repeat the seen observations. Observations are float numbers indicating the to-be-repeated values, e.g. -1.0, 5.1, or 3.2. The action space is the same as the observation space. Rewards are `r=-abs([observation] - [action])`, for all steps. """ def __init__(self, config=None): # Since actions should repeat observations, their spaces must be the same. self.observation_space = config.get( "obs_act_space", gym.spaces.Box(-1.0, 1.0, (1,), np.float32), ) self.action_space = self.observation_space self._cur_obs = None self._episode_len = 0 def reset(self, *, seed=None, options=None): """Resets the environment, starting a new episode.""" # Reset the episode len. self._episode_len = 0 # Sample a random number from our observation space. self._cur_obs = self.observation_space.sample() # Return initial observation. return self._cur_obs, {} def step(self, action): """Takes a single step in the episode given `action`.""" # Set `terminated` and `truncated` flags to True after 10 steps. self._episode_len += 1 terminated = truncated = self._episode_len >= 10 # Compute the reward: `r = -abs([obs] - [action])` reward = -sum(abs(self._cur_obs - action)) # Set a new observation (random sample). self._cur_obs = self.observation_space.sample() return self._cur_obs, reward, terminated, truncated, {} # Point your config to your custom env class: config = ( PPOConfig() .environment( ParrotEnv, # Add `env_config={"obs_act_space": [some Box space]}` to customize. ) ) # Build a PPO algorithm and train it. ppo_w_custom_env = config.build_algo() ppo_w_custom_env.train() .. testcode:: :hide: ppo_w_custom_env.stop() Customizing your models ~~~~~~~~~~~~~~~~~~~~~~~ In the preceding examples, because you didn't specify anything in your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`, RLlib provided a default neural network model. If you would like to either reconfigure the type and size of RLlib's default models, for example define the number of hidden layers and their activation functions, or even write your own custom models from scratch using PyTorch, see here for a :ref:`detailed guide on the RLModule class `. See this tab below for a 30-lines example. .. dropdown:: Quickstart: Custom RLModule :animate: fade-in-slide-down .. testcode:: import torch from ray.rllib.core.columns import Columns from ray.rllib.core.rl_module.torch import TorchRLModule # Define your custom env class by subclassing `TorchRLModule`: class CustomTorchRLModule(TorchRLModule): def setup(self): # You have access here to the following already set attributes: # self.observation_space # self.action_space # self.inference_only # self.model_config # <- a dict with custom settings input_dim = self.observation_space.shape[0] hidden_dim = self.model_config["hidden_dim"] output_dim = self.action_space.n # Define and assign your torch subcomponents. self._policy_net = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim), torch.nn.ReLU(), torch.nn.Linear(hidden_dim, output_dim), ) def _forward(self, batch, **kwargs): # Push the observations from the batch through our `self._policy_net`. action_logits = self._policy_net(batch[Columns.OBS]) # Return parameters for the default action distribution, which is # `TorchCategorical` (due to our action space being `gym.spaces.Discrete`). return {Columns.ACTION_DIST_INPUTS: action_logits} --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-hierarchical-environments-doc: Hierarchical Environments ========================= .. include:: /_includes/rllib/new_api_stack.rst You can implement hierarchical training as a special case of multi-agent RL. For example, consider a two-level hierarchy of policies, where a top-level policy issues high level tasks that are executed at a finer timescale by one or more low-level policies. The following timeline shows one step of the top-level policy, which corresponds to four low-level actions: .. code-block:: text top-level: action_0 -------------------------------------> action_1 -> low-level: action_0 -> action_1 -> action_2 -> action_3 -> action_4 -> Alternatively, you could implement an environment, in which the two agent types don't act at the same time (overlappingly), but the low-level agents wait for the high-level agent to issue an action, then act n times before handing back control to the high-level agent: .. code-block:: text top-level: action_0 -----------------------------------> action_1 -> low-level: ---------> action_0 -> action_1 -> action_2 ------------> You can implement any of these hierarchical action patterns as a multi-agent environment with various types of agents, for example a high-level agent and a low-level agent. When set up using the correct agent to module mapping functions, from RLlib's perspective, the problem becomes a simple independent multi-agent problem with different types of policies. Your configuration might look something like the following: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig config = ( PPOConfig() .multi_agent( policies={"top_level", "low_level"}, policy_mapping_fn=( lambda aid, eps, **kw: "low_level" if aid.startswith("low_level") else "top_level" ), policies_to_train=["top_level"], ) ) In this setup, the appropriate rewards at any hierarchy level should be provided by the multi-agent env implementation. The environment class is also responsible for routing between agents, for example conveying `goals `__ from higher-level agents to lower-level agents as part of the lower-level agent observation. See `this runnable example of a hierarchical env `__. --- .. include:: /_includes/rllib/we_are_hiring.rst .. sphinx_rllib_readme_begin .. _rllib-index: RLlib: Industry-Grade, Scalable Reinforcement Learning ====================================================== .. include:: /_includes/rllib/new_api_stack.rst .. image:: images/rllib-logo.png :align: center .. sphinx_rllib_readme_end .. todo (sven): redo toctree: suggestion: getting-started key-concepts rllib-env (single-agent) ... <- multi-agent ... <- external ... <- hierarchical algorithm-configs rllib-algorithms (overview of all available algos) dev-guide (replaces user-guides) debugging scaling-guide fault-tolerance checkpoints callbacks metrics-logger rllib-advanced-api algorithm (general description of how algos work) rl-modules rllib-offline single-agent-episode multi-agent-episode connector-v2 rllib-learner env-runners rllib-examples new-api-stack-migration-guide package_ref/index .. toctree:: :hidden: getting-started key-concepts rllib-env algorithm-config rllib-algorithms user-guides rllib-examples new-api-stack-migration-guide package_ref/index .. sphinx_rllib_readme_2_begin **RLlib** is an open source library for reinforcement learning (**RL**), offering support for production-level, highly scalable, and fault-tolerant RL workloads, while maintaining simple and unified APIs for a large variety of industry applications. Whether training policies in a **multi-agent** setup, from historic **offline** data, or using **externally connected simulators**, RLlib offers simple solutions for each of these autonomous decision making needs and enables you to start running your experiments within hours. Industry leaders use RLlib in production in many different verticals, such as `gaming `_, `robotics `_, `finance `_, `climate- and industrial control `_, `manufacturing and logistics `_, `automobile `_, and `boat design `_. RLlib in 60 seconds ------------------- .. figure:: images/rllib-index-header.svg It only takes a few steps to get your first RLlib workload up and running on your laptop. Install RLlib and `PyTorch `__, as shown below: .. code-block:: bash pip install "ray[rllib]" torch .. note:: For installation on computers running Apple Silicon, such as M1, `follow instructions here. `_ .. note:: To be able to run the Atari or MuJoCo examples, you also need to do: .. code-block:: bash pip install "gymnasium[atari,accept-rom-license,mujoco]" This is all, you can now start coding against RLlib. Here is an example for running the :ref:`PPO Algorithm ` on the `Taxi domain `__. You first create a `config` for the algorithm, which defines the :ref:`RL environment ` and any other needed settings and parameters. .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.connectors.env_to_module import FlattenObservations # Configure the algorithm. config = ( PPOConfig() .environment("Taxi-v3") .env_runners( num_env_runners=2, # Observations are discrete (ints) -> We need to flatten (one-hot) them. env_to_module_connector=lambda env: FlattenObservations(), ) .evaluation(evaluation_num_env_runners=1) ) Next, ``build`` the algorithm and ``train`` it for a total of five iterations. One training iteration includes parallel, distributed sample collection by the :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors, followed by loss calculation on the collected data, and a model update step. .. testcode:: from pprint import pprint # Build the algorithm. algo = config.build_algo() # Train it for 5 iterations ... for _ in range(5): pprint(algo.train()) At the end of your script, you evaluate the trained Algorithm and release all its resources: .. testcode:: # ... and evaluate it. pprint(algo.evaluate()) # Release the algo's resources (remote actors, like EnvRunners and Learners). algo.stop() You can use any `Farama-Foundation Gymnasium `__ registered environment with the ``env`` argument. In ``config.env_runners()`` you can specify - amongst many other things - the number of parallel :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors to collect samples from the environment. You can also tweak the NN architecture used by tweaking RLlib's :py:class:`~ray.rllib.core.rl_module.default_model_cnofig.DefaultModelConfig`, as well as, set up a separate config for the evaluation :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors through the ``config.evaluation()`` method. :ref:`See here `, if you want to learn more about the RLlib training APIs. Also, `see here `__ for a simple example on how to write an action inference loop after training. If you want to get a quick preview of which **algorithms** and **environments** RLlib supports, click the dropdowns below: .. dropdown:: **RLlib Algorithms** :animate: fade-in-slide-down +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | **On-Policy** | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`PPO (Proximal Policy Optimization) ` | |single_agent| | |multi_agent| | |discr_act| | |cont_act| | |multi_gpu| | |multi_node_multi_gpu| | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | **Off-Policy** | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`SAC (Soft Actor Critic) ` | |single_agent| | |multi_agent| | | |cont_act| | |multi_gpu| | |multi_node_multi_gpu| | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`DQN/Rainbow (Deep Q Networks) ` | |single_agent| | |multi_agent| | |discr_act| | | |multi_gpu| | |multi_node_multi_gpu| | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | **High-throughput Architectures** | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`APPO (Asynchronous Proximal Policy Optimization) ` | |single_agent| | |multi_agent| | |discr_act| | |cont_act| | |multi_gpu| | |multi_node_multi_gpu| | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`IMPALA (Importance Weighted Actor-Learner Architecture) ` | |single_agent| | |multi_agent| | |discr_act| | | |multi_gpu| | |multi_node_multi_gpu| | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | **Model-based RL** | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`DreamerV3 ` | |single_agent| | | |discr_act| | |cont_act| | |multi_gpu| | |multi_node_multi_gpu| | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | **Offline RL and Imitation Learning** | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`BC (Behavior Cloning) ` | |single_agent| | | |discr_act| | |cont_act| | | | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`CQL (Conservative Q-Learning) ` | |single_agent| | | | |cont_act| | | | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ | :ref:`MARWIL (Advantage Re-Weighted Imitation Learning) ` | |single_agent| | | |discr_act| | |cont_act| | | | +-------------------------------------------------------------------------+----------------+---------------+-------------+------------+-------------+------------------------+ .. dropdown:: **RLlib Environments** :animate: fade-in-slide-down +-------------------------------------------------------------------------------------------+ | **Farama-Foundation Environments** | +-------------------------------------------------------------------------------------------+ | `gymnasium `__ |single_agent| | | | | .. code-block:: bash | | | | pip install "gymnasium[atari,accept-rom-license,mujoco]"`` | | | | .. code-block:: python | | | | config.environment("CartPole-v1") # Classic Control | | config.environment("ale_py:ALE/Pong-v5") # Atari | | config.environment("Hopper-v5") # MuJoCo | +-------------------------------------------------------------------------------------------+ | `PettingZoo `__ |multi_agent| | | | | .. code-block:: bash | | | | pip install "pettingzoo[all]" | | | | .. code-block:: python | | | | from ray.tune.registry import register_env | | from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv | | from pettingzoo.sisl import waterworld_v4 | | register_env("env", lambda _: PettingZooEnv(waterworld_v4.env())) | | config.environment("env") | +-------------------------------------------------------------------------------------------+ | **RLlib Multi-Agent** | +-------------------------------------------------------------------------------------------+ | `RLlib's MultiAgentEnv API `__ |multi_agent| | | | | .. code-block:: python | | | | from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole | | from ray import tune | | tune.register_env("env", lambda cfg: MultiAgentCartPole(cfg)) | | config.environment("env", env_config={"num_agents": 2}) | | config.multi_agent( | | policies={"p0", "p1"}, | | policy_mapping_fn=lambda aid, *a, **kw: f"p{aid}", | | ) | +-------------------------------------------------------------------------------------------+ Why chose RLlib? ---------------- .. dropdown:: **Scalable and Fault-Tolerant** :animate: fade-in-slide-down RLlib workloads scale along various axes: - The number of :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors to use. This is configurable through ``config.env_runners(num_env_runners=...)`` and allows you to scale the speed of your (simulator) data collection step. This `EnvRunner` axis is fully **fault tolerant**, meaning you can train against custom environments that are unstable or frequently stall execution and even place all your `EnvRunner` actors on spot machines. - The number of :py:class:`~ray.rllib.core.learner.Learner` actors to use for **multi-GPU training**. This is configurable through ``config.learners(num_learners=...)`` and you normally set this to the number of GPUs available (make sure you then also set ``config.learners(num_gpus_per_learner=1)``) or - if you do not have GPUs - you can use this setting for **DDP-style learning on CPUs** instead. .. dropdown:: **Multi-Agent Reinforcement Learning (MARL)** :animate: fade-in-slide-down RLlib natively supports multi-agent reinforcement learning (MARL), thereby allowing you to run in any complex configuration. - **Independent** multi-agent learning (the default): Every agent collects data for updating its own policy network, interpreting other agents as part of the environment. - **Collaborative** training: Train a team of agents that either all share the same policy (shared parameters) or in which some agents have their own policy network(s). You can also share value functions between all members of the team or some of them, as you see fit, thus allowing for global vs local objectives to be optimized. - **Adversarial** training: Have agents play against other agents in competitive environments. Use self-play, or league based self-play to train your agents to learn how to play throughout various stages of ever increasing difficulty. - **Any combination of the above!** Yes, you can train teams of arbitrary sizes of agents playing against other teams where the agents in each team might have individual sub-objectives and there are groups of neutral agents not participating in any competition. .. dropdown:: **Offline RL and Behavior Cloning** :animate: fade-in-slide-down **Ray.Data** has been integrated into RLlib, enabling **large-scale data ingestion** for offline RL and behavior cloning (BC) workloads. See here for a basic `tuned example for the behavior cloning algo `__ and here for how to `pre-train a policy with BC, then finetuning it with online PPO `__. .. dropdown:: **Support for External Env Clients** :animate: fade-in-slide-down **Support for externally connecting RL environments** is achieved through customizing the :py:class:`~ray.rllib.env.env_runner.EnvRunner` logic from RLlib-owned, internal gymnasium envs to external, TCP-connected Envs that act independently and may even perform their own action inference, e.g. through ONNX. See here for an example of `RLlib acting as a server with connecting external env TCP-clients `__. Learn More ---------- .. grid:: 1 2 3 3 :gutter: 1 :class-container: container pb-4 .. grid-item-card:: **RLlib Key Concepts** ^^^ Learn more about the core concepts of RLlib, such as Algorithms, environments, models, and learners. +++ .. button-ref:: rllib-key-concepts :color: primary :outline: :expand: Key Concepts .. grid-item-card:: **RL Environments** ^^^ Get started with environments supported by RLlib, such as Farama foundation's Gymnasium, Petting Zoo, and many custom formats for vectorized and multi-agent environments. +++ .. button-ref:: rllib-environments-doc :color: primary :outline: :expand: Environments .. grid-item-card:: **Models (RLModule)** ^^^ Learn how to configure RLlib's default models and implement your own custom models through the RLModule APIs, which support arbitrary architectures with PyTorch, complex multi-model setups, and multi-agent models with components shared between agents. +++ .. button-ref:: rlmodule-guide :color: primary :outline: :expand: Models (RLModule) .. grid-item-card:: **Algorithms** ^^^ See the many available RL algorithms of RLlib for on-policy and off-policy training, offline- and model-based RL, multi-agent RL, and more. +++ .. button-ref:: rllib-algorithms-doc :color: primary :outline: :expand: Algorithms Customizing RLlib ----------------- RLlib provides powerful, yet easy to use APIs for customizing all aspects of your experimental- and production training-workflows. For example, you may code your own `environments `__ in python using the `Farama Foundation's gymnasium `__ or DeepMind's OpenSpiel, provide custom `PyTorch models `_, write your own `optimizer setups and loss definitions `__, or define custom `exploratory behavior `_. .. figure:: images/rllib-new-api-stack-simple.svg :align: left :width: 850 **RLlib's API stack:** Built on top of Ray, RLlib offers off-the-shelf, distributed and fault-tolerant algorithms and loss functions, PyTorch default models, multi-GPU training, and multi-agent support. Users customize their experiments by subclassing the existing abstractions. .. sphinx_rllib_readme_2_end .. sphinx_rllib_readme_3_begin Citing RLlib ------------ If RLlib helps with your academic research, the Ray RLlib team encourages you to cite these papers: .. code-block:: @inproceedings{liang2021rllib, title={{RLlib} Flow: Distributed Reinforcement Learning is a Dataflow Problem}, author={ Wu, Zhanghao and Liang, Eric and Luo, Michael and Mika, Sven and Gonzalez, Joseph E. and Stoica, Ion }, booktitle={Conference on Neural Information Processing Systems ({NeurIPS})}, year={2021}, url={https://proceedings.neurips.cc/paper/2021/file/2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf} } @inproceedings{liang2018rllib, title={{RLlib}: Abstractions for Distributed Reinforcement Learning}, author={ Eric Liang and Richard Liaw and Robert Nishihara and Philipp Moritz and Roy Fox and Ken Goldberg and Joseph E. Gonzalez and Michael I. Jordan and Ion Stoica, }, booktitle = {International Conference on Machine Learning ({ICML})}, year={2018}, url={https://arxiv.org/pdf/1712.09381} } .. sphinx_rllib_readme_3_end .. sigils used on this page .. |single_agent| image:: /rllib/images/sigils/single-agent.svg :class: inline-figure :width: 72 .. |multi_agent| image:: /rllib/images/sigils/multi-agent.svg :class: inline-figure :width: 72 .. |discr_act| image:: /rllib/images/sigils/discr-actions.svg :class: inline-figure :width: 72 .. |cont_act| image:: /rllib/images/sigils/cont-actions.svg :class: inline-figure :width: 72 .. |multi_gpu| image:: /rllib/images/sigils/multi-gpu.svg :class: inline-figure :width: 72 .. |multi_node_multi_gpu| image:: /rllib/images/sigils/multi-node-multi-gpu.svg :class: inline-figure :alt: Only on the Anyscale Platform! :width: 72 --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-key-concepts: Key concepts ============ .. include:: /_includes/rllib/new_api_stack.rst To help you get a high-level understanding of how the library works, on this page, you learn about the key concepts and general architecture of RLlib. .. figure:: images/rllib_key_concepts.svg :width: 750 :align: left **RLlib overview:** The central component of RLlib is the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class, acting as a runtime for executing your RL experiments. Your gateway into using an :ref:`Algorithm ` is the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` (cyan) class, allowing you to manage available configuration settings, for example learning rate or model architecture. Most :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` objects have :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors (blue) to collect training samples from the :ref:`RL environment ` and :py:class:`~ray.rllib.core.learner.learner.Learner` actors (yellow) to compute gradients and update your :ref:`models `. The algorithm synchronizes model weights after an update. .. _rllib-key-concepts-algorithms: AlgorithmConfig and Algorithm ----------------------------- .. todo (sven): Change the following link to the actual algorithm and algorithm-config page, once done. Right now, it's pointing to the algos-overview page, instead! .. tip:: The following is a quick overview of **RLlib AlgorithmConfigs and Algorithms**. See here for a :ref:`detailed description of the Algorithm class `. The RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class serves as a runtime for your RL experiments, bringing together all components required for learning an optimal solution to your :ref:`RL environment `. It exposes powerful Python APIs for controlling your experiment runs. The gateways into using the various RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` types are the respective :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` classes, allowing you to configure available settings in a checked and type-safe manner. For example, to configure a :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO` ("Proximal Policy Optimization") algorithm instance, you use the :py:class:`~ray.rllib.algorithms.ppo.ppo.PPOConfig` class. During its construction, the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` first sets up its :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`, containing ``n`` :py:class:`~ray.rllib.env.env_runner.EnvRunner` `actors `__, and its :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`, containing ``m`` :py:class:`~ray.rllib.core.learner.learner.Learner` `actors `__. This way, you can scale up sample collection and training, respectively, from a single core to many thousands of cores in a cluster. .. todo: Separate out our scaling guide into its own page in new PR See this :ref:`scaling guide ` for more details here. You have two ways to interact with and run an :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`: - You can create and manage an instance of it directly through the Python API. - Because the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class is a subclass of the :ref:`Tune Trainable API `, you can use `Ray Tune `__ to more easily manage your experiment and tune hyperparameters. The following examples demonstrate this on RLlib's :py:class:`~ray.rllib.algorithms.ppo.PPO` ("Proximal Policy Optimization") algorithm: .. tab-set:: .. tab-item:: Manage Algorithm instance directly .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig # Configure. config = ( PPOConfig() .environment("CartPole-v1") .training( train_batch_size_per_learner=2000, lr=0.0004, ) ) # Build the Algorithm. algo = config.build() # Train for one iteration, which is 2000 timesteps (1 train batch). print(algo.train()) .. testcode:: :hide: algo.stop() .. tab-item:: Run Algorithm through Ray Tune .. testcode:: from ray import tune from ray.rllib.algorithms.ppo import PPOConfig # Configure. config = ( PPOConfig() .environment("CartPole-v1") .training( train_batch_size_per_learner=2000, lr=0.0004, ) ) # Train through Ray Tune. results = tune.Tuner( "PPO", param_space=config, # Train for 4000 timesteps (2 iterations). run_config=tune.RunConfig(stop={"num_env_steps_sampled_lifetime": 4000}), ).fit() .. _rllib-key-concepts-environments: RL environments --------------- .. tip:: The following is a quick overview of **RL environments**. See :ref:`here for a detailed description of how to use RL environments in RLlib `. A reinforcement learning (RL) environment is a structured space, like a simulator or a controlled section of the real world, in which one or more agents interact and learn to achieve specific goals. The environment defines an observation space, which is the structure and shape of observable tensors at each timestep, an action space, which defines the available actions for the agents at each time step, a reward function, and the rules that govern environment transitions when applying actions. .. figure:: images/envs/env_loop_concept.svg :width: 900 :align: left A simple **RL environment** where an agent starts with an initial observation returned by the ``reset()`` method. The agent, possibly controlled by a neural network policy, sends actions, like ``right`` or ``jump``, to the environment's ``step()`` method, which returns a reward. Here, the reward values are +5 for reaching the goal and 0 otherwise. The environment also returns a boolean flag indicating whether the episode is complete. Environments may vary in complexity, from simple tasks, like navigating a grid world, to highly intricate systems, like autonomous driving simulators, robotic control environments, or multi-agent games. RLlib interacts with the environment by playing through many :ref:`episodes ` during a training iteration to collect data, such as made observations, taken actions, received rewards and ``done`` flags (see preceding figure). It then converts this episode data into a train batch for model updating. The goal of these model updates is to change the agents' behaviors such that it leads to a maximum sum of received rewards over the agents' lifetimes. .. _rllib-key-concepts-rl-modules: RLModules --------- .. tip:: The following is a quick overview of **RLlib RLModules**. See :ref:`here for a detailed description of the RLModule class `. `RLModules `__ are deep-learning framework-specific neural network wrappers. RLlib's :ref:`EnvRunners ` use them for computing actions when stepping through the :ref:`RL environment ` and RLlib's :ref:`Learners ` use :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances for computing losses and gradients before updating them. .. figure:: images/rl_modules/rl_module_overview.svg :width: 750 :align: left **RLModule overview**: *(left)* A minimal :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` contains a neural network and defines its forward exploration-, inference- and training logic. *(right)* In more complex setups, a :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` contains many submodules, each itself an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instance and identified by a ``ModuleID``, allowing you to implement arbitrarily complex multi-model and multi-agent algorithms. In a nutshell, an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` carries the neural network models and defines how to use them during the three phases of its RL lifecycle: **Exploration**, for collecting training data, **inference** when computing actions for evaluation or in production, and **training** for computing the loss function inputs. You can choose to use :ref:`RLlib's built-in default models and configure these ` as needed, for example for changing the number of layers or the activation functions, or :ref:`write your own custom models in PyTorch `, allowing you to implement any architecture and computation logic. .. figure:: images/rl_modules/rl_module_in_env_runner.svg :width: 450 :align: left **An RLModule inside an EnvRunner actor**: The :py:class:`~ray.rllib.env.env_runner.EnvRunner` operates on its own copy of an inference-only version of the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, using it only to compute actions. Each :py:class:`~ray.rllib.env.env_runner.EnvRunner` actor, managed by the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` of the Algorithm, has a copy of the user's :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. Also, each :py:class:`~ray.rllib.core.learner.learner.Learner` actor, managed by the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` of the Algorithm has an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` copy. The :py:class:`~ray.rllib.env.env_runner.EnvRunner` copy is normally in its ``inference_only`` version, meaning that components not required for bare action computation, for example a value function estimate, are missing to save memory. .. figure:: images/rl_modules/rl_module_in_learner.svg :width: 400 :align: left **An RLModule inside a Learner actor**: The :py:class:`~ray.rllib.core.learner.learner.Learner` operates on its own copy of an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, computing the loss function inputs, the loss itself, and the model's gradients, then updating the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` through the :py:class:`~ray.rllib.core.learner.learner.Learner`'s optimizers. .. _rllib-key-concepts-episodes: Episodes -------- .. tip:: The following is a quick overview of **Episode**. See :ref:`here for a detailed description of the Episode classes `. RLlib sends around all training data the form of :ref:`Episodes `. The :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` class describes single-agent trajectories. The :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` class contains several such single-agent episodes and describes the stepping times- and patterns of the individual agents with respect to each other. Both ``Episode`` classes store the entire trajectory data generated while stepping through an :ref:`RL environment `. This data includes the observations, info dicts, actions, rewards, termination signals, and any model computations along the way, like recurrent states, action logits, or action log probabilities. .. tip:: See here for `RLlib's standardized column names `__. Note that episodes conveniently don't have to store any ``next obs`` information as it always overlaps with the information under ``obs``. This design saves almost 50% of memory, because observations are often the largest piece in a trajectory. The same is true for ``state_in`` and ``state_out`` information for stateful networks. RLlib only keeps the ``state_out`` key in the episodes. Typically, RLlib generates episode chunks of size ``config.rollout_fragment_length`` through the :ref:`EnvRunner ` actors in the Algorithm's :ref:`EnvRunnerGroup `, and sends as many episode chunks to each :ref:`Learner ` actor as required to build one training batch of exactly size ``config.train_batch_size_per_learner``. A typical :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` object roughly looks as follows: .. code-block:: python # A SingleAgentEpisode of length 20 has roughly the following schematic structure. # Note that after these 20 steps, you have 20 actions and rewards, but 21 observations and info dicts # due to the initial "reset" observation/infos. episode = { 'obs': np.ndarray((21, 4), dtype=float32), # 21 due to additional reset obs 'infos': [{}, {}, {}, {}, .., {}, {}], # infos are always lists of dicts 'actions': np.ndarray((20,), dtype=int64), # Discrete(4) action space 'rewards': np.ndarray((20,), dtype=float32), 'extra_model_outputs': { 'action_dist_inputs': np.ndarray((20, 4), dtype=float32), # Discrete(4) action space }, 'is_terminated': False, # <- single bool 'is_truncated': True, # <- single bool } For complex observations, for example ``gym.spaces.Dict``, the episode holds all observations in a struct entirely analogous to the observation space, with NumPy arrays at the leafs of that dict. For example: .. code-block:: python episode_w_complex_observations = { 'obs': { "camera": np.ndarray((21, 64, 64, 3), dtype=float32), # RGB images "sensors": { "front": np.ndarray((21, 15), dtype=float32), # 1D tensors "rear": np.ndarray((21, 5), dtype=float32), # another batch of 1D tensors }, }, ... Because RLlib keeps all values in NumPy arrays, this allows for efficient encoding and transmission across the network. In `multi-agent mode `__, the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` produces :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` instances. .. note:: The Ray team is working on a detailed description of the :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` class. .. _rllib-key-concepts-env-runners: EnvRunner: Combining RL environment and RLModule ------------------------------------------------ Given the :ref:`RL environment ` and an :ref:`RLModule `, an :py:class:`~ray.rllib.env.env_runner.EnvRunner` produces lists of :ref:`Episodes `. It does so by executing a classic environment interaction loop. Efficient sample collection can be burdensome to get right, especially when leveraging environment vectorization, stateful recurrent neural networks, or when operating in a multi-agent setting. RLlib provides two built-in :py:class:`~ray.rllib.env.env_runner.EnvRunner` classes, :py:class:`~ray.rllib.env.single_agent_env_runner.SingleAgentEnvRunner` and :py:class:`~ray.rllib.env.multi_agent_env_runner.MultiAgentEnvRunner` that automatically handle these complexities. RLlib picks the correct type based on your configuration, in particular the `config.environment()` and `config.multi_agent()` settings. .. tip:: Call the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.is_multi_agent` method to find out, whether your config is multi-agent or not. RLlib bundles several :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors through the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` API. You can also use an :py:class:`~ray.rllib.env.env_runner.EnvRunner` standalone to produce lists of Episodes by calling its :py:meth:`~ray.rllib.env.env_runner.EnvRunner.sample` method. Here is an example of creating a set of remote :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors and using them to gather experiences in parallel: .. testcode:: import tree # pip install dm_tree import ray from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner # Configure the EnvRunners. config = ( PPOConfig() .environment("Acrobot-v1") .env_runners(num_env_runners=2, num_envs_per_env_runner=1) ) # Create the EnvRunner actors. env_runners = [ ray.remote(SingleAgentEnvRunner).remote(config=config) for _ in range(config.num_env_runners) ] # Gather lists of `SingleAgentEpisode`s (each EnvRunner actor returns one # such list with exactly two episodes in it). episodes = ray.get([ er.sample.remote(num_episodes=3) for er in env_runners ]) # Two remote EnvRunners used. assert len(episodes) == 2 # Each EnvRunner returns three episodes assert all(len(eps_list) == 3 for eps_list in episodes) # Report the returns of all episodes collected for episode in tree.flatten(episodes): print("R=", episode.get_return()) .. testcode:: :hide: for er in env_runners: er.stop.remote() .. _rllib-key-concepts-learners: Learner: Combining RLModule, loss function and optimizer -------------------------------------------------------- .. tip:: The following is a quick overview of **RLlib Learners**. See :ref:`here for a detailed description of the Learner class `. Given the :ref:`RLModule ` and one or more optimizers and loss functions, a :py:class:`~ray.rllib.core.learner.learner.Learner` computes losses and gradients, then updates the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. The input data for such an update step comes in as a list of :ref:`episodes `, which either the Learner's own connector pipeline or an external one converts into the final train batch. .. note:: :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` documentation is work in progress. The Ray team links to the correct documentation page here, once it has completed this work. :py:class:`~ray.rllib.core.learner.learner.Learner` instances are algorithm-specific, mostly due to the various loss functions used by different RL algorithms. RLlib always bundles several :py:class:`~ray.rllib.core.learner.learner.Learner` actors through the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` API, automatically applying distributed data parallelism (``DDP``) on the training data. You can also use a :py:class:`~ray.rllib.core.learner.learner.Learner` standalone to update your RLModule with a list of Episodes. Here is an example of creating a remote :py:class:`~ray.rllib.core.learner.learner.Learner` actor and calling its :py:meth:`~ray.rllib.core.learner.learner.Learner.update` method. .. testcode:: import gymnasium as gym import ray from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig # Configure the Learner. config = ( PPOConfig() .environment("Acrobot-v1") .training(lr=0.0001) .rl_module(model_config=DefaultModelConfig(fcnet_hiddens=[64, 32])) ) # Get the Learner class. ppo_learner_class = config.get_default_learner_class() # Create the Learner actor. learner_actor = ray.remote(ppo_learner_class).remote( config=config, module_spec=config.get_multi_rl_module_spec(env=gym.make("Acrobot-v1")), ) # Build the Learner. ray.get(learner_actor.build.remote()) # Perform an update from the list of episodes we got from the `EnvRunners` above. learner_results = ray.get(learner_actor.update.remote( episodes=tree.flatten(episodes) )) print(learner_results["default_policy"]["policy_loss"]) --- .. include:: /_includes/rllib/we_are_hiring.rst .. _learner-pipeline-docs: .. grid:: 1 2 3 4 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :img-top: /rllib/images/connector_v2/connector_generic.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: connector-v2-docs ConnectorV2 overview .. grid-item-card:: :img-top: /rllib/images/connector_v2/env_to_module_connector.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: env-to-module-pipeline-docs Env-to-module pipelines .. grid-item-card:: :img-top: /rllib/images/connector_v2/learner_connector.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: learner-pipeline-docs Learner connector pipelines (this page) Learner connector pipelines =========================== .. include:: /_includes/rllib/new_api_stack.rst On each :py:class:`~ray.rllib.core.learner.learner.Learner` actor resides a single Learner connector pipeline (see figure below) responsible for compiling the train batch for the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` from a list of episodes. .. figure:: images/connector_v2/learner_connector_pipeline.svg :width: 1000 :align: left **Learner ConnectorV2 Pipelines**: A learner connector pipeline sits between the input training data, a list of episodes, and the :py:class:`~ray.rllib.core.learner.learner.Learner` actor's :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. The pipeline transforms this input data into a train batch readable by the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_train` method of the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. When calling the Learner connector pipeline, a transformation from a list of :ref:`Episode objects ` to an ``RLModule``-readable tensor batch, also referred to as the "train batch", takes place and the :py:class:`~ray.rllib.core.learner.learner.Learner` actor sends the output of the pipeline directly into the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_train` method of the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. .. _default-learner-pipeline: Default Learner pipeline behavior --------------------------------- By default RLlib populates every Learner connector pipeline with the following built-in connector pieces. * :py:class:`~ray.rllib.connectors.common.add_observations_from_episodes_to_batch.AddObservationsFromEpisodesToBatch`: Places all observations from the incoming episodes into the batch. The column name is ``obs``. For example, if you have two incoming episodes of length 10 and 20, your resulting train batch size is 30. * :py:class:`~ray.rllib.connectors.learner.add_columns_from_episodes_to_batch.AddColumnsFromEpisodesToBatch`: Places all other columns, like rewards, actions, and termination flags, from the incoming episodes into the batch. * *Relevant for stateful models only:* :py:class:`~ray.rllib.connectors.common.add_time_dim_to_batch_and_zero_pad.AddTimeDimToBatchAndZeroPad`: If the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` is stateful, adds a time-dimension of size `max_seq_len` at axis=1 to all data in the batch and (right) zero-pads in cases where episodes end at timesteps non-dividable by `max_seq_len`. You can change `max_seq_len` through your RLModule's `model_config_dict` (call `config.rl_module(model_config_dict={'max_seq_len': ...})` on your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` object). * *Relevant for stateful models only:* :py:class:`~ray.rllib.connectors.common.add_states_from_episodes_to_batch.AddStatesFromEpisodesToBatch`: If the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` is stateful, places the most recent state outputs of the module as new state inputs into the batch. The column name is ``state_in`` and the values don't have a time-dimension. * *For multi-agent only:* :py:class:`~ray.rllib.connectors.common.agent_to_module_mapping.AgentToModuleMapping`: Maps per-agent data to the respective per-module data depending on the already determined agent-to-module mapping stored in each multi-agent episode. * :py:class:`~ray.rllib.connectors.common.batch_individual_items.BatchIndividualItems`: Converts all data in the batch, which thus far are lists of individual items, into batched structures meaning NumPy arrays, whose 0th axis is the batch axis. * :py:class:`~ray.rllib.connectors.common.numpy_to_tensor.NumpyToTensor`: Converts all NumPy arrays in the batch into framework specific tensors and moves these to the GPU, if required. You can disable all the preceding default connector pieces by setting `config.learners(add_default_connectors_to_learner_pipeline=False)` in your :ref:`algorithm config `. Note that the order of these transforms is very relevant for the functionality of the pipeline. .. _writing_custom_learner_connectors: Writing custom Learner connectors --------------------------------- You can customize the Learner connector pipeline through specifying a function in your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`, which takes the observation- and action spaces as input arguments and returns a single :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece or a list thereof. RLlib prepends these :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` instances to the :ref:`default Learner pipeline ` in the order returned, unless you set `add_default_connectors_to_learner_pipeline=False` in your config, in which case RLlib exclusively uses the provided :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pieces without any automatically added default behavior. For example, to prepend a custom :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece to the :py:class:`~ray.rllib.core.learner.learner.Learner` connector pipeline, you can do this in your config: .. testcode:: :skipif: True config.learners( learner_connector=lambda obs_space, act_space: MyLearnerConnector(..), ) If you want to add multiple custom pieces to the pipeline, return them as a list: .. testcode:: :skipif: True # Return a list of connector pieces to make RLlib add all of them to your # Learner pipeline. config.learners( learner_connector=lambda obs_space, act_space: [ MyLearnerConnector(..), MyOtherLearnerConnector(..), AndOneMoreConnector(..), ], ) RLlib adds the connector pieces returned by your function to the beginning of the Learner pipeline, before the previously described default connector pieces that RLlib provides automatically: .. figure:: images/connector_v2/custom_pieces_in_learner_pipeline.svg :width: 1000 :align: left **Inserting custom ConnectorV2 pieces into the Learner pipeline**: RLlib inserts custom connector pieces, such as intrinsic reward computation, before the default pieces. This way, if your custom connectors alter the input episodes in any way, for example by changing the rewards as in the succeeding example, the default pieces at the end of the pipeline automatically add these changed rewards to the batch. Example: Reward shaping prior to loss computation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A good example of when to write a custom Learner ConnectorV2 piece is reward shaping before computing your algorithm's loss. The Learner connector's :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.__call__` has full access to the entire episode data, including observations, actions, other agents' data in multi-agent scenarios, and all rewards. Here are the most important code snippets for setting up a simple, count-based intrinsic reward signal. The custom connector computes the intrinsic reward as the inverse number of times an agent has already seen a specific observation. Thus, the more the agent visits a certain state, the lower the computed intrinsic reward for that state, motivating the agent to visit new states and show better exploratory behavior. See `here for the full count-based intrinsic reward example script `__. You can write the custom Learner connector by subclassing :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` and overriding the :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.__call__` method: .. testcode:: from collections import Counter from ray.rllib.connectors.connector_v2 import ConnectorV2 class CountBasedIntrinsicRewards(ConnectorV2): def __init__(self, **kwargs): super().__init__(**kwargs) # Observation counter to compute state visitation frequencies. self._counts = Counter() In the :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.__call__` method, you then loop through all single-agent episodes and change the reward stored in these to: ``r(t) = re(t) + 1 / N(ot)``, where ``re`` is the extrinsic reward from the RL environment and ``N(ot)`` is the number of times the agent has already been to observation ``o(t)``. .. testcode:: def __call__( self, *, rl_module, batch, episodes, explore=None, shared_data=None, **kwargs, ): for sa_episode in self.single_agent_episode_iterator( episodes=episodes, agents_that_stepped_only=False ): # Loop through all observations, except the last one. observations = sa_episode.get_observations(slice(None, -1)) # Get all respective extrinsic rewards. rewards = sa_episode.get_rewards() for i, (obs, rew) in enumerate(zip(observations, rewards)): # Add 1 to obs counter. obs = tuple(obs) self._counts[obs] += 1 # Compute the count-based intrinsic reward and add it to the extrinsic # reward. rew += 1 / self._counts[obs] # Store the new reward back to the episode (under the correct # timestep/index). sa_episode.set_rewards(new_data=rew, at_indices=i) return batch If you plug in this custom :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece into the pipeline through the algorithm config (`config.learners(learner_connector=lambda env: CountBasedIntrinsicRewards())`), your loss function should receive the altered reward signals in the ``rewards`` column of the incoming batch. .. note:: Your custom logic writes the new rewards right back into the given episodes instead of placing them into the train batch. This strategy of writing back those data you pulled from episodes right back into the same episodes makes sure that from this point on, only the changed data is visible to the subsequent connector pieces. The batch remains unchanged at first. However, one of the subsequent :ref:`default Learner connector pieces `, :py:class:`~ray.rllib.connectors.learner.add_columns_from_episodes_to_batch.AddColumnsFromEpisodesToBatch`, fills the batch with rewards data from the episodes. Therefore, RLlib automatically adds to the train batch any changes you make to the episode objects. Example: Stacking the N most recent observations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Another application of the Learner connector API, in combination with a :ref:`custom env-to-module connector piece `, is efficient observation frame stacking, without having to deduplicate the stacked, overlapping observation data and without having to store these additional, overlapping observations in your episodes or send them through the network for inter-actor communication: .. figure:: images/connector_v2/frame_stacking_connector_setup.svg :width: 1000 :align: left **ConnectorV2 setup for observation frame-stacking**: An env-to-module connector pipeline, inside an :py:class:`~ray.rllib.env.env_runner.EnvRunner`, and a Learner connector pipeline, inside a :py:class:`~ray.rllib.core.learner.learner.Learner` actor, both of which contain a custom :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece, which stacks the last four observations from either the ongoing (``EnvRunner``) or already collected episodes (``Learner``) and places these in the batch. Note that you should use dummy, zero-filled observations (in the batch, in red) where the stacking happens close to the beginning of the episode. Because you aren't overriding the original, non-stacked observations in the collected episodes, you have to apply the same batch construction logic responsible for the observation stacking twice, once for the action computation on the :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors and also for the loss computation on the :py:class:`~ray.rllib.core.learner.learner.Learner` actors. For better clarity, it may help to remember that batches produced by a connector pipeline are ephemeral and RLlib discards them right after the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` forward pass. Thus, if frame stacking happens directly on the batch under construction, because you don't want to overload the episodes with deduplicated, stacked observations, you have to apply the stacking logic twice (in the :ref:`env-to-module pipeline ` and the Learner connector pipeline): The following is an example for implementing such a frame-stacking mechanism using the :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` APIs with an RL environment, in which observations are plain 1D tensors. See here for a `more complex end-to-end Atari example for PPO `__. You can write a single :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` class to cover both the env-to-module as well as the Learner custom connector part: .. testcode:: import gymnasium as gym import numpy as np from ray.rllib.connectors.connector_v2 import ConnectorV2 from ray.rllib.core.columns import Columns class StackFourObservations(ConnectorV2): """A connector piece that stacks the previous four observations into one. Works both as Learner connector as well as env-to-module connector. """ def recompute_output_observation_space( self, input_observation_space, input_action_space, ): # Assume the input observation space is a Box of shape (x,). assert ( isinstance(input_observation_space, gym.spaces.Box) and len(input_observation_space.shape) == 1 ) # This connector concatenates the last four observations at axis=0, so the # output space has a shape of (4*x,). return gym.spaces.Box( low=input_observation_space.low, high=input_observation_space.high, shape=(input_observation_space.shape[0] * 4,), dtype=input_observation_space.dtype, ) def __init__( self, input_observation_space, input_action_space, *, as_learner_connector, **kwargs, ): super().__init__(input_observation_space, input_action_space, **kwargs) self._as_learner_connector = as_learner_connector def __call__(self, *, rl_module, batch, episodes, **kwargs): # Loop through all (single-agent) episodes. for sa_episode in self.single_agent_episode_iterator(episodes): # Get the four most recent observations from the episodes. last_4_obs = sa_episode.get_observations( indices=[-4, -3, -2, -1], fill=0.0, # Left-zero-fill in case you reach beginning of episode. ) # Concatenate all stacked observations. new_obs = np.concatenate(last_4_obs, axis=0) # Add the stacked observations to the `batch` using the # `ConnectorV2.add_batch_item()` utility. # Note that you don't change the episode here, which means, if `self` is # the env-to-module connector piece (as opposed to the Learner connector # piece), the episode collected still has only single, non-stacked # observations, which the Learner pipeline must stack again for the # `forward_train()` pass through the model. self.add_batch_item( batch=batch, column=Columns.OBS, item_to_add=new_obs, single_agent_episode=sa_episode, ) # Return batch (with stacked observations). return batch Then, add these lines to your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`: .. testcode:: :hide: from ray.rllib.algorithms.ppo import PPOConfig config = PPOConfig() .. testcode:: # Enable frame-stacking on the EnvRunner side. config.env_runners( env_to_module_connector=lambda env, spaces, device: StackFourObservations(), ) # And again on the Learner side. config.training( learner_connector=lambda obs_space, act_space: StackFourObservations( as_learner_connector=True ), ) Your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` automatically receives the correct, adjusted observation space in its :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.setup` method. The :py:class:`~ray.rllib.env.env_runner.EnvRunner` and its :ref:`env-to-module connector pipeline ` conveniently compute this information for you through the :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.recompute_output_observation_space` methods. Make sure your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` supports stacked observations rather than individual ones. Note that you don't have to concatenate observations into the same original dimension as you did in the preceding implementation of the :py:meth:`~ray.rllib.connectors.connector_v2.ConnectorV2.__call__` method, but you may also stack into a new observation dimension as long as your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` knows how to handle the altered observation shape. .. tip:: The preceding code is for demonstration- and explanation purposes only. There already exists an off-the-shelf :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece in RLlib, which performs the task of stacking the last `N` observations in both env-to-module- and Learner connector pipelines and also supports multi-agent cases. Add these lines here to your config to switch on observation frame stacking: .. testcode:: from ray.rllib.connectors.common.frame_stacking import FrameStacking N = 4 # number of frames to stack # Framestacking on the EnvRunner side. config.env_runners( env_to_module_connector=lambda env, spaces, device: FrameStacking(num_frames=N), ) # Then again on the Learner side. config.training( learner_connector=lambda obs_space, act_space: FrameStacking(num_frames=N, as_learner_connector=True), ) --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-metric-logger-docs: MetricsLogger API ================== .. include:: /_includes/rllib/new_api_stack.rst :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` let's RLlib experiments keep track of metrics. Most components (example: :py:class:`~ray.rllib.env.env_runner.EnvRunner`, :py:class:`~ray.rllib.core.learner.learner.Learner`) in RLlib keep an instance of MetricsLogger that can be logged to. Any logged metrics are aggregated towards to the root MetricsLogger which lives inside the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` object and is used to report metrics to the user or to Ray Tune. When a subcomponent reports metrics down the hierarchy, it "reduces" the logged results before sending them. For example, when reducing by summation, subcomponents calculate sums before sending them to the parent component. We recommend using this API for any metrics that you want RLlib to report, especially if they should be reported to Ray Tune or WandB. To quickly see how RLlib uses MetricsLogger, check out :py:class:`~ray.rllib.env.env_runner.EnvRunner`-based :ref:`callbacks `, a `custom loss function `__, or a custom `training_step `__ implementation. If your goal is to communicate data between RLlib components (for example, communicate a loss from Learners to EnvRunners), we recommend passing such values around through callbacks or by overriding RLlib components' attributes. This is mainly because MetricsLogger is designed to aggregate metrics, but not to make them available everywhere and at any time so queryig logged metrics from it can lead to unexpected results. .. figure:: images/metrics_logger_hierarchy.svg :width: 750 :align: left **RLlib's MetricsLogger aggregation overview**: The diagram illustrates how metrics that are logged in parallel components are aggregated towards the root MetricsLogger. Parallel subcomponents of :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` have their own :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` instance and use it to locally log values. When a component completes a distinct task, for example, an :py:class:`~ray.rllib.env.env_runner.EnvRunner` finishing a sampling request, the local metrics of the subcomponent (``EnvRunner`` or ``Learner``) are "reduced", and sent downstream towards the root component (``Algorithm``). The parent component merges the received results into its own :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger`. Once ``Algorithm`` has completed its own cycle (:py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.step` returns), it "reduces" as well for final reporting to the user or to Ray Tune. Features of MetricsLogger ------------------------- The :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` API offers the following functionalities: - Log scalar values over time, such as losses, individual rewards, or episode returns. - Configure different reduction types, in particular ``ema``, ``mean``, ``min``, ``max``, or ``sum``. Also, users can choose to not reduce at all by using ``item`` or ``item_series``, leaving the logged values untouched. - Specify sliding windows, over which reductions take place, for example ``window=100`` to average over the last 100 logged values per parallel component. Alternatively, specify exponential moving average (EMA) coefficients - Log execution times for distinct code blocks through convenient ``with MetricsLogger.log_time(...)`` blocks. - Add up lifetime sums by setting ``reduce="lifetime_sum"`` when logging values. - For sums and lifetime sums, you can also compute the corresponding throughput metrics per second along the way. Built-in usages of MetricsLogger -------------------------------- RLlib uses the :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` API extensively in the existing code-base. The following is an overview of a typical information flow resulting from this: #. The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` sends parallel sample requests to its ``n`` :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors. #. Each :py:class:`~ray.rllib.env.env_runner.EnvRunner` collects training data by stepping through its :ref:`RL environment ` and logs standard stats to its :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger`, such as episode return or episode length. #. Each :py:class:`~ray.rllib.env.env_runner.EnvRunner` reduces all collected metrics and returns them to the Algorithm. #. The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` aggregates the ``n`` chunks of metrics from the EnvRunners (this depends on the reduce method chosen, an example is averaging if reduce="mean"). #. The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` sends parallel update requests to its ``m`` :py:class:`~ray.rllib.core.learner.learner.Learner` actors. #. Each :py:class:`~ray.rllib.core.learner.learner.Learner` performs a model update while logging metrics to its :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger`, such as total loss or mean gradients. #. Each :py:class:`~ray.rllib.core.learner.learner.Learner` reduces all collected metrics and returns them to the Algorithm. #. The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` aggregates the ``m`` chunks of metrics from the Learners (this again depends on the reduce method chosen, an example is summing if reduce="sum"). #. The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` may add standard metrics to its own :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` instance, for example the average time of a parallel sample request. #. The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` reduces all collected metrics and returns them to the user or Ray Tune. .. warning:: **Don't call the reduce() method yourself** Anytime RLlib reduces metrics, it does so by calling :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.reduce` on the MetricsLogger instance. Doing so clears metrics from the MetricsLogger instance. This is why :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.reduce` should not be called by your custom code. The MetricsLogger APIs in detail -------------------------------- .. figure:: images/metrics_logger_api.svg :width: 750 :align: left **RLlib's MetricsLogger API**: This is how RLlib uses the MetricsLogger API to log and aggregate metrics. We use the methods :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.log_time` and :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.log_value` to log metrics. Metrics then get reduced with the method :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.reduce`. Reduced metrics get aggregated with the method :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.aggregate`. All metrics are finally reduced to be reported to the user or Ray Tune by the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` object. Logging scalar values ~~~~~~~~~~~~~~~~~~~~~ To log a scalar value under some string key in your :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger`, use the :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.log_value` method: .. testcode:: from ray.rllib.utils.metrics.metrics_logger import MetricsLogger logger = MetricsLogger() # Log a scalar float value under the `loss` key. By default, all logged # values under that key are averaged, once `reduce()` is called. logger.log_value("loss", 0.01, reduce="mean", window=2) By default, :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` reduces values through averaging them (``reduce="mean"``). Other available reduction methods can be found in the dictionary ``ray.rllib.utils.metrics.metrics_logger.DEFAULT_STATS_CLS_LOOKUP``. .. note:: You can also provide your own reduction methods by extending ``ray.rllib.utils.metrics.metrics_logger.DEFAULT_STATS_CLS_LOOKUP`` and passing it to :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.reporting`. These new reduction methods will then be available by their key when logging values during runtime. For example, you can use ``reduce="my_custom_reduce_method"`` when when extending the dictionary with a key ``"my_custom_reduce_method"`` and passing it to :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.reporting`. Specifying a ``window`` causes the reduction to take place over the last ``window`` logged values. For example, you can continue logging new values under the ``loss`` key: .. testcode:: logger.log_value("loss", 0.02, reduce="mean", window=2) logger.log_value("loss", 0.03, reduce="mean", window=2) logger.log_value("loss", 0.04, reduce="mean", window=2) logger.log_value("loss", 0.05, reduce="mean", window=2) Because you specified a window of 2, :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` only uses the last 2 values to compute the reduced result. You can ``peek()`` at the currently reduced result through the :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.peek` method: .. testcode:: # Peek at the current, reduced value. # Note that in the underlying structure, the internal values list still # contains all logged values: 0.01, 0.02, 0.03, 0.04, and 0.05. print(logger.peek("loss")) # Expect: 0.045, which is the average over the last 2 values The :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.peek` method allows you to check the current underlying reduced result for some key, without actually having to call :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.reduce`. .. warning:: A limitation of peeking metrics is that you can often not meaningfully peek metrics if they are aggregated downstream. For example, if you log the number of steps you trained on each call to :py:meth:`~ray.rllib.core.learner.learner.Learner.update`, these will be reduced and aggregated by the Algorithm's MetricsLogger and peeking them inside :py:class:`~ray.rllib.core.learner.learner.Learner` will not give you the aggregated result. Instead of providing a flat key, you can also log a value under some nested key through passing in a tuple: .. testcode:: # Log a value under a deeper nested key. logger.log_value(("some", "nested", "key"), -1.0) print(logger.peek(("some", "nested", "key"))) # expect: -1.0 To use reduce methods, other than "mean", specify the ``reduce`` argument in :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.log_value`: .. testcode:: # Log a maximum value. logger.log_value(key="max_value", value=0.0, reduce="max") The maximum value will be reset after each ``reduce()`` operation. .. testcode:: for i in range(1000, 0, -1): logger.log_value(key="max_value", value=float(i)) logger.peek("max_value") # Expect: 1000.0, which is the lifetime max (infinite window) You can also choose to not reduce at all, but to simply collect individual values, for example a set of images you receive from your environment over time and for which it doesn't make sense to reduce them in any way. Use the ``reduce="item"`` or ``reduce="item_series"`` argument for achieving this. However, use your best judgement for what you are logging because RLlib will report all logged values unless you clean them up yourself. .. testcode:: logger.log_value("some_items", value="a", reduce="item_series") logger.log_value("some_items", value="b", reduce="item_series") logger.log_value("an_item", value="c", reduce="item") logger.log_value("an_item", value="d", reduce="item") logger.peek("some_items") # expect a list: ["a", "b"] logger.peek("an_item") # expect a string: "d" logger.reduce() logger.peek("some_items") # expect an empty list: [] logger.peek("an_item") # expect None: [] Logging non-scalar data ~~~~~~~~~~~~~~~~~~~~~~~ .. warning:: You may be tempted to use MetricsLogger as a vehicle to get data from one place in RLlib to another. For example to store data between EnvRunner callbacks, or to move videos captured from the environment from EnvRunners to the Algorithm object. These cases are to be handled with a lot of caution and we generally advise to find other solutions. For example, callbacks can create custom attributes on EnvRunners and you probably don't want your videos to be treated like metrics. MetricsLogger is designed and treated by RLlib as a vehicle to collect metrics from parallel components and aggregate them. It is supposed to handle metrics and these are supposed to flow in one direction - from the parallel components to the root component. If your use-case does not fit this pattern, consider finding another way than using MetricsLogger. :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` isn't limited to scalar values. If you decide that you still want to use MetricsLogger to get data from one place in RLlib to another, you use it to log images, videos, or any other complex data. For example, to log three consecutive image frames from a ``CartPole`` environment, use the ``reduce="item_series"`` argument: .. testcode:: import gymnasium as gym env = gym.make("CartPole-v1") # Log three consecutive render frames from the env. env.reset() logger.log_value("some_images", value=env.render(), reduce="item_series") env.step(0) logger.log_value("some_images", value=env.render(), reduce="item_series") env.step(1) logger.log_value("some_images", value=env.render(), reduce="item_series") Timers ~~~~~~ You can use :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` as a context manager to log timer results. You can time all your code blocks inside your custom code through a single ``with MetricsLogger.log_time(...)`` line: .. testcode:: import time from ray.rllib.utils.metrics.metrics_logger import MetricsLogger logger = MetricsLogger() # First delta measurement: with logger.log_time("my_block_to_be_timed", reduce="ema", ema_coeff=0.1): time.sleep(1.0) # EMA should be ~1sec. assert 1.1 > logger.peek("my_block_to_be_timed") > 0.9 # Second delta measurement: with logger.log_time("my_block_to_be_timed"): time.sleep(2.0) # EMA should be ~1.1sec. assert 1.15 > logger.peek("my_block_to_be_timed") > 1.05 Counters ~~~~~~~~ In case you want to count things, for example the number of environment steps taken in a sample phase, and add up those counts either over the lifetime or over some particular phase, use the ``reduce="sum"`` or ``reduce="lifetime_sum"`` argument in the call to :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.log_value`. .. testcode:: from ray.rllib.utils.metrics.metrics_logger import MetricsLogger logger = MetricsLogger() logger.log_value("my_counter", 50, reduce="sum") logger.log_value("my_counter", 25, reduce="sum") logger.peek("my_counter") # expect: 75 logger.reduce() logger.peek("my_counter") # expect: 0 (upon reduction, all values are cleared) If you log lifetime metrics with ``reduce="lifetime_sum"``, these will get summed up over the lifetime of the experiment and even after resuming from a checkpoint. Note that you can not meaningfully peek ``lifetime_sum`` values outside of the root MetricsLogger. Also note that the lifetime sum is summed up at the root MetricsLogger whereas we only keep the most recent values in parallel components which are cleared each time we reduce. Throughput measurements ++++++++++++++++++++++++ A metrics logged with the settings ``reduce="sum"`` or ``reduce="lifetime_sum"`` can also measure throughput. The throughput is calculated once per metrics reporting cycle. This means that the throughput is always relative to the speed of the metrics reduction cycle. You can use the :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.peek` method to access the throughput value by passing the ``throughput=True`` flag. .. testcode:: import time from ray.rllib.utils.metrics.metrics_logger import MetricsLogger logger = MetricsLogger(root=True) for _ in range(3): logger.log_value("lifetime_sum", 5, reduce="sum", with_throughput=True) time.sleep(1.0) # Expect the throughput to be roughly 15/sec. print(logger.peek("lifetime_sum", throughput=True)) Example 1: How to use MetricsLogger in EnvRunner callbacks ---------------------------------------------------------- To demonstrate how to use the :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` on an :py:class:`~ray.rllib.env.env_runner.EnvRunner`, take a look at this end-to-end example here, which makes use of the :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` API to inject custom code into the RL environment loop. The example computes the average "first-joint angle" of the `Acrobot-v1 RL environment `__ and logs the results through the :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` API. Note that this example is :ref:`identical to the one described here `, but the focus has shifted to explain only the :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` aspects of the code. .. testcode:: import math import numpy as np from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.callbacks.callbacks import RLlibCallback # Define a custom RLlibCallback. class LogAcrobotAngle(RLlibCallback): def on_episode_created(self, *, episode, **kwargs): # Initialize an empty list in the `custom_data` property of `episode`. episode.custom_data["theta1"] = [] def on_episode_step(self, *, episode, env, **kwargs): # Compute the angle at every episode step and store it temporarily in episode: state = env.envs[0].unwrapped.state deg_theta1 = math.degrees(math.atan2(state[1], state[0])) episode.custom_data["theta1"].append(deg_theta1) def on_episode_end(self, *, episode, metrics_logger, **kwargs): theta1s = episode.custom_data["theta1"] avg_theta1 = np.mean(theta1s) # Log the resulting average angle - per episode - to the MetricsLogger. # Report with a sliding window of 50. metrics_logger.log_value("theta1_mean", avg_theta1, reduce="mean", window=50) config = ( PPOConfig() .environment("Acrobot-v1") .callbacks( callbacks_class=LogAcrobotAngle, ) ) ppo = config.build() # Train n times. Expect `theta1_mean` to be found in the results under: # `env_runners/theta1_mean` for i in range(10): results = ppo.train() print( f"iter={i} " f"theta1_mean={results['env_runners']['theta1_mean']} " f"R={results['env_runners']['episode_return_mean']}" ) Also take a look at this more complex example on `how to generate and log a PacMan heatmap (image) to WandB `__ here. Example 2: How to use MetricsLogger in a custom loss function ------------------------------------------------------------- You can log metrics inside your custom loss functions. Use the Learner's own ``Learner.metrics`` attribute for this. .. code-block:: @override(TorchLearner) def compute_loss_for_module(self, *, module_id, config, batch, fwd_out): ... loss_xyz = ... # Log a specific loss term. # Each learner will sum up the loss_xyz value and send it to the root MetricsLogger. self.metrics.log_value("special_loss_term", reduce="sum", value=loss_xyz) total_loss = loss_abc + loss_xyz return total_loss Take a look at this running `end-to-end example for logging custom values inside a loss function `__ here. Example 3: How to use MetricsLogger in a custom Algorithm --------------------------------------------------------- You can log metrics inside your custom Algorithm :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.training_step` method. Use the Algorithm's own ``Algorithm.metrics`` attribute for this. .. code-block:: @override(Algorithm) def training_step(self) -> None: ... # Log some value. self.metrics.log_value("some_mean_result", 1.5, reduce="mean", window=5) ... with self.metrics.log_time(("timers", "some_code")): ... # time some code See this running `end-to-end example for logging inside training_step() `__. Migrating to Ray 2.53 --------------------- If you have been using the MetricsLogger API before Ray 2.52, the following needs your attention: Most importantly: - **Metrics are now cleared once per MetricsLogger.reduce() call. Peeking them thereafter returns the zero-element for the respective reduce type (np.nan, None or an empty list).** - Control flow should be based on other variables, rather than peeking metrics. For MetricsLogger's logging methods (log_value, log_time, etc.): - The ``clear_on_reduce`` argument is deprecated. (see point above) - Using ``reduce="sum"`` and ``clear_on_reduce=False`` is now equivalent to ``reduce="lifetime_sum"``. - The ``throughput_ema_coeff`` is deprecated (we don't use EMA for throughputs anymore). - The ``reduce_per_index_on_aggregate`` argument is deprecated. All metrics are now aggregated over all values collected from leafs of any reduction cycle. Other changes: - Many metrics look more noisy after upgrading to 2.52. This is mostly because they are not smoothed anymore. Smoothing should happen downstream if desired. - :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.aggregate` is now the only way to aggregate metrics. - You can now pass a custom stats class to (AlgorithmConfig.reporting(custom_stats_cls_lookup={...})). This enables you to write your own stats class with its own reduction logic. If your own stats class constitutes a fix or a valuable addition to RLlib, please consider contributing it to the project through a PR. - When aggregating metrics, we can now peek only the onces that were merged in the most recent reduction cycle with the ``latest_merged_only=True`` argument in :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.peek`. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-multi-agent-environments-doc: Multi-Agent Environments ======================== .. include:: /_includes/rllib/new_api_stack.rst In a multi-agent environment, multiple "agents" act simultaneously, in a turn-based sequence, or through an arbitrary combination of both. For instance, in a traffic simulation, there might be multiple "car" and "traffic light" agents interacting simultaneously, whereas in a board game, two or more agents may act in a turn-based sequence. Several different policy networks may be used to control the various agents. Thereby, each of the agents in the environment maps to exactly one particular policy. This mapping is determined by a user-provided function, called the "mapping function". Note that if there are ``N`` agents mapping to ``M`` policies, ``N`` is always larger or equal to ``M``, allowing for any policy to control more than one agent. .. figure:: images/envs/multi_agent_setup.svg :width: 600 :align: left **Multi-agent setup:** ``N`` agents live in the environment and take actions computed by ``M`` policy networks. The mapping from agent to policy is flexible and determined by a user-provided mapping function. Here, `agent_1` and `agent_3` both map to `policy_1`, whereas `agent_2` maps to `policy_2`. RLlib's MultiAgentEnv API ------------------------- .. hint:: This paragraph describes RLlib's own :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` API, which is the recommended way of defining your own multi-agent environment logic. However, if you are already using a third-party multi-agent API, RLlib offers wrappers for :ref:`Farama's PettingZoo API ` as well as :ref:`DeepMind's OpenSpiel API `. The :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` API of RLlib closely follows the conventions and APIs of `Farama's gymnasium (single-agent) `__ envs and even subclasses from `gymnasium.Env`, however, instead of publishing individual observations, rewards, and termination/truncation flags from `reset()` and `step()`, a custom :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` implementation outputs separate dictionaries for observations, rewards, etc., where each dictionary maps agent IDs to the corresponding values for each agent. Here is a first draft of an example :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` implementation: .. code-block:: from ray.rllib.env.multi_agent_env import MultiAgentEnv class MyMultiAgentEnv(MultiAgentEnv): def __init__(self, config=None): super().__init__() ... def reset(self, *, seed=None, options=None): ... # return observation dict and infos dict. return {"agent_1": [obs of agent_1], "agent_2": [obs of agent_2]}, {} def step(self, action_dict): # return observation dict, rewards dict, termination/truncation dicts, and infos dict return {"agent_1": [obs of agent_1]}, {...}, ... Agent Definitions ~~~~~~~~~~~~~~~~~ The number of agents in your environment and their IDs are entirely controlled by your :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` code. Your env decides, which agents start after an episode reset, which agents enter the episode at a later point, which agents terminate the episode early, and which agents stay in the episode until the entire episode ends. To define, which agent IDs might even show up in your episodes, set the `self.possible_agents` attribute to a list of all possible agent ID. .. code-block:: def __init__(self, config=None): super().__init__() ... # Define all agent IDs that might even show up in your episodes. self.possible_agents = ["agent_1", "agent_2"] ... In case your environment only starts with a subset of agent IDs and/or terminates some agent IDs before the end of the episode, you also need to permanently adjust the `self.agents` attribute throughout the course of your episode. If - on the other hand - all agent IDs are static throughout your episodes, you can set `self.agents` to be the same as `self.possible_agents` and don't change its value throughout the rest of your code: .. code-block:: def __init__(self, config=None): super().__init__() ... # If your agents never change throughout the episode, set # `self.agents` to the same list as `self.possible_agents`. self.agents = self.possible_agents = ["agent_1", "agent_2"] # Otherwise, you will have to adjust `self.agents` in `reset()` and `step()` to whatever the # currently "alive" agents are. ... Observation- and Action Spaces ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Next, you should set the observation- and action-spaces of each (possible) agent ID in your constructor. Use the `self.observation_spaces` and `self.action_spaces` attributes to define dictionaries mapping agent IDs to the individual agents' spaces. For example: .. code-block:: import gymnasium as gym import numpy as np ... def __init__(self, config=None): super().__init__() ... self.observation_spaces = { "agent_1": gym.spaces.Box(-1.0, 1.0, (4,), np.float32), "agent_2": gym.spaces.Box(-1.0, 1.0, (3,), np.float32), } self.action_spaces = { "agent_1": gym.spaces.Discrete(2), "agent_2": gym.spaces.Box(0.0, 1.0, (1,), np.float32), } ... In case your episodes hosts a lot of agents, some sharing the same observation- or action spaces, and you don't want to create very large spaces dicts, you can also override the :py:meth:`~ray.rllib.env.multi_agent_env.MultiAgentEnv.get_observation_space` and :py:meth:`~ray.rllib.env.multi_agent_env.MultiAgentEnv.get_action_space` methods and implement the mapping logic from agent ID to space yourself. For example: .. code-block:: def get_observation_space(self, agent_id): if agent_id.startswith("robot_"): return gym.spaces.Box(0, 255, (84, 84, 3), np.uint8) elif agent_id.startswith("decision_maker"): return gym.spaces.Discrete(2) else: raise ValueError(f"bad agent id: {agent_id}!") Observation-, Reward-, and Termination Dictionaries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The remaining two things you need to implement in your custom :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` are the `reset()` and `step()` methods. Equivalently to a single-agent `gymnasium.Env `__, you have to return observations and infos from `reset()`, and return observations, rewards, termination/truncation flags, and infos from `step()`, however, instead of individual values, these all have to be dictionaries mapping agent IDs to the respective individual agents' values. Let's take a look at an example `reset()` implementation first: .. code-block:: def reset(self, *, seed=None, options=None): ... return { "agent_1": np.array([0.0, 1.0, 0.0, 0.0]), "agent_2": np.array([0.0, 0.0, 1.0]), }, {} # <- empty info dict Here, your episode starts with both agents in it, and both expected to compute and send actions for the following `step()` call. In general, the returned observations dict must contain those agents (and only those agents) that should act next. Agent IDs that should NOT act in the next `step()` call must NOT have their observations in the returned observations dict. .. figure:: images/envs/multi_agent_episode_simultaneous.svg :width: 600 :align: left **Env with simultaneously acting agents:** Both agents receive their observations at each time step, including right after `reset()`. Note that an agent must compute and send an action into the next `step()` call whenever an observation is present for that agent in the returned observations dict. Note that the rule of observation dicts determining the exact order of agent moves doesn't equally apply to either reward dicts nor termination/truncation dicts, all of which may contain any agent ID at any time step regardless of whether that agent ID is expected to act or not in the next `step()` call. This is so that an action taken by agent A may trigger a reward for agent B, even though agent B currently isn't acting itself. The same is true for termination flags: Agent A may act in a way that terminates agent B from the episode without agent B having acted itself. .. note:: Use the special agent ID `__all__` in the termination dicts and/or truncation dicts to indicate that the episode should end for all agent IDs, regardless of which agents are still active at that point. RLlib automatically terminates all agents in this case and ends the episode. In summary, the exact order and synchronization of agent actions in your multi-agent episode is determined through the agent IDs contained in (or missing from) your observations dicts. Only those agent IDs that are expected to compute and send actions into the next `step()` call must be part of the returned observation dict. .. figure:: images/envs/multi_agent_episode_turn_based.svg :width: 600 :align: left **Env with agents taking turns:** The two agents act by taking alternating turns. `agent_1` receives the first observation after the `reset()` and thus has to compute and send an action first. Upon receiving this action, the env responds with an observation for `agent_2`, who now has to act. After receiving the action for `agent_2`, a next observation for `agent_1` is returned and so on and so forth. This simple rule allows you to design any type of multi-agent environment, from turn-based games to environments where all agents always act simultaneously, to any arbitrarily complex combination of these two patterns: .. figure:: images/envs/multi_agent_episode_complex_order.svg :width: 600 :align: left **Env with a complex order of turns:** Three agents act in a seemingly chaotic order. `agent_1` and `agent_3` receive their initial observation after the `reset()` and thus has to compute and send actions first. Upon receiving these two actions, the env responds with an observation for `agent_1` and `agent_2`, who now have to act simultaneously. After receiving the actions for `agent_1` and `agent_2`, observations for `agent_2` and `agent_3` are returned and so on and so forth. Let's take a look at two specific, complete :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` example implementations, one where agents always act simultaneously and one where agents act in a turn-based sequence. Example: Environment with Simultaneously Stepping Agents ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A good and simple example for a multi-agent env, in which all agents always step simultaneously is the Rock-Paper-Scissors game, in which two agents have to play N moves altogether, each choosing between the actions "Rock", "Paper", or "Scissors". After each move, the action choices are compared. Rock beats Scissors, Paper beats Rock, and Scissors beats Paper. The player winning the move receives a +1 reward, the losing player -1. Here is the initial class scaffold for your Rock-Paper-Scissors Game: .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/rock_paper_scissors.py :language: python :start-after: __sphinx_doc_1_begin__ :end-before: __sphinx_doc_1_end__ .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/rock_paper_scissors.py :language: python :start-after: __sphinx_doc_2_begin__ :end-before: __sphinx_doc_2_end__ Next, you can implement the constructor of your class: .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/rock_paper_scissors.py :language: python :start-after: __sphinx_doc_3_begin__ :end-before: __sphinx_doc_3_end__ Note that we specify `self.agents = self.possible_agents` in the constructor to indicate that the agents don't change over the course of an episode and stay fixed at `[player1, player2]`. The `reset` logic is to simply add both players in the returned observations dict (both players are expected to act simultaneously in the next `step()` call) and reset a `num_moves` counter that keeps track of the number of moves being played in order to terminate the episode after exactly 10 timesteps (10 actions by either player): .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/rock_paper_scissors.py :language: python :start-after: __sphinx_doc_4_begin__ :end-before: __sphinx_doc_4_end__ Finally, your `step` method should handle the next observations (each player observes the action the opponent just chose), the rewards (+1 or -1 according to the winner/loser rules explained above), and the termination dict (you set the special `__all__` agent ID to `True` iff the number of moves has reached 10). The truncateds- and infos dicts always remain empty: .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/rock_paper_scissors.py :language: python :start-after: __sphinx_doc_5_begin__ :end-before: __sphinx_doc_5_end__ `See here `__ for a complete end-to-end example script showing how to run a multi-agent RLlib setup against your `RockPaperScissors` env. Example: Turn-Based Environments ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's now walk through another multi-agent env example implementation, but this time you implement a turn-based game, in which you have two players (A and B), where A starts the game, then B makes a move, then again A, and so on and so forth. We implement the famous Tic-Tac-Toe game (with one slight aberration), played on a 3x3 field. Each player adds one of their pieces to the field at a time. Pieces can't be moved once placed. The player that first completes one row (horizontal, diagonal, or vertical) wins the game and receives +1 reward. The losing player receives a -1 reward. To make the implementation easier, the aberration from the original game is that trying to place a piece on an already occupied field results in the board not changing at all, but the moving player receiving a -5 reward as a penalty (in the original game, this move is simply not allowed and therefore can never happen). Here is your initial class scaffold for the Tic-Tac-Toe game: .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/tic_tac_toe.py :language: python :start-after: __sphinx_doc_1_begin__ :end-before: __sphinx_doc_1_end__ In your constructor, make sure you define all possible agent IDs that can ever show up in your game ("player1" and "player2"), the currently active agent IDs (same as all possible agents), and each agent's observation- and action space. .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/tic_tac_toe.py :language: python :start-after: __sphinx_doc_2_begin__ :end-before: __sphinx_doc_2_end__ Now let's implement your `reset()` method, in which you empty the board (set it to all 0s), pick a random start player, and return this start player's first observation. Note that you don't return the other player's observation as this player isn't acting next. .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/tic_tac_toe.py :language: python :start-after: __sphinx_doc_3_begin__ :end-before: __sphinx_doc_3_end__ From here on, in each `step()`, you always flip between the two agents (you use the `self.current_player` attribute for keeping track) and return only the current agent's observation, because that's the player you want to act next. You also compute the both agents' rewards based on three criteria: Did the current player win (the opponent lost)? Did the current player place a piece on an already occupied field (gets penalized)? Is the game done because the board is full (both agents receive 0 reward)? .. literalinclude:: ../../../rllib/examples/envs/classes/multi_agent/tic_tac_toe.py :language: python :start-after: __sphinx_doc_4_begin__ :end-before: __sphinx_doc_4_end__ Grouping Agents ~~~~~~~~~~~~~~~ It is common to have groups of agents in multi-agent RL, where each group is treated like a single agent with Tuple action- and observation spaces (one item in the tuple for each individual agent in the group). Such a group of agents can then be assigned to a single policy for centralized execution, or to specialized multi-agent policies that implement centralized training, but decentralized execution. You can use the :py:meth:`~ray.rllib.env.multi_agent_env.MultiAgentEnv.with_agent_groups` method to define these groups: .. literalinclude:: ../../../rllib/env/multi_agent_env.py :language: python :start-after: __grouping_doc_begin__ :end-before: __grouping_doc_end__ For environments with multiple groups, or mixtures of agent groups and individual agents, you can use grouping in conjunction with the policy mapping API described in prior sections. Third Party Multi-Agent Env APIs -------------------------------- Besides RLlib's own :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` API, you can also use various third-party APIs and libraries to implement custom multi-agent envs. .. _farama-pettingzoo-api: Farama PettingZoo ~~~~~~~~~~~~~~~~~ `PettingZoo `__ offers a repository of over 50 diverse multi-agent environments, directly compatible with RLlib through the built-in :py:class:`~ray.rllib.env.wrappers.pettingzoo_env.PettingZooEnv` wrapper: .. testcode:: from pettingzoo.butterfly import pistonball_v6 from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv from ray.tune.registry import register_env register_env( "pistonball", lambda cfg: PettingZooEnv(pistonball_v6.env(num_floors=cfg.get("n_pistons", 20))), ) config = ( PPOConfig() .environment("pistonball", env_config={"n_pistons": 30}) ) See `this example script here `__ for an end-to-env example with the `water world env `__ Also, `see here for an example on the pistonball env with RLlib `__. .. _deepmind-openspiel-api: DeepMind OpenSpiel ~~~~~~~~~~~~~~~~~~ The `OpenSpiel API by DeepMind `__ is a comprehensive framework designed for research and development in multi-agent reinforcement learning, game theory, and decision-making. The API is directly compatible with RLlib through the built-in :py:class:`~ray.rllib.env.wrappers.pettingzoo_env.PettingZooEnv` wrapper: .. testcode:: import pyspiel # pip install open_spiel from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.env.wrappers.open_spiel import OpenSpielEnv from ray.tune.registry import register_env register_env( "open_spiel_env", lambda cfg: OpenSpielEnv(pyspiel.load_game("connect_four")), ) config = PPOConfig().environment("open_spiel_env") See here for an `end-to-end example with the Connect-4 env `__ of OpenSpiel trained by an RLlib algorithm, using a self-play strategy. Running actual Training Experiments with a MultiAgentEnv -------------------------------------------------------- If all agents use the same algorithm class to train their policies, configure multi-agent training as follows: .. code-block:: python from ray.rllib.algorithm.ppo import PPOConfig from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec from ray.rllib.core.rl_module.rl_module import RLModuleSpec config = ( PPOConfig() .environment(env="my_multiagent_env") .multi_agent( policy_mapping_fn=lambda agent_id, episode, **kwargs: ( "traffic_light" if agent_id.startswith("traffic_light_") else random.choice(["car1", "car2"]) ), algorithm_config_overrides_per_module={ "car1": PPOConfig.overrides(gamma=0.85), "car2": PPOConfig.overrides(lr=0.00001), }, ) .rl_module( rl_module_spec=MultiRLModuleSpec(rl_module_specs={ "car1": RLModuleSpec(), "car2": RLModuleSpec(), "traffic_light": RLModuleSpec(), }), ) ) algo = config.build() print(algo.train()) To exclude certain policies from being updated, use the ``config.multi_agent(policies_to_train=[..])`` config setting. This allows running in multi-agent environments with a mix of non-learning and learning policies: .. code-block:: python def policy_mapping_fn(agent_id, episode, **kwargs): agent_idx = int(agent_id[-1]) # 0 (player1) or 1 (player2) return "learning_policy" if episode.id_ % 2 == agent_idx else "random_policy" config = ( PPOConfig() .environment(env="two_player_game") .multi_agent( policy_mapping_fn=policy_mapping_fn, policies_to_train=["learning_policy"], ) .rl_module( rl_module_spec=MultiRLModuleSpec(rl_module_specs={ "learning_policy": RLModuleSpec(), "random_policy": RLModuleSpec(rl_module_class=RandomRLModule), }), ) ) algo = config.build() print(algo.train()) RLlib will create and route decisions to each policy based on the provided ``policy_mapping_fn``. Training statistics for each policy are reported separately in the result-dict returned by ``train()``. The example scripts `rock_paper_scissors_heuristic_vs_learned.py `__ and `rock_paper_scissors_learned_vs_learned.py `__ demonstrate competing policies with heuristic and learned strategies. Scaling to Many MultiAgentEnvs per EnvRunner ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: Unlike for single-agent environments, multi-agent setups are not vectorizable yet. The Ray team is working on a solution for this restriction by utilizing `gymnasium >= 1.x` custom vectorization feature. Variable-Sharing Between Policies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RLlib supports variable-sharing across policies. See the `PettingZoo parameter sharing example `__ for details. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-new-api-stack-migration-guide: .. testcode:: :hide: from ray.rllib.algorithms.ppo import PPOConfig config = PPOConfig() New API stack migration guide ============================= .. include:: /_includes/rllib/new_api_stack.rst This page explains, step by step, how to convert and translate your existing old API stack RLlib classes and code to RLlib's new API stack. What's the new API stack? -------------------------- The new API stack is the result of re-writing the core RLlib APIs from scratch and reducing user-facing classes from more than a dozen critical ones down to only a handful of classes, without any loss of features. When designing these new interfaces, the Ray Team strictly applied the following principles: * Classes must be usable outside of RLlib. * Separation of concerns. Try to answer: "**What** should get done **when** and **by whom**?" and give each class as few non-overlapping and clearly defined tasks as possible. * Offer fine-grained modularity, full interoperability, and frictionless pluggability of classes. * Use widely accepted third-party standards and APIs wherever possible. Applying the preceding principles, the Ray Team reduced the important **must-know** classes for the average RLlib user from eight on the old stack, to only five on the new stack. The **core** new API stack classes are: * :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, which replaces ``ModelV2`` and ``PolicyMap`` APIs * :py:class:`~ray.rllib.core.learner.learner.Learner`, which replaces ``RolloutWorker`` and some of ``Policy`` * :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` and :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode`, which replace ``ViewRequirement``, ``SampleCollector``, ``Episode``, and ``EpisodeV2`` * :py:class:`~ray.rllib.connector.connector_v2.ConnectorV2`, which replaces ``Connector`` and some of ``RolloutWorker`` and ``Policy`` The :py:class:`~ray.rllib.algorithm.algorithm_config.AlgorithmConfig` and :py:class:`~ray.rllib.algorithm.algorithm.Algorithm` APIs remain as-is. These classes are already established APIs on the old stack. .. note:: Even though the new API stack still provides rudimentary support for `TensorFlow `__, RLlib supports a single deep learning framework, the `PyTorch `__ framework, dropping TensorFlow support entirely. Note, though, that the Ray team continues to design RLlib to be framework-agnostic and may add support for additional frameworks in the future. Check your AlgorithmConfig -------------------------- RLlib turns on the new API stack by default for all RLlib algorithms. .. note:: To **deactivate** the new API stack and switch back to the old one, use the `api_stack()` method in your `AlgorithmConfig` object like so: .. testcode:: config.api_stack( enable_rl_module_and_learner=False, enable_env_runner_and_connector_v2=False, ) Note that there are a few other differences between configuring an old API stack algorithm and its new stack counterpart. Go through the following sections and make sure you're translating the respective settings. Remove settings that the new stack doesn't support or need. AlgorithmConfig.framework() ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Even though the new API stack still provides rudimentary support for `TensorFlow `__, RLlib supports a single deep learning framework, the `PyTorch `__ framework. The new API stack deprecates the following framework-related settings: .. testcode:: # Make sure you always set the framework to "torch"... config.framework("torch") # ... and drop all tf-specific settings. config.framework( eager_tracing=True, eager_max_retraces=20, tf_session_args={}, local_tf_session_args={}, ) AlgorithmConfig.resources() ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Ray team deprecated the ``num_gpus`` and ``_fake_gpus`` settings. To place your RLModule on one or more GPUs on the Learner side, do the following: .. testcode:: # The following setting is equivalent to the old stack's `config.resources(num_gpus=2)`. config.learners( num_learners=2, num_gpus_per_learner=1, ) .. hint:: The `num_learners` setting determines how many remote :py:class:`~ray.rllib.core.learner.learner.Learner` workers there are in your Algorithm's :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`. If you set this parameter to ``0``, your LearnerGroup only contains a **local** Learner that runs on the main process and shares its compute resources, typically 1 CPU. For asynchronous algorithms like IMPALA or APPO, this setting should therefore always be >0. `See here for an example on how to train with fractional GPUs `__. Also note that for fractional GPUs, you should always set `num_learners` to 0 or 1. If GPUs aren't available, but you want to learn with more than one :py:class:`~ray.rllib.core.learner.learner.Learner` in a multi-**CPU** fashion, you can do the following: .. testcode:: config.learners( num_learners=2, # or >2 num_cpus_per_learner=1, # <- default num_gpus_per_learner=0, # <- default ) the Ray team renamed the setting ``num_cpus_for_local_worker`` to ``num_cpus_for_main_process``. .. testcode:: config.resources(num_cpus_for_main_process=0) # default is 1 AlgorithmConfig.training() ~~~~~~~~~~~~~~~~~~~~~~~~~~ Train batch size ................ Due to the new API stack's :py:class:`~ray.rllib.core.learner.learner.Learner` worker architecture, training may happen in distributed fashion over ``n`` :py:class:`~ray.rllib.core.learner.learner.Learner` workers, so RLlib provides the train batch size per individual :py:class:`~ray.rllib.core.learner.learner.Learner`. Don't use the ``train_batch_size`` setting any longer: .. testcode:: config.training( train_batch_size_per_learner=512, ) You don't need to change this setting, even when increasing the number of :py:class:`~ray.rllib.core.learner.learner.Learner`, through `config.learners(num_learners=...)`. Note that a good rule of thumb for scaling on the learner axis is to keep the `train_batch_size_per_learner` value constant with a growing number of Learners and to increase the learning rate as follows: `lr = [original_lr] * ([num_learners] ** 0.5)` Neural network configuration ............................ The old stack's `config.training(model=...)` is no longer supported on the new API stack. Instead, use the new :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.rl_module` method to configure RLlib's default :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` or specify and configure a custom :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. See :ref:`RLModules API `, a general guide that also explains the use of the `config.rl_module()` method. If you have an old stack `ModelV2` and want to migrate the entire NN logic to the new stack, see :ref:`ModelV2 to RLModule ` for migration instructions. Learning rate- and coefficient schedules ........................................ If you're using schedules for learning rate or other coefficients, for example, the `entropy_coeff` setting in PPO, provide scheduling information directly in the respective setting. Scheduling behavior doesn't require a specific, separate setting anymore. When defining a schedule, provide a list of 2-tuples, where the first item is the global timestep (*num_env_steps_sampled_lifetime* in the reported metrics) and the second item is the value that the learning rate should reach at that timestep. Always start the first 2-tuple with timestep 0. Note that RLlib linearly interpolates values between two provided timesteps. For example, to create a learning rate schedule that starts with a value of 1e-5, then increases over 1M timesteps to 1e-4 and stays constant after that, do the following: .. testcode:: config.training( lr=[ [0, 1e-5], # <- initial value at timestep 0 [1000000, 1e-4], # <- final value at 1M timesteps ], ) In the preceding example, the value after 500k timesteps is roughly `5e-5` from linear interpolation. Another example is to create an entropy coefficient schedule that starts with a value of 0.05, then increases over 1M timesteps to 0.1 and then suddenly drops to 0, after the 1Mth timestep, do the following: .. testcode:: config.training( entropy_coeff=[ [0, 0.05], # <- initial value at timestep 0 [1000000, 0.1], # <- value at 1M timesteps [1000001, 0.0], # <- sudden drop to 0.0 right after 1M timesteps ] ) In case you need to configure a more complex learning rate scheduling behavior or chain different schedulers into a pipeline, you can use the experimental `_torch_lr_schedule_classes` config property. See `this example script `__ for more details. Note that this example only covers learning rate schedules, but not any other coefficients. AlgorithmConfig.learners() ~~~~~~~~~~~~~~~~~~~~~~~~~~ This method isn't used on the old API stack because the old stack doesn't use Learner workers. It allows you to specify: #. the number of `Learner` workers through `.learners(num_learners=...)`. #. the resources per learner; use `.learners(num_gpus_per_learner=1)` for GPU training and `.learners(num_gpus_per_learner=0)` for CPU training. #. the custom Learner class you want to use. See this `example `__ for more details. #. a config dict you would like to set for your custom learner: `.learners(learner_config_dict={...})`. Note that every `Learner` has access to the entire `AlgorithmConfig` object through `self.config`, but setting the `learner_config_dict` is a convenient way to avoid having to create an entirely new `AlgorithmConfig` subclass only to support a few extra settings for your custom `Learner` class. AlgorithmConfig.env_runners() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. testcode:: # RolloutWorkers have been replace by EnvRunners. EnvRunners are more efficient and offer # a more separation-of-concerns design and cleaner code. config.env_runners( num_env_runners=2, # use this instead of `num_workers` ) # The following `env_runners` settings are deprecated and should no longer be explicitly # set on the new stack: config.env_runners( create_env_on_local_worker=False, sample_collector=None, enable_connectors=True, remote_worker_envs=False, remote_env_batch_wait_ms=0, preprocessor_pref="deepmind", enable_tf1_exec_eagerly=False, sampler_perf_stats_ema_coef=None, ) .. hint:: If you want to IDE-debug what's going on inside your `EnvRunners`, set `num_env_runners=0` and make sure you are running your experiment locally and not through Ray Tune. In order to do this with any of RLlib's `example `__ or `tuned_example `__ scripts, simply set the command line args: `--no-tune --num-env-runners=0`. In case you were using the `observation_filter` setting, perform the following translations: .. testcode:: # For `observation_filter="NoFilter"`, don't set anything in particular. This is the default. # For `observation_filter="MeanStdFilter"`, do the following: from ray.rllib.connectors.env_to_module import MeanStdFilter config.env_runners( env_to_module_connector=lambda env: MeanStdFilter(multi_agent=False), # <- or True ) .. hint:: The main switch for whether to explore or not during sample collection has moved to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners` method. See :ref:`here for more details `. .. _rllib-algo-config-exploration-docs: AlgorithmConfig.exploration() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The main switch for whether to explore or not during sample collection has moved from the deprecated ``AlgorithmConfig.exploration()`` method to :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners`: It determines whether the method your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` calls inside the :py:class:`~ray.rllib.env.env_runner.EnvRunner` is either :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_exploration`, in the case `explore=True`, or :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_inference`, in the case `explore=False`. .. testcode:: config.env_runners(explore=True) # <- or False The Ray team has deprecated the ``exploration_config`` setting. Instead, define the exact exploratory behavior, for example, sample an action from a distribution, inside the overridden :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_exploration` method of your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. Custom callbacks ---------------- If you're using custom callbacks on the old API stack, you're subclassing the ``DefaultCallbacks`` class, which the Ray team renamed to :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback`. You can continue this approach with the new API stack and pass your custom subclass to your config like the following: .. testcode:: # config.callbacks(YourCallbacksClass) However, if you're overriding those methods that triggered on the :py:class:`~ray.rllib.env.env_runner.EnvRunner` side, for example, ``on_episode_start/stop/step/etc...``, you may have to translate some call arguments. The following is a one-to-one translation guide for these types of :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` methods: .. testcode:: from ray.rllib.callbacks.callbacks import RLlibCallback class YourCallbacksClass(RLlibCallback): def on_episode_start( self, *, episode, env_runner, metrics_logger, env, env_index, rl_module, # Old API stack args; don't use or access these inside your method code. worker=None, base_env=None, policies=None, **kwargs, ): # The `SingleAgentEpisode` or `MultiAgentEpisode` that RLlib has just started. # See https://docs.ray.io/en/latest/rllib/single-agent-episode.html for more details: print(episode) # The `EnvRunner` class that collects the episode in question. # This class used to be a `RolloutWorker`. On the new stack, this class is either a # `SingleAgentEnvRunner` or a `MultiAgentEnvRunner` holding the gymnasium Env, # the RLModule, and the 2 connector pipelines, env-to-module and module-to-env. print(env_runner) # The MetricsLogger object on the EnvRunner (documentation is a WIP). print(metrics_logger.peek("episode_return_mean", default=0.0)) # The gymnasium env that sample collection uses. Note that this env may be a # gymnasium.vector.VectorEnv. print(env) # The env index, in case of a vector env, that handles the `episode`. print(env_index) # The RL Module that this EnvRunner uses. Note that this module may be a "plain", single-agent # `RLModule`, or a `MultiRLModule` in the multi-agent case. print(rl_module) # Change similarly: # on_episode_created() # on_episode_step() # on_episode_end() The following callback methods are no longer available on the new API stack: * ``on_sub_environment_created()``: The new API stack uses `Farama's gymnasium `__ vector Envs leaving no control for RLlib to call a callback on each individual env-index's creation. * ``on_create_policy()``: This method is no longer available on the new API stack because only ``RolloutWorker`` calls it. * ``on_postprocess_trajectory()``: The new API stack no longer triggers and calls this method because :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pipelines handle trajectory processing entirely. The documentation for :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` is under development. .. See :ref:`` for a detailed description of RLlib callback APIs. TODO (sven): ref doesn't work for some weird reason. Getting: undefined label: '' .. _rllib-modelv2-to-rlmodule: ModelV2 to RLModule ------------------- If you're using a custom ``ModelV2`` class and want to translate the entire NN architecture and possibly action distribution logic to the new API stack, see :ref:`RL Modules ` in addition to this section. Also, see these example scripts on `how to write a custom CNN-containing RL Module `__ and `how to write a custom LSTM-containing RL Module `__. There are various options for translating an existing, custom ``ModelV2`` from the old API stack, to the new API stack's :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`: #. Move your ModelV2 code to a new, custom `RLModule` class. See :ref:`RL Modules ` for details). #. Use an Algorithm checkpoint or a Policy checkpoint that you have from an old API stack training run and use this checkpoint with the `new stack RL Module convenience wrapper `__. #. Use an existing :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` object from an old API stack training run, with the `new stack RL Module convenience wrapper `__. In more complex scenarios, you might've implemented custom policies, such that you could modify the behavior of constructing models and distributions. Translating Policy.compute_actions_from_input_dict ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This old API stack method, as well as ``compute_actions`` and ``compute_single_action``, directly translate to :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_inference` and :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_exploration`. :ref:`The RLModule guide explains how to implement this method `. Translating Policy.action_distribution_fn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To translate ``action_distribution_fn``, write the following custom RLModule code: .. tab-set:: .. tab-item:: Same action dist. class .. testcode:: :skipif: True from ray.rllib.models.torch.torch_distributions import YOUR_DIST_CLASS class MyRLModule(TorchRLModule): def setup(self): ... # Set the following attribute at the end of your custom `setup()`. self.action_dist_cls = YOUR_DIST_CLASS .. tab-item:: Different action dist. classes .. testcode:: :skipif: True from ray.rllib.models.torch.torch_distributions import ( YOUR_INFERENCE_DIST_CLASS, YOUR_EXPLORATION_DIST_CLASS, YOUR_TRAIN_DIST_CLASS, ) def get_inference_action_dist_cls(self): return YOUR_INFERENCE_DIST_CLASS def get_exploration_action_dist_cls(self): return YOUR_EXPLORATION_DIST_CLASS def get_train_action_dist_cls(self): return YOUR_TRAIN_DIST_CLASS Translating Policy.action_sampler_fn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To translate ``action_sampler_fn``, write the following custom RLModule code: .. testcode:: :skipif: True from ray.rllib.models.torch.torch_distributions import YOUR_DIST_CLASS class MyRLModule(TorchRLModule): def _forward_exploration(self, batch): computation_results = ... my_dist = YOUR_DIST_CLASS(computation_results) actions = my_dist.sample() return {Columns.ACTIONS: actions} # Maybe for inference, you would like to sample from the deterministic version # of your distribution: def _forward_inference(self, batch): computation_results = ... my_dist = YOUR_DIST_CLASS(computation_results) greedy_actions = my_dist.to_deterministic().sample() return {Columns.ACTIONS: greedy_actions} Policy.compute_log_likelihoods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Implement your custom RLModule's :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_train` method and return the ``Columns.ACTION_LOGP`` key together with the corresponding action log probabilities to pass this information to your loss functions, which your code calls after `forward_train()`. The loss logic can then access `Columns.ACTION_LOGP`. Custom loss functions and policies ------------------------------------- If you're using one or more custom loss functions or custom (PyTorch) optimizers to train your models, instead of doing these customizations inside the old stack's Policy class, you need to move the logic into the new API stack's :py:class:`~ray.rllib.core.learner.learner.Learner` class. See :ref:`Learner ` for details on how to write a custom Learner . The following example scripts show how to write: - `a simple custom loss function `__ - `a custom Learner with 2 optimizers and different learning rates for each `__. Note that the new API stack doesn't support the Policy class. In the old stack, this class holds a neural network, which is the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` in the new API stack, an old stack connector, which is the :py:class:`~ray.rllib.connector.connector_v2.ConnectorV2` in the new API stack, and one or more optimizers and losses, which are the :py:class:`~ray.rllib.core.learner.learner.Learner` class in the new API stack. The RL Module API is much more flexible than the old stack's Policy API and provides a cleaner separation-of-concerns experience. Things related to action inference run on the EnvRunners, and things related to updating run on the Learner workers It also provides superior scalability, allowing training in a multi-GPU setup in any Ray cluster and multi-node with multi-GPU training on the `Anyscale `__ platform. Custom connectors ----------------- If you're using custom connectors from the old API stack, move your logic into the new :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` API. Translate your agent connectors into env-to-module ConnectorV2 pieces and your action connectors into module-to-env ConnectorV2 pieces. The :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` documentation is under development. The following are some examples on how to write ConnectorV2 pieces for the different pipelines: #. `Observation frame-stacking `__. #. `Add the most recent action and reward to the RL Module's input `__. #. `Mean-std filtering on all observations `__. #. `Flatten any complex observation space to a 1D space `__. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _algorithm-config-reference-docs: Algorithm Configuration API =========================== .. include:: /_includes/rllib/new_api_stack.rst .. currentmodule:: ray.rllib.algorithms.algorithm_config Constructor ----------- .. autosummary:: :nosignatures: :toctree: doc/ ~AlgorithmConfig Builder methods --------------- .. autosummary:: :nosignatures: :toctree: doc/ ~AlgorithmConfig.build_algo ~AlgorithmConfig.build_learner_group ~AlgorithmConfig.build_learner Properties ---------- .. autosummary:: :nosignatures: :toctree: doc/ ~AlgorithmConfig.is_multi_agent ~AlgorithmConfig.is_offline ~AlgorithmConfig.learner_class ~AlgorithmConfig.model_config ~AlgorithmConfig.rl_module_spec ~AlgorithmConfig.total_train_batch_size Getter methods -------------- .. autosummary:: :nosignatures: :toctree: doc/ ~AlgorithmConfig.get_default_learner_class ~AlgorithmConfig.get_default_rl_module_spec ~AlgorithmConfig.get_evaluation_config_object ~AlgorithmConfig.get_multi_rl_module_spec ~AlgorithmConfig.get_multi_agent_setup ~AlgorithmConfig.get_rollout_fragment_length Public methods -------------- .. autosummary:: :nosignatures: :toctree: doc/ ~AlgorithmConfig.copy ~AlgorithmConfig.validate ~AlgorithmConfig.freeze .. _rllib-algorithm-config-methods: Configuration methods --------------------- .. _rllib-config-env: Configuring the RL Environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.environment :noindex: .. _rllib-config-training: Configuring training behavior ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training :noindex: .. _rllib-config-env-runners: Configuring `EnvRunnerGroup` and `EnvRunner` actors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners :noindex: .. _rllib-config-learners: Configuring `LearnerGroup` and `Learner` actors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners :noindex: .. _rllib-config-callbacks: Configuring custom callbacks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.callbacks :noindex: .. _rllib-config-multi_agent: Configuring multi-agent specific settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.multi_agent :noindex: .. _rllib-config-offline_data: Configuring offline RL specific settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.offline_data :noindex: .. _rllib-config-evaluation: Configuring evaluation settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.evaluation :noindex: .. _rllib-config-framework: Configuring deep learning framework settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.framework :noindex: .. _rllib-config-reporting: Configuring reporting settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.reporting :noindex: .. _rllib-config-checkpointing: Configuring checkpointing settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.checkpointing :noindex: .. _rllib-config-debugging: Configuring debugging settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.debugging :noindex: .. _rllib-config-experimental: Configuring experimental settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.experimental :noindex: --- .. include:: /_includes/rllib/we_are_hiring.rst .. _algorithm-reference-docs: Algorithms ========== .. include:: /_includes/rllib/new_api_stack.rst The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class is the highest-level API in RLlib responsible for **WHEN** and **WHAT** of RL algorithms. Things like **WHEN** should we sample the algorithm, **WHEN** should we perform a neural network update, and so on. The **HOW** will be delegated to components such as ``RolloutWorker``, etc.. It is the main entry point for RLlib users to interact with RLlib's algorithms. It allows you to train and evaluate policies, save an experiment's progress and restore from a prior saved experiment when continuing an RL run. :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` is a sub-class of :py:class:`~ray.tune.trainable.Trainable` and thus fully supports distributed hyperparameter tuning for RL. .. https://docs.google.com/drawings/d/1J0nfBMZ8cBff34e-nSPJZMM1jKOuUL11zFJm6CmWtJU/edit .. figure:: ../images/trainer_class_overview.svg :align: left **A typical RLlib Algorithm object:** Algorithms are normally comprised of N ``RolloutWorkers`` that orchestrated via a :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` object. Each worker own its own a set of ``Policy`` objects and their NN models per worker, plus a :py:class:`~ray.rllib.env.base_env.BaseEnv` instance per worker. Building Custom Algorithm Classes --------------------------------- .. warning:: As of Ray >= 1.9, it is no longer recommended to use the `build_trainer()` utility function for creating custom Algorithm sub-classes. Instead, follow the simple guidelines here for directly sub-classing from :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`. In order to create a custom Algorithm, sub-class the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class and override one or more of its methods. Those are in particular: * :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.setup` * :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.get_default_config` * :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.get_default_policy_class` * :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.training_step` `See here for an example on how to override Algorithm `_. .. _rllib-algorithm-api: Algorithm API ------------- .. currentmodule:: ray.rllib.algorithms.algorithm Construction and setup ~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~Algorithm ~Algorithm.setup ~Algorithm.get_default_config ~Algorithm.env_runner ~Algorithm.eval_env_runner Training ~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~Algorithm.train ~Algorithm.training_step Saving and restoring ~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~Algorithm.save_to_path ~Algorithm.restore_from_path ~Algorithm.from_checkpoint ~Algorithm.get_state ~Algorithm.set_state Evaluation ~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~Algorithm.evaluate Multi Agent ~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~Algorithm.get_module ~Algorithm.add_policy ~Algorithm.remove_policy --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-callback-reference-docs: Callback APIs ============= .. include:: /_includes/rllib/new_api_stack.rst Callback APIs enable you to inject code into an experiment, an Algorithm, and the subcomponents of an Algorithm. You can either subclass :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` and implement one or more of its methods, like :py:meth:`~ray.rllib.callbacks.callbacks.RLlibCallback.on_algorithm_init`, or pass respective arguments to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.callbacks` method of an Algorithm's config, like ``config.callbacks(on_algorithm_init=lambda algorithm, **kw: print('algo initialized!'))``. .. tab-set:: .. tab-item:: Subclass RLlibCallback .. testcode:: from ray.rllib.algorithms.dqn import DQNConfig from ray.rllib.callbacks.callbacks import RLlibCallback class MyCallback(RLlibCallback): def on_algorithm_init(self, *, algorithm, metrics_logger, **kwargs): print(f"Algorithm {algorithm} has been initialized!") config = ( DQNConfig() .callbacks(MyCallback) ) .. testcode:: :hide: config.validate() .. tab-item:: Pass individual callables to ``config.callbacks()`` .. testcode:: from ray.rllib.algorithms.dqn import DQNConfig config = ( DQNConfig() .callbacks( on_algorithm_init=( lambda algorithm, **kwargs: print(f"Algorithm {algorithm} has been initialized!") ) ) ) .. testcode:: :hide: config.validate() See :ref:`Callbacks ` for more details on how to write and configure callbacks. Methods to implement for custom behavior ---------------------------------------- .. note:: RLlib only invokes callbacks in :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` and :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors. The Ray team is considering expanding callbacks onto :py:class:`~ray.rllib.core.learner.learner.Learner` actors and possibly :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances as well. .. currentmodule:: ray.rllib.callbacks.callbacks RLlibCallback ------------- .. autosummary:: :nosignatures: :toctree: doc/ ~RLlibCallback .. _rllib-callback-reference-algorithm-bound: Callbacks invoked in Algorithm ------------------------------ The main Algorithm process always executes the following callback methods: .. autosummary:: :nosignatures: :toctree: doc/ ~RLlibCallback.on_algorithm_init ~RLlibCallback.on_sample_end ~RLlibCallback.on_train_result ~RLlibCallback.on_evaluate_start ~RLlibCallback.on_evaluate_end ~RLlibCallback.on_env_runners_recreated ~RLlibCallback.on_checkpoint_loaded .. _rllib-callback-reference-env-runner-bound: Callbacks invoked in EnvRunner ------------------------------ The EnvRunner actors always execute the following callback methods: .. autosummary:: :nosignatures: :toctree: doc/ ~RLlibCallback.on_environment_created ~RLlibCallback.on_episode_created ~RLlibCallback.on_episode_start ~RLlibCallback.on_episode_step ~RLlibCallback.on_episode_end --- .. include:: /_includes/rllib/we_are_hiring.rst .. _connector-v2-reference-docs: ConnectorV2 API =============== .. include:: /_includes/rllib/new_api_stack.rst .. currentmodule:: ray.rllib.connectors.connector_v2 rllib.connectors.connector_v2.ConnectorV2 ----------------------------------------- .. autoclass:: ray.rllib.connectors.connector_v2.ConnectorV2 :special-members: __call__ :members: rllib.connectors.connector_pipeline_v2.ConnectorPipelineV2 ---------------------------------------------------------- .. autoclass:: ray.rllib.connectors.connector_pipeline_v2.ConnectorPipelineV2 :members: Observation preprocessors ========================= .. currentmodule:: ray.rllib.connectors.env_to_module.observation_preprocessor rllib.connectors.env_to_module.observation_preprocessor.SingleAgentObservationPreprocessor ------------------------------------------------------------------------------------------ .. autoclass:: ray.rllib.connectors.env_to_module.observation_preprocessor.SingleAgentObservationPreprocessor .. automethod:: recompute_output_observation_space .. automethod:: preprocess rllib.connectors.env_to_module.observation_preprocessor.MultiAgentObservationPreprocessor ----------------------------------------------------------------------------------------- .. autoclass:: ray.rllib.connectors.env_to_module.observation_preprocessor.MultiAgentObservationPreprocessor .. automethod:: recompute_output_observation_space .. automethod:: preprocess --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-distributions-reference-docs: Distribution API ================ .. include:: /_includes/rllib/new_api_stack.rst .. currentmodule:: ray.rllib.models.distributions Base Distribution class ----------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~Distribution ~Distribution.from_logits ~Distribution.sample ~Distribution.rsample ~Distribution.logp ~Distribution.kl --- .. include:: /_includes/rllib/we_are_hiring.rst .. _env-reference-docs: Environments ============ .. include:: /_includes/rllib/new_api_stack.rst RLlib mainly supports the `Farama gymnasium API `__ for single-agent environments, and RLlib's own :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv` API for multi-agent setups. Env Vectorization ----------------- For single-agent setups, RLlib automatically vectorizes your provided `gymnasium.Env `__ using gymnasium's own `vectorization feature `__. Use the `config.env_runners(num_envs_per_env_runner=..)` setting to vectorize your env beyond 1 env copy. External Envs ------------- .. note:: External Env support is under development on the new API stack. The recommended way to implement your own external env connection logic, for example through TCP or shared memory, is to write your own :py:class:`~ray.rllib.env.env_runner.EnvRunner` subclass. See this an end-to-end example of an `external CartPole (client) env `__ connecting to RLlib through a custom, TCP-capable :py:class:`~ray.rllib.env.env_runner.EnvRunner` server. Environment API Reference ------------------------- .. toctree:: :maxdepth: 1 env/env_runner.rst env/single_agent_env_runner.rst env/single_agent_episode.rst env/multi_agent_env.rst env/multi_agent_env_runner.rst env/multi_agent_episode.rst env/external.rst env/utils.rst --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-reference-docs: Ray RLlib API ============= .. include:: /_includes/rllib/new_api_stack.rst .. tip:: We'd love to hear your feedback on using RLlib - `sign up to our forum and start asking questions `_! This section contains an overview of RLlib's package- and API reference. If you think there is anything missing, please open an issue on `GitHub`_. .. _`GitHub`: https://github.com/ray-project/ray/issues .. toctree:: :maxdepth: 2 algorithm-config.rst algorithm.rst callback.rst env.rst rl_modules.rst distributions.rst learner.rst offline.rst connector-v2.rst replay-buffers.rst utils.rst --- .. include:: /_includes/rllib/we_are_hiring.rst .. _learner-reference-docs: LearnerGroup API ================ .. include:: /_includes/rllib/new_api_stack.rst Configuring a LearnerGroup and Learner actors --------------------------------------------- .. currentmodule:: ray.rllib.algorithms.algorithm_config .. autosummary:: :nosignatures: :toctree: doc/ AlgorithmConfig.learners Constructing a LearnerGroup --------------------------- .. autosummary:: :nosignatures: :toctree: doc/ AlgorithmConfig.build_learner_group .. currentmodule:: ray.rllib.core.learner.learner_group .. autosummary:: :nosignatures: :toctree: doc/ LearnerGroup Learner API =========== Constructing a Learner ---------------------- .. currentmodule:: ray.rllib.algorithms.algorithm_config .. autosummary:: :nosignatures: :toctree: doc/ AlgorithmConfig.build_learner .. currentmodule:: ray.rllib.core.learner.learner .. autosummary:: :nosignatures: :toctree: doc/ Learner Learner.build Learner._check_is_built Learner._make_module Implementing a custom RLModule to fit a Learner ---------------------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ Learner.rl_module_required_apis Learner.rl_module_is_compatible Performing updates ------------------ .. autosummary:: :nosignatures: :toctree: doc/ Learner.update Learner.before_gradient_based_update Learner.after_gradient_based_update Computing losses ---------------- .. autosummary:: :nosignatures: :toctree: doc/ Learner.compute_losses Learner.compute_loss_for_module Configuring optimizers ---------------------- .. autosummary:: :nosignatures: :toctree: doc/ Learner.configure_optimizers_for_module Learner.configure_optimizers Learner.register_optimizer Learner.get_optimizers_for_module Learner.get_optimizer Learner.get_parameters Learner.get_param_ref Learner.filter_param_dict_for_optimizer Learner._check_registered_optimizer Learner._set_optimizer_lr Gradient computation -------------------- .. autosummary:: :nosignatures: :toctree: doc/ Learner.compute_gradients Learner.postprocess_gradients Learner.postprocess_gradients_for_module Learner.apply_gradients Learner._get_clip_function Saving and restoring -------------------- .. autosummary:: :nosignatures: :toctree: doc/ Learner.save_to_path Learner.restore_from_path Learner.from_checkpoint Learner.get_state Learner.set_state Adding and removing modules --------------------------- .. autosummary:: :nosignatures: :toctree: doc/ Learner.add_module Learner.remove_module --- .. include:: /_includes/rllib/we_are_hiring.rst .. _new-api-offline-reference-docs: Offline RL API ============== .. include:: /_includes/rllib/new_api_stack.rst Configuring Offline RL ---------------------- .. currentmodule:: ray.rllib.algorithms.algorithm_config .. autosummary:: :nosignatures: :toctree: doc/ AlgorithmConfig.offline_data AlgorithmConfig.learners Configuring Offline Recording EnvRunners ---------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ AlgorithmConfig.env_runners Constructing a Recording EnvRunner ---------------------------------- .. currentmodule:: ray.rllib.offline.offline_env_runner .. autosummary:: :nosignatures: :toctree: doc/ OfflineSingleAgentEnvRunner Constructing OfflineData ------------------------ .. currentmodule:: ray.rllib.offline.offline_data .. autosummary:: :nosignatures: :toctree: doc/ OfflineData OfflineData.__init__ Sampling from Offline Data -------------------------- .. autosummary:: :nosignatures: :toctree: doc/ OfflineData.sample OfflineData.default_map_batches_kwargs OfflineData.default_iter_batches_kwargs Constructing an OfflinePreLearner --------------------------------- .. currentmodule:: ray.rllib.offline.offline_prelearner .. autosummary:: :nosignatures: :toctree: doc/ OfflinePreLearner OfflinePreLearner.__init__ Transforming Data with an OfflinePreLearner ------------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ SCHEMA OfflinePreLearner.__call__ OfflinePreLearner._map_to_episodes OfflinePreLearner._map_sample_batch_to_episode OfflinePreLearner._should_module_be_updated OfflinePreLearner.default_prelearner_buffer_class OfflinePreLearner.default_prelearner_buffer_kwargs --- .. include:: /_includes/rllib/we_are_hiring.rst .. _replay-buffer-api-reference-docs: Replay Buffer API ================= .. include:: /_includes/rllib/new_api_stack.rst The following classes don't take into account the separation of experiences from different policies, multi-agent replay buffers will be explained further below. Replay Buffer Base Classes -------------------------- .. currentmodule:: ray.rllib.utils.replay_buffers .. autosummary:: :nosignatures: :toctree: doc/ ~replay_buffer.StorageUnit ~replay_buffer.ReplayBuffer ~prioritized_replay_buffer.PrioritizedReplayBuffer ~reservoir_replay_buffer.ReservoirReplayBuffer Public Methods -------------- .. currentmodule:: ray.rllib.utils.replay_buffers.replay_buffer .. autosummary:: :nosignatures: :toctree: doc/ ~ReplayBuffer.sample ~ReplayBuffer.add ~ReplayBuffer.get_state ~ReplayBuffer.set_state Multi Agent Buffers ------------------- The following classes use the above, "single-agent", buffers as underlying buffers to facilitate splitting up experiences between the different agents' policies. In multi-agent RL, more than one agent exists in the environment and not all of these agents may utilize the same policy (mapping M agents to N policies, where M <= N). This leads to the need for MultiAgentReplayBuffers that store the experiences of different policies separately. .. currentmodule:: ray.rllib.utils.replay_buffers .. autosummary:: :nosignatures: :toctree: doc/ ~multi_agent_replay_buffer.MultiAgentReplayBuffer ~multi_agent_prioritized_replay_buffer.MultiAgentPrioritizedReplayBuffer Utility Methods --------------- .. autosummary:: :nosignatures: :toctree: doc/ ~utils.update_priorities_in_replay_buffer ~utils.sample_min_n_steps_from_buffer --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rlmodule-reference-docs: RLModule APIs ============= .. include:: /_includes/rllib/new_api_stack.rst RLModule specifications and configurations ------------------------------------------- Single RLModuleSpec +++++++++++++++++++ .. currentmodule:: ray.rllib.core.rl_module.rl_module .. autosummary:: :nosignatures: :toctree: doc/ RLModuleSpec RLModuleSpec.build RLModuleSpec.module_class RLModuleSpec.observation_space RLModuleSpec.action_space RLModuleSpec.inference_only RLModuleSpec.learner_only RLModuleSpec.model_config MultiRLModuleSpec +++++++++++++++++ .. currentmodule:: ray.rllib.core.rl_module.multi_rl_module .. autosummary:: :nosignatures: :toctree: doc/ MultiRLModuleSpec MultiRLModuleSpec.build .. autoattribute:: ray.rllib.core.rl_module.multi_rl_module.MultiRLModuleSpec.multi_rl_module_class :no-index: .. autoattribute:: ray.rllib.core.rl_module.multi_rl_module.MultiRLModuleSpec.observation_space :no-index: .. autoattribute:: ray.rllib.core.rl_module.multi_rl_module.MultiRLModuleSpec.action_space :no-index: .. autoattribute:: ray.rllib.core.rl_module.multi_rl_module.MultiRLModuleSpec.inference_only :no-index: .. autoattribute:: ray.rllib.core.rl_module.multi_rl_module.MultiRLModuleSpec.model_config :no-index: .. autoattribute:: ray.rllib.core.rl_module.multi_rl_module.MultiRLModuleSpec.rl_module_specs :no-index: DefaultModelConfig ++++++++++++++++++ .. currentmodule:: ray.rllib.core.rl_module.default_model_config .. autosummary:: :nosignatures: :toctree: doc/ DefaultModelConfig RLModule API ------------ .. currentmodule:: ray.rllib.core.rl_module.rl_module Construction and setup ++++++++++++++++++++++ .. autosummary:: :nosignatures: :toctree: doc/ RLModule RLModule.observation_space RLModule.action_space RLModule.inference_only RLModule.model_config RLModule.setup RLModule.as_multi_rl_module Forward methods +++++++++++++++ Use the following three forward methods when you use RLModule from inside other classes and components. However, do NOT override them and leave them as-is in your custom subclasses. For defining your own forward behavior, override the private methods ``_forward`` (generic forward behavior for all phases) or, for more granularity, use ``_forward_exploration``, ``_forward_inference``, and ``_forward_train``. .. autosummary:: :nosignatures: :toctree: doc/ ~RLModule.forward_exploration ~RLModule.forward_inference ~RLModule.forward_train Override these private methods to define your custom model's forward behavior. - ``_forward``: generic forward behavior for all phases - ``_forward_exploration``: for training sample collection - ``_forward_inference``: for production deployments, greedy acting - `_forward_train``: for computing loss function inputs .. autosummary:: :nosignatures: :toctree: doc/ ~RLModule._forward ~RLModule._forward_exploration ~RLModule._forward_inference ~RLModule._forward_train Saving and restoring ++++++++++++++++++++ .. autosummary:: :nosignatures: :toctree: doc/ ~RLModule.save_to_path ~RLModule.restore_from_path ~RLModule.from_checkpoint ~RLModule.get_state ~RLModule.set_state MultiRLModule API ----------------- .. currentmodule:: ray.rllib.core.rl_module.multi_rl_module Constructor +++++++++++ .. autosummary:: :nosignatures: :toctree: doc/ MultiRLModule MultiRLModule.setup MultiRLModule.as_multi_rl_module Modifying the underlying RLModules ++++++++++++++++++++++++++++++++++ .. autosummary:: :nosignatures: :toctree: doc/ ~MultiRLModule.add_module ~MultiRLModule.remove_module Saving and restoring ++++++++++++++++++++ .. autosummary:: :nosignatures: :toctree: doc/ ~MultiRLModule.save_to_path ~MultiRLModule.restore_from_path ~MultiRLModule.from_checkpoint ~MultiRLModule.get_state ~MultiRLModule.set_state Additional RLModule APIs ------------------------ .. currentmodule:: ray.rllib.core.rl_module.apis InferenceOnlyAPI ++++++++++++++++ .. autoclass:: ray.rllib.core.rl_module.apis.inference_only_api.InferenceOnlyAPI .. automethod:: get_non_inference_attributes QNetAPI +++++++ .. autoclass:: ray.rllib.core.rl_module.apis.q_net_api.QNetAPI .. automethod:: compute_q_values .. automethod:: compute_advantage_distribution SelfSupervisedLossAPI +++++++++++++++++++++ .. autoclass:: ray.rllib.core.rl_module.apis.self_supervised_loss_api.SelfSupervisedLossAPI .. automethod:: compute_self_supervised_loss TargetNetworkAPI ++++++++++++++++ .. autoclass:: ray.rllib.core.rl_module.apis.target_network_api.TargetNetworkAPI .. automethod:: make_target_networks .. automethod:: get_target_network_pairs .. automethod:: forward_target ValueFunctionAPI ++++++++++++++++ .. autoclass:: ray.rllib.core.rl_module.apis.value_function_api.ValueFunctionAPI .. automethod:: compute_values --- .. include:: /_includes/rllib/we_are_hiring.rst .. _utils-reference-docs: RLlib Utilities =============== .. include:: /_includes/rllib/new_api_stack.rst Here is a list of all the utilities available in RLlib. MetricsLogger API ----------------- RLlib uses the MetricsLogger API to log stats and metrics for the various components. Users can also For example: .. testcode:: from ray.rllib.utils.metrics.metrics_logger import MetricsLogger logger = MetricsLogger() # Log a scalar float value under the `loss` key. By default, all logged # values under that key are averaged, once `reduce()` is called. logger.log_value("loss", 0.05, reduce="mean", window=2) logger.log_value("loss", 0.1) logger.log_value("loss", 0.2) logger.peek("loss") # expect: 0.15 (mean of last 2 values: 0.1 and 0.2) .. currentmodule:: ray.rllib.utils.metrics.metrics_logger .. autosummary:: :nosignatures: :toctree: doc/ MetricsLogger MetricsLogger.peek MetricsLogger.log_value MetricsLogger.log_dict MetricsLogger.aggregate MetricsLogger.log_time Scheduler API ------------- RLlib uses the Scheduler API to set scheduled values for variables, in Python or PyTorch, dependent on an int timestep input. The type of the schedule is always a ``PiecewiseSchedule``, which defines a list of increasing time steps, starting at 0, associated with values to be reached at these particular timesteps. ``PiecewiseSchedule`` interpolates values for all intermittent timesteps. The computed values are usually float32 types. For example: .. testcode:: from ray.rllib.utils.schedules.scheduler import Scheduler scheduler = Scheduler([[0, 0.1], [50, 0.05], [60, 0.001]]) print(scheduler.get_current_value()) # <- expect 0.1 # Up the timestep. scheduler.update(timestep=45) print(scheduler.get_current_value()) # <- expect 0.055 # Up the timestep. scheduler.update(timestep=100) print(scheduler.get_current_value()) # <- expect 0.001 (keep final value) .. currentmodule:: ray.rllib.utils.schedules.scheduler .. autosummary:: :nosignatures: :toctree: doc/ Scheduler Scheduler.validate Scheduler.get_current_value Scheduler.update Scheduler._create_tensor_variable Framework Utilities ------------------- Import utilities ~~~~~~~~~~~~~~~~ .. currentmodule:: ray.rllib.utils.framework .. autosummary:: :nosignatures: :toctree: doc/ ~try_import_torch Torch utilities ~~~~~~~~~~~~~~~ .. currentmodule:: ray.rllib.utils.torch_utils .. autosummary:: :nosignatures: :toctree: doc/ ~clip_gradients ~compute_global_norm ~convert_to_torch_tensor ~explained_variance ~flatten_inputs_to_1d_tensor ~global_norm ~one_hot ~reduce_mean_ignore_inf ~sequence_mask ~set_torch_seed ~softmax_cross_entropy_with_logits ~update_target_network Numpy utilities ~~~~~~~~~~~~~~~ .. currentmodule:: ray.rllib.utils.numpy .. autosummary:: :nosignatures: :toctree: doc/ ~aligned_array ~concat_aligned ~convert_to_numpy ~fc ~flatten_inputs_to_1d_tensor ~make_action_immutable ~huber_loss ~l2_loss ~lstm ~one_hot ~relu ~sigmoid ~softmax Checkpoint utilities -------------------- .. currentmodule:: ray.rllib.utils.checkpoints .. autosummary:: :nosignatures: :toctree: doc/ try_import_msgpack Checkpointable --- .. include:: /_includes/rllib/we_are_hiring.rst .. include:: /_includes/rllib/new_api_stack.rst .. _rlmodule-guide: RL Modules ========== The :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` class in RLlib's new API stack allows you to write custom models, including highly complex multi-network setups often found in multi-agent or model-based algorithms. :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` is the main neural network class and exposes three public methods, each corresponding to a distinct phase in the reinforcement learning cycle: - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` handles the computation of actions during data collection if RLlib uses the data for a succeeding training step, balancing exploration and exploitation. - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference` computes actions for evaluation and production, which often need to be greedy or less stochastic. - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_train` manages the training phase, performing calculations required to compute losses, such as Q-values in a DQN model, value function predictions in a PG-style setup, or world-model predictions in model-based algorithms. .. figure:: images/rl_modules/rl_module_overview.svg :width: 700 :align: left **RLModule overview**: (*left*) A plain :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` contains the neural network RLlib uses for computations, for example, a policy network written in `PyTorch `__, and exposes the three forward methods: :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` for sample collection, :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference` for production/deployment, and :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_train` for computing loss function inputs when training. (*right*) A :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` may contain one or more sub-RLModules, each identified by a `ModuleID`, allowing you to implement arbitrarily complex multi-network or multi-agent architectures and algorithms. Enabling the RLModule API in the AlgorithmConfig ------------------------------------------------ In the new API stack, activated by default, RLlib exclusively uses RLModules. If you're working with a legacy config or want to migrate ``ModelV2`` or ``Policy`` classes to the new API stack, see the :ref:`new API stack migration guide ` for more information. If you configured the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` to the old API stack, use the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.api_stack` method to switch: .. testcode:: from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .api_stack( enable_rl_module_and_learner=True, enable_env_runner_and_connector_v2=True, ) ) .. _rllib-default-rl-modules-docs: Default RLModules ----------------- If you don't specify module-related settings in the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`, RLlib uses the respective algorithm's default RLModule, which is an appropriate choice for initial experimentation and benchmarking. All of the default RLModules support 1D-tensor and image observations (``[width] x [height] x [channels]``). .. note:: For discrete or more complex input observation spaces like dictionaries, use the :py:class:`~ray.rllib.connectors.env_to_module.flatten_observations.FlattenObservations` connector piece as follows: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.connectors.env_to_module import FlattenObservations config = ( PPOConfig() # FrozenLake has a discrete observation space (ints). .environment("FrozenLake-v1") # `FlattenObservations` converts int observations to one-hot. .env_runners(env_to_module_connector=lambda env: FlattenObservations()) ) .. TODO (sven): Link here to the connector V2 page and preprocessors once that page is done. Furthermore, all default models offer configurable architecture choices with respect to the number and size of the layers used (``Dense`` or ``Conv2D``), their activations and initializations, and automatic LSTM-wrapping behavior. Use the :py:class:`~ray.rllib.core.rl_module.default_model_config.DefaultModelConfig` datadict class to configure any default model in RLlib. Note that you should only use this class for configuring default models. When writing your own custom RLModules, use plain python dicts to define the model configurations. For how to write and configure your custom RLModules, see :ref:`Implementing custom RLModules `. Configuring default MLP nets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To train a simple multi layer perceptron (MLP) policy, which only contains dense layers, with PPO and the default RLModule, configure your experiment as follows: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig config = ( PPOConfig() .environment("CartPole-v1") .rl_module( # Use a non-default 32,32-stack with ReLU activations. model_config=DefaultModelConfig( fcnet_hiddens=[32, 32], fcnet_activation="relu", ) ) ) .. testcode:: :hide: test = config.build() test.train() test.stop() The following is the compete list of all supported ``fcnet_..`` options: .. literalinclude:: ../../../rllib/core/rl_module/default_model_config.py :language: python :start-after: __sphinx_doc_default_model_config_fcnet_begin__ :end-before: __sphinx_doc_default_model_config_fcnet_end__ Configuring default CNN nets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For image-based environments like `Atari `__, use the ``conv_..`` fields in :py:class:`~ray.rllib.core.rl_module.default_model_config.DefaultModelConfig` to configure the convolutional neural network (CNN) stack. You may have to check whether your CNN configuration works with the incoming observation image dimensions. For example, for an `Atari `__ environment, you can use RLlib's Atari wrapper utility, which performs resizing (default 64x64) and gray scaling (default True), frame stacking (default None), frame skipping (default 4), normalization (from uint8 to float32), and applies up to 30 "noop" actions after a reset, which aren't part of the episode: .. testcode:: import gymnasium as gym # `pip install gymnasium[atari,accept-rom-license]` from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.env.wrappers.atari_wrappers import wrap_atari_for_new_api_stack from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig from ray.tune import register_env register_env( "image_env", lambda _: wrap_atari_for_new_api_stack( gym.make("ale_py:ALE/Pong-v5"), dim=64, # resize original observation to 64x64x3 framestack=4, ) ) config = ( PPOConfig() .environment("image_env") .rl_module( model_config=DefaultModelConfig( # Use a DreamerV3-style CNN stack for 64x64 images. conv_filters=[ [16, 4, 2], # 1st CNN layer: num_filters, kernel, stride(, padding)? [32, 4, 2], # 2nd CNN layer [64, 4, 2], # etc.. [128, 4, 2], ], conv_activation="silu", # After the last CNN, the default model flattens, then adds an optional MLP. head_fcnet_hiddens=[256], ) ) ) .. testcode:: :hide: test = config.build() test.train() test.stop() The following is the compete list of all supported ``conv_..`` options: .. literalinclude:: ../../../rllib/core/rl_module/default_model_config.py :language: python :start-after: __sphinx_doc_default_model_config_conv_begin__ :end-before: __sphinx_doc_default_model_config_conv_end__ Other default model settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For LSTM-based configurations and specific settings for continuous action output layers, see :py:class:`~ray.rllib.core.rl_module.default_model_config.DefaultModelConfig`. .. note:: To auto-wrap your default encoder with an extra LSTM layer and allow your model to learn in non-Markovian, partially observable environments, you can try the convenience ``DefaultModelConfig.use_lstm`` setting in combination with the ``DefaultModelConfig.lstm_cell_size`` and ``DefaultModelConfig.max_seq_len`` settings. See here for a tuned `example that uses a default RLModule with an LSTM layer `__. .. TODO: mention attention example once done Constructing RLModule instances ------------------------------- To maintain consistency and usability, RLlib offers a standardized approach for constructing :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances for both single-module and multi-module use cases. An example of a single-module use case is a single-agent experiment. Examples of multi-module use cases are multi-agent learning or other multi-NN setups. .. _rllib-constructing-rlmodule-w-class-constructor: Construction through the class constructor ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The most direct way to construct your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` is through its constructor: .. testcode:: import gymnasium as gym from ray.rllib.algorithms.bc.torch.default_bc_torch_rl_module import DefaultBCTorchRLModule # Create an env object to know the spaces. env = gym.make("CartPole-v1") # Construct the actual RLModule object. rl_module = DefaultBCTorchRLModule( observation_space=env.observation_space, action_space=env.action_space, # A custom dict that's accessible inside your class as `self.model_config`. model_config={"fcnet_hiddens": [64]}, ) .. note:: If you have a checkpoint of an `py:class:`~ray.rllib.algorithms.algorithm.Algorithm` or an individual :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, see :ref:`Creating instances with from_checkpoint ` for how to recreate your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` from disk. Construction through RLModuleSpecs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because RLlib is a distributed RL library and needs to create more than one copy of your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, you can use :py:class:`~ray.rllib.core.rl_module.rl_module.RLModuleSpec` objects to define how RLlib should construct each copy during the algorithm's setup process. The algorithm passes the spec to all subcomponents that need to have a copy of your RLModule. Creating an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModuleSpec` is straightforward and analogous to the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` constructor: .. tab-set:: .. tab-item:: RLModuleSpec (single model) .. testcode:: import gymnasium as gym from ray.rllib.algorithms.bc.torch.default_bc_torch_rl_module import DefaultBCTorchRLModule from ray.rllib.core.rl_module.rl_module import RLModuleSpec # Create an env object to know the spaces. env = gym.make("CartPole-v1") # First construct the spec. spec = RLModuleSpec( module_class=DefaultBCTorchRLModule, observation_space=env.observation_space, action_space=env.action_space, # A custom dict that's accessible inside your class as `self.model_config`. model_config={"fcnet_hiddens": [64]}, ) # Then, build the RLModule through the spec's `build()` method. rl_module = spec.build() .. tab-item:: MultiRLModuleSpec (multi model) .. testcode:: import gymnasium as gym from ray.rllib.algorithms.bc.torch.default_bc_torch_rl_module import DefaultBCTorchRLModule from ray.rllib.core.rl_module.rl_module import RLModuleSpec from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec # First construct the MultiRLModuleSpec. spec = MultiRLModuleSpec( rl_module_specs={ "module_1": RLModuleSpec( module_class=DefaultBCTorchRLModule, # Define the spaces for only this sub-module. observation_space=gym.spaces.Box(low=-1, high=1, shape=(10,)), action_space=gym.spaces.Discrete(2), # A custom dict that's accessible inside your class as # `self.model_config`. model_config={"fcnet_hiddens": [32]}, ), "module_2": RLModuleSpec( module_class=DefaultBCTorchRLModule, # Define the spaces for only this sub-module. observation_space=gym.spaces.Box(low=-1, high=1, shape=(5,)), action_space=gym.spaces.Discrete(2), # A custom dict that's accessible inside your class as # `self.model_config`. model_config={"fcnet_hiddens": [16]}, ), }, ) # Construct the actual MultiRLModule instance with .build(): multi_rl_module = spec.build() You can pass the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModuleSpec` instances to your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` to tell RLlib to use the particular module class and constructor arguments: .. tab-set:: .. tab-item:: Single-Module (like single-agent) .. code-block:: python from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.rl_module import RLModuleSpec config = ( PPOConfig() .environment("CartPole-v1") .rl_module( rl_module_spec=RLModuleSpec( module_class=MyRLModuleClass, model_config={"some_key": "some_setting"}, ), ) ) ppo = config.build() print(ppo.get_module()) .. note:: Often when creating an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModuleSpec` , you don't have to define attributes like ``observation_space`` or ``action_space`` because RLlib automatically infers these attributes based on the used environment or other configuration parameters. .. tab-item:: Multi-Agent (shared policy net) .. code-block:: python from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.rl_module import RLModuleSpec from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole config = ( PPOConfig() .environment(MultiAgentCartPole, env_config={"num_agents": 2}) .rl_module( rl_module_spec=MultiRLModuleSpec( # All agents (0 and 1) use the same (single) RLModule. rl_module_specs=RLModuleSpec( module_class=MyRLModuleClass, model_config={"some_key": "some_setting"}, ) ), ) ) ppo = config.build() print(ppo.get_module()) .. tab-item:: Multi-Agent (two or more policy nets) .. code-block:: python from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.rl_module import RLModuleSpec from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole config = ( PPOConfig() .environment(MultiAgentCartPole, env_config={"num_agents": 2}) .multi_agent( policies={"p0", "p1"}, # Agent IDs of `MultiAgentCartPole` are 0 and 1, mapping to # "p0" and "p1", respectively. policy_mapping_fn=lambda agent_id, episode, **kw: f"p{agent_id}" ) .rl_module( rl_module_spec=MultiRLModuleSpec( # Agents (0 and 1) use different (single) RLModules. rl_module_specs={ "p0": RLModuleSpec( module_class=MyRLModuleClass, # Small network. model_config={"fcnet_hiddens": [32, 32]}, ), "p1": RLModuleSpec( module_class=MyRLModuleClass, # Large network. model_config={"fcnet_hiddens": [128, 128]}, ), }, ), ) ) ppo = config.build() print(ppo.get_module()) .. _rllib-implementing-custom-rl-modules: Implementing custom RLModules ----------------------------- To implement your own neural network architecture and computation logic, subclass :py:class:`~ray.rllib.core.rl_module.torch_rl_module.TorchRLModule` for any single-agent learning experiment or for independent multi-agent learning. For more advanced multi-agent use cases like ones with shared communication between agents, or any multi-model use cases, subclass the :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` class, instead. .. note:: An alternative to subclassing :py:class:`~ray.rllib.core.rl_module.torch_rl_module.TorchRLModule` is to directly subclass your Algorithm's default RLModule. For example, to use PPO, you can subclass :py:class:`~ray.rllib.algorithms.ppo.torch.default_ppo_torch_rl_module.DefaultPPOTorchRLModule`. You should carefully study the existing default model in this case to understand how to override the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.setup`, the ``_forward_()`` methods, and possibly some algo-specific API methods. See :ref:`Algorithm-specific RLModule APIs ` for how to determine which APIs your algorithm requires you to implement. .. _rllib-implementing-custom-rl-modules-setup: The setup() method ~~~~~~~~~~~~~~~~~~ You should first implement the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.setup` method, in which you add needed NN subcomponents and assign these to class attributes of your choice. Note that you should call ``super().setup()`` in your implementation. You also have access to the following attributes anywhere in the class, including in :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.setup`: #. ``self.observation_space`` #. ``self.action_space`` #. ``self.inference_only`` #. ``self.model_config`` (a dict with any custom config settings) .. testcode:: import torch from ray.rllib.core.rl_module.torch.torch_rl_module import TorchRLModule class MyTorchPolicy(TorchRLModule): def setup(self): # You have access here to the following already set attributes: # self.observation_space # self.action_space # self.inference_only # self.model_config # <- a dict with custom settings # Use the observation space (if a Box) to infer the input dimension. input_dim = self.observation_space.shape[0] # Use the model_config dict to extract the hidden dimension. hidden_dim = self.model_config["fcnet_hiddens"][0] # Use the action space to infer the number of output nodes. output_dim = self.action_space.n # Build all the layers and subcomponents here you need for the # RLModule's forward passes. self._pi_head = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim), torch.nn.ReLU(), torch.nn.Linear(hidden_dim, output_dim), ) .. _rllib-implementing-custom-rl-modules-forward: Forward methods ~~~~~~~~~~~~~~~~~~~ Implementing the forward computation logic, you can either define a generic forward behavior by overriding the private :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward` method, which RLlib then uses everywhere in the model's lifecycle, or, if you require more granularity, define the following three private methods: - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_exploration`: Forward pass for computing exploration actions for collecting training data. - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_inference`: Forward pass for action inference, like greedy. - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_train`: Forward pass for computing loss function inputs for a training update. For custom :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward`, :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_inference`, and :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_exploration` methods, you must return a dictionary that contains the key ``actions`` and/or the key ``action_dist_inputs``. If you return the ``actions`` key from your forward method: - RLlib uses the provided actions as-is. - In case you also return the ``action_dist_inputs`` key, RLlib creates a :py:class:`~ray.rllib.models.distributions.Distribution` instance from the parameters under that key. In the case of :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration`, RLlib also creates compute action probabilities and log probabilities for the given actions automatically. See :ref:`Custom action distributions ` for more information on custom action distribution classes. If you don't return the ``actions`` key from your forward method: - You must return the ``action_dist_inputs`` key from your :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_exploration` and :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_inference` methods. - RLlib creates a :py:class:`~ray.rllib.models.distributions.Distribution` instance from the parameters under that key and sample actions from that distribution. See :ref:`here for more information on custom action distribution classes `. - For :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_exploration`, RLlib also computes action probability and log probability values from the sampled actions automatically. .. note:: In case of :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule._forward_inference`, RLlib always makes the generated distributions from returned key ``action_dist_inputs`` deterministic first through the :py:meth:`~ray.rllib.models.distributions.Distribution.to_deterministic` utility before a possible action sample step. For example, RLlib reduces the sampling from a Categorical distribution to selecting the ``argmax`` actions from the distribution logits or probabilities. If you return the "actions" key, RLlib skips that sampling step. .. tab-set:: .. tab-item:: Returning "actions" key .. code-block:: python from ray.rllib.core import Columns, TorchRLModule class MyTorchPolicy(TorchRLModule): ... def _forward_inference(self, batch): ... return { Columns.ACTIONS: ... # RLlib uses these actions as-is } def _forward_exploration(self, batch): ... return { Columns.ACTIONS: ..., # RLlib uses these actions as-is (no sampling step!) Columns.ACTION_DIST_INPUTS: ... # If provided, RLlib uses these dist inputs to compute probs and logp. } .. tab-item:: Not returning "actions" key .. code-block:: python from ray.rllib.core import Columns, TorchRLModule class MyTorchPolicy(TorchRLModule): ... def _forward_inference(self, batch): ... return { # RLlib: # - Generates distribution from ACTION_DIST_INPUTS parameters. # - Converts distribution to a deterministic equivalent. # - Samples from the deterministic distribution. Columns.ACTION_DIST_INPUTS: ... } def _forward_exploration(self, batch): ... return { # RLlib: # - Generates distribution from ACTION_DIST_INPUTS parameters. # - Samples from the stochastic distribution. # - Computes action probs and logs automatically using the sampled # actions and the distribution. Columns.ACTION_DIST_INPUTS: ... } Never override the constructor (``__init__``), however, note that the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` class's constructor requires the following arguments and also receives these properly when you call a spec's ``build()`` method: - :py:attr:`~ray.rllib.core.rl_module.rl_module.RLModule.observation_space`: The observation space after having passed all connectors; this observation space is the actual input space for the model after all preprocessing steps. - :py:attr:`~ray.rllib.core.rl_module.rl_module.RLModule.action_space`: The action space of the environment. - :py:attr:`~ray.rllib.core.rl_module.rl_module.RLModule.inference_only`: Whether RLlib should build the RLModule in inference-only mode, dropping subcomponents that it only needs for learning. - :py:attr:`~ray.rllib.core.rl_module.rl_module.RLModule.model_config`: The model config, which is either a custom dictionary for custom RLModules or a :py:class:`~ray.rllib.core.rl_module.default_model_config.DefaultModelConfig` dataclass object, which is only for RLlib's default models. Define model hyper-parameters such as number of layers, type of activation, etc. in this object. See :ref:`Construction through the class constructor ` for more details on how to create an RLModule through the constructor. .. _rllib-algo-specific-rl-module-apis-docs: Algorithm-specific RLModule APIs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The algorithm that you choose to use with your RLModule affects to some extent the structure of the final custom module. Each Algorithm class has a fixed set of APIs that all RLModules trained by that algorithm, need to implement. To find out, what APIs your Algorithms require, do the following: .. testcode:: # Import the config of the algorithm of your choice. from ray.rllib.algorithms.sac import SACConfig # Print out the abstract APIs, you need to subclass from and whose # abstract methods you need to implement, besides the ``setup()`` and ``_forward_..()`` # methods. print( SACConfig() .get_default_learner_class() .rl_module_required_apis() ) .. note:: You didn't implement any APIs in the preceding example module, because you hadn't considered training it with any particular algorithm yet. You can find examples of custom :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` classes implementing the :py:class:`~ray.rllib.core.rl_module.apis.self_supervised_loss_api.SelfSupervisedLossAPI` and thus ready to train with :py:class:`~ray.rllib.algorithms.ppo.PPO` in the `tiny_atari_cnn_rlm example `__ and in the `lstm_containing_rlm example `__. You can mix supervised losses into any RLlib algorithm through the :py:class:`~ray.rllib.core.rl_module.apis.self_supervised_loss_api.SelfSupervisedLossAPI`. Your Learner actors automatically call the implemented :py:meth:`~ray.rllib.core.rl_module.apis.self_supervised_loss_api.SelfSupervisedLossAPI.compute_self_supervised_loss` method to compute the model's own loss passing it the outputs of the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.forward_train` call. See here for an `example script utilizing a self-supervised loss RLModule `__. Losses can be defined over either policy evaluation inputs, or data read from `offline storage `__. Note that you may want to set the :py:attr:`~ray.rllib.core.rl_module.rl_module.RLModuleSpec.learner_only` attribute to ``True`` in your custom :py:class:`~ray.rllib.rl_module.rl_module.RLModuleSpec` if you don't need the self-supervised model for collecting samples in your :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors. You may also need an extra Learner connector piece in this case make sure your :py:class:`~ray.rllib.rl_module.rl_module.RLModule` receives data to learn. End-to-end example ~~~~~~~~~~~~~~~~~~~~~~~ Putting together the elements of your custom :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` that you implemented, a working end-to-end example is as follows: .. literalinclude:: ../../../rllib/examples/rl_modules/classes/vpg_torch_rlm.py :language: python .. _rllib-rl-module-w-custom-action-dists: Custom action distributions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The preceding examples rely on :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` using the correct action distribution with the computed ``ACTION_DIST_INPUTS`` returned by the forward methods. RLlib picks a default distribution class based on the action space, which is :py:class:`~ray.rllib.models.torch.torch_distributions.TorchCategorical` for ``Discrete`` action spaces and :py:class:`~ray.rllib.models.torch.torch_distributions.TorchDiagGaussian` for ``Box`` action spaces. To use a different distribution class and return parameters for this distribution's constructor from your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` forward methods, you can set the :py:attr:`~ray.rllib.core.rl_module.rl_module.RLModule.action_dist_cls` attribute inside the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.setup` method of your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. See here for an `example script introducing a temperature parameter on top of a Categorical distribution `__. If you need more granularity and specify different distribution classes for the different forward methods of your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, override the following methods in your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` implementation and return different distribution classes from these: - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.get_inference_action_dist_cls` - :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.get_exploration_action_dist_cls` - and :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.get_train_action_dist_cls` .. note:: If you only return ``ACTION_DIST_INPUTS`` from your forward methods, RLlib automatically uses the :py:meth:`~ray.rllib.models.distributions.Distribution.to_deterministic` method of the distribution returned by your :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.get_inference_action_dist_cls`. See `torch_distributions.py `__ for common distribution implementations. Auto-regressive action distributions ++++++++++++++++++++++++++++++++++++ In an action space with multiple components, for example ``Tuple(a1, a2)``, you may want to condition the sampling of ``a2`` on the sampled value of ``a1``, such that ``a2_sampled ~ P(a2 | a1_sampled, obs)``. Note that in the default, non-autoregressive case, RLlib would use a default model in combination with an independent :py:class:`~ray.rllib.models.torch.torch_distributions.TorchMultiDistribution` and thus sample ``a1`` and ``a2`` independently. This makes it impossible to learn in environments, in which one action component should be sampled dependent on another action, already sampled, action component. See an `example for a "correlated actions" environment `__ here. To write a custom :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` that samples the various action components as previously described, you need to carefully implement its forward logic. Find an `example of such a autoregressive action model `__ here. You implement the main action sampling logic in the ``_forward_...()`` methods: .. literalinclude:: ../../../rllib/examples/rl_modules/classes/autoregressive_actions_rlm.py :language: python :dedent: 4 :start-after: __sphinx_begin__ :end-before: __sphinx_end__ .. TODO: Move this parametric paragraph back in here, once we have the example translated to the new API stack Variable-length / Parametric Action Spaces ++++++++++++++++++++++++++++++++++++++++++ Custom models can be used to work with environments where (1) the set of valid actions `varies per step `__, and/or (2) the number of valid actions is `very large `__. The general idea is that the meaning of actions can be completely conditioned on the observation, i.e., the ``a`` in ``Q(s, a)`` becomes just a token in ``[0, MAX_AVAIL_ACTIONS)`` that only has meaning in the context of ``s``. This works with algorithms in the `DQN and policy-gradient families `__ and can be implemented as follows: 1. The environment should return a mask and/or list of valid action embeddings as part of the observation for each step. To enable batching, the number of actions can be allowed to vary from 1 to some max number: .. code-block:: python class MyParamActionEnv(gym.Env): def __init__(self, max_avail_actions): self.action_space = Discrete(max_avail_actions) self.observation_space = Dict({ "action_mask": Box(0, 1, shape=(max_avail_actions, )), "avail_actions": Box(-1, 1, shape=(max_avail_actions, action_embedding_sz)), "real_obs": ..., }) 2. A custom model can be defined that can interpret the ``action_mask`` and ``avail_actions`` portions of the observation. Here the model computes the action logits via the dot product of some network output and each action embedding. Invalid actions can be masked out of the softmax by scaling the probability to zero: .. code-block:: python class ParametricActionsModel(TFModelV2): def __init__(self, obs_space, action_space, num_outputs, model_config, name, true_obs_shape=(4,), action_embed_size=2): super(ParametricActionsModel, self).__init__( obs_space, action_space, num_outputs, model_config, name) self.action_embed_model = FullyConnectedNetwork(...) def forward(self, input_dict, state, seq_lens): # Extract the available actions tensor from the observation. avail_actions = input_dict["obs"]["avail_actions"] action_mask = input_dict["obs"]["action_mask"] # Compute the predicted action embedding action_embed, _ = self.action_embed_model({ "obs": input_dict["obs"]["cart"] }) # Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the # avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE]. intent_vector = tf.expand_dims(action_embed, 1) # Batch dot product => shape of logits is [BATCH, MAX_ACTIONS]. action_logits = tf.reduce_sum(avail_actions * intent_vector, axis=2) # Mask out invalid actions (use tf.float32.min for stability) inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min) return action_logits + inf_mask, state Depending on your use case it may make sense to use |just the masking|_, |just action embeddings|_, or |both|_. For a runnable example of "just action embeddings" in code, check out `examples/parametric_actions_cartpole.py `__. .. |just the masking| replace:: just the **masking** .. _just the masking: https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/models/action_mask_model.py .. |just action embeddings| replace:: just action **embeddings** .. _just action embeddings: https://github.com/ray-project/ray/blob/master/rllib/examples/parametric_actions_cartpole.py .. |both| replace:: **both** .. _both: https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/models/parametric_actions_model.py Note that since masking introduces ``tf.float32.min`` values into the model output, this technique might not work with all algorithm options. For example, algorithms might crash if they incorrectly process the ``tf.float32.min`` values. The cartpole example has working configurations for DQN (must set ``hiddens=[]``), PPO (must disable running mean and set ``model.vf_share_layers=True``), and several other algorithms. Not all algorithms support parametric actions; see the `algorithm overview `__. .. _implementing-custom-multi-rl-modules: Implementing custom MultiRLModules ---------------------------------- For multi-module setups, RLlib provides the :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` class, whose default implementation is a dictionary of individual :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` objects. one for each submodule and identified by a ``ModuleID``. The base-class :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` implementation works for most of the use cases that need to define independent neural networks. However, for any complex, multi-network or multi-agent use case, where agents share one or more neural networks, you should inherit from this class and override the default implementation. The following code snippets create a custom multi-agent RL module with two simple "policy head" modules, which share the same encoder, the third network in the MultiRLModule. The encoder receives the raw observations from the env and outputs embedding vectors that then serve as input for the two policy heads to compute the agents' actions. .. _rllib-rlmodule-guide-implementing-custom-multi-rl-modules: .. tab-set:: .. tab-item:: MultiRLModule (w/ two policy nets and one encoder) .. literalinclude:: ../../../rllib/examples/rl_modules/classes/vpg_using_shared_encoder_rlm.py :language: python :start-after: __sphinx_doc_mrlm_begin__ :end-before: __sphinx_doc_mrlm_end__ .. literalinclude:: ../../../rllib/examples/rl_modules/classes/vpg_using_shared_encoder_rlm.py :language: python :start-after: __sphinx_doc_mrlm_2_begin__ :end-before: __sphinx_doc_mrlm_2_end__ .. tab-item:: Policy RLModule Within the MultiRLModule, you need to have two policy sub-RLModules. They may be of the same class, which you can implement as follows: .. literalinclude:: ../../../rllib/examples/rl_modules/classes/vpg_using_shared_encoder_rlm.py :language: python :start-after: __sphinx_doc_policy_begin__ :end-before: __sphinx_doc_policy_end__ .. literalinclude:: ../../../rllib/examples/rl_modules/classes/vpg_using_shared_encoder_rlm.py :language: python :start-after: __sphinx_doc_policy_2_begin__ :end-before: __sphinx_doc_policy_2_end__ .. tab-item:: Shared encoder RLModule Finally, the shared encoder RLModule should look similar to this: .. literalinclude:: ../../../rllib/examples/rl_modules/classes/vpg_using_shared_encoder_rlm.py :language: python :start-after: __sphinx_doc_encoder_begin__ :end-before: __sphinx_doc_encoder_end__ To plug in the :ref:`custom MultiRLModule ` from the first tab, into your algorithm's config, create a :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModuleSpec` with the new class and its constructor settings. Also, create one :py:class:`~ray.rllib.core.rl_module.rl_module.RLModuleSpec` for each agent and the shared encoder RLModule, because RLlib requires their observation and action spaces and their model hyper-parameters: .. literalinclude:: ../../../rllib/examples/rl_modules/classes/vpg_using_shared_encoder_rlm.py :language: python :start-after: __sphinx_doc_how_to_run_begin__ :end-before: __sphinx_doc_how_to_run_end__ .. note:: In order to properly learn with the preceding setup, you should write and use a specific multi-agent :py:class:`~ray.rllib.core.learner.learner.Learner`, capable of handling the shared encoder. This Learner should only have a single optimizer updating all three submodules, which are the encoder and the two policy nets, to stabilize learning. When using the standard "one-optimizer-per-module" Learners, however, the two optimizers for policy 1 and 2 would take turns updating the same shared encoder, which would lead to learning instabilities. .. _rllib-checkpoints-rl-modules-docs: Checkpointing RLModules ----------------------- You can checkpoint :py:class:`~ray.rllib.core.rl_module.rl_module.RLModules` instances with their :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.save_to_path` method. If you already have an instantiated RLModule and would like to load new model weights into it from an existing checkpoint, use the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.restore_from_path` method. The following examples show how you can use these methods outside of, or in conjunction with, an RLlib Algorithm. Creating an RLModule checkpoint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. testcode:: import tempfile import gymnasium as gym from ray.rllib.algorithms.ppo.torch.default_ppo_torch_rl_module import DefaultPPOTorchRLModule from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig env = gym.make("CartPole-v1") # Create an RLModule to later checkpoint. rl_module = DefaultPPOTorchRLModule( observation_space=env.observation_space, action_space=env.action_space, model_config=DefaultModelConfig(fcnet_hiddens=[32]), ) # Finally, write the RLModule checkpoint. module_ckpt_path = tempfile.mkdtemp() rl_module.save_to_path(module_ckpt_path) Creating an RLModule from an (RLModule) checkpoint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you have an RLModule checkpoint saved and would like to create a new RLModule directly from it, use the :py:meth:`~ray.rllib.core.rl_module.rl_module.RLModule.from_checkpoint` method: .. testcode:: from ray.rllib.core.rl_module.rl_module import RLModule # Create a new RLModule from the checkpoint. new_module = RLModule.from_checkpoint(module_ckpt_path) Loading an RLModule checkpoint into a running Algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig # Create a new Algorithm (with the changed module config: 32 units instead of the # default 256; otherwise loading the state of ``module`` fails due to a shape # mismatch). config = ( PPOConfig() .environment("CartPole-v1") .rl_module(model_config=DefaultModelConfig(fcnet_hiddens=[32])) ) ppo = config.build() Now you can load the saved RLModule state from the preceding ``module.save_to_path()``, directly into the running Algorithm RLModules. Note that all RLModules within the algorithm get updated, the ones in the Learner workers and the ones in the EnvRunners. .. testcode:: ppo.restore_from_path( module_ckpt_path, # <- NOT an Algorithm checkpoint, but single-agent RLModule one. # Therefore, we have to provide the exact path (of RLlib components) down # to the individual RLModule within the algorithm, which is: component="learner_group/learner/rl_module/default_policy", ) .. testcode:: :hide: import shutil ppo.stop() shutil.rmtree(module_ckpt_path) --- .. include:: /_includes/rllib/new_api_stack.rst .. _rllib-advanced-api-doc: Advanced Python APIs -------------------- Custom training workflows ~~~~~~~~~~~~~~~~~~~~~~~~~ In the `basic training example `__, Tune will call ``train()`` on your algorithm once per training iteration and report the new training results. Sometimes, it's desirable to have full control over training, but still run inside Tune. Tune supports using :ref:`custom trainable functions ` to implement `custom training workflows (example) `__. Curriculum learning ~~~~~~~~~~~~~~~~~~~ In curriculum learning, you can set the environment to different difficulties throughout the training process. This setting allows the algorithm to learn how to solve the actual and final problem incrementally, by interacting with and exploring in more and more difficult phases. Normally, such a curriculum starts with setting the environment to an easy level and then - as training progresses - transitions more toward a harder-to-solve difficulty. See `Reverse Curriculum Generation for Reinforcement Learning Agents `_ blog post for another example of doing curriculum learning. RLlib's Algorithm and custom callbacks APIs allow for implementing any arbitrary curricula. This `example script `__ introduces the basic concepts you need to understand. First, define some env options. This example uses the `FrozenLake-v1` environment, a grid world, whose map you can fully customize. RLlib represents three tasks of different env difficulties with slightly different maps that the agent has to navigate. .. literalinclude:: ../../../rllib/examples/curriculum/curriculum_learning.py :language: python :start-after: __curriculum_learning_example_env_options__ :end-before: __END_curriculum_learning_example_env_options__ Then, define the central piece controlling the curriculum, which is a custom callbacks class overriding the :py:meth:`~ray.rllib.callbacks.callbacks.RLlibCallback.on_train_result`. .. TODO move to doc_code and make it use algo configs. .. code-block:: python import ray from ray import tune from ray.rllib.callbacks.callbacks import RLlibCallback class MyCallbacks(RLlibCallback): def on_train_result(self, algorithm, result, **kwargs): if result["env_runners"]["episode_return_mean"] > 200: task = 2 elif result["env_runners"]["episode_return_mean"] > 100: task = 1 else: task = 0 algorithm.env_runner_group.foreach_worker( lambda ev: ev.foreach_env( lambda env: env.set_task(task))) ray.init() tune.Tuner( "PPO", param_space={ "env": YourEnv, "callbacks": MyCallbacks, }, ).fit() Global coordination ~~~~~~~~~~~~~~~~~~~ Sometimes, you need to coordinate between pieces of code that live in different processes managed by RLlib. For example, it can be useful to maintain a global average of a certain variable, or centrally control a hyperparameter that policies use. Ray provides a general way to achieve this coordination through *named actors*. See :ref:`Ray actors ` to learn more. RLlib assigns these actors a global name. You can retrieve handles to them using these names. As an example, consider maintaining a shared global counter that's environments increment and read periodically from the driver program: .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_counter_begin__ :end-before: __rllib-adv_api_counter_end__ Ray actors provide high levels of performance. In more complex cases you can use them to implement communication patterns such as parameter servers and all-reduce. Visualizing custom metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~ Access and visualize custom metrics like any other training result: .. image:: images/custom_metric.png .. _exploration-api: Customizing exploration behavior ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RLlib offers a unified top-level API to configure and customize an agent’s exploration behavior, including the decisions, like how and whether, to sample actions from distributions, stochastically or deterministically. Set up the behavior using built-in Exploration classes. See `this package `__), which you specify and further configure inside ``AlgorithmConfig().env_runners(..)``. Besides using one of the available classes, you can sub-class any of these built-ins, add custom behavior to it, and use that new class in the config instead. Every policy has an Exploration object, which RLlib creates from the AlgorithmConfig’s ``.env_runners(exploration_config=...)`` method. The method specifies the class to use through the special “type” key, as well as constructor arguments through all other keys. For example: .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_explore_begin__ :end-before: __rllib-adv_api_explore_end__ The following table lists all built-in Exploration sub-classes and the agents that use them by default: .. View table below at: https://docs.google.com/drawings/d/1dEMhosbu7HVgHEwGBuMlEDyPiwjqp_g6bZ0DzCMaoUM/edit?usp=sharing .. image:: images/rllib-exploration-api-table.svg An Exploration class implements the ``get_exploration_action`` method, in which you define the exact exploratory behavior. It takes the model’s output, the action distribution class, the model itself, a timestep, like the global env-sampling steps already taken, and an ``explore`` switch. It outputs a tuple of a) action and b) log-likelihood: .. literalinclude:: ../../../rllib/utils/exploration/exploration.py :language: python :start-after: __sphinx_doc_begin_get_exploration_action__ :end-before: __sphinx_doc_end_get_exploration_action__ At the highest level, the ``Algorithm.compute_actions`` and ``Policy.compute_actions`` methods have a boolean ``explore`` switch, which RLlib passes into ``Exploration.get_exploration_action``. If ``explore=None``, RLlib uses the value of ``Algorithm.config[“explore”]``, which serves as a main switch for exploratory behavior, allowing for example turning off of any exploration easily for evaluation purposes. See :ref:`CustomEvaluation`. The following are example excerpts from different Algorithms' configs to setup different exploration behaviors from ``rllib/algorithms/algorithm.py``: .. TODO move to doc_code and make it use algo configs. .. code-block:: python # All of the following configs go into Algorithm.config. # 1) Switching *off* exploration by default. # Behavior: Calling `compute_action(s)` without explicitly setting its `explore` # param results in no exploration. # However, explicitly calling `compute_action(s)` with `explore=True` # still(!) results in exploration (per-call overrides default). "explore": False, # 2) Switching *on* exploration by default. # Behavior: Calling `compute_action(s)` without explicitly setting its # explore param results in exploration. # However, explicitly calling `compute_action(s)` with `explore=False` # results in no(!) exploration (per-call overrides default). "explore": True, # 3) Example exploration_config usages: # a) DQN: see rllib/algorithms/dqn/dqn.py "explore": True, "exploration_config": { # Exploration sub-class by name or full path to module+class # (e.g., “ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy”) "type": "EpsilonGreedy", # Parameters for the Exploration class' constructor: "initial_epsilon": 1.0, "final_epsilon": 0.02, "epsilon_timesteps": 10000, # Timesteps over which to anneal epsilon. }, # b) DQN Soft-Q: In order to switch to Soft-Q exploration, do instead: "explore": True, "exploration_config": { "type": "SoftQ", # Parameters for the Exploration class' constructor: "temperature": 1.0, }, # c) All policy-gradient algos and SAC: see rllib/algorithms/algorithm.py # Behavior: The algo samples stochastically from the # model-parameterized distribution. This is the global Algorithm default # setting defined in algorithm.py and used by all PG-type algos (plus SAC). "explore": True, "exploration_config": { "type": "StochasticSampling", "random_timesteps": 0, # timesteps at beginning, over which to act uniformly randomly }, .. _CustomEvaluation: Customized evaluation during training ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RLlib reports online training rewards, however in some cases you may want to compute rewards with different settings. For example, with exploration turned off, or on a specific set of environment configurations. You can activate evaluating policies during training (``Algorithm.train()``) by setting the ``evaluation_interval`` to a positive integer. This value specifies how many ``Algorithm.train()`` calls should occur each time RLlib runs an "evaluation step": .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_evaluation_1_begin__ :end-before: __rllib-adv_api_evaluation_1_end__ An evaluation step runs, using its own ``EnvRunner`` instances, for ``evaluation_duration`` episodes or time-steps, depending on the ``evaluation_duration_unit`` setting, which can take values of either ``"episodes"``, which is the default, or ``"timesteps"``. .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_evaluation_2_begin__ :end-before: __rllib-adv_api_evaluation_2_end__ Note that when using ``evaluation_duration_unit=timesteps`` and the ``evaluation_duration`` setting isn't divisible by the number of evaluation workers, RLlib rounds up the number of time-steps specified to the nearest whole number of time-steps that's divisible by the number of evaluation workers. Also, when using ``evaluation_duration_unit=episodes`` and the ``evaluation_duration`` setting isn't divisible by the number of evaluation workers, RLlib runs the remainder of episodes on the first n evaluation EnvRunners and leave the remaining workers idle for that time. You can configure evaluation workers with ``evaluation_num_env_runners``. For example: .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_evaluation_3_begin__ :end-before: __rllib-adv_api_evaluation_3_end__ Before each evaluation step, RLlib synchronizes weights from the main model to all evaluation workers. By default, RLlib runs the evaluation step, provided one exists in the current iteration, immediately **after** the respective training step. For example, for ``evaluation_interval=1``, the sequence of events is: ``train(0->1), eval(1), train(1->2), eval(2), train(2->3), ...``. The indices show the version of neural network weights RLlib used. ``train(0->1)`` is an update step that changes the weights from version 0 to version 1 and ``eval(1)`` then uses weights version 1. Weights index 0 represents the randomly initialized weights of the neural network. The following is another example. For ``evaluation_interval=2``, the sequence is: ``train(0->1), train(1->2), eval(2), train(2->3), train(3->4), eval(4), ...``. Instead of running ``train`` and ``eval`` steps in sequence, you can also run them in parallel with the ``evaluation_parallel_to_training=True`` config setting. In this case, RLlib runs both training and evaluation steps at the same time using multi-threading. This parallelization can speed up the evaluation process significantly, but leads to a 1-iteration delay between reported training and evaluation results. The evaluation results are behind in this case because they use slightly outdated model weights, which RLlib synchronizes after the previous training step. For example, for ``evaluation_parallel_to_training=True`` and ``evaluation_interval=1``, the sequence is: ``train(0->1) + eval(0), train(1->2) + eval(1), train(2->3) + eval(2)``, where ``+`` connects phases happening at the same time. Note that the change in the weights indices are with respect to the non-parallel examples. The evaluation weights indices are now "one behind" the resulting train weights indices (``train(1->**2**) + eval(**1**)``). When running with the ``evaluation_parallel_to_training=True`` setting, RLlib supports a special "auto" value for ``evaluation_duration``. Use this auto setting to make the evaluation step take roughly as long as the concurrently ongoing training step: .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_evaluation_4_begin__ :end-before: __rllib-adv_api_evaluation_4_end__ The ``evaluation_config`` key allows you to override any config settings for the evaluation workers. For example, to switch off exploration in the evaluation steps, do the following: .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_evaluation_5_begin__ :end-before: __rllib-adv_api_evaluation_5_end__ .. note:: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting "explore=False" results in the evaluation workers not using this stochastic policy. RLlib determines the level of parallelism within the evaluation step by the ``evaluation_num_env_runners`` setting. Set this parameter to a larger value if you want the desired evaluation episodes or time-steps to run as much in parallel as possible. For example, if ``evaluation_duration=10``, ``evaluation_duration_unit=episodes``, and ``evaluation_num_env_runners=10``, each evaluation ``EnvRunner`` only has to run one episode in each evaluation step. In case you observe occasional failures in the evaluation EnvRunners during evaluation, for example an environment that sometimes crashes or stalls, use the following combination of settings, to minimize the negative effects of that environment behavior: .. todo (sven): Add link here to new fault-tolerance page, once done. :ref:`fault tolerance settings `, such as Note that with or without parallel evaluation, RLlib respects all fault tolerance settings, such as ``ignore_env_runner_failures`` or ``restart_failed_env_runners``, and applies them to the failed evaluation workers. The following is an example: .. literalinclude:: ./doc_code/advanced_api.py :language: python :start-after: __rllib-adv_api_evaluation_6_begin__ :end-before: __rllib-adv_api_evaluation_6_end__ This example runs the parallel sampling of all evaluation EnvRunners, such that if one of the workers takes too long to run through an episode and return data or fails entirely, the other evaluation EnvRunners still complete the job. If you want to entirely customize the evaluation step, set ``custom_eval_function`` in the config to a callable, which takes the Algorithm object and an :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` object, the Algorithm's ``self.evaluation_workers`` :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` instance, and returns a metrics dictionary. See `algorithm.py `__ for further documentation. This end-to-end example shows how to set up a custom online evaluation in `custom_evaluation.py `__. Note that if you only want to evaluate your policy at the end of training, set ``evaluation_interval: [int]``, where ``[int]`` should be the number of training iterations before stopping. Below are some examples of how RLlib reports the custom evaluation metrics nested under the ``evaluation`` key of normal training results: .. TODO make sure these outputs are still valid. .. code-block:: bash ------------------------------------------------------------------------ Sample output for `python custom_evaluation.py --no-custom-eval` ------------------------------------------------------------------------ INFO algorithm.py:623 -- Evaluating current policy for 10 episodes. INFO algorithm.py:650 -- Running round 0 of parallel evaluation (2/10 episodes) INFO algorithm.py:650 -- Running round 1 of parallel evaluation (4/10 episodes) INFO algorithm.py:650 -- Running round 2 of parallel evaluation (6/10 episodes) INFO algorithm.py:650 -- Running round 3 of parallel evaluation (8/10 episodes) INFO algorithm.py:650 -- Running round 4 of parallel evaluation (10/10 episodes) Result for PG_SimpleCorridor_2c6b27dc: ... evaluation: env_runners: custom_metrics: {} episode_len_mean: 15.864661654135338 episode_return_max: 1.0 episode_return_mean: 0.49624060150375937 episode_return_min: 0.0 episodes_this_iter: 133 .. code-block:: bash ------------------------------------------------------------------------ Sample output for `python custom_evaluation.py` ------------------------------------------------------------------------ INFO algorithm.py:631 -- Running custom eval function Update corridor length to 4 Update corridor length to 7 Custom evaluation round 1 Custom evaluation round 2 Custom evaluation round 3 Custom evaluation round 4 Result for PG_SimpleCorridor_0de4e686: ... evaluation: env_runners: custom_metrics: {} episode_len_mean: 9.15695067264574 episode_return_max: 1.0 episode_return_mean: 0.9596412556053812 episode_return_min: 0.0 episodes_this_iter: 223 foo: 1 Rewriting trajectories ~~~~~~~~~~~~~~~~~~~~~~ In the ``on_postprocess_traj`` callback you have full access to the trajectory batch (``post_batch``) and other training state. You can use this information to rewrite the trajectory, which has a number of uses including: * Backdating rewards to previous time steps, for example, based on values in ``info``. * Adding model-based curiosity bonuses to rewards. You can train the model with a `custom model supervised loss `__. To access the policy or model (``policy.model``) in the callbacks, note that ``info['pre_batch']`` returns a tuple where the first element is a policy and the second one is the batch itself. You can also access all the rollout worker state using the following call: .. TODO move to doc_code and make it use algo configs. .. code-block:: python from ray.rllib.evaluation.rollout_worker import get_global_worker # You can use this call from any callback to get a reference to the # RolloutWorker running in the process. The RolloutWorker has references to # all the policies, etc. See rollout_worker.py for more info. rollout_worker = get_global_worker() RLlib defines policy losses over the ``post_batch`` data, so you can mutate that in the callbacks to change what data the policy loss function sees. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-algorithms-doc: Algorithms ========== .. include:: /_includes/rllib/new_api_stack.rst The following table is an overview of all available algorithms in RLlib. Note that all algorithms support multi-GPU training on a single (GPU) node in `Ray (open-source) `__ (|multi_gpu|) as well as multi-GPU training on multi-node (GPU) clusters when using the `Anyscale platform `__ (|multi_node_multi_gpu|). +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Algorithm** | **Single- and Multi-agent** | **Multi-GPU (multi-node)** | **Action Spaces** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **On-Policy** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`PPO (Proximal Policy Optimization) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Off-Policy** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`DQN/Rainbow (Deep Q Networks) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`SAC (Soft Actor Critic) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **High-throughput on- and off policy** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`APPO (Asynchronous Proximal Policy Optimization) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`IMPALA (Importance Weighted Actor-Learner Architecture) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Model-based RL** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`DreamerV3 ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Offline RL and Imitation Learning** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`BC (Behavior Cloning) ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`CQL (Conservative Q-Learning) ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`IQL (Implicit Q-Learning) ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`MARWIL (Monotonic Advantage Re-Weighted Imitation Learning) ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Algorithm Extensions and -Plugins** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`Curiosity-driven Exploration by Self-supervised Prediction ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ On-policy ~~~~~~~~~ .. _ppo: Proximal Policy Optimization (PPO) ---------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/ppo-architecture.svg :width: 750 :align: left **PPO architecture:** In a training iteration, PPO performs three major steps: 1. Sampling a set of episodes or episode fragments 1. Converting these into a train batch and updating the model using a clipped objective and multiple SGD passes over this batch 1. Syncing the weights from the Learners back to the EnvRunners PPO scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. **Tuned examples:** `Pong-v5 `__, `CartPole-v1 `__. `Pendulum-v1 `__. **PPO-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.ppo.ppo.PPOConfig :members: training Off-Policy ~~~~~~~~~~ .. _dqn: Deep Q Networks (DQN, Rainbow, Parametric DQN) ---------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/dqn-architecture.svg :width: 650 :align: left **DQN architecture:** DQN uses a replay buffer to temporarily store episode samples that RLlib collects from the environment. Throughout different training iterations, these episodes and episode fragments are re-sampled from the buffer and re-used for updating the model, before eventually being discarded when the buffer has reached capacity and new samples keep coming in (FIFO). This reuse of training data makes DQN very sample-efficient and off-policy. DQN scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. All of the DQN improvements evaluated in `Rainbow `__ are available, though not all are enabled by default. See also how to use `parametric-actions in DQN `__. **Tuned examples:** `PongDeterministic-v4 `__, `Rainbow configuration `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__, `with Dueling and Double-Q `__, `with Distributional DQN `__. .. hint:: For a complete `rainbow `__ setup, make the following changes to the default DQN config: ``"n_step": [between 1 and 10], "noisy": True, "num_atoms": [more than 1], "v_min": -10.0, "v_max": 10.0`` (set ``v_min`` and ``v_max`` according to your expected range of returns). **DQN-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.dqn.dqn.DQNConfig :members: training .. _sac: Soft Actor Critic (SAC) ------------------------ `[original paper] `__, `[follow up paper] `__, `[implementation] `__. .. figure:: images/algos/sac-architecture.svg :width: 750 :align: left **SAC architecture:** SAC uses a replay buffer to temporarily store episode samples that RLlib collects from the environment. Throughout different training iterations, these episodes and episode fragments are re-sampled from the buffer and re-used for updating the model, before eventually being discarded when the buffer has reached capacity and new samples keep coming in (FIFO). This reuse of training data makes DQN very sample-efficient and off-policy. SAC scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. **Tuned examples:** `Pendulum-v1 `__, `HalfCheetah-v3 `__, **SAC-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.sac.sac.SACConfig :members: training High-Throughput On- and Off-Policy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _appo: Asynchronous Proximal Policy Optimization (APPO) ------------------------------------------------ .. tip:: APPO was originally `published under the name "IMPACT" `__. RLlib's APPO exactly matches the algorithm described in the paper. `[paper] `__ `[implementation] `__ .. figure:: images/algos/appo-architecture.svg :width: 750 :align: left **APPO architecture:** APPO is an asynchronous variant of :ref:`Proximal Policy Optimization (PPO) ` based on the IMPALA architecture, but using a surrogate policy loss with clipping, allowing for multiple SGD passes per collected train batch. In a training iteration, APPO requests samples from all EnvRunners asynchronously and the collected episode samples are returned to the main algorithm process as Ray references rather than actual objects available on the local process. APPO then passes these episode references to the Learners for asynchronous updates of the model. RLlib doesn't always sync back the weights to the EnvRunners right after a new model version is available. To account for the EnvRunners being off-policy, APPO uses a procedure called v-trace, `described in the IMPALA paper `__. APPO scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. **Tuned examples:** `Pong-v5 `__ `HalfCheetah-v4 `__ **APPO-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.appo.appo.APPOConfig :members: training .. _impala: Importance Weighted Actor-Learner Architecture (IMPALA) ------------------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/impala-architecture.svg :width: 750 :align: left **IMPALA architecture:** In a training iteration, IMPALA requests samples from all EnvRunners asynchronously and the collected episodes are returned to the main algorithm process as Ray references rather than actual objects available on the local process. IMPALA then passes these episode references to the Learners for asynchronous updates of the model. RLlib doesn't always sync back the weights to the EnvRunners right after a new model version is available. To account for the EnvRunners being off-policy, IMPALA uses a procedure called v-trace, `described in the paper `__. IMPALA scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. Tuned examples: `PongNoFrameskip-v4 `__, `vectorized configuration `__, `multi-gpu configuration `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__. .. figure:: images/impala.png :width: 650 Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers. The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second). **IMPALA-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.impala.impala.IMPALAConfig :members: training Model-based RL ~~~~~~~~~~~~~~ .. _dreamerv3: DreamerV3 --------- `[paper] `__ `[implementation] `__ `[RLlib readme] `__ Also see `this README here for more details on how to run experiments `__ with DreamerV3. .. figure:: images/algos/dreamerv3-architecture.svg :width: 850 :align: left **DreamerV3 architecture:** DreamerV3 trains a recurrent WORLD_MODEL in supervised fashion using real environment interactions sampled from a replay buffer. The world model's objective is to correctly predict the transition dynamics of the RL environment: next observation, reward, and a boolean continuation flag. DreamerV3 trains the actor- and critic-networks on synthesized trajectories only, which are "dreamed" by the WORLD_MODEL. The algorithm scales out on both axes, supporting multiple :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors for sample collection and multiple GPU- or CPU-based :py:class:`~ray.rllib.core.learner.learner.Learner` actors for updating the model. It can also be used in different environment types, including those with image- or vector based observations, continuous- or discrete actions, as well as sparse or dense reward functions. **Tuned examples:** `Atari 100k `__, `Atari 200M `__, `DeepMind Control Suite `__ **Pong-v5 results (1, 2, and 4 GPUs)**: .. figure:: images/dreamerv3/pong_1_2_and_4gpus.svg Episode mean rewards for the Pong-v5 environment (with the "100k" setting, in which only 100k environment steps are allowed): Note that despite the stable sample efficiency - shown by the constant learning performance per env step - the wall time improves almost linearly as we go from 1 to 4 GPUs. **Left**: Episode reward over environment timesteps sampled. **Right**: Episode reward over wall-time. **Atari 100k results (1 vs 4 GPUs)**: .. figure:: images/dreamerv3/atari100k_1_vs_4gpus.svg Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. **Left**: Episode reward over environment timesteps sampled. **Right**: Episode reward over wall-time. **DeepMind Control Suite (vision) results (1 vs 4 GPUs)**: .. figure:: images/dreamerv3/dmc_1_vs_4gpus.svg Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. **Left**: Episode reward over environment timesteps sampled. **Right**: Episode reward over wall-time. Offline RL and Imitation Learning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _bc: Behavior Cloning (BC) --------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/bc-architecture.svg :width: 750 :align: left **BC architecture:** RLlib's behavioral cloning (BC) uses Ray Data to tap into its parallel data processing capabilities. In one training iteration, BC reads episodes in parallel from offline files, for example `parquet `__, by the n DataWorkers. Connector pipelines then preprocess these episodes into train batches and send these as data iterators directly to the n Learners for updating the model. RLlib's (BC) implementation is directly derived from its `MARWIL`_ implementation, with the only difference being the ``beta`` parameter (set to 0.0). This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. **Tuned examples:** `CartPole-v1 `__ `Pendulum-v1 `__ **BC-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.bc.bc.BCConfig :members: training .. _cql: Conservative Q-Learning (CQL) ----------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/cql-architecture.svg :width: 750 :align: left **CQL architecture:** CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution through a conservative critic estimate. It adds a simple Q regularizer loss to the standard Bellman update loss, ensuring that the critic doesn't output overly optimistic Q-values. The `SACLearner` adds this conservative correction term to the TD-based Q-learning loss. **Tuned examples:** `Pendulum-v1 `__ **CQL-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.cql.cql.CQLConfig :members: training .. _iql: Implicit Q-Learning (IQL) ------------------------- `[paper] `__ `[implementation] `__ **IQL architecture:** IQL (Implicit Q-Learning) is an offline RL algorithm that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. Instead of standard TD-error minimization, it introduces a value function trained through expectile regression, which yields a conservative estimate of returns. This allows policy improvement through advantage-weighted behavior cloning, ensuring safer generalization without explicit exploration. The `IQLLearner` replaces the usual TD-based value loss with an expectile regression loss, and trains the policy to imitate high-advantage actions—enabling substantial performance gains over the behavior policy using only in-dataset actions. **Tuned examples:** `Pendulum-v1 `__ **IQL-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.iql.iql.IQLConfig :members: training .. _marwil: Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) ----------------------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/marwil-architecture.svg :width: 750 :align: left **MARWIL architecture:** MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data. When the ``beta`` hyperparameter is set to zero, the MARWIL objective reduces to plain imitation learning (see `BC`_). MARWIL uses Ray. Data to tap into its parallel data processing capabilities. In one training iteration, MARWIL reads episodes in parallel from offline files, for example `parquet `__, by the n DataWorkers. Connector pipelines preprocess these episodes into train batches and send these as data iterators directly to the n Learners for updating the model. **Tuned examples:** `CartPole-v1 `__ **MARWIL-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.marwil.marwil.MARWILConfig :members: training Algorithm Extensions- and Plugins ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _icm: Curiosity-driven Exploration by Self-supervised Prediction ---------------------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/curiosity-architecture.svg :width: 850 :align: left **Intrinsic Curiosity Model (ICM) architecture:** The main idea behind ICM is to train a world-model (in parallel to the "main" policy) to predict the environment's dynamics. The loss of the world model is the intrinsic reward that the `ICMLearner` adds to the env's (extrinsic) reward. This makes sure that when in regions of the environment that are relatively unknown (world model performs badly in predicting what happens next), the artificial intrinsic reward is large and the agent is motivated to go and explore these unknown regions. RLlib's curiosity implementation works with any of RLlib's algorithms. See these links here for example implementations on top of `PPO and DQN `__. ICM uses the chosen Algorithm's `training_step()` as-is, but then executes the following additional steps during `LearnerGroup.update`: Duplicate the train batch of the "main" policy and use it for performing a self-supervised update of the ICM. Use the ICM to compute the intrinsic rewards and add these to the extrinsic (env) rewards. Then continue updating the "main" policy. **Tuned examples:** `12x12 FrozenLake-v1 `__ .. |single_agent| image:: images/sigils/single-agent.svg :class: inline-figure :width: 84 .. |multi_agent| image:: images/sigils/multi-agent.svg :class: inline-figure :width: 84 .. |multi_gpu| image:: images/sigils/multi-gpu.svg :class: inline-figure :width: 84 .. |multi_node_multi_gpu| image:: images/sigils/multi-node-multi-gpu.svg :class: inline-figure :width: 84 .. |discr_actions| image:: images/sigils/discr-actions.svg :class: inline-figure :width: 84 .. |cont_actions| image:: images/sigils/cont-actions.svg :class: inline-figure :width: 84 --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-callback-docs: Callbacks ========= .. include:: /_includes/rllib/new_api_stack.rst Callbacks are the most straightforward way to inject code into experiments. You can define the code to execute at certain events and pass it to your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`. The following is an example of defining a simple lambda that prints out an episode's return after the episode terminates: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig ppo = config = ( PPOConfig() .environment("CartPole-v1") .callbacks( on_episode_end=( lambda episode, **kw: print(f"Episode done. R={episode.get_return()}") ) ) .build() ) ppo.train() .. testcode:: :hide: ppo.stop() Callback lambdas versus stateful RLlibCallback ---------------------------------------------- There are two ways to define custom code for various callback events to execute. Callback lambdas ~~~~~~~~~~~~~~~~ If the injected code is rather simple and doesn't need to store temporary information for reuse in succeeding event calls, you can use a lambda and pass it to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.callbacks` method as previously shown. See ref:`Callback events ` for a complete list. The names of the events always match the argument names for the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.callbacks` method. Stateful RLlibCallback ~~~~~~~~~~~~~~~~~~~~~~ If the injected code is stateful and temporarily stores results for reuse in succeeding calls triggered by the same or a different event, you need to subclass the :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` API and then implement one or more methods, for example :py:meth:`~ray.rllib.callbacks.callbacks.RLlibCallback.on_algorithm_init`: The following is the same example that prints out a terminated episode's return, but uses a subclass of :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback`. .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.callbacks.callbacks import RLlibCallback class EpisodeReturn(RLlibCallback): def __init__(self): super().__init__() # Keep some global state in between individual callback events. self.overall_sum_of_rewards = 0.0 def on_episode_end(self, *, episode, **kwargs): self.overall_sum_of_rewards += episode.get_return() print(f"Episode done. R={episode.get_return()} Global SUM={self.overall_sum_of_rewards}") ppo = ( PPOConfig() .environment("CartPole-v1") .callbacks(EpisodeReturn) .build() ) ppo.train() .. testcode:: :hide: ppo.stop() .. _rllib-callback-event-overview: Callback events --------------- During a training iteration, the Algorithm normally walks through the following event tree, a high-level overview of all supported events in RLlib's callbacks system: .. code-block:: text Algorithm .__init__() `on_algorithm_init` - After algorithm construction and setup. .train() `on_train_result` - After a training iteration. .evaluate() `on_evaluate_start` - Before evaluation starts using the eval ``EnvRunnerGroup``. `on_evaluate_end` - After evaluation is finished. .restore_from_path() `on_checkpoint_loaded` - After a checkpoint's new state has been loaded. EnvRunner .__init__() `on_environment_created` - After the RL environment has been created. .sample() `on_episode_created` - After a new episode object has been created. `on_episode_start` - After an episode object has started (after ``env.reset()``). `on_episode_step` - After an episode object has stepped (after ``env.step()``). `on_episode_end` - After an episode object has terminated (or truncated). `on_sample_end` - At the end of the ``EnvRunner.sample()`` call. Note that some of the events in the tree happen simultaneously, on different processes through Ray actors. For example an EnvRunner actor may trigger its ``on_episode_start`` event while at the same time another EnvRunner actor may trigger its ``on_sample_end`` event and the main Algorithm process triggers ``on_train_result``. .. note:: RLlib only invokes callbacks in :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` and :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors. The Ray team is considering expanding callbacks onto :py:class:`~ray.rllib.core.learner.learner.Learner` actors and possibly :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances as well. .. currentmodule:: ray.rllib.callbacks.callbacks .. dropdown:: Algorithm-bound methods of ``RLlibCallback`` .. autosummary:: RLlibCallback.on_algorithm_init RLlibCallback.on_evaluate_start RLlibCallback.on_evaluate_end RLlibCallback.on_env_runners_recreated RLlibCallback.on_checkpoint_loaded .. dropdown:: EnvRunner-bound methods of ``RLlibCallback`` .. autosummary:: RLlibCallback.on_environment_created RLlibCallback.on_episode_created RLlibCallback.on_episode_start RLlibCallback.on_episode_step RLlibCallback.on_episode_end RLlibCallback.on_sample_end Chaining callbacks ------------------ You can define more than one :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` class and send them in a list to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.callbacks` method. You can also send lists of callables, instead of a single callable, to the different arguments of that method. For example, if you already wrote a subclass of :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` and want to reuse it in different experiments. Because one of your experiments requires some debug callback code, you want to inject it only temporarily for a couple of runs. Resolution order of chained callbacks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RLlib resolves all available callback methods and callables for a given event as follows: Subclasses of :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` take precedence over individual or lists of callables that you provide through the various arguments of the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.callbacks` method. For example, assume the callback event is ``on_train_result``, which fires at the end of a training iteration and inside the algorithm's process: - RLlib loops through the list of all given :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` subclasses and calls their ``on_train_result`` method. Thereby, it keeps the exact order the user provided in the list. - RLlib then loops through the list of all defined ``on_train_result`` callables. You configured these by calling the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.callbacks` method and defining the ``on_train_result`` argument in this call. .. code-block:: python class MyCallbacks(RLlibCallback): def on_train_result(self, *, algorithm, metrics_logger, result, **kwargs): print("RLlibCallback subclass") class MyDebugCallbacks(RLlibCallback): def on_train_result(self, *, algorithm, metrics_logger, result, **kwargs): print("debug subclass") # Define the callbacks order through the config. # Subclasses first, then individual `on_train_result` (or other events) callables: config.callbacks( callbacks_class=[MyDebugCallbacks, MyCallbacks], # <- note: debug class first on_train_result=[ lambda algorithm, **kw: print('lambda 1'), lambda algorithm, **kw: print('lambda 2'), ], ) # When training the algorithm, after each training iteration, you should see # something like: # > debug subclass # > RLlibCallback subclass # > lambda 1 # > lambda 2 Examples -------- The following are two examples showing you how to setup custom callbacks on the :ref:`Algorithm ` process as well as on the :ref:`EnvRunner ` processes. .. _rllib-callback-example-on-train-result: Example 1: `on_train_result` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following example demonstrates how to implement a simple custom function writing the replay buffer contents to disk from time to time. You normally don't want to write the contents of buffers along with your :ref:`Algorithm checkpoints `, so writing less often, in a more controlled fashion through a custom callback could be a good compromise. .. testcode:: import ormsgpack from ray.rllib.algorithms.dqn import DQNConfig def _write_buffer_if_necessary(algorithm, metrics_logger, result): # Write the buffer contents only every ith iteration. if algorithm.training_iteration % 2 == 0: # python dict buffer_contents = algorithm.local_replay_buffer.get_state() # binary msgpacked = ormsgpack.packb( buffer_contents, option=ormsgpack.OPT_SERIALIZE_NUMPY, ) # Open some file and write the buffer contents into it using `ormsgpack`. with open("replay_buffer_contents.msgpack", "wb") as f: f.write(msgpacked) config = ( DQNConfig() .environment("CartPole-v1") .callbacks( on_train_result=_write_buffer_if_necessary, ) ) dqn = config.build() # Train n times. Expect RLlib to write buffer every ith iteration. for _ in range(4): print(dqn.train()) See :ref:`Callbacks invoked in Algorithm ` for the exact call signatures of all available callbacks and the argument types that they expect. .. _rllib-callback-example-on-episode-step-and-end: Example 2: `on_episode_step` and `on_episode_end` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following example demonstrates how to implement a custom :py:class:`~ray.rllib.callbacks.callbacks.RLlibCallback` class computing the average "first-joint angle" of the `Acrobot-v1 RL environment `__: .. figure:: images/acrobot-v1.png :width: 150 :align: left **The Acrobot-v1 environment**: The env code describes the angle you are about to compute and log through your custom callback as: .. code-block:: text `theta1` is the angle of the first joint, where an angle of 0.0 indicates that the first link is pointing directly downwards. This example utilizes RLlib's :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` API to log the custom computations of the injected code. See :ref:`rllib-metric-logger-docs` for more details about the MetricsLogger API. Also, see this more complex example that `generates and logs a PacMan heatmap (image) to WandB `__. .. testcode:: import math import numpy as np from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.callbacks.callbacks import RLlibCallback class LogAcrobotAngle(RLlibCallback): def on_episode_created(self, *, episode, **kwargs): # Initialize an empty list in the `custom_data` property of `episode`. episode.custom_data["theta1"] = [] def on_episode_step(self, *, episode, env, **kwargs): # First get the angle from the env (note that `env` is a VectorEnv). # See https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/acrobot.py # for the env source code. cos_theta1, sin_theta1 = env.envs[0].unwrapped.state[0], env.envs[0].unwrapped.state[1] # Convert cos/sin/tan into degree. deg_theta1 = math.degrees(math.atan2(sin_theta1, cos_theta1)) # Log the theta1 degree value in the episode object, temporarily. episode.custom_data["theta1"].append(deg_theta1) def on_episode_end(self, *, episode, metrics_logger, **kwargs): # Get all the logged theta1 degree values and average them. theta1s = episode.custom_data["theta1"] avg_theta1 = np.mean(theta1s) # Log the final result - per episode - to the MetricsLogger. # Report with a sliding/smoothing window of 50. metrics_logger.log_value("theta1_mean", avg_theta1, reduce="mean", window=50) config = ( PPOConfig() .environment("Acrobot-v1") .callbacks( callbacks_class=LogAcrobotAngle, ) ) ppo = config.build() # Train n times. Expect to find `theta1_mean` in the results under: # `env_runners/theta1_mean` for i in range(10): results = ppo.train() print( f"iter={i} " f"theta1_mean={results['env_runners']['theta1_mean']} " f"R={results['env_runners']['episode_return_mean']}" ) .. tip:: You can base your custom logic on whether the calling EnvRunner is a regular "training" EnvRunner, used to collect training samples, or an evaluation EnvRunner, used to play through episodes for evaluation only. Access the ``env_runner.config.in_evaluation`` boolean flag, which is True on evaluation ``EnvRunner`` actors and False on ``EnvRunner`` actors used to collect training data. See :ref:`Callbacks invoked in Algorithm ` for the exact call signatures of all available callbacks and the argument types they expect. --- .. include:: /_includes/rllib/we_are_hiring.rst .. include:: /_includes/rllib/new_api_stack.rst Install RLlib for Development ============================= You can develop RLlib locally without needing to compile Ray by using the `setup-dev.py script `__. This sets up symlinks between the ``ray/rllib`` dir in your local git clone and the respective directory bundled with the pip-installed ``ray`` package. This way, every change you make in the source files in your local git clone will immediately be reflected in your installed ``ray`` as well. However if you have installed ray from source using `these instructions `__ then don't use this, as these steps should have already created the necessary symlinks. When using the `setup-dev.py script `__, make sure that your git branch is in sync with the installed Ray binaries, meaning you are up-to-date on `master `__ and have the latest `wheel `__ installed. .. code-block:: bash # Clone your fork onto your local machine, e.g.: git clone https://github.com/[your username]/ray.git cd ray # Only enter 'Y' at the first question on linking RLlib. # This leads to the most stable behavior and you won't have to re-install ray as often. # If you anticipate making changes to e.g. Tune or Train quite often, consider also symlinking Ray Tune or Train here # (say 'Y' when asked by the script about creating the Tune or Train symlinks). python python/ray/setup-dev.py Contributing to RLlib ===================== Contributing Fixes and Enhancements ----------------------------------- Feel free to file new RLlib-related PRs through `Ray's github repo `__. The RLlib team is very grateful for any external help they can get from the open-source community. If you are unsure about how to structure your bug-fix or enhancement-PRs, create a small PR first, then ask us questions within its conversation section. `See here for an example of a good first community PR `__. Contributing Algorithms ----------------------- These are the guidelines for merging new algorithms into RLlib. We distinguish between two levels of contributions: As an `example script `__ (possibly with additional classes in other files) or as a fully-integrated RLlib Algorithm in `rllib/algorithms `__. * Example Algorithms: - must subclass Algorithm and implement the ``training_step()`` method - must include the main example script, in which the algo is demoed, in a CI test, which proves that the algo is learning a certain task. - should offer functionality not present in existing algorithms * Fully integrated Algorithms have the following additional requirements: - must offer substantial new functionality not possible to add to other algorithms - should support custom RLModules - should use RLlib abstractions and support distributed execution - should include at least one `tuned hyperparameter example `__, testing of which is part of the CI Both integrated and contributed algorithms ship with the ``ray`` PyPI package, and are tested as part of Ray's automated tests. New Features ------------ New feature developments, discussions, and upcoming priorities are tracked on the `GitHub issues page `__ (note that this may not include all development efforts). API Stability ============= API Decorators in the Codebase ------------------------------ Objects and methods annotated with ``@PublicAPI`` (new API stack), ``@DeveloperAPI`` (new API stack), or ``@OldAPIStack`` (old API stack) have the following API compatibility guarantees: .. autofunction:: ray.util.annotations.PublicAPI :noindex: .. autofunction:: ray.util.annotations.DeveloperAPI :noindex: .. autofunction:: ray.rllib.utils.annotations.OldAPIStack :noindex: Benchmarks ========== A number of training run results are available in the `rl-experiments repo `__, and there is also a list of working hyperparameter configurations in `tuned_examples `__, sorted by algorithm. Benchmark results are extremely valuable to the community, so if you happen to have results that may be of interest, consider making a pull request to either repo. Debugging RLlib =============== Finding Memory Leaks In Workers ------------------------------- Keeping the memory usage of long running workers stable can be challenging. The ``MemoryTrackingCallbacks`` class can be used to track memory usage of workers. .. autoclass:: ray.rllib.callbacks.callbacks.MemoryTrackingCallbacks The objects with the top 20 memory usage in the workers are added as custom metrics. These can then be monitored using tensorboard or other metrics integrations like Weights & Biases: .. image:: images/MemoryTrackingCallbacks.png Troubleshooting --------------- If you encounter errors like `blas_thread_init: pthread_create: Resource temporarily unavailable` when using many workers, try setting ``OMP_NUM_THREADS=1``. Similarly, check configured system limits with `ulimit -a` for other resource limit errors. For debugging unexpected hangs or performance problems, you can run ``ray stack`` to dump the stack traces of all Ray workers on the current node, ``ray timeline`` to dump a timeline visualization of tasks to a file, and ``ray memory`` to list all object references in the cluster. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-environments-doc: Environments ============ .. toctree:: :hidden: multi-agent-envs hierarchical-envs external-envs .. include:: /_includes/rllib/new_api_stack.rst .. grid:: 1 2 3 4 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :img-top: /rllib/images/envs/single_agent_env_logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-single-agent-env-doc Single-Agent Environments (this page) .. grid-item-card:: :img-top: /rllib/images/envs/multi_agent_env_logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-multi-agent-environments-doc Multi-Agent Environments .. grid-item-card:: :img-top: /rllib/images/envs/external_env_logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-external-env-setups-doc External Environments and Applications .. grid-item-card:: :img-top: /rllib/images/envs/hierarchical_env_logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-hierarchical-environments-doc Hierarchical Environments .. _rllib-single-agent-env-doc: In online reinforcement learning (RL), an algorithm trains a policy neural network by collecting data on-the-fly using an RL environment or simulator. The agent navigates within the environment choosing actions governed by this policy and collecting the environment's observations and rewards. The goal of the algorithm is to train the policy on the collected data such that the policy's action choices eventually maximize the cumulative reward over the agent's lifetime. .. figure:: images/envs/single_agent_setup.svg :width: 600 :align: left **Single-agent setup:** One agent lives in the environment and takes actions computed by a single policy. The mapping from agent to policy is fixed ("default_agent" maps to "default_policy"). See :ref:`Multi-Agent Environments ` for how this setup generalizes in the multi-agent case. .. _gymnasium: Farama Gymnasium ---------------- RLlib relies on `Farama's Gymnasium API `__ as its main RL environment interface for **single-agent** training (:ref:`see here for multi-agent `). To implement custom logic with `gymnasium` and integrate it into an RLlib config, see this `SimpleCorridor example `__. .. tip:: Not all action spaces are compatible with all RLlib algorithms. See the `algorithm overview `__ for details. In particular, pay attention to which algorithms support discrete and which support continuous action spaces or both. For more details on building a custom `Farama Gymnasium `__ environment, see the `gymnasium.Env class definition `__. For **multi-agent** training, see :ref:`RLlib's multi-agent API and supported third-party APIs `. .. _configuring-environments: Configuring Environments ------------------------ To specify which RL environment to train against, you can provide either a string name or a Python class that has to subclass `gymnasium.Env `__. Specifying by String ~~~~~~~~~~~~~~~~~~~~ RLlib interprets string values as `registered gymnasium environment names `__ by default. For example: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig config = ( PPOConfig() # Configure the RL environment to use as a string (by name), which # is registered with Farama's gymnasium. .environment("Acrobot-v1") ) algo = config.build() print(algo.train()) .. testcode:: :hide: algo.stop() .. tip:: For all supported environment names registered with Farama, refer to these resources (by env category): * `Toy Text `__ * `Classic Control `__ * `Atari `__ * `MuJoCo `__ * `Box2D `__ Specifying by Subclass of gymnasium.Env ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you're using a custom subclass of `gymnasium.Env class `__, you can pass the class itself rather than a registered string. Your subclass must accept a single ``config`` argument in its constructor (which may default to `None`). For example: .. testcode:: import gymnasium as gym import numpy as np from ray.rllib.algorithms.ppo import PPOConfig class MyDummyEnv(gym.Env): # Write the constructor and provide a single `config` arg, # which may be set to None by default. def __init__(self, config=None): # As per gymnasium standard, provide observation and action spaces in your # constructor. self.observation_space = gym.spaces.Box(-1.0, 1.0, (1,), np.float32) self.action_space = gym.spaces.Discrete(2) def reset(self, seed=None, options=None): # Return (reset) observation and info dict. return np.array([1.0]), {} def step(self, action): # Return next observation, reward, terminated, truncated, and info dict. return np.array([1.0]), 1.0, False, False, {} config = ( PPOConfig() .environment( MyDummyEnv, env_config={}, # `config` to pass to your env class ) ) algo = config.build() print(algo.train()) .. testcode:: :hide: algo.stop() Specifying by Tune-Registered Lambda ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A third option for providing environment information to your config is to register an environment creator function (or lambda) with Ray Tune. The creator function must take a single ``config`` parameter and return a single non-vectorized `gymnasium.Env `__ instance. For example: .. testcode:: from ray.tune.registry import register_env def env_creator(config): return MyDummyEnv(config) # Return a gymnasium.Env instance. register_env("my_env", env_creator) config = ( PPOConfig() .environment("my_env") # <- Tune registered string pointing to your custom env creator. ) algo = config.build() print(algo.train()) .. testcode:: :hide: algo.stop() For a complete example using a custom environment, see the `custom_gym_env.py example script `__. .. warning:: Due to Ray's distributed nature, gymnasium's own registry is incompatible with Ray. Always use the registration method documented here to ensure remote Ray actors can access your custom environments. In the preceding example, the ``env_creator`` function takes a ``config`` argument. This config is primarily a dictionary containing required settings. However, you can also access additional properties within the ``config`` variable. For example, use ``config.worker_index`` to get the remote EnvRunner index or ``config.num_workers`` for the total number of EnvRunners used. This approach can help customize environments within an ensemble and make environments running on some EnvRunners behave differently from those running on other EnvRunners. For example: .. code-block:: python class EnvDependingOnWorkerAndVectorIndex(gym.Env): def __init__(self, config): # Pick actual env based on worker and env indexes. self.env = gym.make( choose_env_for(config.worker_index, config.vector_index) ) self.action_space = self.env.action_space self.observation_space = self.env.observation_space def reset(self, seed, options): return self.env.reset(seed, options) def step(self, action): return self.env.step(action) register_env("multi_env", lambda config: MultiEnv(config)) .. tip:: When using logging within an environment, the configuration must be done inside the environment (running within Ray workers). Pre-Ray logging configurations will be ignored. Use the following code to connect to Ray's logging instance: .. testcode:: import logging logger = logging.getLogger("ray.rllib") Performance and Scaling ----------------------- .. figure:: images/envs/env_runners.svg :width: 600 :align: left **EnvRunner with gym.Env setup:** Environments in RLlib are located within the :py:class:`~ray.rllib.envs.env_runner.EnvRunner` actors, whose number (`n`) you can scale through the `config.env_runners(num_env_runners=..)` setting. Each :py:class:`~ray.rllib.envs.env_runner.EnvRunner` actor can hold more than one `gymnasium `__ environment (vectorized). You can set the number of individual environment copies per EnvRunner through `config.env_runners(num_envs_per_env_runner=..)`. There are two methods to scale sample collection with RLlib and `gymnasium `__ environments. You can use both in combination. 1. **Distribute across multiple processes:** RLlib creates multiple :py:class:`~ray.rllib.envs.env_runner.EnvRunner` instances, each a Ray actor, for experience collection, controlled through your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`: ``config.env_runners(num_env_runners=..)``. #. **Vectorization within a single process:** Many environments achieve high frame rates per core but are limited by policy inference latency. To address this limitation, create multiple environments per process to batch the policy forward pass across these vectorized environments. Set ``config.env_runners(num_envs_per_env_runner=..)`` to create more than one environment copy per :py:class:`~ray.rllib.envs.env_runner.EnvRunner` actor. Additionally, you can make the individual sub-environments within a vector independent processes through Python's multiprocessing used by gymnasium. Set `config.env_runners(remote_worker_envs=True)` to create individual subenvironments as separate processes and step them in parallel. .. note:: Multi-agent setups aren't vectorizable yet. The Ray team is working on a solution for this restriction by using the `gymnasium >= 1.x` custom vectorization feature. .. tip:: See the :ref:`scaling guide ` for more on RLlib training at scale. Expensive Environments ~~~~~~~~~~~~~~~~~~~~~~ Some environments may require substantial resources to initialize and run. If your environments require more than 1 CPU per :py:class:`~ray.rllib.envs.env_runner.EnvRunner`, you can provide more resources for each actor by setting the following config options: ``config.env_runners(num_cpus_per_env_runner=.., num_gpus_per_env_runner=..)`` --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-examples-overview-docs: Examples ======== .. include:: /_includes/rllib/new_api_stack.rst This page contains an index of all the python scripts in the `examples folder `__ of RLlib, demonstrating the different use cases and features of the library. .. note:: RLlib is currently in a transition state from old- to new API stack. The Ray team has translated most of the example scripts to the new stack and tag those still on the old stack with this comment line on top: ``# @OldAPIStack``. The moving of all example scripts over to the new stack is work in progress. .. note:: If you find any new API stack example broken, or if you'd like to add an example to this page, create an issue in the `RLlib GitHub repository `__. Folder structure ---------------- The `examples folder `__ has several sub-directories described in detail below. How to run an example script ---------------------------- Most of the example scripts are self-executable, meaning you can ``cd`` into the respective directory and run the script as-is with python: .. code-block:: bash $ cd ray/rllib/examples/multi_agent $ python multi_agent_pendulum.py --num-agents=2 Use the `--help` command line argument to have each script print out its supported command line options. Most of the scripts share a common subset of generally applicable command line arguments, for example `--num-env-runners`, to scale the number of EnvRunner actors, `--no-tune`, to switch off running with Ray Tune, `--wandb-key`, to log to WandB, or `--verbose`, to control log chattiness. All example sub-folders ----------------------- Actions +++++++ .. _rllib-examples-overview-autoregressive-actions: - `Auto-regressive actions `__: Configures an RL module that generates actions in an autoregressive manner, where the second component of an action depends on the previously sampled first component of the same action. - `Custom action distribution class `__: Demonstrates how to write a custom action distribution class, taking an additional temperature parameter on top of a Categorical distribution, and how to configure this class inside your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` implementation. Further explains how to define different such classes for the different forward methods of your :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` in case you need more granularity. - `Nested Action Spaces `__: Sets up an environment with nested action spaces using custom single- or multi-agent configurations. This example demonstrates how RLlib manages complex action structures, such as multi-dimensional or hierarchical action spaces. Algorithms ++++++++++ - `Custom implementation of the Model-Agnostic Meta-Learning (MAML) algorithm `__: Shows how to stably train a model in an "infinite-task" environment, where each task corresponds to a sinusoidal function with randomly sampled amplitude and phase. Because each new task introduces a shift in data distribution, traditional learning algorithms would fail to generalize. - `Custom "vanilla policy gradient" (VPG) algorithm `__: Shows how to write a very simple policy gradient :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` from scratch, including a matching :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`, a matching :py:class:`~ray.rllib.core.learner.learner.Learner` which defines the loss function, and the Algorithm's :py:meth:`~ray.rllib.algorithms.algorithm.Algorithm.training_step` implementation. - `Custom algorithm with a global, shared data actor for sending manipulated rewards from EnvRunners to Learners `__: Shows how to write a custom shared data actor accessible from any of the Algorithm's other actors, like :py:class:`~ray.rllib.env.env_runner.EnvRunner` and :py:class:`~ray.rllib.core.learner.learner.Learner` actors. The new actor stores manipulated rewards from sampled episodes under unique, per-episode keys and then serves this information to the :py:class:`~ray.rllib.core.learner.learner.Learner` for adding these rewards to the train batch. Checkpoints +++++++++++ - `Checkpoint by custom criteria `__: Shows how to create checkpoints based on custom criteria, giving users control over when to save model snapshots during training. - `Continue training from checkpoint `__: Illustrates resuming training from a saved checkpoint, useful for extending training sessions or recovering from interruptions. - `Restore 1 out of N agents from checkpoint `__: Restores one specific agent from a multi-agent checkpoint, allowing selective loading for environments where only certain agents need to resume training. Connectors ++++++++++ .. note:: RLlib's Connector API has been re-written from scratch for the new API stack. Connector-pieces and -pipelines are now referred to as :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` to distinguish against the ``Connector`` class, which only continue to work on the old API stack. - `Flatten and one-hot observations `__: Demonstrates how to one-hot discrete observation spaces and/or flatten complex observations, Dict or Tuple, allowing RLlib to process arbitrary observation data as flattened 1D vectors. Useful for environments with complex, discrete, or hierarchical observations. - `Observation frame-stacking `__: Implements frame stacking, where N consecutive frames stack together to provide temporal context to the agent. This technique is common in environments with continuous state changes, like video frames in Atari games. Using connectors for frame stacking is more efficient as it avoids having to send large observation tensors through ray remote calls. - `Mean/Std filtering `__: Adds mean and standard deviation normalization for observations, shifting by the mean and dividing by std-dev. This type of filtering can improve learning stability in environments with highly variable state magnitudes by scaling observations to a normalized range. - `Multi-agent observation preprocessor enhancing non-Markovian observations to Markovian ones `__: A multi-agent preprocessor enhances the per-agent observations of a multi-agent env, which by themselves are non-Markovian, partial observations and converts them into Markovian observations by adding information from the respective other agent. A policy can only be trained optimally through this additional information. - `Prev-actions, prev-rewards connector `__: Augments observations with previous actions and rewards, giving the agent a short-term memory of past events, which can improve decision-making in partially observable or sequentially dependent tasks. - `Single-agent observation preprocessor `__: A connector alters the CartPole-v1 environment observations from the Markovian 4-tuple (x-pos, angular-pos, x-velocity, angular-velocity) to a non-Markovian, simpler 2-tuple (only x-pos and angular-pos). The resulting problem can only be solved through a memory/stateful model, for example an LSTM. Curiosity +++++++++ - `Count-based curiosity `__: Implements count-based intrinsic motivation to encourage exploration of less visited states. Using curiosity is beneficial in sparse-reward environments where agents may struggle to find rewarding paths. However, count-based methods are only feasible for environments with small observation spaces. - `Euclidean distance-based curiosity `__: Uses Euclidean distance between states and the initial state to measure novelty, encouraging exploration by rewarding the agent for reaching "far away" regions of the environment. Suitable for sparse-reward tasks, where diverse exploration is key to success. - `Intrinsic-curiosity-model (ICM) Based Curiosity `__: Adds an `Intrinsic Curiosity Model (ICM) `__ that learns to predict the next state as well as the action in between two states to measure novelty. The higher the loss of the ICM, the higher the "novelty" and thus the intrinsic reward. Ideal for complex environments with large observation spaces where reward signals are sparse. Curriculum learning +++++++++++++++++++ - `Custom env rendering method `__: Demonstrates curriculum learning, where the environment difficulty increases as the agent improves. This approach enables gradual learning, allowing agents to master simpler tasks before progressing to more challenging ones, ideal for environments with hierarchical or staged difficulties. Also see the :doc:`curriculum learning how-to ` from the documentation. - `Curriculum learning for Atari Pong `__: Demonstrates curriculum learning for Atari Pong using the `frameskip` to increase difficulty of the task. This approach enables gradual learning, allowing agents to master slower reactions (lower `frameskip`) before progressing to more faster ones (higher `frameskip`). Also see the :doc:`curriculum learning how-to ` from the documentation. Debugging +++++++++ - `Deterministic sampling and training `__: Demonstrates the possibility to seed an experiment through the algorithm config. RLlib passes the seed through to all components that have a copy of the :ref:`RL environment ` and the :ref:`RLModule ` and thus makes sure these components behave deterministically. When using a seed, train results should become repeatable. Note that some algorithms, such as :ref:`APPO ` which rely on asynchronous sampling in combination with Ray network communication always behave stochastically, no matter whether you set a seed or not. Environments ++++++++++++ - `Async gym vectorization, parallelizing sub-environments `__: Shows how the `gym_env_vectorize_mode` config setting can significantly speed up your :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors, if your RL environment is slow and you're using `num_envs_per_env_runner > 1`. The reason for the performance gain is that each sub-environment runs in its own process. - `Custom env rendering method `__: Demonstrates how to add a custom `render()` method to a (custom) environment, allowing visualizations of agent interactions. - `Custom gymnasium env `__: Implements a custom `gymnasium `__ environment from scratch, showing how to define observation and action spaces, arbitrary reward functions, as well as, step- and reset logic. - `Env connecting to RLlib through a tcp client `__: An external environment, running outside of RLlib and acting as a client, connects to RLlib as a server. The external env performs its own action inference using an ONNX model, sends collected data back to RLlib for training, and receives model updates from time to time from RLlib. - `Env rendering and recording `__: Illustrates environment rendering and recording setups within RLlib, capturing visual outputs for later review (ex. on WandB), which is essential for tracking agent behavior in training. - `Env with protobuf observations `__: Uses Protobuf for observations, demonstrating an advanced way of handling serialized data in environments. This approach is useful for integrating complex external data sources as observations. Evaluation ++++++++++ - `Custom evaluation `__: Configures custom evaluation metrics for agent performance, allowing users to define specific success criteria beyond standard RLlib evaluation metrics. - `Evaluation parallel to training `__: Runs evaluation episodes in parallel with training, reducing training time by offloading evaluation to separate processes. This method is beneficial when you require frequent evaluation without interrupting learning. Fault tolerance +++++++++++++++ - `Crashing and stalling env `__: Simulates an environment that randomly crashes or stalls, allowing users to test RLlib's fault-tolerance mechanisms. This script is useful for evaluating how RLlib handles interruptions and recovers from unexpected failures during training. GPUs for training and sampling ++++++++++++++++++++++++++++++ - `Float16 training and inference `__: Configures a setup for float16 training and inference, optimizing performance by reducing memory usage and speeding up computation. This is especially useful for large-scale models on compatible GPUs. - `Fractional GPUs per Learner `__: Demonstrates allocating fractional GPUs to individual learners, enabling finer resource allocation in multi-model setups. Useful for saving resources when training smaller models, many of which can fit on a single GPU. - `Mixed precision training and float16 inference `__: Uses mixed precision, float32 and float16, for training, while switching to float16 precision for inference, balancing stability during training with performance improvements during evaluation. - `Using GPUs on EnvRunners `__: Demos how :py:class:`~ray.rllib.env.env_runner.EnvRunner` instances, single- or multi-agent, can request GPUs through the `config.env_runners(num_gpus_per_env_runner=..)` setting. Hierarchical training +++++++++++++++++++++ - `Hierarchical RL training `__: Showcases a hierarchical RL setup inspired by automatic subgoal discovery and subpolicy specialization. A high-level policy selects subgoals and assigns one of three specialized low-level policies to achieve them within a time limit, encouraging specialization and efficient task-solving. The agent has to navigate a complex grid-world environment. The example highlights the advantages of hierarchical learning over flat approaches by demonstrating significantly improved learning performance in challenging, goal-oriented tasks. Inference of models or policies +++++++++++++++++++++++++++++++ - `Policy inference after training `__: Demonstrates performing inference using a checkpointed :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` or an `ONNX runtime `__. First trains the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, creates a checkpoint, then re-loads the module from this checkpoint or ONNX file, and computes actions in a simulated environment. - `Policy inference after training, with ConnectorV2 `__: Runs inference with a trained, LSTM-based :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` or an `ONNX runtime `__. Two connector pipelines, env-to-module and module-to-env, preprocess observations and LSTM-states and postprocess model outputs into actions, allowing for very modular and flexible inference setups. Learners ++++++++ - `Custom loss function, simple `__: Implements a custom loss function for training, demonstrating how users can define tailored loss objectives for specific environments or behaviors. - `Custom torch learning rate schedulers `__: Adds learning rate scheduling to PPO, showing how to adjust the learning rate dynamically using PyTorch schedulers for improved training stability. - `Separate learning rate and optimizer for value function `__: Configures a separate learning rate and a separate optimizer for the value function vs the policy network, enabling differentiated training dynamics between policy and value estimation in RL algorithms. Metrics +++++++ - `Logging custom metrics in Algorithm.training_step `__: Shows how to log custom metrics inside a custom :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` through overriding the :py:meth:`` method and making calls to the :py:meth:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger.log_value` method of the :py:class:`~ray.rllib.utils.metrics.metrics_logger.MetricsLogger` instance. - `Logging custom metrics in EnvRunners `__: Demonstrates adding custom metrics to :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors, providing a way to track specific performance- and environment indicators beyond the standard RLlib metrics. Multi-agent RL ++++++++++++++ - `Custom heuristic policy `__: Demonstrates running a hybrid policy setup within the `MultiAgentCartPole` environment, where one agent follows a hand-coded random policy while another agent trains with PPO. This example highlights integrating static and dynamic policies, suitable for environments with a mix of fixed-strategy and adaptive agents. - `Different observation- and action spaces for different agents `__: Configures agents with differing observation and action spaces within the same environment, showcasing RLlib's support for heterogeneous agents with varying space requirements in a single multi-agent environment. Another example, which also makes use of connectors, and that covers the same topic, agents having different spaces, can be found `here `__. - `Grouped agents, two-step game `__: Implements a multi-agent, grouped setup within a two-step game environment from the `QMIX paper `__. N agents form M teams in total, where N >= M, and agents in each team share rewards and one policy. This example demonstrates RLlib's ability to manage collective objectives and interactions among grouped agents. - `Multi-agent CartPole `__: Runs a multi-agent version of the CartPole environment with each agent independently learning to balance its pole. This example serves as a foundational test for multi-agent reinforcement learning scenarios in simple, independent tasks. - `Multi-agent Pendulum `__: Extends the classic Pendulum environment into a multi-agent setting, where multiple agents attempt to balance their respective pendulums. This example highlights RLlib's support for environments with replicated dynamics but distinct agent policies. - `PettingZoo independent learning `__: Integrates RLlib with `PettingZoo `__ to facilitate independent learning among multiple agents. Each agent independently optimizes its policy within a shared environment. - `PettingZoo parameter sharing `__: Uses `PettingZoo `__ for an environment where all agents share a single policy. - `PettingZoo shared value function `__: Also using PettingZoo, this example explores shared value functions among agents. It demonstrates collaborative learning scenarios where agents collectively estimate a value function rather than individual policies. - `Rock-paper-scissors heuristic vs learned `__: Simulates a rock-paper-scissors game with one heuristic-driven agent and one learning agent. It provides insights into performance when combining fixed and adaptive strategies in adversarial games. - `Rock-paper-scissors learned vs learned `__: Sets up a rock-paper-scissors game where you train both agents to learn strategies on how to play against each other. Useful for evaluating performance in simple adversarial settings. - `Self-play, league-based, with OpenSpiel `__: Uses OpenSpiel to demonstrate league-based self-play, where agents play against various versions of themselves, frozen or in-training, to improve through competitive interaction. - `Self-play with Footsies and PPO algorithm `__: Implements self-play with the Footsies environment (two player zero-sum game). This example highlights RLlib's capabilities in connecting to the external binaries running the game engine, as well as setting up a multi-agent self-play training scenario. - `Self-play with OpenSpiel `__: Similar to the league-based self-play, but simpler. This script leverages OpenSpiel for two-player games, allowing agents to improve through direct self-play without building a complex, structured league. Offline RL ++++++++++ - `Train with behavioral cloning (BC), Finetune with PPO `__: Combines behavioral cloning pre-training with PPO fine-tuning, providing a two-phase training strategy. Offline imitation learning as a first step followed by online reinforcement learning. Ray Serve and RLlib +++++++++++++++++++ - `Using Ray Serve with RLlib `__: Integrates RLlib with `Ray Serve `__, showcasing how to deploy trained :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances as RESTful services. This setup is ideal for deploying models in production environments with API-based interactions. Ray Tune and RLlib ++++++++++++++++++ - `Custom experiment `__: Configures a custom experiment with `Ray Tune `__, demonstrating advanced options for custom training- and evaluation phases - `Custom logger `__: Shows how to implement a custom logger within `Ray Tune `__, allowing users to define specific logging behaviors and outputs during training. - `Custom progress reporter `__: Demonstrates a custom progress reporter in `Ray Tune `__, which enables tracking and displaying specific training metrics or status updates in a customized format. RLModules +++++++++ - `Action masking `__: Implements an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` with action masking, where certain disallowed actions are masked based on parts of the observation dict, useful for environments with conditional action availability. - `Auto-regressive actions `__: :ref:`See here for more details `. - `Custom CNN-based RLModule `__: Demonstrates a custom CNN architecture realized as an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, enabling convolutional feature extraction tailored to the environment's visual observations. - `Custom LSTM-based RLModule `__: Uses a custom LSTM within an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, allowing for temporal sequence processing, beneficial for partially observable environments with sequential dependencies. - `Migrate ModelV2 to RLModule by config `__: Shows how to migrate a ModelV2-based setup (old API stack) to the new API stack's :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, using an (old API stack) :py:class:`~ray.rllib.algorithm.algorithm_config.AlgorithmConfig` instance. - `Migrate ModelV2 to RLModule by Policy Checkpoint `__: Migrates a ModelV2 (old API stack) to the new API stack's :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` by directly loading a policy checkpoint, enabling smooth transitions to the new API stack while preserving learned parameters. - `Pretrain single-agent policy, then train in multi-agent Env `__: Demonstrates pretraining a single-agent model and transferring it to a multi-agent setting, useful for initializing multi-agent scenarios with pre-trained policies. .. _rllib-tuned-examples-docs: Tuned examples -------------- The `tuned examples `__ folder contains python config files that you can execute analogously to all other example scripts described here to run tuned learning experiments for the different algorithms and environment types. For example, see this `tuned Atari example for PPO `__, which learns to solve the Pong environment in roughly 5 minutes. You can run it as follows on a single g5.24xlarge or g6.24xlarge machine with 4 GPUs and 96 CPUs: .. code-block:: bash $ cd ray/rllib/tuned_examples/ppo $ python atari_ppo.py --env=ale_py:ALE/Pong-v5 --num-learners=4 --num-env-runners=95 Note that RLlib's daily or weekly release tests use some of the files in this folder as well. Community examples ------------------ .. note:: The community examples listed here all refer to the old API stack of RLlib. - `Arena AI `__: A General Evaluation Platform and Building Toolkit for Single/Multi-Agent Intelligence with RLlib-generated baselines. - `CARLA `__: Example of training autonomous vehicles with RLlib and `CARLA `__ simulator. - `The Emergence of Adversarial Communication in Multi-Agent Reinforcement Learning `__: Using Graph Neural Networks and RLlib to train multiple cooperative and adversarial agents to solve the "cover the area"-problem, thereby learning how to best communicate or - in the adversarial case - how to disturb communication (`code `__). - `Flatland `__: A dense traffic simulating environment with RLlib-generated baselines. - `GFootball `__: Example of setting up a multi-agent version of `GFootball `__ with RLlib. - `mobile-env `__: An open, minimalist Gymnasium environment for autonomous coordination in wireless mobile networks. Includes an example notebook using Ray RLlib for multi-agent RL with mobile-env. - `Neural MMO `__: A multiagent AI research environment inspired by Massively Multiplayer Online (MMO) role playing games – self-contained worlds featuring thousands of agents per persistent macrocosm, diverse skilling systems, local and global economies, complex emergent social structures, and ad-hoc high-stakes single and team based conflict. - `NeuroCuts `__: Example of building packet classification trees using RLlib / multi-agent in a bandit-like setting. - `NeuroVectorizer `__: Example of learning optimal LLVM vectorization compiler pragmas for loops in C and C++ codes using RLlib. - `Roboschool / SageMaker `__: Example of training robotic control policies in SageMaker with RLlib. - `Sequential Social Dilemma Games `__: Example of using the multi-agent API to model several `social dilemma games `__. - `Simple custom environment for single RL with Ray and RLlib `__: Create a custom environment and train a single agent RL using Ray 2.0 with Tune. - `StarCraft2 `__: Example of training in StarCraft2 maps with RLlib / multi-agent. - `Traffic Flow `__: Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent. Blog posts ---------- .. note:: The blog posts listed here all refer to the old API stack of RLlib. - `Attention Nets and More with RLlib’s Trajectory View API `__: Blog describing RLlib's new "trajectory view API" and how it enables implementations of GTrXL attention net architectures. - `Lessons from Implementing 12 Deep RL Algorithms in TF and PyTorch `__: Discussion on how the Ray Team ported 12 of RLlib's algorithms from TensorFlow to PyTorch and the lessons learned. - `Scaling Multi-Agent Reinforcement Learning `__: Blog post of a brief tutorial on multi-agent RL and its design in RLlib. - `Functional RL with Keras and TensorFlow Eager `__: Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms. --- .. include:: /_includes/rllib/we_are_hiring.rst .. include:: /_includes/rllib/new_api_stack.rst Fault Tolerance And Elastic Training ==================================== RLlib handles common failures modes, such as machine failures, spot instance preemption, network outages, or Ray cluster failures. There are three main areas for RLlib fault tolerance support: * Worker recovery * Environment fault tolerance * Experiment level fault tolerance with Ray Tune Worker Recovery --------------- RLlib supports self-recovering and elastic :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` for both training and evaluation EnvRunner workers. This provides fault tolerance at worker level. This means that if you have n :py:class:`~ray.rllib.env.env_runner.EnvRunner` workers sitting on different machines and a machine is pre-empted, RLlib can continue training and evaluation with minimal interruption. The two properties that RLlib supports here are self-recovery and elasticity: * **Elasticity**: RLlib continues training even when it removes an :py:class:`~ray.rllib.env.env_runner.EnvRunner`. For example, if an RLlib trial uses spot instances, Ray may remove nodes from the cluster, potentially resulting in Ray not scheduling a subset of workers. In this case, RLlib continues with whatever healthy :py:class:`~ray.rllib.env.env_runner.EnvRunner` instances remain at a reduced speed. * **Self-Recovery**: When possible, RLlib attempts to restore any :py:class:`~ray.rllib.env.env_runner.EnvRunner` that it previously removed. During restoration, RLlib syncs the latest state over to the restored :py:class:`~ray.rllib.env.env_runner.EnvRunner` before sampling new episodes. You can turn on worker fault tolerance by setting ``config.fault_tolerance(restart_failed_env_runners=True)``. RLlib achieves this by utilizing a `state-aware and fault tolerant actor manager `__. Under the hood, RLlib relies on Ray Core :ref:`actor fault tolerance ` to automatically recover failed worker actors. Env Fault Tolerance ------------------- In addition to worker fault tolerance, RLlib offers fault tolerance at the environment level as well. Rollout or evaluation workers often run multiple environments in parallel to take advantage of, for example, the parallel computing power that GPU offers. You can control this parallelism with the ``num_envs_per_env_runner`` config. It may then be wasteful if RLlib needs to reconstruct the entire worker needs because of errors from a single environment. In that case, RLlib offers the capability to restart individual environments without bubbling the errors to higher level components. You can do that easily by turning on config ``restart_failed_sub_environments``. .. note:: Environment restarts are blocking. A rollout worker waits until the environment comes back and finishes initialization. So for on-policy algorithms, it may be better to recover at worker level to make sure training progresses with elastic worker set while RLlib reconstructs the environments. More specifically, use configs ``num_envs_per_env_runner=1``, ``restart_failed_sub_environments=False``, and ``restart_failed_env_runners=True``. Fault Tolerance and Recovery Provided by Ray Tune ------------------------------------------------- Ray Tune provides fault tolerance and recovery at the experiment trial level. When using Ray Tune with RLlib, you can enable :ref:`periodic checkpointing `, which saves the state of the experiment to a user-specified persistent storage location. If a trial fails, Ray Tune automatically restarts it from the latest :ref:`checkpointed ` state. Other Miscellaneous Considerations ---------------------------------- By default, RLlib runs health checks during initial worker construction. The whole job errors out if RLlib can't establish a completely healthy worker fleet at the start of a training run. If an environment is by nature flaky, you may want to turn off this feature by setting config ``validate_env_runners_after_construction`` to False. Lastly, in an extreme case where no healthy workers remain for training, RLlib waits a certain number of iterations for some of the workers to recover before the entire training job fails. You can configure the number of iterations RLlib waits with the config ``num_consecutive_env_runner_failures_tolerance``. .. TODO(jungong) : move fault tolerance related options into a separate AlgorithmConfig group and update the doc here. --- .. include:: /_includes/rllib/we_are_hiring.rst .. include:: /_includes/rllib/new_api_stack.rst .. |tensorflow| image:: images/tensorflow.png :class: inline-figure :width: 16 .. |pytorch| image:: images/pytorch.png :class: inline-figure :width: 16 .. _learner-guide: Learner (Alpha) =============== :py:class:`~ray.rllib.core.learner.learner.Learner` allows you to abstract the training logic of RLModules. It supports both gradient-based and non-gradient-based updates (e.g. polyak averaging, etc.) The API enables you to distribute the Learner using data- distributed parallel (DDP). The Learner achieves the following: (1) Facilitates gradient-based updates on :ref:`RLModule `. (2) Provides abstractions for non-gradient based updates such as polyak averaging, etc. (3) Reporting training statistics. (4) Checkpoints the modules and optimizer states for durable training. The :py:class:`~ray.rllib.core.learner.learner.Learner` class supports data-distributed- parallel style training using the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` API. Under this paradigm, the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` maintains multiple copies of the same :py:class:`~ray.rllib.core.learner.learner.Learner` with identical parameters and hyperparameters. Each of these :py:class:`~ray.rllib.core.learner.learner.Learner` instances computes the loss and gradients on a shard of a sample batch and then accumulates the gradients across the :py:class:`~ray.rllib.core.learner.learner.Learner` instances. Learn more about data-distributed parallel learning in `this article. `_ :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` also allows for asynchronous training and (distributed) checkpointing for durability during training. Enabling Learner API in RLlib experiments ========================================= Adjust the amount of resources for training using the `num_gpus_per_learner`, `num_cpus_per_learner`, and `num_learners` arguments in the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`. .. testcode:: :hide: from ray.rllib.algorithms.ppo.ppo import PPOConfig .. testcode:: config = ( PPOConfig() .learners( num_learners=0, # Set this to greater than 1 to allow for DDP style updates. num_gpus_per_learner=0, # Set this to 1 to enable GPU training. num_cpus_per_learner=1, ) ) .. testcode:: :hide: config = config.environment("CartPole-v1") config.build() # test that the algorithm can be built with the given resources .. note:: This features is in alpha. If you migrate to this algorithm, enable the feature by via `AlgorithmConfig.api_stack(enable_rl_module_and_learner=True, enable_env_runner_and_connector_v2=True)`. The following algorithms support :py:class:`~ray.rllib.core.learner.learner.Learner` out of the box. Implement an algorithm with a custom :py:class:`~ray.rllib.core.learner.learner.Learner` to leverage this API for other algorithms. .. list-table:: :header-rows: 1 :widths: 60 60 * - Algorithm - Supported Framework * - **PPO** - |pytorch| |tensorflow| * - **IMPALA** - |pytorch| |tensorflow| * - **APPO** - |pytorch| |tensorflow| Basic usage =========== Use the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` utility to interact with multiple learners. Construction ------------ If you enable the :ref:`RLModule ` and :py:class:`~ray.rllib.core.learner.learner.Learner` APIs via the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`, then calling :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.build_algo` constructs a :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` for you, but if you’re using these APIs standalone, you can construct the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` as follows. .. testcode:: :hide: # imports for the examples import gymnasium as gym import numpy as np import ray from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.rl_module import RLModuleSpec from ray.rllib.core.learner.learner_group import LearnerGroup .. tab-set:: .. tab-item:: Constructing a LearnerGroup .. testcode:: env = gym.make("CartPole-v1") # Create an AlgorithmConfig object from which we can build the # LearnerGroup. config = ( PPOConfig() # Number of Learner workers (Ray actors). # Use 0 for no actors, only create a local Learner. # Use >=1 to create n DDP-style Learner workers (Ray actors). .learners(num_learners=1) # Specify the learner's hyperparameters. .training( use_kl_loss=True, kl_coeff=0.01, kl_target=0.05, clip_param=0.2, vf_clip_param=0.2, entropy_coeff=0.05, vf_loss_coeff=0.5 ) ) # Construct a new LearnerGroup using our config object. learner_group = config.build_learner_group(env=env) .. tab-item:: Constructing a Learner .. testcode:: env = gym.make("CartPole-v1") # Create an AlgorithmConfig object from which we can build the # Learner. config = ( PPOConfig() # Specify the Learner's hyperparameters. .training( use_kl_loss=True, kl_coeff=0.01, kl_target=0.05, clip_param=0.2, vf_clip_param=0.2, entropy_coeff=0.05, vf_loss_coeff=0.5 ) ) # Construct a new Learner using our config object. learner = config.build_learner(env=env) # Needs to be called on the learner before calling any functions. learner.build() Updates ------- .. testcode:: :hide: import time from ray.rllib.core import DEFAULT_MODULE_ID from ray.rllib.evaluation.postprocessing import Postprocessing from ray.rllib.policy.sample_batch import SampleBatch, MultiAgentBatch DUMMY_BATCH = { SampleBatch.OBS: np.array( [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8], [0.9, 1.0, 1.1, 1.2]], dtype=np.float32, ), SampleBatch.NEXT_OBS: np.array( [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8], [0.9, 1.0, 1.1, 1.2]], dtype=np.float32, ), SampleBatch.ACTIONS: np.array([0, 1, 1]), SampleBatch.PREV_ACTIONS: np.array([0, 1, 1]), SampleBatch.REWARDS: np.array([1.0, -1.0, 0.5], dtype=np.float32), SampleBatch.PREV_REWARDS: np.array([1.0, -1.0, 0.5], dtype=np.float32), SampleBatch.TERMINATEDS: np.array([False, False, True]), SampleBatch.TRUNCATEDS: np.array([False, False, False]), SampleBatch.VF_PREDS: np.array([0.5, 0.6, 0.7], dtype=np.float32), SampleBatch.ACTION_DIST_INPUTS: np.array( [[-2.0, 0.5], [-3.0, -0.3], [-0.1, 2.5]], dtype=np.float32 ), SampleBatch.ACTION_LOGP: np.array([-0.5, -0.1, -0.2], dtype=np.float32), SampleBatch.EPS_ID: np.array([0, 0, 0]), SampleBatch.AGENT_INDEX: np.array([0, 0, 0]), Postprocessing.ADVANTAGES: np.array([0.1, 0.2, 0.3], dtype=np.float32), Postprocessing.VALUE_TARGETS: np.array([0.5, 0.6, 0.7], dtype=np.float32), } default_batch = SampleBatch(DUMMY_BATCH) DUMMY_BATCH = default_batch.as_multi_agent() # Make sure, we convert the batch to the correct framework (here: torch). DUMMY_BATCH = learner._convert_batch_type(DUMMY_BATCH) .. tab-set:: .. tab-item:: Updating a LearnerGroup .. testcode:: TIMESTEPS = {"num_env_steps_sampled_lifetime": 250} # This is a blocking update. results = learner_group.update(batch=DUMMY_BATCH, timesteps=TIMESTEPS) # This is a non-blocking update. The results are returned in a future # call to `update(..., async_update=True)` _ = learner_group.update(batch=DUMMY_BATCH, async_update=True, timesteps=TIMESTEPS) # Artificially wait for async request to be done to get the results # in the next call to # `LearnerGroup.update(..., async_update=True)`. time.sleep(5) results = learner_group.update( batch=DUMMY_BATCH, async_update=True, timesteps=TIMESTEPS ) # `results` is a list of n result dicts from various Learner actors. assert isinstance(results, list), results assert isinstance(results[0], dict), results When updating a :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` you can perform blocking or async updates on batches of data. Async updates are necessary for implementing async algorithms such as APPO/IMPALA. .. tab-item:: Updating a Learner .. testcode:: # This is a blocking update (given a training batch). result = learner.update(batch=DUMMY_BATCH, timesteps=TIMESTEPS) When updating a :py:class:`~ray.rllib.core.learner.learner.Learner` you can only perform blocking updates on batches of data. You can perform non-gradient based updates before or after the gradient-based ones by overriding :py:meth:`~ray.rllib.core.learner.learner.Learner.before_gradient_based_update` and :py:meth:`~ray.rllib.core.learner.learner.Learner.after_gradient_based_update`. Getting and setting state ------------------------- .. tab-set:: .. tab-item:: Getting and Setting State for a LearnerGroup .. testcode:: # Get the LearnerGroup's RLModule weights and optimizer states. state = learner_group.get_state() learner_group.set_state(state) # Only get the RLModule weights. weights = learner_group.get_weights() learner_group.set_weights(weights) Set/get the state dict of all learners through learner_group through `LearnerGroup.set_state` or `LearnerGroup.get_state`. This includes the neural network weights and the optimizer states on each learner. For example an Adam optimizer's state has momentum information based on recently computed gradients. If you only want to get or set the weights of the RLModules (neural networks) of all Learners, you can do so through the LearnerGroup APIs `LearnerGroup.get_weights` and `LearnerGroup.set_weights`. .. tab-item:: Getting and Setting State for a Learner .. testcode:: from ray.rllib.core import COMPONENT_RL_MODULE # Get the Learner's RLModule weights and optimizer states. state = learner.get_state() # Note that `state` is now a dict: # { # COMPONENT_RL_MODULE: [RLModule's state], # COMPONENT_OPTIMIZER: [Optimizer states], # } learner.set_state(state) # Only get the RLModule weights (as numpy, not torch/tf). rl_module_only_state = learner.get_state(components=COMPONENT_RL_MODULE) # Note that `rl_module_only_state` is now a dict: # {COMPONENT_RL_MODULE: [RLModule's state]} learner.module.set_state(rl_module_only_state) You can set and get the entire state of a :py:class:`~ray.rllib.core.learner.learner.Learner` using :py:meth:`~ray.rllib.core.learner.learner.Learner.set_state` and :py:meth:`~ray.rllib.core.learner.learner.Learner.get_state` . For getting only the RLModule's weights (without optimizer states), use the `components=COMPONENT_RL_MODULE` arg in :py:meth:`~ray.rllib.core.learner.learner.Learner.get_state` (see code above). For setting only the RLModule's weights (without touching the optimizer states), use :py:meth:`~ray.rllib.core.learner.learner.Learner.get_state` and pass in a dict: `{COMPONENT_RL_MODULE: [RLModule's state]}` (see code above). .. testcode:: :hide: import tempfile LEARNER_CKPT_DIR = tempfile.mkdtemp() LEARNER_GROUP_CKPT_DIR = tempfile.mkdtemp() Checkpointing ------------- .. tab-set:: .. tab-item:: Checkpointing a LearnerGroup .. testcode:: learner_group.save_to_path(LEARNER_GROUP_CKPT_DIR) learner_group.restore_from_path(LEARNER_GROUP_CKPT_DIR) Checkpoint the state of all learners in the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` through :py:meth:`~ray.rllib.core.learner.learner_group.LearnerGroup.save_to_path` and restore the state of a saved :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` through :py:meth:`~ray.rllib.core.learner.learner_group.LearnerGroup.restore_from_path`. A LearnerGroup's state includes the neural network weights and all optimizer states. Note that since the state of all of the :py:class:`~ray.rllib.core.learner.learner.Learner` instances is identical, only the states from the first :py:class:`~ray.rllib.core.learner.learner.Learner` are saved. .. tab-item:: Checkpointing a Learner .. testcode:: learner.save_to_path(LEARNER_CKPT_DIR) learner.restore_from_path(LEARNER_CKPT_DIR) Checkpoint the state of a :py:class:`~ray.rllib.core.learner.learner.Learner` through :py:meth:`~ray.rllib.core.learner.learner.Learner.save_to_path` and restore the state of a saved :py:class:`~ray.rllib.core.learner.learner.Learner` through :py:meth:`~ray.rllib.core.learner.learner.Learner.restore_from_path`. A Learner's state includes the neural network weights and all optimizer states. Implementation ============== :py:class:`~ray.rllib.core.learner.learner.Learner` has many APIs for flexible implementation, however the core ones that you need to implement are: .. list-table:: :widths: 60 60 :header-rows: 1 * - Method - Description * - :py:meth:`~ray.rllib.core.learner.learner.Learner.configure_optimizers_for_module()` - set up any optimizers for a RLModule. * - :py:meth:`~ray.rllib.core.learner.learner.Learner.compute_loss_for_module()` - calculate the loss for gradient based update to a module. * - :py:meth:`~ray.rllib.core.learner.learner.Learner.before_gradient_based_update()` - do any non-gradient based updates to a RLModule before(!) the gradient based ones, e.g. add noise to your network. * - :py:meth:`~ray.rllib.core.learner.learner.Learner.after_gradient_based_update()` - do any non-gradient based updates to a RLModule after(!) the gradient based ones, e.g. update a loss coefficient based on some schedule. Starter Example --------------- A :py:class:`~ray.rllib.core.learner.learner.Learner` that implements behavior cloning could look like the following: .. testcode:: :hide: from typing import Any, Dict, DefaultDict from ray.rllib.algorithms.algorithm_config import AlgorithmConfig from ray.rllib.core.learner.learner import Learner from ray.rllib.core.learner.torch.torch_learner import TorchLearner from ray.rllib.policy.sample_batch import SampleBatch from ray.rllib.utils.annotations import override from ray.rllib.utils.numpy import convert_to_numpy from ray.rllib.utils.typing import ModuleID, TensorType .. testcode:: class BCTorchLearner(TorchLearner): @override(Learner) def compute_loss_for_module( self, *, module_id: ModuleID, config: AlgorithmConfig = None, batch: Dict[str, Any], fwd_out: Dict[str, TensorType], ) -> TensorType: # standard behavior cloning loss action_dist_inputs = fwd_out[SampleBatch.ACTION_DIST_INPUTS] action_dist_class = self._module[module_id].get_train_action_dist_cls() action_dist = action_dist_class.from_logits(action_dist_inputs) loss = -torch.mean(action_dist.logp(batch[SampleBatch.ACTIONS])) return loss --- .. include:: /_includes/rllib/we_are_hiring.rst Working with offline data ========================= .. include:: /_includes/rllib/new_api_stack.rst RLlib's offline RL API enables you to work with experiences read from offline storage (for example, disk, cloud storage, streaming systems, Hadoop Distributed File System (HDFS). For example, you might want to read experiences saved from previous training runs, collected from experts, or gathered from policies deployed in `web applications `__. You can also log new agent experiences produced during online training for future use. RLlib represents trajectory sequences (for example, ``(s, a, r, s', ...)`` tuples) with :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` objects (multi-agent offline training is currently not supported). Using this episode format allows for efficient encoding and compression of experiences, rewriting trajectories, and user-friendly data access through getter methods. During online training, RLlib uses :py:class:`~ray.rllib.env.single_agent_env_runner.SingleAgentEnvRunner` actors to generate episodes of experiences in parallel using the current policy. However, RLlib uses this same episode format for reading experiences from and writing experiences to offline storage (see :py:class:`~ray.rllib.offline.offline_env_runner.OfflineSingleAgentEnvRunner`). You can store experiences either directly in RLlib's episode format or in table (columns) format. You should use the episode format when #. You need experiences grouped by their trajectory and ordered in time (for example, to train stateful modules). #. You want to use recorded experiences exclusively within RLlib (for example for offline RL or behavior cloning). On the contrary, you should prefer the table (columns) format, if #. You need to read the data easily with other data tools or ML libraries. .. note:: RLlib's new API stack incorporates principles that support standalone applications. Consequently, the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` class is usable outside of an RLlib context. To enable faster access through external data tools (for example, for data transformations), it's recommended to use the table record format. Most importantly, RLlib's offline RL API builds on top of :ref:`Ray Data ` and therefore supports all of its read and write methods (for example :py:class:`~ray.data.read_parquet`, :py:class:`~ray.data.read_json`, etc.) with :py:class:`~ray.data.read_parquet` and :py:class:`~ray.data.Dataset.write_parquet` being the default read and write methods. A core design principle of the API is to apply as many data transformations as possible on-the-fly prior to engaging the learner, allowing the latter to focus exclusively on model updates. .. hint:: During the transition phase from old- to new API stack you can use the new offline RL API also with your :py:class:`~ray.rllib.policy.sample_batch.SampleBatch` data recorded with the old API stack. To enable this feature set ``config.offline_data(input_read_sample_batches=True)``. Example: Training an expert policy ---------------------------------- In this example you train a PPO agent on the ``CartPole-v1`` environment until it reaches an episode mean return of ``450.0``. You checkpoint this agent and then use its policy to record expert data to local disk. .. testsetup:: # Define a shared variable to store the path to the # best checkpoint. best_checkpoint = None # Define a shared variable to store the path to the # recorded data. data_path = None # Define another shared variable to store the path to # the tabular recording data. tabular_data_path = None .. code-block:: from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig from ray.rllib.utils.metrics import ( ENV_RUNNER_RESULTS, EVALUATION_RESULTS, EPISODE_RETURN_MEAN, ) from ray import tune # Configure the PPO algorithm. config = ( PPOConfig() .environment("CartPole-v1") .training( lr=0.0003, # Run 6 SGD minibatch iterations on a batch. num_epochs=6, # Weigh the value function loss smaller than # the policy loss. vf_loss_coeff=0.01, ) .rl_module( model_config=DefaultModelConfig( fcnet_hiddens=[32], fcnet_activation="linear", # Share encoder layers between value network # and policy. vf_share_layers=True, ), ) ) # Define the metric to use for stopping. metric = f"{EVALUATION_RESULTS}/{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}" # Define the Tuner. tuner = tune.Tuner( "PPO", param_space=config, run_config=tune.RunConfig( stop={ metric: 450.0, }, name="docs_rllib_offline_pretrain_ppo", verbose=2, checkpoint_config=tune.CheckpointConfig( checkpoint_frequency=1, checkpoint_at_end=True, ), ), ) results = tuner.fit() # Store the best checkpoint to use it later for recording # an expert policy. best_checkpoint = ( results .get_best_result( metric=metric, mode="max" ) .checkpoint.path ) In this example, you saved a checkpoint from an agent that has become an expert at playing ``CartPole-v1``. You use this checkpoint in the next example to record expert data to disk, which is later utilized for offline training to clone another agent. Example: Record expert data to local disk ----------------------------------------- After you train an expert policy to play `CartPole-v1` you load its policy here to record expert data during evaluation. You use ``5`` :py:class:`~ray.rllib.offline.offline_env_runner.OfflineSingleAgentEnvRunner` instances to collect ``50`` complete episodes per `sample()` call. In this example you store experiences directly in RLlib's :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` objects with no more than ``25`` episode objects per Parquet file. Altogether you run 10 evaluation runs, which should result in ``500`` recorded episodes from the expert policy. You use these data in the next example to train a new policy through Offline RL that should reach a return of ``450.0`` when playing ``CartPole-v1``. .. code-block:: python from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core import ( COMPONENT_LEARNER_GROUP, COMPONENT_LEARNER, COMPONENT_RL_MODULE, DEFAULT_MODULE_ID, ) from ray.rllib.core.rl_module import RLModuleSpec # Store recording data under the following path. data_path = "/tmp/docs_rllib_offline_recording" # Configure the algorithm for recording. config = ( PPOConfig() # The environment needs to be specified. .environment( env="CartPole-v1", ) # Make sure to sample complete episodes because # you want to record RLlib's episode objects. .env_runners( batch_mode="complete_episodes", ) # Set up 5 evaluation `EnvRunners` for recording. # Sample 50 episodes in each evaluation rollout. .evaluation( evaluation_num_env_runners=5, evaluation_duration=50, evaluation_duration_unit="episodes", ) # Use the checkpointed expert policy from the preceding PPO training. # Note, we have to use the same `model_config` as # the one with which the expert policy was trained, otherwise # the module state can't be loaded. .rl_module( model_config=DefaultModelConfig( fcnet_hiddens=[32], fcnet_activation="linear", # Share encoder layers between value network # and policy. vf_share_layers=True, ), ) # Define the output path and format. In this example you # want to store data directly in RLlib's episode objects. # Each Parquet file should hold no more than 25 episodes. .offline_data( output=data_path, output_write_episodes=True, output_max_rows_per_file=25, ) ) # Build the algorithm. algo = config.build() # Load now the PPO-trained `RLModule` to use in recording. algo.restore_from_path( best_checkpoint, # Load only the `RLModule` component here. component=COMPONENT_RL_MODULE, ) # Run 10 evaluation iterations and record the data. for i in range(10): print(f"Iteration {i + 1}") eval_results = algo.evaluate() print(eval_results) # Stop the algorithm. Note, this is important for when # defining `output_max_rows_per_file`. Otherwise, # remaining episodes in the `EnvRunner`s buffer isn't written to disk. algo.stop() .. note:: RLlib formats The stored episode data as ``binary``. Each episode is converted into its dictionary representation and serialized using ``msgpack-numpy``, ensuring version compatibility. RLlib's recording process is efficient because it utilizes multiple :py:class:`~ray.rllib.offline.offline_env_runner.OfflineSingleAgentEnvRunner` instances during evaluation, enabling parallel data writing. You can explore the folder to review the stored Parquet data: .. code-block:: text $ ls -la /tmp/docs_rllib_offline_recording/cartpole-v1 drwxr-xr-x. 22 user user 440 21. Nov 17:23 . drwxr-xr-x. 3 user user 60 21. Nov 17:23 .. drwxr-xr-x. 2 user user 540 21. Nov 17:23 run-000001-00004 drwxr-xr-x. 2 user user 540 21. Nov 17:23 run-000001-00009 drwxr-xr-x. 2 user user 540 21. Nov 17:23 run-000001-00012 drwxr-xr-x. 2 user user 540 21. Nov 17:23 run-000001-00016 drwxr-xr-x. 2 user user 540 21. Nov 17:23 run-000002-00004 drwxr-xr-x. 2 user user 540 21. Nov 17:23 run-000002-00007 .. hint:: RLlib stores records under a folder named by the RL environment. Therein, you see one folder of Parquet files for each :py:class:`~ray.rllib.offline.offline_env_runner.OfflineSingleAgentEnvRunner` and write operation. The write operation count is given in the second numbering. For example: above, env-runner 1 has sampled 25 episodes at its 4th :py:meth:`~ray.rllib.offline.offline_env_runner.OfflineSingleAgentEnvRunner.sample` call and writes then (because ``output_max_rows_per_file=25``) all sampled episodes to disk into file ``run-000001-00004``. .. note:: The number of write operations per worker may vary because policy rollouts aren't evenly distributed. Faster workers collect more episodes, leading to differences in write operation counts. As a result, the second numbering may differ across files generated by different env-runner instances. Example: Training on previously saved experiences ------------------------------------------------- In this example you are using behavior cloning with the previously recorded Parquet data from your expert policy playing ``CartPole-v1``. The data needs to be linked in the configuration of the algorithm (through the ``input_`` attribute). .. code-block:: python from ray import tune from ray.rllib.algorithms.bc import BCConfig # Setup the config for behavior cloning. config = ( BCConfig() .environment( # Use the `CartPole-v1` environment from which the # data was recorded. This is merely for receiving # action and observation spaces and to use it during # evaluation. env="CartPole-v1", ) .learners( # Use a single learner. num_learners=0, ) .training( # This has to be defined in the new offline RL API. train_batch_size_per_learner=1024, ) .offline_data( # Link the data. input_=[data_path], # You want to read in RLlib's episode format b/c this # is how you recorded data. input_read_episodes=True, # Read smaller batches from the data than the learner # trains on. Note, each batch element is an episode # with multiple timesteps. input_read_batch_size=512, # Create exactly 2 `DataWorkers` that transform # the data on-the-fly. Give each of them a single # CPU. map_batches_kwargs={ "concurrency": 2, "num_cpus": 1, }, # When iterating over the data, prefetch two batches # to improve the data pipeline. Don't shuffle the # buffer (the data is too small). iter_batches_kwargs={ "prefetch_batches": 2, "local_shuffle_buffer_size": None, }, # You must set this for single-learner setups. dataset_num_iters_per_learner=1, ) .evaluation( # Run evaluation to see how well the learned policy # performs. Run every 3rd training iteration an evaluation. evaluation_interval=3, # Use a single `EnvRunner` for evaluation. evaluation_num_env_runners=1, # In each evaluation rollout, collect 5 episodes of data. evaluation_duration=5, # Evaluate the policy parallel to training. evaluation_parallel_to_training=True, ) ) # Set the stopping metric to be the evaluation episode return mean. metric = f"{EVALUATION_RESULTS}/{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}" # Configure Ray Tune. tuner = tune.Tuner( "BC", param_space=config, run_config=tune.RunConfig( name="docs_rllib_offline_bc", # Stop behavior cloning when we reach 450 in return. stop={metric: 450.0}, checkpoint_config=tune.CheckpointConfig( # Only checkpoint at the end to be faster. checkpoint_frequency=0, checkpoint_at_end=True, ), verbose=2, ) ) # Run the experiment. analysis = tuner.fit() Behavior cloning in RLlib is highly performant, completing a single training iteration in approximately 2 milliseconds. The experiment's results should resemble the following: .. image:: images/offline/docs_rllib_offline_bc_episode_return_mean.svg :alt: Episode mean return over the course of BC training. :width: 500 :align: left It should take you around ``98`` seconds (``456`` iterations) to achieve the same episode return mean as the PPO agent. While this may not seem impressive compared to the PPO training time, it's important to note that ``CartPole-v1`` is a very simple environment to learn. In more complex environments, which require more sophisticated agents and significantly longer training times, pre-training through behavior cloning can be highly beneficial. Combining behavior cloning with subsequent fine-tuning using a reinforcement learning algorithm can substantially reduce training time, resource consumption, and associated costs. Using external expert experiences --------------------------------- Your expert data is often already available, either recorded from an operational system or directly provided by human experts. Typically, you might store this data in a tabular (columnar) format. RLlib's new Offline RL API simplifies the use of such data by allowing direct ingestion through a specified schema that organizes the expert data. The API default schema for reading data is provided in :py:data:`~ray.rllib.offline.offline_prelearner.SCHEMA`. Lets consider a simple example in which your expert data is stored with the schema: ``(o_t, a_t, r_t, o_tp1, d_t, i_t, logprobs_t)``. In this case you provide this schema as follows: .. code-block:: python from ray.rllib.algorithms.bc import BCConfig from ray.rllib.core.columns import Columns config = ( BCConfig() ... .offline_data( input_=[], # Provide the schema of your data (map to column names known to RLlib). input_read_schema={ Columns.OBS: "o_t", Columns.ACTIONS: "a_t", Columns.REWARDS: "r_t", Columns.NEXT_OBS: "o_tp1", Columns.INFOS: "i_t", "done": "d_t", }, ) ) .. note:: Internally, the legacy ``gym``'s ``done`` signals are mapped to ``gymnasium``'s ``terminated`` signals, with ``truncated`` values defaulting to ``False``. RLlib's :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` structures align with ``gymnasium``, adhering to the updated environment API standards in reinforcement learning. Converting tabular data to RLlib's episode format ------------------------------------------------- While the tabular format is widely compatible and seamlessly integrates with RLlib's new Offline RL API, there are cases where you may prefer to use RLlib's native episode format. As briefly mentioned earlier, such scenarios typically arise when full expert trajectories are required. .. note:: RLlib processes tabular data in batches, converting each row into a *single-step episode*. This approach is primarily for procedural simplicity, as data can't generally be assumed to arrive in time-ordered rows grouped by episodes, though this may occasionally be the case (however knowledge of such a structure resides with the user as RLlib can't easily infer it automatically). While it's possible to concatenate consecutive :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` chunks, this can't be done with chunks arriving in some scrambled order. If you require full trajectories you can transform your tabular data into :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` objects and store these in Parquet format. The next example shows how to do this. First, you store experiences of the preceding trained expert policy in tabular format (note the `output_write_episodes=False` setting below to activate tabular data output): .. code-block:: python from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core import ( COMPONENT_LEARNER_GROUP, COMPONENT_LEARNER, COMPONENT_RL_MODULE, DEFAULT_MODULE_ID, ) from ray.rllib.core.rl_module import RLModuleSpec # Set up a path for the tabular data records. tabular_data_path = "tmp/docs_rllib_offline_recording_tabular" # Configure the algorithm for recording. config = ( PPOConfig() # The environment needs to be specified. .environment( env="CartPole-v1", ) # Make sure to sample complete episodes because # you want to record RLlib's episode objects. .env_runners( batch_mode="complete_episodes", ) # Set up 5 evaluation `EnvRunners` for recording. # Sample 50 episodes in each evaluation rollout. .evaluation( evaluation_num_env_runners=5, evaluation_duration=50, ) # Use the checkpointed expert policy from the preceding PPO training. # Note, we have to use the same `model_config` as # the one with which the expert policy was trained, otherwise # the module state can't be loaded. .rl_module( model_config=DefaultModelConfig( fcnet_hiddens=[32], fcnet_activation="linear", # Share encoder layers between value network # and policy. vf_share_layers=True, ), ) # Define the output path and format. In this example you # want to store data directly in RLlib's episode objects. .offline_data( output=tabular_data_path, # You want to store for this example tabular data. output_write_episodes=False, ) ) # Build the algorithm. algo = config.build() # Load the PPO-trained `RLModule` to use in recording. algo.restore_from_path( best_checkpoint, # Load only the `RLModule` component here. component=COMPONENT_RL_MODULE, ) # Run 10 evaluation iterations and record the data. for i in range(10): print(f"Iteration {i + 1}") res_eval = algo.evaluate() print(res_eval) # Stop the algorithm. Note, this is important for when # defining `output_max_rows_per_file`. Otherwise, # remaining episodes in the `EnvRunner`s buffer isn't written to disk. algo.stop() You may have noticed that recording data in tabular format takes significantly longer than recording in episode format. This slower performance is due to the additional post-processing required to convert episode data into a columnar format. To confirm that the recorded data is now in columnar format, you can print its schema: .. code-block:: python from ray import data # Read the tabular data into a Ray dataset. ds = ray.data.read_parquet(tabular_data_path) # Now, print its schema. print("Tabular data schema of expert experiences:\n") print(ds.schema()) # Column Type # ------ ---- # eps_id string # agent_id null # module_id null # obs ArrowTensorTypeV2(shape=(4,), dtype=float) # actions int32 # rewards double # new_obs ArrowTensorTypeV2(shape=(4,), dtype=float) # terminateds bool # truncateds bool # action_dist_inputs ArrowTensorTypeV2(shape=(2,), dtype=float) # action_logp float # weights_seq_no int64 .. note:: ``infos`` aren't stored to disk when they're all empty. If your expert data is given in columnar format and you need to train on full expert trajectories you can follow the code in the following example to convert your own data into RLlib's :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` objects: .. code-block:: python import gymnasium as gym import msgpack import msgpack_numpy as mnp from collections import defaultdict from ray import data from ray.rllib.env.single_agent_episode import SingleAgentEpisode # Load the dataset with the tabular data. ds = data.read_parquet(tabular_data_path) # Build the environment from which the data was sampled to get the # spaces. env = gym.make("CartPole-v1") # Define buffers for episode data. eps_obs = [] eps_actions = [] eps_rewards = [] # Note, extra-model-outputs needs to be a dictionary with list # values. eps_extra_model_outputs = defaultdict(list) # Define a buffer for unwritten episodes. episodes = [] # Start iterating over the rows of your experience data. for i, row in enumerate(ds.iter_rows(prefetch_batches=10)): # If the episode isn't terminated nor truncated, buffer the data. if not row["terminateds"] and not row["truncateds"]: eps_obs.append(row["obs"]) eps_actions.append(row["actions"]) eps_rewards.append(row["rewards"]) eps_extra_model_outputs["action_dist_inputs"].append(row["action_dist_inputs"]) eps_extra_model_outputs["action_logp"].append(row["action_logp"]) # Otherwise, build the episode. else: eps_obs.append(row["new_obs"]) episode = SingleAgentEpisode( id_=row["eps_id"], agent_id=row["agent_id"], module_id=row["module_id"], observations=eps_obs, # Use the spaces from the environment. observation_space=env.observation_space, action_space=env.action_space, actions=eps_actions, rewards=eps_rewards, # Set the starting timestep to zero. t_started=0, # You don't want to have a lookback buffer. len_lookback_buffer=0, terminated=row["terminateds"], truncated=row["truncateds"], extra_model_outputs=eps_extra_model_outputs, ) # Store the ready-to-write episode to the episode buffer. episodes.append(msgpack.packb(episode.get_state(), default=mnp.encode)) # Clear all episode data buffers. eps_obs.clear() eps_actions.clear() eps_rewards.clear() eps_extra_model_outputs = defaultdict(list) # Write episodes to disk when the episode buffer holds 50 episodes. if len(episodes) > 49: # Generate a Ray dataset from episodes. episodes_ds = data.from_items(episodes) # Write the Parquet data and compress it. episodes_ds.write_parquet( f"/tmp/test_converting/file-{i}".zfill(6), compression="gzip", ) # Delete the dataset in memory and clear the episode buffer. del episodes_ds episodes.clear() # If we are finished and have unwritten episodes, write them now. if len(episodes) > 0: episodes_ds = data.from_items(episodes) episodes_ds.write_parquet( f"/tmp/test_converting/file-{i}".zfill(6), compression="gzip", ) del episodes_ds episodes.clear() Using old API stack ``SampleBatch`` recordings ---------------------------------------------- If you have expert data previously recorded using RLlib's old API stack, it can be seamlessly utilized in the new stack's Offline RL API by setting ``input_read_sample_batches=True``. Alternatively, you can convert your ``SampleBatch`` recordings into :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` format using RLlib's :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` as demonstrated below: .. code-block:: python import msgpack import msgpack_numpy as mnp from ray import data from ray.rllib.offline.offline_prelearner import OfflinePreLearner # Set up the data path to your `SampleBatch` expert data. data_path = ... # Set up the write path for the Parquet episode data. output_data_path = "/tmp/sample_batch_data" # Load the `SampleBatch` recordings. ds = data.read_json(data_path) # Iterate over batches (of `SampleBatch`es) and convert them to episodes. for i, batch in enumerate(ds.iter_batches(batch_size=100, prefetch_batches=2)): # Use the RLlib's `OfflinePreLearner` to convert `SampleBatch`es to episodes. episodes = OfflinePreLearner._map_sample_batch_to_episode(False, batch)["episodes"] # Create a dataset from the episodes. Note, for storing episodes you need to # serialize them through `msgpack-numpy`. episode_ds = data.from_items([msgpack.packb(eps.get_state(), default=mnp.encode) for eps in episodes]) # Write the batch of episodes to local disk. episode_ds.write_parquet(output_data_path + f"/file-{i}".zfill(6), compression="gzip") print("Finished converting `SampleBatch` data to episode data.") .. note:: RLlib considers your :py:class:`~ray.rllib.policy.sample_batch.SampleBatch` to represent a terminated/truncated episode and builds its :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` according to this assumption. Pre-processing, filtering and post-processing --------------------------------------------- During recording, your expert policy may utilize pre-processing techniques for observations, such as *frame-stacking*, or filtering methods like *mean-std filtering*. Similarly, actions may undergo pre-processing, such as *action sampling* or *scaling*. In its ``EnvRunner`` instances, RLlib applies such pre-processing and filtering (through the *env-to-module* connector pipeline) **before** observations are passed to the ``RLModule``. However, raw observations (as received directly from the environment) are stored in the episodes. Likewise, actions are recorded in their raw form (as output directly from the ``RLModule``) while undergoing pre-processing (through RLlib's *module-to-env* connectors) before being sent to the environment. It's crucial to carefully consider the pre-processing and filtering applied during the recording of experiences, as they significantly influence how the expert policy learns and subsequently performs in the environment. For example, if the expert policy uses *mean-std filtering* for observations, it learns a strategy based on the filtered observations, where the filter itself is highly dependent on the experiences collected during training. When deploying this expert policy, it's essential to use the exact same filter during evaluation to avoid performance degradation. Similarly, a policy trained through behavior cloning may also require a *mean-std filter* for observations to accurately replicate the behavior of the expert policy. Scaling I/O throughput ---------------------- Just as online training can be scaled, offline recording I/O throughput can also be increased by configuring the number of RLlib env-runners. Use the ``num_env_runners`` setting to scale recording during training or ``evaluation_num_env_runners`` for scaling during evaluation-only recording. Each worker operates independently, writing experiences in parallel, enabling linear scaling of I/O throughput for write operations. Within each :py:class:`~ray.rllib.offline.offline_env_runner.OfflineSingleAgentEnvRunner`, episodes are sampled and serialized before being written to disk. Offline RL training in RLlib is highly parallelized, encompassing data reading, post-processing, and, if applicable, updates. When training on offline data, scalability is achieved by increasing the number of ``DataWorker`` instances used to transform offline experiences into a learner-compatible format (:py:class:`~ray.rllib.policy.sample_batch.MultiAgentBatch`). Ray Data optimizes reading operations under the hood by leveraging file metadata, predefined concurrency settings for batch post-processing, and available system resources. It's strongly recommended not to override these defaults, as doing so may disrupt this optimization process. Data processing in RLlib involves three key layers, all of which are highly scalable: #. **Read Operations:** This layer handles data ingestion from files in a specified folder. It's automatically optimized by Ray Data and shouldn't be manually scaled or adjusted. #. **Post-processing (PreLearner):** In this stage, batches are converted, if necessary, into RLlib's :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` format and passed through the *learner connector pipeline*. The processed data is then transformed into :py:class:`~ray.rllib.policy.sample_batch.MultiAgentBatch` objects for updating. This layer can be scaling the ``DataWorker`` instances. #. **Updating (Learner):** This stage involves updating the policy and associated modules. Scalability is achieved by increasing the number of learners (``num_learners``), enabling parallel processing of batches during updates. The diagram below illustrates the layers and their scalability: .. image:: images/offline/key_layers.svg :width: 500 :alt: Key layers of RLlib's fully scalable Offline RL API. **Read operations** are executed exclusively on the CPU and are primarily scaled by allocating additional resources (see :ref:`How to tune performance ` for details), as they're fully managed by Ray Data. **Post-processing** can be scaled by increasing the concurrency level specified in the keyword arguments for the mapping operation: .. code-block:: python config = ( AlgorithmConfig() .offline_data( map_batches_kwargs={ "concurrency": 10, "num_cpus": 4, } ) ) This initiates an actor pool with 10 ``DataWorker`` instances, each running an instance of RLlib's callable :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` class to post-process batches for updating the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. .. note:: The ``num_cpus`` (and similarly the ``num_gpus``) attribute defines the resources **allocated to each** ``DataWorker`` not the full actor pool. You scale the number of learners in RLlib's :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners` configuration block: .. code-block:: python config = ( AlgorithmConfig() .learners( num_learners=4, num_gpus_per_learner=1, ) ) With this configuration you start an application with 4 (remote) :py:class:`~ray.rllib.core.learner.learner.Learner`s (see :ref:`Learner (Alpha)` for more details about RLlib's learners) each of them using a single GPU. Using cloud storage ------------------- Unlike RLlib's previous stack, the new Offline RL API is cloud-agnostic and fully integrates with PyArrow. You can utilize any available cloud storage path or PyArrow-compatible filesystem. If using a PyArrow or compatible filesystem, ensure that your ``input_`` path is a relative path within this filesystem. Similar to Ray Data, you can also use placeholders, lists of files or folders, or simply specify a single folder to read recursively from. For example, to read from a storage bucket in GCS, you can specify the folder location as follows: .. code-block:: python config=( AlgorithmConfig() .offline_data( input_="gs:///dir1", ) ) This configuration allows RLlib to read data recursively from any folder beneath the specified path. If you are using a filesystem for GCS (for instance, due to authentication requirements), use the following syntax: .. code-block:: python import pyarrow.fs # Define the PyArrow filesystem gcs = pyarrow.fs.GcsFilesystem( # This is needed to resolve the hostname for public buckets. anonymous=True, retry_time_limit=timedelta(seconds=15) ) # Define the configuration. config= ( AlgorithmConfig() .offline_data( # NOTE: Use a relative file path now input_="/dir1", input_filesystem=gcs, ) ) You can learn more about PyArrow's filesystems, particularly regarding cloud filesystems and required authentication, in `PyArrow Filesystem Interface `__. Using cloud storage for recording ********************************* You can use cloud storage in a similar way when recording experiences from an expert policy: .. code-block:: python config= ( AlgorithmConfig() .offline_data( output="gs:///dir1", ) ) RLlib writes then directly into the folder in the cloud storage and creates it if not already existent in the bucket. The only difference to reading is that you can't use multiple paths for writing. So something like .. code-block:: python config= ( AlgorithmConfig() .offline_data( output=["gs:///dir1", "gs:///dir2"], ) ) would not work. If the storage requires special permissions for creating folders and/or writing files, ensure that the cluster user is granted the necessary permissions. Failure to do so results in denied write access, causing the recording process to stop. .. note:: When using cloud storage, Ray Data typically streams data, meaning it's consumed in chunks. This allows postprocessing and training to begin after a brief warmup phase. More specifically, even if your cloud storage is large, the same amount of space isn't required on the nodes running RLlib. .. _how-to-tune-performance: How to tune performance ----------------------- In RLlib's Offline RL API the various key layers are managed by distinct modules and configurations, making it non-trivial to scale these layers effectively. It's important to understand the specific parameters and their respective impact on system performance. .. _how-to-tune-reading-operations: How to tune reading operations ****************************** As noted earlier, the **Reading Operations** layer is automatically handled and dynamically optimized by :ref:`Ray Data `. It's strongly recommended to avoid modifying this process. However, there are certain parameters that can enhance performance on this layer to some extent, including: #. Available resources (dedicated to the job). #. Data locality. #. Data sharding. #. Data pruning. Available resources ~~~~~~~~~~~~~~~~~~~ The scheduling strategy employed by :ref:`Ray Data ` operates independently of any existing placement group, scheduling tasks and actors separately. Consequently, it's essential to reserve adequate resources for other tasks and actors within your job. To optimize :ref:`Ray Data `'s scalability for read operations and improve reading performance, consider increasing the available resources in your cluster while preserving the resource allocation for existing tasks and actors. The key resources to monitor and provision are CPUs and object store memory. Insufficient object store memory, especially under heavy backpressure, may lead to objects being spilled to disk, which can severely impact application performance. Bandwidth is a crucial factor influencing the throughput within your cluster. In some cases, scaling the number of nodes can increase bandwidth, thereby enhancing the flow of data from storage to consuming processes. Scenarios where this approach is beneficial include: - Independent connections to the network backbone: Nodes utilize dedicated bandwidth, avoiding shared up-links and potential bottlenecks (see for ex. `here `__ for AWS and `here `__ for GCP network bandwidth documentations). - Optimized cloud access: Employing features like `S3 Transfer Acceleration `__, `Google Cloud Storage FUSE `__ , or parallel and accelerated data transfer methods to enhance performance. Data locality ~~~~~~~~~~~~~ Data locality is a critical factor in achieving fast data processing. For instance, if your data resides on GCP, running a Ray cluster on AWS S3 or a local machine inevitably results in low transfer rates and slow data processing. To ensure optimal performance, storing data within the same region, same zone and cloud provider as the Ray cluster is generally sufficient to enable efficient streaming for RLlib's Offline RL API. Additional adjustments to consider include: - Multi-Region Buckets: Use multi-region storage to improve data availability and potentially enhance access speeds for distributed systems. - Storage class optimization within buckets: Use **standard storage** for frequent access and low-latency streaming. Avoid archival storage classes like AWS Glacier or GCP Archive for streaming workloads due to high retrieval times. Data sharding ~~~~~~~~~~~~~ Data sharding improves the efficiency of fetching, transferring, and reading data by balancing chunk sizes. If chunks are too large, they can cause delays during transfer and processing, leading to bottlenecks. Conversely, chunks that are too small can result in high metadata fetching overhead, slowing down overall performance. Finding an optimal chunk size is critical for balancing these trade-offs and maximizing throughput. - As a rule-of-thumb keep data file sizes in between 64MiB to 256MiB. Data pruning ~~~~~~~~~~~~ If your data is in **Parquet** format (the recommended offline data format for RLlib), you can leverage data pruning to optimize performance. :ref:`Ray Data ` supports pruning in its :py:meth:`~ray.data.read_parquet` method through projection pushdown (column filtering) and filter pushdown (row filtering). These filters are applied directly during file scans, reducing the amount of unnecessary data loaded into memory. For instance, if you only require specific columns from your offline data (for example, to avoid loading the ``infos`` column): .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig from ray.rllib.core.columns import Columns config = ( AlgorithmConfig() .offline_Data( input_read_method_kwargs={ "columns": [ Columns.EPS_ID, Columns.AGENT_ID, Columns.OBS, Columns.NEXT_OBS, Columns.REWARDS, Columns.ACTIONS, Columns.TERMINATED, Columns.TRUNCATED, ], }, ) ) Similarly, if you only require specific rows from your dataset, you can apply pushdown filters as shown below: .. code-block:: python import pyarrow.dataset from ray.rllib.algorithms.algorithm_config import AlgorithmConfig from ray.rllib.core.columns import Columns config = ( AlgorithmConfig() .offline_data( input_read_method_kwargs={ "filter": pyarrow.dataset.field(Columns.AGENT_ID) == "agent_1", }, ) ) How to tune post-processing (PreLearner) **************************************** When enabling high throughput in Read Operations, it's essential to ensure sufficient processing capacity in the Post-Processing (Pre-Learner) stage. Insufficient capacity in this stage can cause backpressure, leading to increased memory usage and, in severe cases, object spilling to disk or even Out-Of-Memory (see :ref:`Out-Of-Memory Prevention `) errors. Tuning the **Post-Processing (Pre-Learner)** layer is generally more straightforward than optimizing the **Read Operations** layer. Tuning the Post-Processing (Pre-Learner) layer is generally more straightforward than optimizing the Read Operations layer. The following parameters can be adjusted to optimize its performance: - Actor Pool Size - Allocated Resources - Read Batch and Buffer Sizes. Actor pool size ~~~~~~~~~~~~~~~ Internally, the **Post-Processing (PreLearner)** layer is defined by a :py:meth:`~ray.data.Dataset.map_batches` operation that starts an :py:class:`~ray.data._internal.execution.operators.actor_pool_map_operator._ActorPool`. Each actor in this pool runs an :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` instances to transform batches on their way from disk to RLlib's :py:class:`~ray.rllib.core.learner.learner.Learner`. Obviously, the size of this :py:class:`~ray.data._internal.execution.operators.actor_pool_map_operator._ActorPool` defines the throughput of this layer and needs to be fine-tuned in regard to the previous layer's throughput to avoid backpressure. You can use the ``concurrency`` in RLlib's ``map_batches_kwargs`` parameter to define this pool size: .. code-block:: python from ray.rllib.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .offline_data( map_batches_kwargs={ "concurrency": 4, }, ) ) With the preceding code you would enable :ref:`Ray Data ` to start up to ``4`` parallel :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` actors that can post-process your data for training. .. note:: :ref:`Ray Data ` dynamically adjusts its read operations based on the parallelism of your **Post-Processing (Pre-Learner)** layer. It scales read operations up or down depending on the backpressure in the **Post-Processing (Pre-Learner)** stage. This means the throughput of your entire streaming pipeline is determined by the performance of the downstream tasks and the resources allocated to the **Reading Operations** layer (see :ref:`How to tune reading operations `). However, due to the overhead associated with scaling reading operations up or down, backpressure - and in severe cases, object spilling or Out-Of-Memory (OOM) errors - can't always be entirely avoided. You can also enable auto-scaling in your **Post-Processing (PreLearner)** by providing an interval instead of a straight number: .. code-block:: python from ray.rllib.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .offline_data( map_batches_kwargs={ "concurrency": (4, 8), }, ) ) This allows :ref:`Ray Data ` to start up to ``8`` post-processing actors to downstream data faster, for example in case of backpressure. .. note:: Implementing an autoscaled actor pool in the **Post-Processing (Pre-Learner)** layer doesn't guarantee you the elimination of backpressure. Adding more :py:class:`~ray.rllib.offline.offline_prelearner.OffLinePreLearner` instances introduces additional overhead to the system. RLlib's offline RL pipeline is optimized for streaming data, which typically exhibits stable throughput and resource usage, except in cases of imbalances between upstream and downstream tasks. As a rule of thumb, consider using autoscaling only under the following conditions: (1) throughput is expected to be highly variable, (2) Cluster resources are subject to fluctuations (for example, in shared or dynamic environments), and/or (3) workload characteristics are highly unpredictable. Allocated resources ~~~~~~~~~~~~~~~~~~~ Other than the number of post-processing actors you can tune performance on the **Post-Processing (PreLearner)** layer through defining resources to be allocated to each :py:class:`~ray.rllib.offline.offline_prelearner.OffLinePreLearner` in the actor pool. Such resources can be defined either through ``num_cpus`` and ``num_gpus`` or in the ``ray_remote_args``. .. note:: Typically, increasing the number of CPUs is sufficient for performance tuning in the post-processing stage of your pipeline. GPUs are only needed in specialized cases, such as in customized pipelines. For example, RLlib’s :py:class:`~ray.rllib.algorithms.marwil.marwil.MARWIL` implementation uses the :py:class:`~ray.rllib.connectors.learner.general_advantage_estimation.GeneralAdvantageEstimation` connector in its :py:class:`~ray.rllib.connectors.connector_pipeline_v2.ConnectorPipelineV2` to apply `General Advantage Estimation `__ on experience batches. In these calculations, the value model of the algorithm's :py:class:`~ray.rllib.core.rl_module.RLModule` is applied, which you can accelerate by running on a GPU. As an example, to provide each of your ``4`` :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` in the **Post-Processing (PreLearner)** ``2`` CPUs you can use the following syntax: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .offline_data( map_batches_kwargs={ "concurrency": 4, "num_cpus": 2, }, ) ) .. warning:: Don't override the ``batch_size`` in RLlib's ``map_batches_kwargs``. This usually leads to high performance degradations. Note, this ``batch_size`` differs from the `train_batch_size_per_learner`: the former specifies the batch size in transformations of the streaming pipeline, while the latter defines the batch size used for training within each :py:class:`~ray.rllib.core.learner.learner.Learner` (the batch size of the actual model forward- and backward passes performed for training). Read batch- and buffer sizes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When working with data from :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` or the legacy :py:class:`~ray.rllib.policy.sample_batch.SampleBatch` format, fine-tuning the `input_read_batch_size` parameter provides additional optimization opportunities. This parameter controls the size of batches retrieved from data files. Its effectiveness is particularly notable when handling episodic or legacy :py:class:`~ray.rllib.policy.sample_batch.SampleBatch` data because the streaming pipeline utilizes for these data an :py:class:`~ray.rllib.utils.replay_buffers.episode_replay_buffer.EpisodeReplayBuffer` to handle the multiple timesteps contained in each data row. All incoming data is converted into :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` instances - if not already in this format - and stored in an episode replay buffer, which precisely manages the sampling of `train_batch_size_per_learner` for training. .. image:: images/offline/docs_rllib_offline_prelearner.svg :alt: The OfflinePreLearner converts and buffers episodes before sampling the batches used in learning. :width: 500 :align: left Achieving an optimal balance between data ingestion efficiency and sampling variation in your streaming pipeline is crucial. Consider the following example: suppose each :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` has a length of ``100`` timesteps, and your `train_batch_size_per_learner` is configured to be ``1000``. Each :py:class:`~ray.rllib.utils.replay_buffers.episode_replay_buffer.EpisodeReplayBuffer` instance is set with a capacity of ``1000``: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .training( # Train on a batch of 1000 timesteps each iteration. train_batch_size_per_learner=1000, ) .offline_data( # Read in RLlib's new stack `SingleAgentEpisode` data. input_read_episodes=True # Define an input read batch size of 10 episodes. input_read_batch_size=10, # Set the replay buffer in the `OfflinePrelearner` # to 1,000 timesteps. prelearner_buffer_kwargs={ "capacity": 1000, }, ) ) If you configure `input_read_batch_size` to ``10`` as shown in the code, each of the ``10`` :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` fit into the buffer, enabling sampling across a wide variety of timesteps from multiple episodes. This results in high sampling variation. Now, consider the case where the buffer capacity is reduced to ``500``: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .training( # Train on a batch of 1000 timesteps each iteration. train_batch_size_per_learner=1000, ) .offline_data( # Read in RLlib's new stack `SingleAgentEpisode` data. input_read_episodes=True # Define an input read batch size of 10 episodes. input_read_batch_size=10, # Set the replay buffer in the `OfflinePrelearner` # to 500 timesteps. prelearner_buffer_kwargs={ "capacity": 500, }, ) ) With the same `input_read_batch_size`, only ``5`` :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` can be buffered at a time, causing inefficiencies as more data is read than can be retained for sampling. In another scenario, if each :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` still has a length of ``100`` timesteps and the `train_batch_size_per_learner` is set to ``4000`` timesteps as in the code below, the buffer holds ``10`` :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` instances. This configuration results in lower sampling variation because many timesteps are repeatedly sampled, reducing diversity across training batches. These examples highlight the importance of tuning these parameters to balance data ingestion and sampling diversity in your offline streaming pipeline effectively. .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .training( # Train on a batch of 4000 timesteps each iteration. train_batch_size_per_learner=4000, ) .offline_data( # Read in RLlib's new stack `SingleAgentEpisode` data. input_read_episodes=True # Define an input read batch size of 10 episodes. input_read_batch_size=10, # Set the replay buffer in the `OfflinePrelearner` # to 1,000 timesteps. prelearner_buffer_kwargs={ "capacity": 500, }, ) ) .. tip:: To choose an adequate `input_read_batch_size` take a look at the length of your recorded episodes. In some cases each single episode is long enough to fulfill the `train_batch_size_per_learner` and you could choose a `input_read_batch_size` of ``1``. Most times it's not and you need to consider how many episodes should be buffered to balance the amount of data digested from read input and the variation of data sampled from the :py:class:`~ray.rllib.utils.replay_buffers.episode_replay_buffer.EpisodeReplayBuffer` instances in the :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner`. How to tune updating (Learner) ****************************** **Updating (Learner)** is the final downstream task in RLlib's Offline RL pipeline, and its consumption speed determines the overall throughput of the data pipeline. If the learning process is slow, it can cause backpressure in upstream layers, potentially leading to object spilling or Out-Of-Memory (OOM) errors. Therefore, it's essential to fine-tune this layer in coordination with the upstream components. Several parameters can be adjusted to optimize the learning speed in your Offline algorithm: - Actor Pool Size - Allocated Resources - Scheduling Strategy - Batch Sizing - Batch Prefetching - Learner Iterations. .. _actor-pool-size: Actor pool size *************** RLlib supports scaling :py:class:`~ray.rllib.core.learner.learner.Learner` instances through the parameter `num_learners`. When this value is ``0``, RLlib uses a Learner instance in the local process, whereas for values ``>0``, RLlib scales out using a :py:class:`~ray.train._internals.backend_executor_BackendExecutor`. This executor spawns your specified number of :py:class:`~ray.rllib.core.learner.learner.Learner` instances, manages distributed training and aggregates intermediate results across :py:class:`~ray.rllib.core.learner.learner.Learner` actors. :py:class:`~ray.rllib.core.learner.learner.Learner` scaling increases training throughput and you should only apply it, if the upstream components in your Offline Data pipeline can supply data at a rate sufficient to match the increased training capacity. RLlib's Offline API offers powerful scalability at its final layer by utilizing :py:class:`~ray.data.Dataset.streaming_split`. This functionality divides the data stream into multiple substreams, which are then processed by individual :py:class:`~ray.rllib.core.learner.learner.Learner` instances, enabling efficient parallel consumption and enhancing overall throughput. For example to set the number of learners to ``4``, you use the following syntax: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .learners(num_learners=4) ) .. tip::For performance optimization you should choose between using a single local :py:class:`~ray.rllib.core.learner.learner.Learner` or multiple remote ones :py:class:`~ray.rllib.core.learner.learner.Learner`. In case your dataset is small, use scaling of :py:class:`~ray.rllib.core.learner.learner.Learner` instances with caution as it produces significant overhead and splits the data pipeline into multiple streams. Allocated resources ~~~~~~~~~~~~~~~~~~~ Just as with the Post-Processing (Pre-Learner) layer, allocating additional resources can help address slow training issues. The primary resource to leverage is the GPU, as training involves forward and backward passes through the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, which GPUs can accelerate significantly. If your training already utilizes GPUs and performance still remains an issue, consider scaling up by either adding more GPUs to each :py:class:`~ray.rllib.core.learner.learner.Learner` to increase GPU memory and computational capacity (set `config.learners(num_gpus_per_learner=...)`), or by adding additional :py:class:`~ray.rllib.core.learner.learner.Learner` workers to further distribute the workload (by setting `config.learners(num_learners=...)`). Additionally, ensure that data throughput and upstream components are optimized to keep the learners fully utilized, as insufficient upstream capacity can bottleneck the training process. .. warning::Currently, you can't set both `num_gpus_per_learner` and `num_cpus_per_learner` due to placement group (PG) fragmentation in Ray. To provide your learners with more compute use ``num_gpus_per_learner`` or ``num_cpus_per_learner`` as follows: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .learners(num_learners=4, num_gpus_per_learner=2) ) .. tip::If you experience backpressure in the **Post-Processing (Pre-Learner)** stage of your pipeline, consider enabling GPU training before scaling up the number of your :py:class:`~ray.rllib.core.learner.learner.Learner` instances. Scheduling strategy ~~~~~~~~~~~~~~~~~~~ The scheduling strategy in Ray plays a key role in task and actor placement by attempting to distribute them across multiple nodes in a cluster, thereby maximizing resource utilization and fault tolerance. When running on a single-node cluster (that's: one large head node), the scheduling strategy has little to no noticeable impact. However, in a multi-node cluster, scheduling can significantly influence the performance of your Offline Data pipeline due to the importance of data locality. Data processing occurs across all nodes, and maintaining data locality during training can enhance performance. In such scenarios, you can improve data locality by changing RLlib's default scheduling strategy from ``"PACK"`` to ``"SPREAD"``. This strategy distributes the :py:class:`~ray.rllib.core.learner.learner.Learner` actors across the cluster, allowing `Ray Data ` to take advantage of locality-aware bundle selection, which can improve efficiency. Here is an example of how you can change the scheduling strategy: .. code-block:: python """Just for show-casing, don't run.""" import os from ray import data from ray.rllib.algorithms.algorithm_config.AlgorithmConfig # Configure a "SPREAD" scheduling strategy for learners. os.environ["TRAIN_ENABLE_WORKER_SPREAD_ENV"] = "1" # Get the current data context. data_context = data.DataContext.get_current() # Set the execution options such that the Ray Data tries to match # the locality of an output stream with where learners are located. data_context.execution_options = data.ExecutionOptions( locality_with_output=True, ) # Build the config. config = ( AlgorithmConfig() .learners( # Scale the learners. num_learners=4, num_gpus_per_learner=2, ) .offline_data( ..., # Run in each RLlib training iteration 10 # iterations per learner (each of them with # `train_batch_size_per_learner`). dataset_num_iters_per_learner=20, ) ) # Build the algorithm from the config. algo = config.build() # Train for 10 iterations. for _ in range(10) res = algo.train() .. warning::Changing scheduling strategies in RLlib's Offline RL API is experimental; use with caution. Batch size ~~~~~~~~~~ Batch size is one of the simplest parameters to adjust for optimizing performance in RLlib's new Offline RL API. Small batch sizes may under-utilize hardware, leading to inefficiencies, while overly large batch sizes can exceed memory limits. In a streaming pipeline, the selected batch size impacts how data is partitioned and processed across parallel workers. Larger batch sizes reduce the overhead of frequent task coordination, but if they exceed hardware constraints, they can slow down the entire pipeline. You can configure the training batch size using the `train_batch_size_per_learner` attribute as shown below. .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .training( train_batch_size_per_learner=1024, ) ) .. tip::A good starting point for batch size tuning is ``2048``. In `Ray Data `, it's common practice to use batch sizes that are powers of two. However, you are free to select any integer value for the batch size based on your needs. Batch prefetching ~~~~~~~~~~~~~~~~~ Batch prefetching allows you to control data consumption on the downstream side of your offline data pipeline. The primary goal is to ensure that learners remain active, maintaining a continuous flow of data. This is achieved by preparing the next batch while the learner processes the current one. Prefetching determines how many batches are kept ready for learners and should be tuned based on the time required to produce the next batch and the learner's update speed. Prefetching too many batches can lead to memory inefficiencies and, in some cases, backpressure in upstream tasks. .. tip::The default in RLlib's Offline RL API is to prefetch ``2`` batches per learner instance, which works well with most tested applications. You can configure batch prefetching in the `iter_batches_kwargs`: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .offline_data( iter_batches_kwargs={ "prefetch_batches": 2, } ) ) .. warning:: Don't override the ``batch_size`` in RLlib's `map_batches_kwargs`. This usually leads to high performance degradations. Note, this ``batch_size`` differs from the `train_batch_size_per_learner`: the former specifies the batch size in iterating over data output of the streaming pipeline, while the latter defines the batch size used for training within each :py:class:`~ray.rllib.core.learner.learner.Learner`. Learner iterations ~~~~~~~~~~~~~~~~~~ This tuning parameter is available only when using multiple instances of ::py:class:`~ray.rllib.core.learner.learner.Learner`. In distributed learning, each :py:class:`~ray.rllib.core.learner.learner.Learner` instance processes a sub-stream of the offline streaming pipeline, iterating over batches from that sub-stream. You can control the number of iterations each :py:class:`~ray.rllib.core.learner.learner.Learner` instance runs per RLlib training iteration. Result reporting occurs after each RLlib training iteration. Setting this parameter too low results in inefficiencies, while excessively high values can hinder training monitoring and, in some cases - such as in RLlib's :py:class:`~ray.rllib.algorithms.marwil.marwil.MARWIL` implementation - lead to stale training data. This happens because some data transformations rely on the same :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` that the :py:class:`~ray.rllib.core.learner.learner.Learner` instances are training on. The number of iterations per sub-stream is controlled by the attribute :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.dataset_num_iters_per_learner`, which has a default value of ``None``, meaning it runs one epoch on the sub-stream. You can modify this value as follows: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .offline_data( # Train on 20 batches from the substream in each learner. dataset_num_iters_per_learner=20, ) ) .. note::The default value of :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.dataset_num_iters_per_learner` is None, which allows each :py:class:`~ray.rllib.core.learner.learner.Learner` instance to process a full epoch on its data substream. While this setting works well for small datasets, it may not be suitable for larger datasets. It's important to tune this parameter according to the size of your dataset to ensure optimal performance. Customization ------------- Customization of the Offline RL components in RLlib, such as the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`, :py:class:`~ray.rllib.core.learner.learner.Learner`, or :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, follows a similar process to that of their Online RL counterparts. For detailed guidance, refer to the documentation on :ref:`Algorithms `, :ref:`Learners `, and RLlib's :ref:`RLModule `. The new stack Offline RL streaming pipeline in RLlib supports customization at various levels and locations within the dataflow, allowing for tailored solutions to meet the specific requirements of your offline RL algorithm. - Connector Level - PreLearner Level - Pipeline Level. Connector level *************** Small data transformations on instances of :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` can be easily implemented by modifying the :py:class:`~ray.rllib.connectors.connector_pipeline_v2.ConnectorPipelineV2`, which is part of the :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` and prepares episodes for training. You can leverage any connector from RLlib's library (see `RLlib's default connectors `__) or create a custom connector (see `RLlib's ConnectorV2 examples `__) to integrate into the :py:class:`~ray.rllib.core.learner.learner.Learner`'s :py:class:`~ray.rllib.connectors.connector_pipeline_v2.ConnectorPipelineV2`. Careful consideration must be given to the order in which :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` instances are applied, as demonstrated in the implementation of `RLlib's MARWIL algorithm `__ (see the `MARWIL paper `__). The `MARWIL algorithm `__ computes a loss that extends beyond behavior cloning by improving the expert's strategy during training using advantages. These advantages are calculated through `General Advantage Estimation (GAE) `__ using a value model. GAE is computed on-the-fly through the :py:class:`~ray.rllib.connectors.learner.general_advantage_estimation.GeneralAdvantageEstimation` connector. This connector has specific requirements: it processes a list of :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` instances and must be one of the final components in the :py:class:`~ray.rllib.connectors.connector_pipeline_v2.ConnectorPipelineV2`. This is because it relies on fully prepared batches containing `OBS`, `REWARDS`, `NEXT_OBS`, `TERMINATED`, and `TRUNCATED` fields. Additionally, the incoming :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` instances must already include one artificially elongated timestep. To meet these requirements, the pipeline must include the following sequence of :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` instances: 1. :py:class:`ray.rllib.connectors.learner.add_one_ts_to_episodes_and_truncate.AddOneTsToEpisodesAndTruncate` ensures the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` objects are elongated by one timestep. 2. :py:class:`ray.rllib.connectors.common.add_observations_from_episodes_to_batch.AddObservationsFromEpisodesToBatch` incorporates the observations (`OBS`) into the batch. 3. :py:class:`ray.rllib.connectors.learner.add_next_observations_from_episodes_to_train_batch.AddNextObservationsFromEpisodesToTrainBatch` adds the next observations (`NEXT_OBS`). 4. Finally, the :py:class:`ray.rllib.connectors.learner.general_advantage_estimation.GeneralAdvantageEstimation` connector piece is applied. Below is the example code snippet from `RLlib's MARWIL algorithm `__ demonstrating this setup: .. code-block:: python @override(AlgorithmConfig) def build_learner_connector( self, input_observation_space, input_action_space, device=None, ): pipeline = super().build_learner_connector( input_observation_space=input_observation_space, input_action_space=input_action_space, device=device, ) # Before anything, add one ts to each episode (and record this in the loss # mask, so that the computations at this extra ts aren't used to compute # the loss). pipeline.prepend(AddOneTsToEpisodesAndTruncate()) # Prepend the "add-NEXT_OBS-from-episodes-to-train-batch" connector piece (right # after the corresponding "add-OBS-..." default piece). pipeline.insert_after( AddObservationsFromEpisodesToBatch, AddNextObservationsFromEpisodesToTrainBatch(), ) # At the end of the pipeline (when the batch is already completed), add the # GAE connector, which performs a vf forward pass, then computes the GAE # computations, and puts the results of this (advantages, value targets) # directly back in the batch. This is then the batch used for # `forward_train` and `compute_losses`. pipeline.append( GeneralAdvantageEstimation(gamma=self.gamma, lambda_=self.lambda_) ) return pipeline Define a primer LearnerConnector pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are multiple ways to customize the :py:class:`~ray.rllib.connectors.learner.learner_connector_pipeline.LearnerConnectorPipeline`. One approach, as demonstrated above, is to override the `build_learner_connector` method in the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`. Alternatively, you can directly define a custom :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece to the :py:class:`~ray.rllib.connectors.learner.learner_connector_pipeline.LearnerConnectorPipeline` by utilizing the `learner_connector` attribute: .. code-block:: python def _make_learner_connector(input_observation_space, input_action_space): # Create the learner connector. return CustomLearnerConnector( parameter_1=0.3, parameter_2=100, ) config = ( AlgorithmConfig() .training( # Add the connector pipeline as the starting point for # the learner connector pipeline. learner_connector=_make_learner_connector, ) ) As noted in the comments, this approach to adding a :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece to the :py:class:`~ray.rllib.connectors.learner.learner_connector_pipeline.LearnerConnectorPipeline` is suitable only if you intend to manipulate raw episodes, as your :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` piece serves as the foundation for building the remainder of the pipeline (including batching and other processing steps). If your goal is to modify data further along in the :py:class:`~ray.rllib.connectors.learner.learner_connector_pipeline.LearnerConnectorPipeline`, you should either override the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`'s `build_learner_connector` method or consider the third option: overriding the entire :py:class:`~ray.rllib.offline.offline_prelearner.PreLearner`. PreLearner level **************** If you need to perform data transformations at a deeper level - before your data reaches the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` stage - consider overriding the :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner`. This class orchestrates the complete data transformation pipeline, converting raw input data into :py:class:`~ray.rllib.policy.sample_batch.MultiAgentBatch` objects ready for training. For instance, if your data is stored in specialized formats requiring pre-parsing and restructuring (for example, XML, HTML, Protobuf, images, or videos), you may need to handle these custom formats directly. You can leverage tools such as `Ray Data's custom datasources ` (for example, :py:meth:`~ray.data.read_binary_files`) to manage the ingestion process. To ensure this data is appropriately structured and sorted into :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` objects, you can override the :py:meth:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner._map_to_episodes` static method. For more extensive customization, you can rewrite the `__call__` method to define custom transformation steps, implement a unique :py:class:`~ray.rllib.connectors.learner.learner_connector_pipeline.LearnerConnectorPipeline`, and construct :py:class:`~ray.rllib.policy.sample_batch.MultiAgentBatch` instances for the :py:class:`~ray.rllib.core.learner.learner.Learner`. The following example demonstrates how to use a custom :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` to process text data and construct training batches: .. testcode:: import gymnasium as gym import numpy as np import uuid from typing import Any, Dict, List, Optional, Union from ray import data from ray.rllib.env.single_agent_episode import SingleAgentEpisode from ray.rllib.offline.offline_prelearner import OfflinePreLearner, SCHEMA from ray.rllib.utils.annotations import override from ray.rllib.utils.typing import EpisodeType class TextOfflinePreLearner(OfflinePreLearner): @staticmethod @override(OfflinePreLearner) def _map_to_episodes( is_multi_agent: bool, batch: Dict[str, Union[list, np.ndarray]], schema: Dict[str, str] = SCHEMA, to_numpy: bool = False, input_compress_columns: Optional[List[str]] = None, observation_space: gym.Space = None, action_space: gym.Space = None, vocabulary: Dict[str, Any] = None, **kwargs: Dict[str, Any], ) -> Dict[str, List[EpisodeType]]: # If we have no vocabulary raise an error. if not vocabulary: raise ValueError( "No `vocabulary`. It needs a vocabulary in form of dictionary ", "mapping tokens to their IDs." ) # Define container for episodes. episodes = [] # Data comes in batches of string arrays under the `"text"` key. for text in batch["text"]: # Split the text and tokenize. tokens = text.split(" ") # Encode tokens. encoded = [vocabulary[token] for token in tokens] one_hot_vectors = np.zeros((len(tokens), len(vocabulary), 1, 1)) for i, token in enumerate(tokens): if token in vocabulary: one_hot_vectors[i][vocabulary[token] - 1] = 1.0 # Build the `SingleAgentEpisode`. episode = SingleAgentEpisode( # Generate a unique ID. id_=uuid.uuid4().hex, # agent_id="default_policy", # module_id="default_policy", # We use the starting token with all added tokens as observations. observations=[ohv for ohv in one_hot_vectors], observation_space=observation_space, # Actions are defined to be the "chosen" follow-up token after # given the observation. actions=encoded[1:], action_space=action_space, # Rewards are zero until the end of a sequence. rewards=[0.0 for i in range(len(encoded) - 2)] + [1.0], # The episode is always terminated (as sentences in the dataset are). terminated=True, truncated=False, # No lookback. You want the episode to start at timestep zero. len_lookback_buffer=0, t_started=0, ) # If episodes should be numpy'ized. Some connectors need this. if to_numpy: episode.to_numpy() # Append the episode to the list of episodes. episodes.append(episode) # Return a batch with key `"episodes"`. return {"episodes": episodes} # Define the dataset. ds = data.read_text("s3://anonymous@ray-example-data/this.txt") # Create a vocabulary. tokens = [] for b in ds.iter_rows(): tokens.extend(b["text"].split(" ")) vocabulary = {token: idx for idx, token in enumerate(set(tokens), start=1)} # Take a small batch of 10 from the dataset. batch = ds.take_batch(10) # Now use your `OfflinePreLearner`. episodes = TextOfflinePreLearner._map_to_episodes( is_multi_agent=False, batch=batch, to_numpy=True, schema=None, input_compress_columns=False, action_space=None, observation_space=None, vocabulary=vocabulary, ) # Show the constructed episodes. print(f"Episodes: {episodes}") The preceding example illustrates the flexibility of RLlib's Offline RL API for custom data transformation. In this case, a customized :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` processes a batch of text data - organized as sentences - and converts each sentence into a :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode`. The static method returns a dictionary containing a list of these :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` instances. Similarly, you can extend this functionality by overriding the :py:meth:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner.__call__` method. For instance, you could implement a :py:class:`ray.rllib.connectors.learner.learner_connector_pipeline.LearnerConnectorPipeline` that stacks multiple observations (for example, tokens) together. This can be achieved using RLlib's :py:class:`~ray.rllib.connectors.learner.frame_stacking.FrameStackingLearner` and is shown in the example below. .. testcode:: import gymnasium as gym import numpy as np import uuid from typing import Any, Dict, List, Optional, Tuple, Union from ray import data from ray.actor import ActorHandle from ray.rllib.algorithms.algorithm_config import AlgorithmConfig from ray.rllib.algorithms.bc.bc_catalog import BCCatalog from ray.rllib.algorithms.bc.torch.default_bc_torch_rl_module import DefaultBCTorchRLModule from ray.rllib.connectors.common import AddObservationsFromEpisodesToBatch, BatchIndividualItems, NumpyToTensor, AgentToModuleMapping from ray.rllib.connectors.learner.add_columns_from_episodes_to_train_batch import AddColumnsFromEpisodesToTrainBatch from ray.rllib.connectors.learner.frame_stacking import FrameStackingLearner from ray.rllib.connectors.learner.learner_connector_pipeline import LearnerConnectorPipeline from ray.rllib.core.learner.learner import Learner from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec from ray.rllib.core.rl_module.rl_module import RLModuleSpec from ray.rllib.env.single_agent_episode import SingleAgentEpisode from ray.rllib.policy.sample_batch import MultiAgentBatch, SampleBatch from ray.rllib.offline.offline_prelearner import OfflinePreLearner, SCHEMA from ray.rllib.utils.annotations import override from ray.rllib.utils.typing import EpisodeType, ModuleID class TextOfflinePreLearner(OfflinePreLearner): @override(OfflinePreLearner) def __init__( self, config: "AlgorithmConfig", learner: Union[Learner, List[ActorHandle]] = None, locality_hints: Optional[List[str]] = None, spaces: Optional[Tuple[gym.Space, gym.Space]] = None, module_spec: Optional[MultiRLModuleSpec] = None, module_state: Optional[Dict[ModuleID, Any]] = None, vocabulary: Dict[str, Any] = None, **kwargs: Dict[str, Any], ): self.config = config self.spaces = spaces self.vocabulary = vocabulary self.vocabulary_size = len(self.vocabulary) # Build the `RLModule`. self._module = module_spec.build() if module_state: self._module.set_state(module_state) # Build the learner connector pipeline. self._learner_connector = LearnerConnectorPipeline( connectors=[ FrameStackingLearner( num_frames=4, ) ], input_action_space=module_spec.action_space, input_observation_space=module_spec.observation_space, ) self._learner_connector.append( AddObservationsFromEpisodesToBatch(as_learner_connector=True), ) self._learner_connector.append( AddColumnsFromEpisodesToTrainBatch(), ) self._learner_connector.append( BatchIndividualItems(multi_agent=False), ) # Let us run exclusively on CPU, then we can convert here to Tensor. self._learner_connector.append( NumpyToTensor(as_learner_connector=True), ) @override(OfflinePreLearner) def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, List[EpisodeType]]: # Convert raw data to episodes. episodes = TextOfflinePreLearner._map_to_episodes( is_multi_agent=False, batch=batch, to_numpy=True, schema=None, input_compress_columns=False, action_space=self.spaces[0], observation_space=self.spaces[1], vocabulary=self.vocabulary, )["episodes"] # Run the learner connector pipeline with the # `FrameStackLearner` piece. batch = self._learner_connector( rl_module=self._module, batch={}, episodes=episodes, shared_data={}, ) # Convert to `MultiAgentBatch` for the learner. batch = MultiAgentBatch( { module_id: SampleBatch(module_data) for module_id, module_data in batch.items() }, # TODO (simon): This can be run once for the batch and the # metrics, but we run it twice: here and later in the learner. env_steps=sum(e.env_steps() for e in episodes), ) # Return the `MultiAgentBatch` under the `"batch"` key. return {"batch": batch} @staticmethod @override(OfflinePreLearner) def _map_to_episodes( is_multi_agent: bool, batch: Dict[str, Union[list, np.ndarray]], schema: Dict[str, str] = SCHEMA, to_numpy: bool = False, input_compress_columns: Optional[List[str]] = None, observation_space: gym.Space = None, action_space: gym.Space = None, vocabulary: Dict[str, Any] = None, **kwargs: Dict[str, Any], ) -> Dict[str, List[EpisodeType]]: # If we have no vocabulary raise an error. if not vocabulary: raise ValueError( "No `vocabulary`. It needs a vocabulary in form of dictionary ", "mapping tokens to their IDs." ) # Define container for episodes. episodes = [] # Data comes in batches of string arrays under the `"text"` key. for text in batch["text"]: # Split the text and tokenize. tokens = text.split(" ") # Encode tokens. encoded = [vocabulary[token] for token in tokens] one_hot_vectors = np.zeros((len(tokens), len(vocabulary), 1, 1)) for i, token in enumerate(tokens): if token in vocabulary: one_hot_vectors[i][vocabulary[token] - 1] = 1.0 # Build the `SingleAgentEpisode`. episode = SingleAgentEpisode( # Generate a unique ID. id_=uuid.uuid4().hex, # agent_id="default_policy", # module_id="default_policy", # We use the starting token with all added tokens as observations. observations=[ohv for ohv in one_hot_vectors], observation_space=observation_space, # Actions are defined to be the "chosen" follow-up token after # given the observation. actions=encoded[1:], action_space=action_space, # Rewards are zero until the end of a sequence. rewards=[0.0 for i in range(len(encoded) - 2)] + [1.0], # The episode is always terminated (as sentences in the dataset are). terminated=True, truncated=False, # No lookback. You want the episode to start at timestep zero. len_lookback_buffer=0, t_started=0, ) # If episodes should be numpy'ized. Some connectors need this. if to_numpy: episode.to_numpy() # Append the episode to the list of episodes. episodes.append(episode) # Return a batch with key `"episodes"`. return {"episodes": episodes} # Define dataset on sample data. ds = data.read_text("s3://anonymous@ray-example-data/this.txt") # Create a vocabulary. tokens = [] for b in ds.iter_rows(): tokens.extend(b["text"].split(" ")) vocabulary = {token: idx for idx, token in enumerate(set(tokens), start=1)} # Specify an `RLModule` and wrap it with a `MultiRLModuleSpec`. Note, # on `Learner`` side any `RLModule` is an `MultiRLModule`. module_spec = MultiRLModuleSpec( rl_module_specs={ "default_policy": RLModuleSpec( model_config=DefaultModelConfig( conv_filters=[[16, 4, 2], [32, 4, 2], [64, 4, 2], [128, 4, 2]], conv_activation="relu", ), inference_only=False, module_class=DefaultBCTorchRLModule, catalog_class=BCCatalog, action_space = gym.spaces.Discrete(len(vocabulary)), observation_space=gym.spaces.Box(0.0, 1.0, (len(vocabulary), 1, 1), np.float32), ), }, ) # Take a small batch. batch = ds.take_batch(10) # Build and instance your `OfflinePreLearner`. oplr = TextOfflinePreLearner( config=AlgorithmConfig(), spaces=( gym.spaces.Discrete(len(vocabulary)), gym.spaces.Box(0.0, 1.0, (len(vocabulary), 1, 1), np.float32)), module_spec=module_spec, vocabulary=vocabulary, ) # Run your `OfflinePreLearner`. transformed = oplr(batch) # Show the generated batch. print(f"Batch: {batch}") The ability to fully customize the :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` empowers you to design tailored data transformation workflows. This includes defining a specific learner connector pipeline and implementing raw data mapping, enabling multi-step processing of text data from its raw format to a :py:class:`~ray.rllib.policy.sample_batch.MultiAgentBatch`. To integrate your custom :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner`, simply specify it within your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`: .. code-block:: python from ray.rllib.algorithms.algorithm_config import AlgorithmConfig config = ( AlgorithmConfig() .offline_data( # Provide your custom `OfflinePreLearner`. prelearner_class=TextOfflinePreLearner, # Provide special keyword arguments your `OfflinePreLearner` needs. prelearner_kwargs={ "vocabulary": vocabulary, }, ) ) If these customization capabilities still don't meet your requirements, consider moving to the **Pipeline Level** for even greater flexibility. Pipeline level ~~~~~~~~~~~~~~ On this level of RLlib's Offline RL API you can redefine your complete pipeline from data reading to batch iteration by overriding the :py:class:`~®ay.rllib.offline.offline_data.OfflineData` class. In most cases however the other two levels should be sufficient for your requirements. Manipulating the complete pipeline needs sensible handling because it could degrade performance of your pipeline to a high degree. Study carefully the :py:class:`~ray.rllib.offline.offline_data.OfflineData` class to reach a good understanding of how the default pipeline works before going over to program your own one. There are mainly two methods that define this pipeline: - The :py:meth:`~ray.rllib.offline.offline_data.OfflineData.__init__` method that defines the data reading process. - The :py:meth:`~ray.rllib.offline.offline_data.OfflineData.sample` method that defines the data mapping and batch iteration. For example consider overriding the :py:meth:`~ray.rllib.offline.offline_data.OfflineData.__init__` method, if you have some foundational data transformations as for example transforming image files into numpy arrays. .. literalinclude:: ../../../rllib/examples/offline_rl/classes/image_offline_data.py :language: python In the code example provided, you define a custom :py:class:`~ray.rllib.offline.offline_data.OfflineData` class to handle the reading and preprocessing of image data, converting it from a binary encoding format into `numpy` arrays. Additionally, you implement a custom :py:class:`~ray.rllib.offline.offline_prelearner.OfflinePreLearner` to process this data further, transforming it into a learner-ready :py:class:`~ray.rllib.policy.sample_batch.MultiAgentBatch` format. .. literalinclude:: ../../../rllib/examples/offline_rl/classes/image_offline_prelearner.py :language: python This demonstrates how the entire Offline Data Pipeline can be customized with your own logic. You can run the example by using the following code: .. literalinclude:: ../../../rllib/examples/offline_rl/offline_rl_with_image_data.py :language: python .. tip:: Consider this approach carefully: in many cases, fully transforming your data into a suitable format before engaging RLlib's offline RL API can be more efficient. For instance, in the example above, you could preprocess the entire image dataset into `numpy` arrays beforehand and utilize RLlib's default :py:class:`~ray.rllib.offline.offline_data.OfflineData` class for subsequent steps. Monitoring ---------- To effectively monitor your offline data pipeline, leverage :ref:`Ray Data's built-in monitoring capacities `. Focus on ensuring that all stages of your offline data streaming pipeline are actively processing data. Additionally, keep an eye on the Learner instance, particularly the `learner_update_timer`, which should maintain low values - around `0.02` for small models - to indicate efficient data processing and model updates. .. note:: RLlib doesn't include :ref:`Ray Data ` metrics in its results or display them in `Tensorboard` through :ref:`Ray Tune `'s :py:class:`~ray.tune.logger.tensorboardx.TBXLoggerCallback`. It's strongly recommended to enable the :ref:`Ray Dashboard `, accessible at `127.0.0.1:8265`, for comprehensive monitoring and insights. Input API --------- You can configure experience input for an agent using the following options: .. code-block:: python def offline_data( self, *, # Specify how to generate experiences: # - A local directory or file glob expression (for example "/tmp/*.json"). # - A cloud storage path or file glob expression (for example "gs://rl/"). # - A list of individual file paths/URIs (for example ["/tmp/1.json", # "s3://bucket/2.json"]). # - A file or directory path in a given `input_filesystem`. input_: Optional[Union[str, Callable[[IOContext], InputReader]]], # Read method for the `ray.data.Dataset` to read in the # offline data from `input_`. The default is `read_parquet` for Parquet # files. See https://docs.ray.io/en/latest/data/api/input_output.html for # more info about available read methods in `ray.data`. input_read_method: Optional[Union[str, Callable]], # Keyword args for `input_read_method`. These # are passed into the read method without checking. Use these # keyword args together with `map_batches_kwargs` and # `iter_batches_kwargs` to tune the performance of the data pipeline. It # is strongly recommended to rely on Ray Data's automatic read performance # tuning input_read_method_kwargs: Optional[Dict], # Table schema for converting offline data to episodes. # This schema maps the offline data columns to # `ray.rllib.core.columns.Columns`: # `{Columns.OBS: 'o_t', Columns.ACTIONS: 'a_t', ...}`. Columns in # the data set that aren't mapped through this schema are sorted into # episodes' `extra_model_outputs`. If no schema is passed in the default # schema used is `ray.rllib.offline.offline_data.SCHEMA`. If your data set # contains already the names in this schema, no `input_read_schema` is # needed. The same applies, if the offline data is in RLlib's # `EpisodeType` or old `SampleBatch` format input_read_schema: Optional[Dict[str, str]], # Whether offline data is already stored in RLlib's # `EpisodeType` format, i.e. `ray.rllib.env.SingleAgentEpisode` (multi # -agent is planned but not supported, yet). Reading episodes directly # avoids additional transform steps and is usually faster and # therefore the recommended format when your application remains fully # inside of RLlib's schema. The other format is a columnar format and is # agnostic to the RL framework used. Use the latter format, if you are # unsure when to use the data or in which RL framework. The default is # to read column data, i.e. `False`. `input_read_episodes` and # `input_read_sample_batches` can't be `True` at the same time. See # also `output_write_episodes` to define the output data format when # recording. input_read_episodes: Optional[bool], # Whether offline data is stored in RLlib's old # stack `SampleBatch` type. This is usually the case for older data # recorded with RLlib in JSON line format. Reading in `SampleBatch` # data needs extra transforms and might not concatenate episode chunks # contained in different `SampleBatch`es in the data. If possible avoid # to read `SampleBatch`es and convert them in a controlled form into # RLlib's `EpisodeType` (i.e. `SingleAgentEpisode`). The default is # `False`. `input_read_episodes` and `input_read_sample_batches` can't # be True at the same time. input_read_sample_batches: Optional[bool], # Batch size to pull from the data set. This could # differ from the `train_batch_size_per_learner`, if a dataset holds # `EpisodeType` (i.e. `SingleAgentEpisode`) or `SampleBatch`, or any # other data type that contains multiple timesteps in a single row of the # dataset. In such cases a single batch of size # `train_batch_size_per_learner` potentially pulls a multiple of # `train_batch_size_per_learner` timesteps from the offline dataset. The # default is `None` in which the `train_batch_size_per_learner` is pulled. input_read_batch_size: Optional[int], # A cloud filesystem to handle access to cloud storage when # reading experiences. Can be "gcs" for Google Cloud Storage, "s3" for AWS # S3 buckets, "abs" for Azure Blob Storage, or any filesystem supported # by PyArrow. In general the file path is sufficient for accessing data # from public or local storage systems. See # https://arrow.apache.org/docs/python/filesystems.html for details. input_filesystem: Optional[str], # A dictionary holding the kwargs for the filesystem # given by `input_filesystem`. See `gcsfs.GCSFilesystem` for GCS, # `pyarrow.fs.S3FileSystem`, for S3, and `ablfs.AzureBlobFilesystem` for # ABS filesystem arguments. input_filesystem_kwargs: Optional[Dict], # What input columns are compressed with LZ4 in the # input data. If data is stored in RLlib's `SingleAgentEpisode` ( # `MultiAgentEpisode` not supported, yet). Note the providing # `rllib.core.columns.Columns.OBS` also tries to decompress # `rllib.core.columns.Columns.NEXT_OBS`. input_compress_columns: Optional[List[str]], # Whether the raw data should be materialized in memory. # This boosts performance, but requires enough memory to avoid an OOM, so # make sure that your cluster has the resources available. For very large # data you might want to switch to streaming mode by setting this to # `False` (default). If your algorithm doesn't need the RLModule in the # Learner connector pipeline or all (learner) connectors are stateless # you should consider setting `materialize_mapped_data` to `True` # instead (and set `materialize_data` to `False`). If your data doesn't # fit into memory and your Learner connector pipeline requires an RLModule # or is stateful, set both `materialize_data` and # `materialize_mapped_data` to `False`. materialize_data: Optional[bool], # Whether the data should be materialized after # running it through the Learner connector pipeline (i.e. after running # the `OfflinePreLearner`). This improves performance, but should only be # used in case the (learner) connector pipeline doesn't require an # RLModule and the (learner) connector pipeline is stateless. For example, # MARWIL's Learner connector pipeline requires the RLModule for value # function predictions and training batches would become stale after some # iterations causing learning degradation or divergence. Also ensure that # your cluster has enough memory available to avoid an OOM. If set to # `True`, make sure that `materialize_data` is set to `False` to # avoid materialization of two datasets. If your data doesn't fit into # memory and your Learner connector pipeline requires an RLModule or is # stateful, set both `materialize_data` and `materialize_mapped_data` to # `False`. materialize_mapped_data: Optional[bool], # Keyword args for the `map_batches` method. These are # passed into the `ray.data.Dataset.map_batches` method when sampling # without checking. If no arguments passed in the default arguments # `{'concurrency': max(2, num_learners), 'zero_copy_batch': True}` is # used. Use these keyword args together with `input_read_method_kwargs` # and `iter_batches_kwargs` to tune the performance of the data pipeline. map_batches_kwargs: Optional[Dict], # Keyword args for the `iter_batches` method. These are # passed into the `ray.data.Dataset.iter_batches` method when sampling # without checking. If no arguments are passed in, the default argument # `{'prefetch_batches': 2}` is used. Use these keyword args # together with `input_read_method_kwargs` and `map_batches_kwargs` to # tune the performance of the data pipeline. iter_batches_kwargs: Optional[Dict], # An optional `OfflinePreLearner` class that's used to # transform data batches in `ray.data.map_batches` used in the # `OfflineData` class to transform data from columns to batches that can # be used in the `Learner.update...()` methods. Override the # `OfflinePreLearner` class and pass your derived class in here, if you # need to make some further transformations specific for your data or # loss. The default is `None`` which uses the base `OfflinePreLearner` # defined in `ray.rllib.offline.offline_prelearner`. prelearner_class: Optional[Type], # An optional `EpisodeReplayBuffer` class is # used to buffer experiences when data is in `EpisodeType` or # RLlib's previous `SampleBatch` type format. In this case, a single # data row may contain multiple timesteps and the buffer serves two # purposes: (a) to store intermediate data in memory, and (b) to ensure # that exactly `train_batch_size_per_learner` experiences are sampled # per batch. The default is RLlib's `EpisodeReplayBuffer`. prelearner_buffer_class: Optional[Type], # Optional keyword arguments for initializing the # `EpisodeReplayBuffer`. In most cases this is simply the `capacity` # for the default buffer used (`EpisodeReplayBuffer`), but it may # differ if the `prelearner_buffer_class` uses a custom buffer. prelearner_buffer_kwargs: Optional[Dict], # Number of updates to run in each learner # during a single training iteration. If None, each learner runs a # complete epoch over its data block (the dataset is partitioned into # at least as many blocks as there are learners). The default is `None`. # This must be set to `1`, if a single (local) learner is used. dataset_num_iters_per_learner: Optional[int], ) Output API ---------- You can configure experience output for an agent using the following options: .. code-block:: python def offline_data( # Specify where experiences should be saved: # - None: don't save any experiences # - a path/URI to save to a custom output directory (for example, "s3://bckt/") output: Optional[str], # What sample batch columns to LZ4 compress in the output data. # Note that providing `rllib.core.columns.Columns.OBS` also # compresses `rllib.core.columns.Columns.NEXT_OBS`. output_compress_columns: Optional[List[str]], # Max output file size (in bytes) before rolling over to a new # file. output_max_file_size: Optional[float], # Max output row numbers before rolling over to a new file. output_max_rows_per_file: Optional[int], # Write method for the `ray.data.Dataset` to write the # offline data to `output`. The default is `read_parquet` for Parquet # files. See https://docs.ray.io/en/latest/data/api/input_output.html for # more info about available read methods in `ray.data`. output_write_method: Optional[str], # Keyword arguments for the `output_write_method`. These are # passed into the write method without checking. output_write_method_kwargs: Optional[Dict], # A cloud filesystem to handle access to cloud storage when # writing experiences. Can be "gcs" for Google Cloud Storage, "s3" for AWS # S3 buckets, "abs" for Azure Blob Storage, or any filesystem supported # by PyArrow. In general the file path is sufficient for accessing data # from public or local storage systems. See # https://arrow.apache.org/docs/python/filesystems.html for details. output_filesystem: Optional[str], # A dictionary holding the keyword arguments for the filesystem # given by `output_filesystem`. See `gcsfs.GCSFilesystem` for GCS, # `pyarrow.fs.S3FileSystem`, for S3, and `ablfs.AzureBlobFilesystem` for # ABS filesystem arguments. output_filesystem_kwargs: Optional[Dict], # If data should be recorded in RLlib's `EpisodeType` # format (i.e. `SingleAgentEpisode` objects). Use this format, if you # need data to be ordered in time and directly grouped by episodes for # example to train stateful modules or if you plan to use recordings # exclusively in RLlib. Otherwise data is recorded in tabular (columnar) # format. Default is `True`. output_write_episodes: Optional[bool], --- .. include:: /_includes/rllib/we_are_hiring.rst .. include:: /_includes/rllib/new_api_stack.rst .. _replay-buffer-reference-docs: ############## Replay Buffers ############## Quick Intro to Replay Buffers in RL ===================================== When we talk about replay buffers in reinforcement learning, we generally mean a buffer that stores and replays experiences collected from interactions of our agent(s) with the environment. In python, a simple buffer can be implemented by a list to which elements are added and later sampled from. Such buffers are used mostly in off-policy learning algorithms. This makes sense intuitively because these algorithms can learn from experiences that are stored in the buffer, but where produced by a previous version of the policy (or even a completely different "behavior policy"). Sampling Strategy ----------------- When sampling from a replay buffer, we choose which experiences to train our agent with. A straightforward strategy that has proven effective for many algorithms is to pick these samples uniformly at random. A more advanced strategy (proven better in many cases) is `Prioritized Experiences Replay (PER) `__. In PER, single items in the buffer are assigned a (scalar) priority value, which denotes their significance, or in simpler terms, how much we expect to learn from these items. Experiences with a higher priority are more likely to be sampled. Eviction Strategy ----------------- A buffer is naturally limited in its capacity to hold experiences. In the course of running an algorithm, a buffer will eventually reach its capacity and in order to make room for new experiences, we need to delete (evict) older ones. This is generally done on a first-in-first-out basis. For your algorithms this means that buffers with a high capacity give the opportunity to learn from older samples, while smaller buffers make the learning process more on-policy. An exception from this strategy is made in buffers that implement reservoir sampling. Replay Buffers in RLlib ======================= RLlib comes with a set of extendable replay buffers built in. All the of them support the two basic methods ``add()`` and ``sample()``. We provide a base :py:class:`~ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer` class from which you can build your own buffer. In most algorithms, we require :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_replay_buffer.MultiAgentReplayBuffer`\s. This is because we want them to generalize to the multi-agent case. Therefore, these buffer's ``add()`` and ``sample()`` methods require a ``policy_id`` to handle experiences per policy. Have a look at the :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_replay_buffer.MultiAgentReplayBuffer` to get a sense of how it extends our base class. You can find buffer types and arguments to modify their behaviour as part of RLlib's default parameters. They are part of the ``replay_buffer_config``. Basic Usage ----------- You will rarely have to define your own replay buffer sub-class, when running an experiment, but rather configure existing buffers. The following is `from RLlib's examples section `__: and runs the R2D2 algorithm with `PER `__ (which by default it doesn't). The highlighted lines focus on the PER configuration. .. dropdown:: **Executable example script** :animate: fade-in-slide-down .. literalinclude:: ../../../rllib/examples/_old_api_stack/replay_buffer_api.py :emphasize-lines: 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70 :language: python :start-after: __sphinx_doc_replay_buffer_api_example_script_begin__ :end-before: __sphinx_doc_replay_buffer_api_example_script_end__ .. tip:: Because of its prevalence, most Q-learning algorithms support PER. The priority update step that is needed is embedded into their training iteration functions. .. warning:: If your custom buffer requires extra interaction, you will have to change the training iteration function, too! Specifying a buffer type works the same way as specifying an exploration type. Here are three ways of specifying a type: .. dropdown:: **Changing a replay buffer configuration** :animate: fade-in-slide-down .. literalinclude:: doc_code/replay_buffer_demo.py :language: python :start-after: __sphinx_doc_replay_buffer_type_specification__begin__ :end-before: __sphinx_doc_replay_buffer_type_specification__end__ Apart from the ``type``, you can also specify the ``capacity`` and other parameters. These parameters are mostly constructor arguments for the buffer. The following categories exist: #. Parameters that define how algorithms interact with replay buffers. e.g. ``worker_side_prioritization`` to decide where to compute priorities #. Constructor arguments to instantiate the replay buffer. e.g. ``capacity`` to limit the buffer's size #. Call arguments for underlying replay buffer methods. e.g. ``prioritized_replay_beta`` is used by the :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_prioritized_replay_buffer.MultiAgentPrioritizedReplayBuffer` to call the ``sample()`` method of every underlying :py:class:`~ray.rllib.utils.replay_buffers.prioritized_replay_buffer.PrioritizedReplayBuffer` .. tip:: Most of the time, only 1. and 2. are of interest. 3. is an advanced feature that supports use cases where a :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_replay_buffer.MultiAgentReplayBuffer` instantiates underlying buffers that need constructor or default call arguments. ReplayBuffer Base Class ----------------------- The base :py:class:`~ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer` class only supports storing and replaying experiences in different :py:class:`~ray.rllib.utils.replay_buffers.replay_buffer.StorageUnit`\s. You can add data to the buffer's storage with the ``add()`` method and replay it with the ``sample()`` method. Advanced buffer types add functionality while trying to retain compatibility through inheritance. The following is an example of the most basic scheme of interaction with a :py:class:`~ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer`. .. literalinclude:: doc_code/replay_buffer_demo.py :language: python :start-after: __sphinx_doc_replay_buffer_basic_interaction__begin__ :end-before: __sphinx_doc_replay_buffer_basic_interaction__end__ Building your own ReplayBuffer ------------------------------ Here is an example of how to implement your own toy example of a ReplayBuffer class and make SimpleQ use it: .. literalinclude:: doc_code/replay_buffer_demo.py :language: python :start-after: __sphinx_doc_replay_buffer_own_buffer__begin__ :end-before: __sphinx_doc_replay_buffer_own_buffer__end__ For a full implementation, you should consider other methods like ``get_state()`` and ``set_state()``. A more extensive example is `our implementation `__ of reservoir sampling, the :py:class:`~ray.rllib.utils.replay_buffers.reservoir_replay_buffer.ReservoirReplayBuffer`. Advanced Usage ============== In RLlib, all replay buffers implement the :py:class:`~ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer` interface. Therefore, they support, whenever possible, different :py:class:`~ray.rllib.utils.replay_buffers.replay_buffer.StorageUnit`\s. The storage_unit constructor argument of a replay buffer defines how experiences are stored, and therefore the unit in which they are sampled. When later calling the ``sample()`` method, num_items will relate to said storage_unit. Here is a full example of how to modify the storage_unit and interact with a custom buffer: .. literalinclude:: doc_code/replay_buffer_demo.py :language: python :start-after: __sphinx_doc_replay_buffer_advanced_usage_storage_unit__begin__ :end-before: __sphinx_doc_replay_buffer_advanced_usage_storage_unit__end__ As noted above, RLlib's :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_replay_buffer.MultiAgentReplayBuffer`\s support modification of underlying replay buffers. Under the hood, the :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_replay_buffer.MultiAgentReplayBuffer` stores experiences per policy in separate underlying replay buffers. You can modify their behaviour by specifying an underlying ``replay_buffer_config`` that works the same way as the parent's config. Here is an example of how to create an :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_replay_buffer.MultiAgentReplayBuffer` with an alternative underlying :py:class:`~ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer`. The :py:class:`~ray.rllib.utils.replay_buffers.multi_agent_replay_buffer.MultiAgentReplayBuffer` can stay the same. We only need to specify our own buffer along with a default call argument: .. literalinclude:: doc_code/replay_buffer_demo.py :language: python :start-after: __sphinx_doc_replay_buffer_advanced_usage_underlying_buffers__begin__ :end-before: __sphinx_doc_replay_buffer_advanced_usage_underlying_buffers__end__ --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-scaling-guide: RLlib scaling guide =================== .. include:: /_includes/rllib/new_api_stack.rst RLlib is a distributed and scalable RL library, based on `Ray `__. An RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` uses `Ray actors `__ wherever parallelization of its sub-components can speed up sample and learning throughput. .. figure:: images/scaling_axes_overview.svg :width: 600 :align: left **Scalable axes in RLlib**: Three scaling axes are available across all RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` classes: - The number of :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors in the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`, settable through ``config.env_runners(num_env_runners=n)``. - The number of vectorized sub-environments on each :py:class:`~ray.rllib.env.env_runner.EnvRunner` actor, settable through ``config.env_runners(num_envs_per_env_runner=p)``. - The number of :py:class:`~ray.rllib.core.learner.learner.Learner` actors in the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`, settable through ``config.learners(num_learners=m)``. Scaling the number of EnvRunner actors -------------------------------------- You can control the degree of parallelism for the sampling machinery of the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` by increasing the number of remote :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors in the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` through the config as follows. .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig config = ( PPOConfig() # Use 4 EnvRunner actors (default is 2). .env_runners(num_env_runners=4) ) To assign resources to each :py:class:`~ray.rllib.env.env_runner.EnvRunner`, use these config settings: .. code-block:: python config.env_runners( num_cpus_per_env_runner=.., num_gpus_per_env_runner=.., ) See this `example of an EnvRunner and RL environment requiring a GPU resource `__. The number of GPUs may be fractional quantities, for example 0.5, to allocate only a fraction of a GPU per :py:class:`~ray.rllib.env.env_runner.EnvRunner`. Note that there's always one "local" :py:class:`~ray.rllib.env.env_runner.EnvRunner` in the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`. If you only want to sample using this local :py:class:`~ray.rllib.env.env_runner.EnvRunner`, set ``num_env_runners=0``. This local :py:class:`~ray.rllib.env.env_runner.EnvRunner` directly sits in the main :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` process. .. hint:: The Ray team may decide to deprecate the local :py:class:`~ray.rllib.env.env_runner.EnvRunner` some time in the future. It still exists for historical reasons. It's usefulness to keep in the set is still under debate. Scaling the number of envs per EnvRunner actor ---------------------------------------------- RLlib vectorizes :ref:`RL environments ` on :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors through the `gymnasium's VectorEnv `__ API. To create more than one environment copy per :py:class:`~ray.rllib.env.env_runner.EnvRunner`, set the following in your config: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig config = ( PPOConfig() # Use 10 sub-environments (vector) per EnvRunner. .env_runners(num_envs_per_env_runner=10) ) .. note:: Unlike single-agent environments, RLlib can't vectorize multi-agent setups yet. The Ray team is working on a solution for this restriction by utilizing `gymnasium >= 1.x` custom vectorization feature. Doing so allows the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` on the :py:class:`~ray.rllib.env.env_runner.EnvRunner` to run inference on a batch of data and thus compute actions for all sub-environments in parallel. By default, the individual sub-environments in a vector ``step`` and ``reset``, in sequence, making only the action computation of the RL environment loop parallel, because observations can move through the model in a batch. However, `gymnasium `__ supports an asynchronous vectorization setting, in which each sub-environment receives its own Python process. This way, the vector environment can ``step`` or ``reset`` in parallel. Activate this asynchronous vectorization behavior through: .. testcode:: import gymnasium as gym config.env_runners( gym_env_vectorize_mode=gym.envs.registration.VectorizeMode.ASYNC, # default is `SYNC` ) This setting can speed up the sampling process significantly in combination with ``num_envs_per_env_runner > 1``, especially when your RL environment's stepping process is time consuming. See this `example script `__ that demonstrates a massive speedup with async vectorization. Scaling the number of Learner actors ------------------------------------ Learning updates happen in the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`, which manages either a single, local :py:class:`~ray.rllib.core.learner.learner.Learner` instance or any number of remote :py:class:`~ray.rllib.core.learner.learner.Learner` actors. Set the number of remote :py:class:`~ray.rllib.core.learner.learner.Learner` actors through: .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig config = ( PPOConfig() # Use 2 remote Learner actors (default is 0) for distributed data parallelism. # Choosing 0 creates a local Learner instance on the main Algorithm process. .learners(num_learners=2) ) Typically, you use as many :py:class:`~ray.rllib.core.learner.learner.Learner` actors as you have GPUs available for training. Make sure to set the number of GPUs per :py:class:`~ray.rllib.core.learner.learner.Learner` to 1: .. testcode:: config.learners(num_gpus_per_learner=1) .. warning:: For some algorithms, such as IMPALA and APPO, the performance of a single remote :py:class:`~ray.rllib.core.learner.learner.Learner` actor (``num_learners=1``) compared to a single local :py:class:`~ray.rllib.core.learner.learner.Learner` instance (``num_learners=0``), depends on whether you have a GPU available or not. If exactly one GPU is available, you should run these two algorithms with ``num_learners=0, num_gpus_per_learner=1``, if no GPU is available, set ``num_learners=1, num_gpus_per_learner=0``. If more than 1 GPU is available, set ``num_learners=.., num_gpus_per_learner=1``. The number of GPUs may be fractional quantities, for example 0.5, to allocate only a fraction of a GPU per :py:class:`~ray.rllib.env.env_runner.EnvRunner`. For example, you can pack five :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instances onto one GPU by setting ``num_learners=1, num_gpus_per_learner=0.2``. See this `fractional GPU example `__ for details. .. note:: If you specify ``num_gpus_per_learner > 0`` and your machine doesn't have the required number of GPUs available, the experiment may stall until the Ray autoscaler brings up enough machines to fulfill the resource request. If your cluster has autoscaling turned off, this setting then results in a seemingly hanging experiment run. On the other hand, if you set ``num_gpus_per_learner=0``, RLlib builds the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances solely on CPUs, even if GPUs are available on the cluster. Outlook: More RLlib elements that should scale ---------------------------------------------- There are other components and aspects in RLlib that should be able to scale up. For example, the model size is limited to whatever fits on a single GPU, due to "distributed data parallel" (DDP) being the only way in which RLlib scales :py:class:`~ray.rllib.core.learner.learner.Learner` actors. The Ray team is working on closing these gaps. In particular, future areas of improvements are: - Enable **training very large models**, such as a "large language model" (LLM). The team is actively working on a "Reinforcement Learning from Human Feedback" (RLHF) prototype setup. The main problems to solve are the model-parallel and tensor-parallel distribution across multiple GPUs, as well as, a reasonably fast transfer of weights between Ray actors. - Enable training with **thousands of multi-agent policies**. A possible solution for this scaling problem could be to split up the :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` into manageable groups of individual policies across the various :py:class:`~ray.rllib.env.env_runner.EnvRunner` and :py:class:`~ray.rllib.core.learner.learner.Learner` actors. - Enabling **vector envs for multi-agent**. --- .. include:: /_includes/rllib/we_are_hiring.rst .. _single-agent-episode-docs: Episodes ======== .. include:: /_includes/rllib/new_api_stack.rst RLlib stores and transports all trajectory data in the form of `Episodes`, in particular :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` for single-agent setups and :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` for multi-agent setups. The data is translated from this `Episode` format to tensor batches (including a possible move to the GPU) only immediately before a neural network forward pass by so called :ref:`connector pipelines `. .. figure:: images/episodes/usage_of_episodes.svg :width: 750 :align: left **Episodes** are the main vehicle to store and transport trajectory data across the different components of RLlib (for example from `EnvRunner` to `Learner` or from `ReplayBuffer` to `Learner`). One of the main design principles of RLlib's new API stack is that all trajectory data is kept in such episodic form for as long as possible. Only immediately before the neural network passes, :ref:`connector pipelines ` translate lists of Episodes into tensor batches. See the section on :ref:`Connectors and Connector pipelines here ` for more details. The main advantage of collecting and moving around data in such a trajectory-as-a-whole format (as opposed to tensor batches) is that it offers 360° visibility and full access to the RL environment's history. This means users can extract arbitrary pieces of information from episodes to be further processed by their custom components. Think of a transformer model requiring not only the most recent observation to compute the next action, but instead the whole sequence of the last n observations. Using :py:meth:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode.get_observations`, a user can easily extract this information inside their custom :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` pipeline and add the data to the neural network batch. Another advantage of episodes over batches is the more efficient memory footprint. For example, an algorithm like DQN needs to have both observations and next observations (to compute the TD error-based loss) in the train batch, thereby duplicating an already large observation tensor. Using episode objects for most of the time reduces the memory need to a single observation-track, which contains all observations, from reset to terminal. This page explains in detail what working with RLlib's Episode APIs looks like. SingleAgentEpisode ================== This page describes the single-agent case only. .. note:: The Ray team is working on a detailed description of the multi-agent case, analogous to this page here, but for :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode`. Creating a SingleAgentEpisode ----------------------------- RLlib usually takes care of creating :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` instances and moving them around, for example from :py:class:`~ray.rllib.env.env_runner.EnvRunner` to :py:class:`~ray.rllib.core.learner.learner.Learner`. However, here is how to manually generate and fill an initially empty episode with dummy data: .. literalinclude:: doc_code/sa_episode.py :language: python :start-after: rllib-sa-episode-01-begin :end-before: rllib-sa-episode-01-end The :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` constructed and filled preceding should roughly look like this now: .. figure:: images/episodes/sa_episode.svg :width: 750 :align: left **(Single-agent) Episode**: The episode starts with a single observation (the "reset observation"), then continues on each timestep with a 3-tuple of `(observation, action, reward)`. Note that because of the reset observation, every episode - at each timestep - always contains one more observation than it contains actions or rewards. Important additional properties of an Episode are its `id_` (str) and `terminated/truncated` (bool) flags. See further below for a detailed description of the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` APIs exposed to the user. Using the getter APIs of SingleAgentEpisode ------------------------------------------- Now that there is a :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` to work with, one can explore and extract information from this episode using its different "getter" methods: .. figure:: images/episodes/sa_episode_getters.svg :width: 750 :align: left **SingleAgentEpisode getter APIs**: "getter" methods exist for all of the Episode's fields, which are `observations`, `actions`, `rewards`, `infos`, and `extra_model_outputs`. For simplicity, only the getters for observations, actions, and rewards are shown here. Their behavior is intuitive, returning a single item when provided with a single index and returning a list of items (in the non-numpy'ized case; see further below) when provided with a list of indices or a slice of indices. Note that for `extra_model_outputs`, the getter is slightly more complicated as there exist sub-keys in this data (for example: `action_logp`). See :py:meth:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode.get_extra_model_outputs` for more information. The following code snippet summarizes the various capabilities of the different getter methods: .. literalinclude:: doc_code/sa_episode.py :language: python :start-after: rllib-sa-episode-02-begin :end-before: rllib-sa-episode-02-end Numpy'ized and non-numpy'ized Episodes -------------------------------------- The data in a :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` can exist in two states: non-numpy'ized and numpy'ized. A non-numpy'ized episode stores its data items in plain python lists and appends new timestep data to these. In a numpy'ized episode, these lists have been converted into possibly complex structures that have NumPy arrays at their leafs. Note that a numpy'ized episode doesn't necessarily have to be terminated or truncated yet in the sense that the underlying RL environment declared the episode to be over or has reached some maximum number of timesteps. .. figure:: images/episodes/sa_episode_non_finalized_vs_finalized.svg :width: 900 :align: left :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` objects start in the non-numpy'ized state, in which data is stored in python lists, making it very fast to append data from an ongoing episode: .. literalinclude:: doc_code/sa_episode.py :language: python :start-after: rllib-sa-episode-03-begin :end-before: rllib-sa-episode-03-end To illustrate the differences between the data stored in a non-numpy'ized episode vs. the same data stored in a numpy'ized one, take a look at this complex observation example here, showing the exact same observation data in two episodes (one non-numpy'ized the other numpy'ized): .. figure:: images/episodes/sa_episode_non_finalized.svg :width: 800 :align: left **Complex observations in a non-numpy'ized episode**: Each individual observation is a (complex) dict matching the gymnasium environment's observation space. There are three such observation items stored in the episode so far. .. figure:: images/episodes/sa_episode_finalized.svg :width: 600 :align: left **Complex observations in a numpy'ized episode**: The entire observation record is a single complex dict matching the gymnasium environment's observation space. At the leafs of the structure are `NDArrays` holding the individual values of the leaf. Note that these `NDArrays` have an extra batch dim (axis=0), whose length matches the length of the episode stored (here 3). Episode.cut() and lookback buffers ---------------------------------- During sample collection from an RL environment, the :py:class:`~ray.rllib.env.env_runner.EnvRunner` sometimes has to stop appending data to an ongoing (non-terminated) :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` to return the data collected thus far. The `EnvRunner` then calls :py:meth:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode.cut` on the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` object, which returns a new episode chunk, with which collection can continue in the next round of sampling. .. literalinclude:: doc_code/sa_episode.py :language: python :start-after: rllib-sa-episode-04-begin :end-before: rllib-sa-episode-04-end Note that a "lookback" mechanism exists to allow for connectors to look back into the `H` previous timesteps of the cut episode from within the continuation chunk, where `H` is a configurable parameter. .. figure:: images/episodes/sa_episode_cut_and_lookback.svg :width: 800 :align: left The default lookback horizon (`H`) is 1. This means you can - after a `cut()` - still access the most recent action (`get_actions(-1)`), the most recent reward (`get_rewards(-1)`), and the two most recent observations (`get_observations([-2, -1])`). If you would like to be able to access data further in the past, change this setting in your :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`: .. testcode:: :hide: from ray.rllib.algorithms.algorithm_config import AlgorithmConfig .. testcode:: config = AlgorithmConfig() # Change the lookback horizon setting, in case your connector (pipelines) need # to access data further in the past. config.env_runners(episode_lookback_horizon=10) Lookback Buffers and getters in more Detail ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following code demonstrates more options available to users of the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` getter APIs to access information further in the past (inside the lookback buffers). Imagine having to write a connector piece that has to add the last 5 rewards to the tensor batch used by your model's action computing forward pass: .. literalinclude:: doc_code/sa_episode.py :language: python :start-after: rllib-sa-episode-05-begin :end-before: rllib-sa-episode-05-end Another useful getter argument (besides `fill`) is the `neg_index_as_lookback` boolean argument. If set to True, negative indices are not interpreted as "from the end", but as "into the lookback buffer". This allows you to loop over a range of global timesteps while looking back a certain amount of timesteps from each of these global timesteps: .. literalinclude:: doc_code/sa_episode.py :language: python :start-after: rllib-sa-episode-06-begin :end-before: rllib-sa-episode-06-end --- .. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-guides: User Guides =========== .. include:: /_includes/rllib/new_api_stack.rst .. toctree:: :hidden: rllib-advanced-api rllib-callback checkpoints metrics-logger single-agent-episode connector-v2 rllib-replay-buffers rllib-offline rl-modules rllib-learner rllib-fault-tolerance rllib-dev scaling-guide .. _rllib-feature-guide: RLlib Feature Guides -------------------- .. grid:: 1 2 3 4 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-advanced-api-doc Advanced features of the RLlib python API .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-callback Injecting custom code into RLlib through callbacks .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: metrics-logger Logging metrics and statistics from custom code .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: checkpoints Checkpointing your experiments and models .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: single-agent-episode How to process trajectories through episodes .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: connector-v2 How To Use Connectors and Connector pipelines? .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-offline Offline RL with offline datasets .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-replay-buffers Working with replay buffers .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: rllib-dev Contribute to RLlib .. grid-item-card:: :img-top: /rllib/images/rllib-logo.svg :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img .. button-ref:: scaling-guide How to run RLlib experiments at scale --- .. _train-api: Ray Train API ============= .. currentmodule:: ray .. important:: These API references are for the revamped Ray Train V2 implementation that is available starting from Ray 2.43 by enabling the environment variable ``RAY_TRAIN_V2_ENABLED=1``. These APIs assume that the environment variable has been enabled. See :ref:`train-deprecated-api` for the old API references and the `Ray Train V2 Migration Guide `_. PyTorch Ecosystem ----------------- .. autosummary:: :nosignatures: :toctree: doc/ ~train.torch.TorchTrainer ~train.torch.TorchConfig ~train.torch.xla.TorchXLAConfig .. _train-pytorch-integration: PyTorch ~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.torch.get_device ~train.torch.get_devices ~train.torch.prepare_model ~train.torch.prepare_data_loader ~train.torch.enable_reproducibility .. _train-lightning-integration: PyTorch Lightning ~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.lightning.prepare_trainer ~train.lightning.RayLightningEnvironment ~train.lightning.RayDDPStrategy ~train.lightning.RayFSDPStrategy ~train.lightning.RayDeepSpeedStrategy ~train.lightning.RayTrainReportCallback .. _train-transformers-integration: Hugging Face Transformers ~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.huggingface.transformers.prepare_trainer ~train.huggingface.transformers.RayTrainReportCallback More Frameworks --------------- TensorFlow/Keras ~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.tensorflow.TensorflowTrainer ~train.tensorflow.TensorflowConfig ~train.tensorflow.prepare_dataset_shard ~train.tensorflow.keras.ReportCheckpointCallback XGBoost ~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.xgboost.XGBoostTrainer ~train.xgboost.RayTrainReportCallback LightGBM ~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.lightgbm.LightGBMTrainer ~train.lightgbm.get_network_params ~train.lightgbm.RayTrainReportCallback JAX ~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.v2.jax.JaxTrainer .. _ray-train-configs-api: Ray Train Configuration ----------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~train.CheckpointConfig ~train.DataConfig ~train.FailureConfig ~train.RunConfig ~train.ScalingConfig .. _train-loop-api: Ray Train Utilities ------------------- **Classes** .. autosummary:: :nosignatures: :toctree: doc/ ~train.Checkpoint ~train.CheckpointUploadMode ~train.CheckpointConsistencyMode ~train.TrainContext **Functions** .. autosummary:: :nosignatures: :toctree: doc/ ~train.get_all_reported_checkpoints ~train.get_checkpoint ~train.get_context ~train.get_dataset_shard ~train.report **Collective** .. autosummary:: :nosignatures: :toctree: doc/ ~train.collective.barrier ~train.collective.broadcast_from_rank_zero Ray Train Output ---------------- .. autosummary:: :nosignatures: :template: autosummary/class_without_autosummary.rst :toctree: doc/ ~train.ReportedCheckpoint ~train.Result Ray Train Errors ---------------- .. autosummary:: :nosignatures: :template: autosummary/class_without_autosummary.rst :toctree: doc/ ~train.ControllerError ~train.WorkerGroupError ~train.TrainingFailedError Ray Tune Integration Utilities ------------------------------ .. autosummary:: :nosignatures: :toctree: doc/ tune.integration.ray_train.TuneReportCallback Ray Train Developer APIs ------------------------ Trainer Base Class ~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.v2.api.data_parallel_trainer.DataParallelTrainer Train Backend Base Classes ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _train-backend: .. _train-backend-config: .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_autosummary.rst ~train.backend.Backend ~train.backend.BackendConfig Trainer Callbacks ~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.UserCallback --- :orphan: .. _train-deprecated-api: Ray Train V1 API ================ .. currentmodule:: ray .. important:: Ray Train V2 is an overhaul of Ray Train's implementation and select APIs, which can be enabled by setting the environment variable ``RAY_TRAIN_V2_ENABLED=1`` starting in Ray 2.43. This page contains the deprecated V1 API references. See :ref:`train-api` for the new V2 API references and the `Ray Train V2 Migration Guide `_. PyTorch Ecosystem ----------------- .. autosummary:: :nosignatures: :toctree: doc/ ~train.torch.torch_trainer.TorchTrainer ~train.torch.TorchConfig ~train.torch.xla.TorchXLAConfig PyTorch ~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.torch.get_device ~train.torch.get_devices ~train.torch.prepare_model ~train.torch.prepare_data_loader ~train.torch.enable_reproducibility PyTorch Lightning ~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.lightning.prepare_trainer ~train.lightning.RayLightningEnvironment ~train.lightning.RayDDPStrategy ~train.lightning.RayFSDPStrategy ~train.lightning.RayDeepSpeedStrategy ~train.lightning.RayTrainReportCallback Hugging Face Transformers ~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.huggingface.transformers.prepare_trainer ~train.huggingface.transformers.RayTrainReportCallback More Frameworks --------------- TensorFlow/Keras ~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.tensorflow.tensorflow_trainer.TensorflowTrainer ~train.tensorflow.TensorflowConfig ~train.tensorflow.prepare_dataset_shard ~train.tensorflow.keras.ReportCheckpointCallback Horovod ~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.horovod.HorovodTrainer ~train.horovod.HorovodConfig XGBoost ~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.xgboost.xgboost_trainer.XGBoostTrainer ~train.xgboost.RayTrainReportCallback LightGBM ~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.lightgbm.lightgbm_trainer.LightGBMTrainer ~train.lightgbm.RayTrainReportCallback Ray Train Configuration ----------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~air.config.ScalingConfig ~air.config.RunConfig ~air.config.FailureConfig ~train.CheckpointConfig ~train.DataConfig ~train.SyncConfig Ray Train Utilities ------------------- **Classes** .. autosummary:: :nosignatures: :toctree: doc/ ~train.Checkpoint ~train.context.TrainContext **Functions** .. autosummary:: :nosignatures: :toctree: doc/ ~train._internal.session.get_checkpoint ~train.context.get_context ~train._internal.session.get_dataset_shard ~train._internal.session.report Ray Train Output ---------------- .. autosummary:: :nosignatures: :toctree: doc/ ~train.Result Ray Train Errors ---------------- .. autosummary:: :nosignatures: :template: autosummary/class_without_autosummary.rst :toctree: doc/ ~train.error.SessionMisuseError ~train.base_trainer.TrainingFailedError Ray Train Developer APIs ------------------------ Trainer Base Classes ~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~train.trainer.BaseTrainer ~train.data_parallel_trainer.DataParallelTrainer Train Backend Base Classes ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_autosummary.rst ~train.backend.Backend ~train.backend.BackendConfig --- .. _train-benchmarks: Ray Train Benchmarks ==================== Below we document key performance benchmarks for common Ray Train tasks and workflows. .. _pytorch_gpu_training_benchmark: GPU image training ------------------ This task uses the TorchTrainer module to train different amounts of data using a PyTorch ResNet model. We test out the performance across different cluster sizes and data sizes. - `GPU image training script`_ - `GPU training small cluster configuration`_ - `GPU training large cluster configuration`_ .. note:: For multi-host distributed training, on AWS we need to ensure ec2 instances are in the same VPC and all ports are open in the security group. .. list-table:: * - **Cluster Setup** - **Data Size** - **Performance** - **Command** * - 1 g3.8xlarge node (1 worker) - 1 GB (1623 images) - 79.76 s (2 epochs, 40.7 images/sec) - `python pytorch_training_e2e.py --data-size-gb=1` * - 1 g3.8xlarge node (1 worker) - 20 GB (32460 images) - 1388.33 s (2 epochs, 46.76 images/sec) - `python pytorch_training_e2e.py --data-size-gb=20` * - 4 g3.16xlarge nodes (16 workers) - 100 GB (162300 images) - 434.95 s (2 epochs, 746.29 images/sec) - `python pytorch_training_e2e.py --data-size-gb=100 --num-workers=16` .. _pytorch-training-parity: PyTorch training parity ----------------------- This task checks the performance parity between native PyTorch Distributed and Ray Train's distributed TorchTrainer. We demonstrate that the performance is similar (within 2.5\%) between the two frameworks. Performance may vary greatly across different model, hardware, and cluster configurations. The reported times are for the raw training times. There is an unreported constant setup overhead of a few seconds for both methods that is negligible for longer training runs. - `PyTorch comparison training script`_ - `PyTorch comparison CPU cluster configuration`_ - `PyTorch comparison GPU cluster configuration`_ .. list-table:: * - **Cluster Setup** - **Dataset** - **Performance** - **Command** * - 4 m5.2xlarge nodes (4 workers) - FashionMNIST - 196.64 s (vs 194.90 s PyTorch) - `python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 4 --cpus-per-worker 8` * - 4 m5.2xlarge nodes (16 workers) - FashionMNIST - 430.88 s (vs 475.97 s PyTorch) - `python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 2` * - 4 g4dn.12xlarge nodes (16 workers) - FashionMNIST - 149.80 s (vs 146.46 s PyTorch) - `python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 4 --use-gpu` .. _tf-training-parity: TensorFlow training parity -------------------------- This task checks the performance parity between native TensorFlow Distributed and Ray Train's distributed TensorflowTrainer. We demonstrate that the performance is similar (within 1\%) between the two frameworks. Performance may vary greatly across different model, hardware, and cluster configurations. The reported times are for the raw training times. There is an unreported constant setup overhead of a few seconds for both methods that is negligible for longer training runs. .. note:: The batch size and number of epochs is different for the GPU benchmark, resulting in a longer runtime. - `TensorFlow comparison training script`_ - `TensorFlow comparison CPU cluster configuration`_ - `TensorFlow comparison GPU cluster configuration`_ .. list-table:: * - **Cluster Setup** - **Dataset** - **Performance** - **Command** * - 4 m5.2xlarge nodes (4 workers) - FashionMNIST - 78.81 s (versus 79.67 s TensorFlow) - `python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 4 --cpus-per-worker 8` * - 4 m5.2xlarge nodes (16 workers) - FashionMNIST - 64.57 s (versus 67.45 s TensorFlow) - `python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 2` * - 4 g4dn.12xlarge nodes (16 workers) - FashionMNIST - 465.16 s (versus 461.74 s TensorFlow) - `python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 200 --num-workers 16 --cpus-per-worker 4 --batch-size 64 --use-gpu` .. _xgboost-benchmark: XGBoost training ---------------- This task uses the XGBoostTrainer module to train on different sizes of data with different amounts of parallelism to show near-linear scaling from distributed data parallelism. XGBoost parameters were kept as defaults for ``xgboost==1.7.6`` this task. - `XGBoost Training Script`_ - `XGBoost Cluster Configuration`_ .. list-table:: * - **Cluster Setup** - **Number of distributed training workers** - **Data Size** - **Performance** - **Command** * - 1 m5.4xlarge node with 16 CPUs - 1 training worker using 12 CPUs, leaving 4 CPUs for Ray Data tasks - 10 GB (26M rows) - 310.22 s - `python train_batch_inference_benchmark.py "xgboost" --size=10GB` * - 10 m5.4xlarge nodes - 10 training workers (one per node), using 10x12 CPUs, leaving 10x4 CPUs for Ray Data tasks - 100 GB (260M rows) - 326.86 s - `python train_batch_inference_benchmark.py "xgboost" --size=100GB` .. _`GPU image training script`: https://github.com/ray-project/ray/blob/cec82a1ced631525a4d115e4dc0c283fa4275a7f/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py#L95-L106 .. _`GPU training small cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_1_aws.yaml#L6-L24 .. _`GPU training large cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_4x4_aws.yaml#L5-L25 .. _`PyTorch comparison training script`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/workloads/torch_benchmark.py .. _`PyTorch comparison CPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_cpu_4_aws.yaml .. _`PyTorch comparison GPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_4x4_aws.yaml .. _`TensorFlow comparison training script`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/workloads/tensorflow_benchmark.py .. _`TensorFlow comparison CPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_cpu_4_aws.yaml .. _`TensorFlow comparison GPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_4x4_aws.yaml .. _`XGBoost Training Script`: https://github.com/ray-project/ray/blob/9ac58f4efc83253fe63e280106f959fe317b1104/release/train_tests/xgboost_lightgbm/train_batch_inference_benchmark.py .. _`XGBoost Cluster Configuration`: https://github.com/ray-project/ray/tree/9ac58f4efc83253fe63e280106f959fe317b1104/release/train_tests/xgboost_lightgbm --- Configure scale and GPUs ------------------------ Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure: 1. :class:`num_workers ` - The number of distributed training worker processes. 2. :class:`use_gpu ` - Whether each worker should use a GPU (or CPU). .. testcode:: from ray.train import ScalingConfig scaling_config = ScalingConfig(num_workers=2, use_gpu=True) For more details, see :ref:`train_scaling_config`. Configure persistent storage ---------------------------- Create a :class:`~ray.train.RunConfig` object to specify the path where results (including checkpoints and artifacts) will be saved. .. testcode:: from ray.train import RunConfig # Local path (/some/local/path/unique_run_name) run_config = RunConfig(storage_path="/some/local/path", name="unique_run_name") # Shared cloud storage URI (s3://bucket/unique_run_name) run_config = RunConfig(storage_path="s3://bucket", name="unique_run_name") # Shared NFS path (/mnt/nfs/unique_run_name) run_config = RunConfig(storage_path="/mnt/nfs", name="unique_run_name") .. warning:: Specifying a *shared storage location* (such as cloud storage or NFS) is *optional* for single-node clusters, but it is **required for multi-node clusters.** Using a local path will :ref:`raise an error ` during checkpointing for multi-node clusters. For more details, see :ref:`persistent-storage-guide`. Launch a training job --------------------- Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. .. testcode:: :hide: from ray.train import ScalingConfig train_func = lambda: None scaling_config = ScalingConfig(num_workers=1) run_config = None .. testcode:: from ray.train.torch import TorchTrainer trainer = TorchTrainer( train_func, scaling_config=scaling_config, run_config=run_config ) result = trainer.fit() Access training results ----------------------- After training completes, a :class:`~ray.train.Result` object is returned which contains information about the training run, including the metrics and checkpoints reported during training. .. testcode:: result.metrics # The metrics reported during training. result.checkpoint # The latest checkpoint reported during training. result.path # The path where logs are stored. result.error # The exception that was raised, if training failed. For more usage examples, see :ref:`train-inspect-results`. --- First, update your training code to support distributed training. Begin by wrapping your code in a :ref:`training function `: .. testcode:: :skipif: True def train_func(): # Your model training code here. ... Each distributed training worker executes this function. You can also specify the input argument for `train_func` as a dictionary via the Trainer's `train_loop_config`. For example: .. testcode:: python :skipif: True def train_func(config): lr = config["lr"] num_epochs = config["num_epochs"] config = {"lr": 1e-4, "num_epochs": 10} trainer = ray.train.torch.TorchTrainer(train_func, train_loop_config=config, ...) .. warning:: Avoid passing large data objects through `train_loop_config` to reduce the serialization and deserialization overhead. Instead, it's preferred to initialize large objects (e.g. datasets, models) directly in `train_func`. .. code-block:: diff def load_dataset(): # Return a large in-memory dataset ... def load_model(): # Return a large in-memory model instance ... -config = {"data": load_dataset(), "model": load_model()} def train_func(config): - data = config["data"] - model = config["model"] + data = load_dataset() + model = load_model() ... trainer = ray.train.torch.TorchTrainer(train_func, train_loop_config=config, ...) --- .. _train-deepspeed: Get Started with DeepSpeed ========================== The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed `_ training across a distributed Ray cluster. DeepSpeed is an optimization library that enables efficient large-scale model training through techniques like ZeRO (Zero Redundancy Optimizer). Benefits of Using Ray Train with DeepSpeed ------------------------------------------ - **Simplified Distributed Setup**: Ray Train handles all the distributed environment setup for you - **Multi-Node Scaling**: Easily scale to multiple nodes with minimal code changes - **Checkpoint Management**: Built-in checkpoint saving and loading across distributed workers - **Seamless Integration**: Works with your existing DeepSpeed code Code example ------------ You can use your existing DeepSpeed training code with Ray Train's TorchTrainer. The integration is minimal and preserves your familiar DeepSpeed workflow: .. testcode:: :skipif: True import deepspeed from deepspeed.accelerator import get_accelerator def train_func(): # Instantiate your model and dataset model = ... train_dataset = ... eval_dataset = ... deepspeed_config = {...} # Your DeepSpeed config # Prepare everything for distributed training model, optimizer, train_dataloader, lr_scheduler = deepspeed.initialize( model=model, model_parameters=model.parameters(), training_data=tokenized_datasets["train"], collate_fn=collate_fn, config=deepspeed_config, ) # Define the GPU device for the current worker device = get_accelerator().device_name(model.local_rank) # Start training for epoch in range(num_epochs): # Training logic ... # Report metrics to Ray Train ray.train.report(metrics={"loss": loss}) from ray.train.torch import TorchTrainer from ray.train import ScalingConfig trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(...), # If running in a multi-node cluster, this is where you # should configure the run's persistent storage that is accessible # across all worker nodes. # run_config=ray.train.RunConfig(storage_path="s3://..."), ... ) result = trainer.fit() Complete Examples ----------------- Below are complete examples of ZeRO-3 training with DeepSpeed. Each example shows a full implementation of fine-tuning a Bidirectional Encoder Representations from Transformers (BERT) model on the Microsoft Research Paraphrase Corpus (MRPC) dataset. Install the requirements: .. code-block:: bash pip install deepspeed torch datasets transformers torchmetrics "ray[train]" .. tab-set:: .. tab-item:: Example with Ray Data .. dropdown:: Show Code .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py :language: python :start-after: __deepspeed_torch_basic_example_start__ :end-before: __deepspeed_torch_basic_example_end__ .. tab-item:: Example with PyTorch DataLoader .. dropdown:: Show Code .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer_no_raydata.py :language: python :start-after: __deepspeed_torch_basic_example_no_raydata_start__ :end-before: __deepspeed_torch_basic_example_no_raydata_end__ .. tip:: To run DeepSpeed with pure PyTorch, you **don't need to** provide any additional Ray Train utilities like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training function. Instead, keep using `deepspeed.initialize() `_ as usual to prepare everything for distributed training. Fine-tune LLMs with DeepSpeed ----------------------------- See this step-by-step guide for how to fine-tune large language models (LLMs) with Ray Train and DeepSpeed: :doc:`Fine-tune an LLM with Ray Train and DeepSpeed `. Run DeepSpeed with Other Frameworks ----------------------------------- Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train. Check the below examples for more details: .. list-table:: :header-rows: 1 * - Framework - Example * - Accelerate (:ref:`User Guide `) - `Fine-tune Llama-2 series models with DeepSpeed, Accelerate, and Ray Train. `_ * - Transformers (:ref:`User Guide `) - :doc:`Fine-tune GPT-J-6b with DeepSpeed and Hugging Face Transformers ` * - Lightning (:ref:`User Guide `) - :doc:`Fine-tune vicuna-13b with DeepSpeed and PyTorch Lightning ` For more information about DeepSpeed configuration options, refer to the `official DeepSpeed documentation `_. --- :orphan: .. _train-fault-tolerance-deprecated-api: Handling Failures and Node Preemption (Deprecated API) ====================================================== .. important:: This user guide covers deprecated fault tolerance APIs. See :ref:`train-fault-tolerance` for the new API user guide. Please see :ref:`here ` for information about the deprecation and migration. Automatically Recover from Train Worker Failures ------------------------------------------------ Ray Train has built-in fault tolerance to recover from worker failures (i.e. ``RayActorError``\s). When a failure is detected, the workers will be shut down and new workers will be added in. The training function will be restarted, but progress from the previous execution can be resumed through checkpointing. .. tip:: In order to retain progress when recovery, your training function **must** implement logic for both :ref:`saving ` *and* :ref:`loading checkpoints `. Each instance of recovery from a worker failure is considered a retry. The number of retries is configurable through the ``max_failures`` attribute of the :class:`~ray.train.FailureConfig` argument set in the :class:`~ray.train.RunConfig` passed to the ``Trainer``: .. literalinclude:: ../doc_code/fault_tolerance.py :language: python :start-after: __failure_config_start__ :end-before: __failure_config_end__ Which checkpoint will be restored? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Train will automatically resume training from the latest available :ref:`checkpoint reported to Ray Train `. This will be the last checkpoint passed to :func:`train.report() `. Restore a Ray Train Experiment ------------------------------ At the experiment level, Trainer restoration allows you to resume a previously interrupted experiment from where it left off. A Train experiment may be interrupted due to one of the following reasons: - The experiment was manually interrupted (e.g., Ctrl+C, or pre-empted head node instance). - The head node crashed (e.g., OOM or some other runtime error). - The entire cluster went down (e.g., network error affecting all nodes). Trainer restoration is possible for all of Ray Train's built-in trainers, but we use ``TorchTrainer`` in the examples for demonstration. We also use ``Trainer`` to refer to methods that are shared across all built-in trainers. Let's say your initial Train experiment is configured as follows. The actual training loop is just for demonstration purposes: the important detail is that :ref:`saving ` *and* :ref:`loading checkpoints ` has been implemented. .. literalinclude:: ../doc_code/dl_guide.py :language: python :start-after: __ft_initial_run_start__ :end-before: __ft_initial_run_end__ The results and checkpoints of the experiment are saved to the path configured by :class:`~ray.train.RunConfig`. If the experiment has been interrupted due to one of the reasons listed above, use this path to resume: .. literalinclude:: ../doc_code/dl_guide.py :language: python :start-after: __ft_restored_run_start__ :end-before: __ft_restored_run_end__ .. tip:: You can also restore from a remote path (e.g., from an experiment directory stored in a s3 bucket). .. literalinclude:: ../doc_code/dl_guide.py :language: python :dedent: :start-after: __ft_restore_from_cloud_initial_start__ :end-before: __ft_restore_from_cloud_initial_end__ .. literalinclude:: ../doc_code/dl_guide.py :language: python :dedent: :start-after: __ft_restore_from_cloud_restored_start__ :end-before: __ft_restore_from_cloud_restored_end__ .. note:: Different trainers may allow more parameters to be optionally re-specified on restore. Only **datasets** are required to be re-specified on restore, if they were supplied originally. `TorchTrainer.restore`, `TensorflowTrainer.restore`, and `HorovodTrainer.restore` can take in the same parameters as their parent class's :meth:`DataParallelTrainer.restore `. Unless otherwise specified, other trainers will accept the same parameters as :meth:`BaseTrainer.restore `. Auto-resume ~~~~~~~~~~~ Adding the branching logic below will allow you to run the same script after the interrupt, picking up training from where you left on the previous run. Notice that we use the :meth:`Trainer.can_restore ` utility method to determine the existence and validity of the given experiment directory. .. literalinclude:: ../doc_code/dl_guide.py :language: python :start-after: __ft_autoresume_start__ :end-before: __ft_autoresume_end__ .. seealso:: See the :meth:`BaseTrainer.restore ` docstring for a full example. .. note:: `Trainer.restore` is different from :class:`Trainer(..., resume_from_checkpoint=...) `. `resume_from_checkpoint` is meant to be used to start a *new* Train experiment, which writes results to a new directory and starts over from iteration 0. `Trainer.restore` is used to continue an existing experiment, where new results will continue to be appended to existing logs. --- :orphan: .. _train-tune-deprecated-api: Hyperparameter Tuning with Ray Tune (Deprecated API) ==================================================== .. important:: This user guide covers the deprecated Train + Tune integration. See :ref:`train-tune` for the new API user guide. Please see :ref:`here ` for information about the deprecation and migration. Hyperparameter tuning with :ref:`Ray Tune ` is natively supported with Ray Train. .. https://docs.google.com/drawings/d/1yMd12iMkyo6DGrFoET1TIlKfFnXX9dfh2u3GSdTz6W4/edit .. figure:: ../images/train-tuner.svg :align: center The `Tuner` will take in a `Trainer` and execute multiple training runs, each with different hyperparameter configurations. Key Concepts ------------ There are a number of key concepts when doing hyperparameter optimization with a :class:`~ray.tune.Tuner`: * A set of hyperparameters you want to tune in a *search space*. * A *search algorithm* to effectively optimize your parameters and optionally use a *scheduler* to stop searches early and speed up your experiments. * The *search space*, *search algorithm*, *scheduler*, and *Trainer* are passed to a Tuner, which runs the hyperparameter tuning workload by evaluating multiple hyperparameters in parallel. * Each individual hyperparameter evaluation run is called a *trial*. * The Tuner returns its results as a :class:`~ray.tune.ResultGrid`. .. note:: Tuners can also be used to launch hyperparameter tuning without using Ray Train. See :ref:`the Ray Tune documentation ` for more guides and examples. Basic usage ----------- You can take an existing :class:`Trainer ` and simply pass it into a :class:`~ray.tune.Tuner`. .. literalinclude:: ../doc_code/tuner.py :language: python :start-after: __basic_start__ :end-before: __basic_end__ How to configure a Tuner? ------------------------- There are two main configuration objects that can be passed into a Tuner: the :class:`TuneConfig ` and the :class:`ray.tune.RunConfig`. The :class:`TuneConfig ` contains tuning specific settings, including: - the tuning algorithm to use - the metric and mode to rank results - the amount of parallelism to use Here are some common configurations for `TuneConfig`: .. literalinclude:: ../doc_code/tuner.py :language: python :start-after: __tune_config_start__ :end-before: __tune_config_end__ See the :class:`TuneConfig API reference ` for more details. The :class:`ray.tune.RunConfig` contains configurations that are more generic than tuning specific settings. This includes: - failure/retry configurations - verbosity levels - the name of the experiment - the logging directory - checkpoint configurations - custom callbacks - integration with cloud storage Below we showcase some common configurations of :class:`ray.tune.RunConfig`. .. literalinclude:: ../doc_code/tuner.py :language: python :start-after: __run_config_start__ :end-before: __run_config_end__ Search Space configuration -------------------------- A `Tuner` takes in a `param_space` argument where you can define the search space from which hyperparameter configurations will be sampled. Depending on the model and dataset, you may want to tune: - The training batch size - The learning rate for deep learning training (e.g., image classification) - The maximum depth for tree-based models (e.g., XGBoost) You can use a Tuner to tune most arguments and configurations for Ray Train, including but not limited to: - Ray :class:`Datasets ` - :class:`~ray.train.ScalingConfig` - and other hyperparameters. Read more about :ref:`Tune search spaces here `. Train - Tune gotchas -------------------- There are a couple gotchas about parameter specification when using Tuners with Trainers: - By default, configuration dictionaries and config objects will be deep-merged. - Parameters that are duplicated in the Trainer and Tuner will be overwritten by the Tuner ``param_space``. - **Exception:** all arguments of the :class:`ray.tune.RunConfig` and :class:`ray.tune.TuneConfig` are inherently un-tunable. See :doc:`/tune/tutorials/tune_get_data_in_and_out` for an example. Advanced Tuning --------------- Tuners also offer the ability to tune over different data preprocessing steps and different training/validation datasets, as shown in the following snippet. .. literalinclude:: ../doc_code/tuner.py :language: python :start-after: __tune_dataset_start__ :end-before: __tune_dataset_end__ --- .. _train-tensorflow-overview: Get Started with Distributed Training using TensorFlow/Keras ============================================================ Ray Train's `TensorFlow `__ integration enables you to scale your TensorFlow and Keras training functions to many machines and GPUs. On a technical level, Ray Train schedules your training workers and configures ``TF_CONFIG`` for you, allowing you to run your ``MultiWorkerMirroredStrategy`` training script. See `Distributed training with TensorFlow `_ for more information. Most of the examples in this guide use TensorFlow with Keras, but Ray Train also works with vanilla TensorFlow. Quickstart ----------- .. literalinclude:: ./doc_code/tf_starter.py :language: python :start-after: __tf_train_start__ :end-before: __tf_train_end__ Update your training function ----------------------------- First, update your :ref:`training function ` to support distributed training. .. note:: The current TensorFlow implementation supports ``MultiWorkerMirroredStrategy`` (and ``MirroredStrategy``). If there are other strategies you wish to see supported by Ray Train, submit a `feature request on GitHub `_. These instructions closely follow TensorFlow's `Multi-worker training with Keras `_ tutorial. One key difference is that Ray Train handles the environment variable set up for you. **Step 1:** Wrap your model in ``MultiWorkerMirroredStrategy``. The `MultiWorkerMirroredStrategy `_ enables synchronous distributed training. You *must* build and compile the ``Model`` within the scope of the strategy. .. testcode:: :skipif: True with tf.distribute.MultiWorkerMirroredStrategy().scope(): model = ... # build model model.compile() **Step 2:** Update your ``Dataset`` batch size to the *global* batch size. Set ``batch_size`` appropriately because `batch `_ splits evenly across worker processes. .. code-block:: diff -batch_size = worker_batch_size +batch_size = worker_batch_size * train.get_context().get_world_size() .. warning:: Ray doesn't automatically set any environment variables or configuration related to local parallelism or threading :ref:`aside from "OMP_NUM_THREADS" `. If you want greater control over TensorFlow threading, use the ``tf.config.threading`` module (eg. ``tf.config.threading.set_inter_op_parallelism_threads(num_cpus)``) at the beginning of your ``train_loop_per_worker`` function. Create a TensorflowTrainer -------------------------- ``Trainer``\s are the primary Ray Train classes for managing state and execute training. For distributed TensorFlow, use a :class:`~ray.train.tensorflow.TensorflowTrainer` that you can setup like this: .. testcode:: :hide: train_func = lambda: None .. testcode:: from ray.train import ScalingConfig from ray.train.tensorflow import TensorflowTrainer # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = TensorflowTrainer( train_func, scaling_config=ScalingConfig(use_gpu=use_gpu, num_workers=2) ) To customize the backend setup, you can pass a :class:`~ray.train.tensorflow.TensorflowConfig`: .. testcode:: :skipif: True from ray.train import ScalingConfig from ray.train.tensorflow import TensorflowTrainer, TensorflowConfig trainer = TensorflowTrainer( train_func, tensorflow_backend=TensorflowConfig(...), scaling_config=ScalingConfig(num_workers=2), ) For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API. Run a training function ----------------------- With a distributed training function and a Ray Train ``Trainer``, you are now ready to start training. .. testcode:: :skipif: True trainer.fit() Load and preprocess data ------------------------ TensorFlow by default uses its own internal dataset sharding policy, as described `in the guide `__. If your TensorFlow dataset is compatible with distributed loading, you don't need to change anything. If you require more advanced preprocessing, you may want to consider using Ray Data for distributed data ingest. See :ref:`Ray Data with Ray Train `. The main difference is that you may want to convert your Ray Data dataset shard to a TensorFlow dataset in your training function so that you can use the Keras API for model training. `See this example `__ for distributed data loading. The relevant parts are: .. testcode:: import tensorflow as tf from ray import train from ray.train.tensorflow import prepare_dataset_shard def train_func(config: dict): # ... # Get dataset shard from Ray Train dataset_shard = train.get_context().get_dataset_shard("train") # Define a helper function to build a TensorFlow dataset def to_tf_dataset(dataset, batch_size): def to_tensor_iterator(): for batch in dataset.iter_tf_batches( batch_size=batch_size, dtypes=tf.float32 ): yield batch["image"], batch["label"] output_signature = ( tf.TensorSpec(shape=(None, 784), dtype=tf.float32), tf.TensorSpec(shape=(None, 784), dtype=tf.float32), ) tf_dataset = tf.data.Dataset.from_generator( to_tensor_iterator, output_signature=output_signature ) # Call prepare_dataset_shard to disable automatic sharding # (since the dataset is already sharded) return prepare_dataset_shard(tf_dataset) for epoch in range(epochs): # Call our helper function to build the dataset tf_dataset = to_tf_dataset( dataset=dataset_shard, batch_size=64, ) history = multi_worker_model.fit(tf_dataset) Report results -------------- During training, the training loop should report intermediate results and checkpoints to Ray Train. This reporting logs the results to the console output and appends them to local log files. The logging also triggers :ref:`checkpoint bookkeeping `. The easiest way to report your results with Keras is by using the :class:`~ray.train.tensorflow.keras.ReportCheckpointCallback`: .. testcode:: from ray.train.tensorflow.keras import ReportCheckpointCallback def train_func(config: dict): # ... for epoch in range(epochs): model.fit(dataset, callbacks=[ReportCheckpointCallback()]) This callback automatically forwards all results and checkpoints from the Keras training function to Ray Train. Aggregate results ~~~~~~~~~~~~~~~~~ TensorFlow Keras automatically aggregates metrics from all workers. If you wish to have more control over that, consider implementing a `custom training loop `__. Save and load checkpoints ------------------------- You can save :class:`Checkpoints ` by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the training function. This call saves the checkpoint state from the distributed workers on the ``Trainer``, where you executed your python script. You can access the latest saved checkpoint through the ``checkpoint`` attribute of the :py:class:`~ray.train.Result`, and access the best saved checkpoints with the ``best_checkpoints`` attribute. These concrete examples demonstrate how Ray Train appropriately saves checkpoints, model weights but not models, in distributed training. .. testcode:: import json import os import tempfile from ray import train from ray.train import Checkpoint, ScalingConfig from ray.train.tensorflow import TensorflowTrainer import numpy as np def train_func(config): import tensorflow as tf n = 100 # create a toy dataset # data : X - dim = (n, 4) # target : Y - dim = (n, 1) X = np.random.normal(0, 1, size=(n, 4)) Y = np.random.uniform(0, 1, size=(n, 1)) strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() with strategy.scope(): # toy neural network : 1-layer model = tf.keras.Sequential([tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))]) model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"]) for epoch in range(config["num_epochs"]): history = model.fit(X, Y, batch_size=20) with tempfile.TemporaryDirectory() as temp_checkpoint_dir: model.save(os.path.join(temp_checkpoint_dir, "model.keras")) checkpoint_dict = os.path.join(temp_checkpoint_dir, "checkpoint.json") with open(checkpoint_dict, "w") as f: json.dump({"epoch": epoch}, f) checkpoint = Checkpoint.from_directory(temp_checkpoint_dir) train.report({"loss": history.history["loss"][0]}, checkpoint=checkpoint) trainer = TensorflowTrainer( train_func, train_loop_config={"num_epochs": 5}, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() print(result.checkpoint) By default, checkpoints persist to local disk in the :ref:`log directory ` of each run. Load checkpoints ~~~~~~~~~~~~~~~~ .. testcode:: import os import tempfile from ray import train from ray.train import Checkpoint, ScalingConfig from ray.train.tensorflow import TensorflowTrainer import numpy as np def train_func(config): import tensorflow as tf n = 100 # create a toy dataset # data : X - dim = (n, 4) # target : Y - dim = (n, 1) X = np.random.normal(0, 1, size=(n, 4)) Y = np.random.uniform(0, 1, size=(n, 1)) strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() with strategy.scope(): # toy neural network : 1-layer checkpoint = train.get_checkpoint() if checkpoint: with checkpoint.as_directory() as checkpoint_dir: model = tf.keras.models.load_model( os.path.join(checkpoint_dir, "model.keras") ) else: model = tf.keras.Sequential( [tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))] ) model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"]) for epoch in range(config["num_epochs"]): history = model.fit(X, Y, batch_size=20) with tempfile.TemporaryDirectory() as temp_checkpoint_dir: model.save(os.path.join(temp_checkpoint_dir, "model.keras")) extra_json = os.path.join(temp_checkpoint_dir, "checkpoint.json") with open(extra_json, "w") as f: json.dump({"epoch": epoch}, f) checkpoint = Checkpoint.from_directory(temp_checkpoint_dir) train.report({"loss": history.history["loss"][0]}, checkpoint=checkpoint) trainer = TensorflowTrainer( train_func, train_loop_config={"num_epochs": 5}, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() print(result.checkpoint) Further reading --------------- See :ref:`User Guides ` to explore more topics: - :ref:`Experiment tracking ` - :ref:`Fault tolerance and training on spot instances ` - :ref:`Hyperparameter optimization ` --- :orphan: Distributed Training with Hugging Face Accelerate ================================================= .. raw:: html Run on Anyscale

This example does distributed data parallel training with Hugging Face Accelerate, Ray Train, and Ray Data. It fine-tunes a BERT model and is adapted from https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py Code example ------------ .. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py See also -------- * :ref:`Get Started with Hugging Face Accelerate ` for a tutorial on using Ray Train and HF Accelerate * :doc:`Ray Train Examples <../../examples>` for more use cases --- :orphan: Distributed fine-tuning of Llama 3.1 8B on AWS Trainium with Ray and PyTorch Lightning ====================================================================================== .. raw:: html Run on Anyscale

This example demonstrates how to fine-tune the `Llama 3.1 8B `__ model on `AWS Trainium `__ instances using Ray Train, PyTorch Lightning, and AWS Neuron SDK. AWS Trainium is the machine learning (ML) chip that AWS built for deep learning (DL) training of 100B+ parameter models. `AWS Neuron SDK `__ helps developers train models on Trainium accelerators. Prepare the environment ----------------------- See `Setup EKS cluster and tools `__ for setting up an Amazon EKS cluster leveraging AWS Trainium instances. Create a Docker image --------------------- When the EKS cluster is ready, create an Amazon ECR repository for building and uploading the Docker image containing artifacts for fine-tuning a Llama3.1 8B model: 1. Clone the repo. :: git clone https://github.com/aws-neuron/aws-neuron-eks-samples.git 2. Go to the ``llama3.1_8B_finetune_ray_ptl_neuron`` directory. :: cd aws-neuron-eks-samples/llama3.1_8B_finetune_ray_ptl_neuron 3. Trigger the script. :: chmod +x 0-kuberay-trn1-llama3-finetune-build-image.sh ./0-kuberay-trn1-llama3-finetune-build-image.sh 4. Enter the zone your cluster is running in, for example: us-east-2. 5. Verify in the AWS console that the Amazon ECR service has the newly created ``kuberay_trn1_llama3.1_pytorch2`` repository. 6. Update the ECR image ARN in the manifest file used for creating the Ray cluster. Replace the and placeholders with actual values in the ``1-llama3-finetune-trn1-create-raycluster.yaml`` file using commands below to reflect the ECR image ARN created above: :: export AWS_ACCOUNT_ID= # for ex: 111222333444 export REGION= # for ex: us-east-2 sed -i "s//$AWS_ACCOUNT_ID/g" 1-llama3-finetune-trn1-create-raycluster.yaml sed -i "s//$REGION/g" 1-llama3-finetune-trn1-create-raycluster.yaml Configuring Ray Cluster ----------------------- The ``llama3.1_8B_finetune_ray_ptl_neuron`` directory in the AWS Neuron samples repository simplifies the Ray configuration. KubeRay provides a manifest that you can apply to the cluster to set up the head and worker pods. Run the following command to set up the Ray cluster: :: kubectl apply -f 1-llama3-finetune-trn1-create-raycluster.yaml Accessing Ray Dashboard ----------------------- Port forward from the cluster to see the state of the Ray dashboard and then view it on `http://localhost:8265 `__. Run it in the background with the following command: :: kubectl port-forward service/kuberay-trn1-head-svc 8265:8265 & Launching Ray Jobs ------------------ The Ray cluster is now ready to handle workloads. Initiate the data preparation and fine-tuning Ray jobs: 1. Launch the Ray job for downloading the dolly-15k dataset and the Llama3.1 8B model artifacts: :: kubectl apply -f 2-llama3-finetune-trn1-rayjob-create-data.yaml 2. When the job has executed successfully, run the following fine-tuning job: :: kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml 3. Monitor the jobs via the Ray Dashboard For detailed information on each of the steps above, see the `AWS documentation link `__. --- :orphan: Train with DeepSpeed ZeRO-3 and Ray Train ========================================= .. raw:: html Run on Anyscale

This is an intermediate example that shows how to do distributed training with DeepSpeed ZeRO-3 and Ray Train. It demonstrates how to use :ref:`Ray Data ` with DeepSpeed ZeRO-3 and Ray Train. If you just want to quickly convert your existing TorchTrainer scripts into Ray Train, you can refer to the :ref:`Train with DeepSpeed `. Code example ------------ .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py See also -------- * :doc:`Ray Train Examples <../../examples>` for more use cases. * :ref:`Get Started with DeepSpeed ` for a tutorial. --- :orphan: Run Horovod Distributed Training with PyTorch and Ray Train =========================================================== .. raw:: html Run on Anyscale

This basic example demonstrates how to run Horovod distributed training with PyTorch and Ray Train. Code example ------------ .. literalinclude:: /../../python/ray/train/examples/horovod/horovod_example.py See also -------- * :ref:`Get Started with Horovod ` for a tutorial on using Horovod with Ray Train * :doc:`Ray Train Examples <../../examples>` for more use cases --- :orphan: Fine-tuning of Stable Diffusion with DreamBooth and Ray Train ============================================================= .. raw:: html Run on Anyscale

This is an intermediate example that shows how to do DreamBooth fine-tuning of a Stable Diffusion model using Ray Train. It demonstrates how to use :ref:`Ray Data ` with PyTorch Lightning in Ray Train. See the original `DreamBooth project homepage `_ for more details on what this fine-tuning method achieves. .. image:: https://dreambooth.github.io/DreamBooth_files/high_level.png :target: https://dreambooth.github.io :alt: DreamBooth fine-tuning overview This example builds on `this Hugging Face 🤗 tutorial `_. See the Hugging Face tutorial for useful explanations and suggestions on hyperparameters. **Adapting this example to Ray Train allows you to easily scale up the fine-tuning to an arbitrary number of distributed training workers.** **Compute requirements:** * Because of the large model sizes, you need a machine with at least 1 A10G GPU. * Each training worker uses 1 GPU. You can use multiple GPUs or workers to leverage data-parallel training to speed up training time. This example fine-tunes both the ``text_encoder`` and ``unet`` models used in the stable diffusion process, with respect to a prior preserving loss. .. image:: /templates/05_dreambooth_finetuning/dreambooth/images/dreambooth_example.png :alt: DreamBooth overview Find the full code repository at `https://github.com/ray-project/ray/tree/master/doc/source/templates/05_dreambooth_finetuning `_ How it works ------------ This example uses Ray Data for data loading and Ray Train for distributed training. Data loading ^^^^^^^^^^^^ .. note:: Find the latest version of the code at `dataset.py `_ The latest version might differ slightly from the code presented here. Use Ray Data for data loading. The code has three interesting parts. First, load two datasets using :func:`ray.data.read_images`: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python :start-at: instance_dataset = read :end-at: class_dataset = read :dedent: 4 Then, tokenize the prompt that generated these images: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python :start-at: tokenizer = AutoTokenizer :end-at: instance_prompt_ids = _tokenize :dedent: 4 And lastly, apply a ``torchvision`` preprocessing pipeline to the images: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python :start-after: START: image preprocessing :end-before: END: image preprocessing :dedent: 4 Apply all three parts in a final step: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python :start-after: START: Apply preprocessing :end-before: END: Apply preprocessing :dedent: 4 Distributed training ^^^^^^^^^^^^^^^^^^^^ .. note:: Find the latest version of the code at `train.py `_ The latest version might differ slightly from the code presented here. The central part of the training code is the :ref:`training function `. This function accepts a configuration dict that contains the hyperparameters. It then defines a regular PyTorch training loop. You interact with the Ray Train API in only a few locations, which follow in-line comments in the snippet below. Remember that you want to do data-parallel training for all the models. #. Load the data shard for each worker with `session.get_dataset_shard("train")`` #. Iterate over the dataset with `train_dataset.iter_torch_batches()`` #. Report results to Ray Train with `session.report(results)`` The code is compacted for brevity. The `full code `_ is more thoroughly annotated. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/train.py :language: python :start-at: def train_fn(config) :end-before: END: Training loop You can then run this training function with Ray Train's TorchTrainer: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/train.py :language: python :start-at: args = train_arguments :end-at: trainer.fit() :dedent: 4 Configure the scale ^^^^^^^^^^^^^^^^^^^ In the TorchTrainer, you can easily configure the scale. The preceding example uses the ``num_workers`` argument to specify the number of workers. This argument defaults to 2 workers with 1 GPU each, totalling to 2 GPUs. To run the example on 4 GPUs, set the number of workers to 4 using ``--num-workers=4``. Or you can change the scaling config directly: .. code-block:: diff scaling_config=ScalingConfig( use_gpu=True, - num_workers=args.num_workers, + num_workers=4, ) If you're running multi-node training, make sure that all nodes have access to a shared storage like NFS or EFS. In the following example script, you can adjust the location with the ``DATA_PREFIX`` environment variable. Training throughput ~~~~~~~~~~~~~~~~~~~ Compare throughput of the preceding training runs that used 1, 2, and 4 workers or GPUs. Consider the following setup: * 1 GCE g2-standard-48-nvidia-l4-4 instance with 4 GPUs * Model as configured below * Data from this example * 200 regularization images * Training for 4 epochs (local batch size = 2) * 3 runs per configuration You expect that the training time should benefit from scale and decreases when running with more workers and GPUs. .. image:: /templates/05_dreambooth_finetuning/dreambooth/images/dreambooth_training.png :alt: DreamBooth training times .. list-table:: :header-rows: 1 * - Number of workers/GPUs - Training time (seconds) * - 1 - 802.14 * - 2 - 487.82 * - 4 - 313.25 While the training time decreases linearly with the amount of workers/GPUs, you can observe some penalty. Specifically, with double the amount of workers you don't get half of the training time. This penalty is most likely due to additional communication between processes and the transfer of large model weights. You are also only training with a batch size of one because of the GPU memory limitation. On larger GPUs with higher batch sizes you would expect a greater benefit from scaling out. Run the example --------------- First, download the pre-trained Stable Diffusion model as a starting point. Then train this model with a few images of a subject. To achieve this, choose a non-word as an identifier, such as ``unqtkn``. When fine-tuning the model with this subject, you teach the model that the prompt is ``A photo of a unqtkn ``. After fine-tuning you can run inference with this specific prompt. For instance: ``A photo of a unqtkn `` creates an image of the subject. Similarly, ``A photo of a unqtkn at the beach`` creates an image of the subject at the beach. Step 0: Preparation ^^^^^^^^^^^^^^^^^^^ Clone the Ray repository, go to the example directory, and install dependencies. .. code-block:: bash git clone https://github.com/ray-project/ray.git cd doc/source/templates/05_dreambooth_finetuning pip install -Ur dreambooth/requirements.txt Prepare some directories and environment variables. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash :start-after: __preparation_start__ :end-before: __preparation_end__ Step 1: Download the pre-trained model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Download and cache a pre-trained Stable Diffusion model locally. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash :start-after: __cache_model_start__ :end-before: __cache_model_end__ You can access the downloaded model checkpoint at the ``$ORIG_MODEL_PATH``. Step 2: Supply images of your subject ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use one of the sample datasets, like `dog` or `lego car`, or provide your own directory of images, and specify the directory with the ``$INSTANCE_DIR`` environment variable. Then, copy these images to ``$IMAGES_OWN_DIR``. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash :start-after: __supply_own_images_start__ :end-before: __supply_own_images_end__ The ``$CLASS_NAME`` should be the general category of your subject. The images produced by the prompt ``photo of a unqtkn `` should be diverse images that are different enough from the subject in order for generated images to clearly show the effect of fine-tuning. Step 3: Create the regularization images ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create a regularization image set for a class of subjects using the pre-trained Stable Diffusion model. This regularization set ensures that the model still produces decent images for random images of the same class, rather than just optimize for producing good images of the subject. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash :start-after: Step 3: START :end-before: Step 3: END Use Ray Data to do batch inference with 4 workers, to generate more images in parallel. Step 4: Fine-tune the model ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Save a few, like 4 to 5, images of the subject being fine-tuned in a local directory. Then launch the training job with: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash :start-after: Step 4: START :end-before: Step 4: END Step 5: Generate images of the subject ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Try your model with the same command line as Step 2, but point to your own model this time. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash :start-after: Step 5: START :end-before: Step 5: END Next, try replacing the prompt with something more interesting. For example, for the dog subject, you can try: - "photo of a unqtkn dog in a bucket" - "photo of a unqtkn dog sleeping" - "photo of a unqtkn dog in a doghouse" See also -------- * :doc:`Ray Train Examples <../../examples>` for more use cases * :ref:`Ray Train User Guides ` for how-to guides --- :orphan: .. _train-pytorch-fashion-mnist: Train a PyTorch model on Fashion MNIST ====================================== .. raw:: html Run on Anyscale

This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. Code example ------------ .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_fashion_mnist_example.py See also -------- * :ref:`Get Started with PyTorch ` for a tutorial on using Ray Train and PyTorch * :doc:`Ray Train Examples <../../examples>` for more use cases --- :orphan: torch_regression_example ======================== .. raw:: html Run on Anyscale

.. literalinclude:: /../../python/ray/train/examples/pytorch/torch_regression_example.py --- :orphan: Training with TensorFlow and Ray Train ====================================== .. raw:: html Run on Anyscale

This basic example runs distributed training of a TensorFlow model on MNIST with Ray Train. Code example ------------ .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_mnist_example.py See also -------- * :doc:`Ray Train Examples <../../examples>` for more use cases. * :ref:`Distributed Tensorflow & Keras ` for a tutorial. --- :orphan: tensorflow_regression_example ============================= .. raw:: html Run on Anyscale

.. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_regression_example.py --- :orphan: .. _transformers_torch_trainer_basic_example: Fine-tune a Text Classifier with Hugging Face Transformers ========================================================== .. raw:: html Run on Anyscale

This basic example of distributed training with Ray Train and Hugging Face (HF) Transformers fine-tunes a text classifier on the Yelp review dataset using HF Transformers and Ray Train. Code example ------------ .. literalinclude:: /../../python/ray/train/examples/transformers/transformers_torch_trainer_basic.py See also -------- * :ref:`Get Started with Hugging Face Transformers ` for a tutorial * :doc:`Ray Train Examples <../../examples>` for more use cases --- .. _train-jax: Get Started with Distributed Training using JAX =============================================== This guide provides an overview of the `JaxTrainer` in Ray Train. What is JAX? ------------ `JAX `_ is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning. JAX provides an extensible system for transforming numerical functions like `jax.grad`, `jax.jit`, and `jax.vmap`, utilizing the XLA compiler to create highly optimized code that scales efficiently on accelerators like GPUs and TPUs. The core power of JAX lies in its composability, allowing these transformations to be combined to build complex, high-performance numerical programs for distributed execution. What are TPUs? -------------- Tensor Processing Units (TPUs), are custom-designed accelerators created by Google to optimize machine learning workloads. Unlike general-purpose CPUs or parallel-processing GPUs, TPUs are highly specialized for the massive matrix and tensor computations involved in deep learning, making them exceptionally efficient. The primary advantage of TPUs is performance at scale, as they are designed to be connected into large, multi-host configurations called “PodSlices” via a high-speed ICI interconnect, making them ideal for training large models that are unable to fit on a single node. To learn more about configuring TPUs with KubeRay, see :ref:`kuberay-tpu`. JaxTrainer API -------------- The :class:`~ray.train.v2.jax.JaxTrainer` is the core component for orchestrating distributed JAX training in Ray Train with TPUs. It follows the Single-Program, Multi-Data (SPMD) paradigm, where your training code is executed simultaneously across multiple workers, each running on a separate TPU virtual machine within a TPU slice. Ray automatically handles atomically reserving a TPU multi-host slice. The `JaxTrainer` is initialized with your training logic, defined in a `train_loop_per_worker` function, and a `ScalingConfig` that specifies the distributed hardware layout. The `JaxTrainer` currently only supports TPU accelerator types. Configuring Scale and TPU ------------------------- For TPU training, the `ScalingConfig` is where you define the specifics of your hardware slice. Key fields include: * `use_tpu`: This is a new field added in Ray 2.49.0 to the V2 `ScalingConfig`. This boolean flag explicitly tells Ray Train to initialize the JAX backend for TPU execution. * `topology`: This is a new field added in Ray 2.49.0 to the V2 `ScalingConfig`. Topology is a string defining the physical arrangement of the TPU chips (e.g., "4x4"). This is required for multi-host training and ensures Ray places workers correctly across the slice. For a list of supported TPU topologies by generation, see the `GKE documentation `_. * `num_workers`: Set to the number of VMs in your TPU slice. For a v4-32 slice with a 2x2x4 topology, this would be 4. * `resources_per_worker`: A dictionary specifying the resources each worker needs. For TPUs, you typically request the number of chips per VM (Ex: {"TPU": 4}). * `accelerator_type`: For TPUs, `accelerator_type` specifies the TPU generation you are using (e.g., "TPU-V6E"), ensuring your workload is scheduled on the desired TPU slice. Together, these configurations provide a declarative API for defining your entire distributed JAX training environment, allowing Ray Train to handle the complex task of launching and coordinating workers across a TPU slice. Quickstart ---------- For reference, the final code is as follows: .. testcode:: :skipif: True from ray.train.v2.jax import JaxTrainer from ray.train import ScalingConfig def train_func(): # Your JAX training code here. scaling_config = ScalingConfig(num_workers=4, use_tpu=True, topology="4x4", accelerator_type="TPU-V6E") trainer = JaxTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() 1. `train_func` is the Python code that executes on each distributed training worker. 2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use TPUs. 3. :class:`~ray.train.v2.jax.JaxTrainer` launches the distributed training job. Compare a JAX training script with and without Ray Train. .. tab-set:: .. tab-item:: JAX + Ray Train .. testcode:: :skipif: True import jax import jax.numpy as jnp import optax import ray.train from ray.train.v2.jax import JaxTrainer from ray.train import ScalingConfig def train_func(): """This function is run on each distributed worker.""" key = jax.random.PRNGKey(jax.process_index()) X = jax.random.normal(key, (100, 1)) noise = jax.random.normal(key, (100, 1)) * 0.1 y = 2 * X + 1 + noise def linear_model(params, x): return x @ params['w'] + params['b'] def loss_fn(params, x, y): preds = linear_model(params, x) return jnp.mean((preds - y) ** 2) @jax.jit def train_step(params, opt_state, x, y): loss, grads = jax.value_and_grad(loss_fn)(params, x, y) updates, opt_state = optimizer.update(grads, opt_state) params = optax.apply_updates(params, updates) return params, opt_state, loss # Initialize parameters and optimizer. key, w_key, b_key = jax.random.split(key, 3) params = {'w': jax.random.normal(w_key, (1, 1)), 'b': jax.random.normal(b_key, (1,))} optimizer = optax.adam(learning_rate=0.01) opt_state = optimizer.init(params) # Training loop epochs = 100 for epoch in range(epochs): params, opt_state, loss = train_step(params, opt_state, X, y) # Report metrics back to Ray Train. ray.train.report({"loss": float(loss), "epoch": epoch}) # Define the hardware configuration for your distributed job. scaling_config = ScalingConfig( num_workers=4, use_tpu=True, topology="4x4", accelerator_type="TPU-V6E", placement_strategy="SPREAD" ) # Define and run the JaxTrainer. trainer = JaxTrainer( train_loop_per_worker=train_func, scaling_config=scaling_config, ) result = trainer.fit() print(f"Training finished. Final loss: {result.metrics['loss']:.4f}") .. tab-item:: JAX .. This snippet isn't tested because it doesn't use any Ray code. .. testcode:: :skipif: True import jax import jax.numpy as jnp import optax # In a non-Ray script, you would manually initialize the # distributed environment for multi-host training. # import jax.distributed # jax.distributed.initialize() # Generate synthetic data. key = jax.random.PRNGKey(0) X = jax.random.normal(key, (100, 1)) noise = jax.random.normal(key, (100, 1)) * 0.1 y = 2 * X + 1 + noise # Model and loss function are standard JAX. def linear_model(params, x): return x @ params['w'] + params['b'] def loss_fn(params, x, y): preds = linear_model(params, x) return jnp.mean((preds - y) ** 2) @jax.jit def train_step(params, opt_state, x, y): loss, grads = jax.value_and_grad(loss_fn)(params, x, y) updates, opt_state = optimizer.update(grads, opt_state) params = optax.apply_updates(params, updates) return params, opt_state, loss # Initialize parameters and optimizer. key, w_key, b_key = jax.random.split(key, 3) params = {'w': jax.random.normal(w_key, (1, 1)), 'b': jax.random.normal(b_key, (1,))} optimizer = optax.adam(learning_rate=0.01) opt_state = optimizer.init(params) # Training loop epochs = 100 print("Starting training...") for epoch in range(epochs): params, opt_state, loss = train_step(params, opt_state, X, y) if epoch % 10 == 0: print(f"Epoch {epoch}, Loss: {loss:.4f}") print("Training finished.") print(f"Learned parameters: w={params['w'].item():.4f}, b={params['b'].item():.4f}") Set up a training function -------------------------- Ray Train automatically initializes the JAX distributed environment on each TPU worker. To adapt your existing JAX code, you simply need to wrap your training logic in a Python function that can be passed to the `JaxTrainer`. This function is the entry point that Ray will execute on each remote worker. .. code-block:: diff +from ray.train.v2.jax import JaxTrainer +from ray.train import ScalingConfig, report -def main_logic() +def train_func(): """This function is run on each distributed worker.""" # ... (JAX model, data, and training step definitions) ... # Training loop for epoch in range(epochs): params, opt_state, loss = train_step(params, opt_state, X, y) - print(f"Epoch {epoch}, Loss: {loss:.4f}") + # In Ray Train, you can report metrics back to the trainer + report({"loss": float(loss), "epoch": epoch}) -if __name__ == "__main__": - main_logic() +# Define the hardware configuration for your distributed job. +scaling_config = ScalingConfig( + num_workers=4, + use_tpu=True, + topology="4x4", + accelerator_type="TPU-V6E", + placement_strategy="SPREAD" +) + +# Define and run the JaxTrainer, which executes `train_func`. +trainer = JaxTrainer( + train_loop_per_worker=train_func, + scaling_config=scaling_config +) +result = trainer.fit() Configure persistent storage ---------------------------- Create a :class:`~ray.train.RunConfig` object to specify the path where results (including checkpoints and artifacts) will be saved. .. testcode:: from ray.train import RunConfig # Local path (/some/local/path/unique_run_name) run_config = RunConfig(storage_path="/some/local/path", name="unique_run_name") # Shared cloud storage URI (s3://bucket/unique_run_name) run_config = RunConfig(storage_path="s3://bucket", name="unique_run_name") # Shared NFS path (/mnt/nfs/unique_run_name) run_config = RunConfig(storage_path="/mnt/nfs", name="unique_run_name") .. warning:: Specifying a *shared storage location* (such as cloud storage or NFS) is *optional* for single-node clusters, but it is **required for multi-node clusters.** Using a local path will :ref:`raise an error ` during checkpointing for multi-node clusters. For more details, see :ref:`persistent-storage-guide`. Launch a training job --------------------- Tying it all together, you can now launch a distributed training job with a :class:`~ray.train.v2.jax.JaxTrainer`. .. testcode:: :skipif: True from ray.train import ScalingConfig train_func = lambda: None scaling_config = ScalingConfig(num_workers=4, use_tpu=True, topology="4x4", accelerator_type="TPU-V6E") run_config = None .. testcode:: :skipif: True from ray.train.v2.jax import JaxTrainer trainer = JaxTrainer( train_func, scaling_config=scaling_config, run_config=run_config ) result = trainer.fit() Access training results ----------------------- After training completes, a :class:`~ray.train.Result` object is returned which contains information about the training run, including the metrics and checkpoints reported during training. .. testcode:: :skipif: True result.metrics # The metrics reported during training. result.checkpoint # The latest checkpoint reported during training. result.path # The path where logs are stored. result.error # The exception that was raised, if training failed. For more usage examples, see :ref:`train-inspect-results`. Next steps ---------- After you have converted your JAX training script to use Ray Train: * See :ref:`User Guides ` to learn more about how to perform specific tasks. * Browse the :doc:`Examples ` for end-to-end examples of how to use Ray Train. * Consult the :ref:`API Reference ` for more details on the classes and methods from this tutorial. --- .. _train-lightgbm: Get Started with Distributed Training using LightGBM ==================================================== This tutorial walks through the process of converting an existing LightGBM script to use Ray Train. Learn how to: 1. Configure a :ref:`training function ` to report metrics and save checkpoints. 2. Configure :ref:`scaling ` and CPU or GPU resource requirements for a training job. 3. Launch a distributed training job with a :class:`~ray.train.lightgbm.LightGBMTrainer`. Quickstart ---------- For reference, the final code will look something like this: .. testcode:: :skipif: True import ray.train from ray.train.lightgbm import LightGBMTrainer def train_func(): # Your LightGBM training code here. ... scaling_config = ray.train.ScalingConfig(num_workers=2, resources_per_worker={"CPU": 4}) trainer = LightGBMTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() 1. `train_func` is the Python code that executes on each distributed training worker. 2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs. 3. :class:`~ray.train.lightgbm.LightGBMTrainer` launches the distributed training job. Compare a LightGBM training script with and without Ray Train. .. tab-set:: .. tab-item:: LightGBM + Ray Train .. literalinclude:: ./doc_code/lightgbm_quickstart.py :language: python :start-after: __lightgbm_ray_start__ :end-before: __lightgbm_ray_end__ .. tab-item:: LightGBM .. literalinclude:: ./doc_code/lightgbm_quickstart.py :language: python :start-after: __lightgbm_start__ :end-before: __lightgbm_end__ Set up a training function -------------------------- First, update your training code to support distributed training. Begin by wrapping your `native `_ or `scikit-learn estimator `_ LightGBM training code in a :ref:`training function `: .. testcode:: :skipif: True def train_func(): # Your native LightGBM training code here. train_set = ... lightgbm.train(...) Each distributed training worker executes this function. You can also specify the input argument for `train_func` as a dictionary via the Trainer's `train_loop_config`. For example: .. testcode:: python :skipif: True def train_func(config): label_column = config["label_column"] num_boost_round = config["num_boost_round"] ... config = {"label_column": "target", "num_boost_round": 100} trainer = ray.train.lightgbm.LightGBMTrainer(train_func, train_loop_config=config, ...) .. warning:: Avoid passing large data objects through `train_loop_config` to reduce the serialization and deserialization overhead. Instead, initialize large objects (e.g. datasets, models) directly in `train_func`. .. code-block:: diff def load_dataset(): # Return a large in-memory dataset ... def load_model(): # Return a large in-memory model instance ... -config = {"data": load_dataset(), "model": load_model()} def train_func(config): - data = config["data"] - model = config["model"] + data = load_dataset() + model = load_model() ... trainer = ray.train.lightgbm.LightGBMTrainer(train_func, train_loop_config=config, ...) Configure distributed training parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To enable distributed LightGBM training, add network communication parameters to your training configuration using :func:`ray.train.lightgbm.get_network_params`. This function automatically configures the necessary network settings for worker communication: .. code-block:: diff def train_func(): ... params = { # Your LightGBM training parameters here ... + "tree_learner": "data_parallel", + "pre_partition": True, + **ray.train.lightgbm.get_network_params(), } model = lightgbm.train( params, ... ) ... .. note:: Make sure to set ``tree_learner`` to enable distributed training. See the `LightGBM documentation `_ for more details. You should also set ``pre_partition=True`` if using Ray Data to load and shard your dataset, as shown in the quickstart example. Report metrics and save checkpoints ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To persist your checkpoints and monitor training progress, add a :class:`ray.train.lightgbm.RayTrainReportCallback` utility callback to your Trainer: .. testcode:: python :skipif: True import lightgbm from ray.train.lightgbm import RayTrainReportCallback def train_func(): ... bst = lightgbm.train( ..., callbacks=[ RayTrainReportCallback( metrics=["eval-multi_logloss"], frequency=1 ) ], ) ... Reporting metrics and checkpoints to Ray Train enables :ref:`fault-tolerant training ` and the integration with Ray Tune. Loading data ------------ When running distributed LightGBM training, each worker should use a different shard of the dataset. .. testcode:: python :skipif: True def get_train_dataset(world_rank: int) -> lightgbm.Dataset: # Define logic to get the Dataset shard for this worker rank ... def get_eval_dataset(world_rank: int) -> lightgbm.Dataset: # Define logic to get the Dataset for each worker ... def train_func(): rank = ray.train.get_world_rank() train_set = get_train_dataset(rank) eval_set = get_eval_dataset(rank) ... A common way to do this is to pre-shard the dataset and then assign each worker a different set of files to read. Pre-sharding the dataset is not very flexible to changes in the number of workers, since some workers may be assigned more data than others. For more flexibility, Ray Data provides a solution for sharding the dataset at runtime. Use Ray Data to shard the dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :ref:`Ray Data ` is a distributed data processing library that allows you to easily shard and distribute your data across multiple workers. First, load your **entire** dataset as a Ray Data Dataset. Reference the :ref:`Ray Data Quickstart ` for more details on how to load and preprocess data from different sources. .. testcode:: python :skipif: True train_dataset = ray.data.read_parquet("s3://path/to/entire/train/dataset/dir") eval_dataset = ray.data.read_parquet("s3://path/to/entire/eval/dataset/dir") In the training function, you can access the dataset shards for this worker using :meth:`ray.train.get_dataset_shard`. Convert this into a native `lightgbm.Dataset `_. .. testcode:: python :skipif: True def get_dataset(dataset_name: str) -> lightgbm.Dataset: shard = ray.train.get_dataset_shard(dataset_name) df = shard.materialize().to_pandas() X, y = df.drop("target", axis=1), df["target"] return lightgbm.Dataset(X, label=y) def train_func(): train_set = get_dataset("train") eval_set = get_dataset("eval") ... Finally, pass the dataset to the Trainer. This will automatically shard the dataset across the workers. These keys must match the keys used when calling ``get_dataset_shard`` in the training function. .. testcode:: python :skipif: True trainer = LightGBMTrainer(..., datasets={"train": train_dataset, "eval": eval_dataset}) trainer.fit() For more details, see :ref:`data-ingest-torch`. Configure scale and GPUs ------------------------ Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure: 1. :class:`num_workers ` - The number of distributed training worker processes. 2. :class:`use_gpu ` - Whether each worker should use a GPU (or CPU). 3. :class:`resources_per_worker ` - The number of CPUs or GPUs per worker. .. testcode:: from ray.train import ScalingConfig # 4 nodes with 8 CPUs each. scaling_config = ScalingConfig(num_workers=4, resources_per_worker={"CPU": 8}) .. note:: When using Ray Data with Ray Train, be careful not to request all available CPUs in your cluster with the `resources_per_worker` parameter. Ray Data needs CPU resources to execute data preprocessing operations in parallel. If all CPUs are allocated to training workers, Ray Data operations may be bottlenecked, leading to reduced performance. A good practice is to leave some portion of CPU resources available for Ray Data operations. For example, if your cluster has 8 CPUs per node, you might allocate 6 CPUs to training workers and leave 2 CPUs for Ray Data: .. testcode:: # Allocate 6 CPUs per worker, leaving resources for Ray Data operations scaling_config = ScalingConfig(num_workers=4, resources_per_worker={"CPU": 6}) In order to use GPUs, you will need to set the `use_gpu` parameter to `True` in your :class:`~ray.train.ScalingConfig` object. This will request and assign a single GPU per worker. .. testcode:: # 1 node with 8 CPUs and 4 GPUs each. scaling_config = ScalingConfig(num_workers=4, use_gpu=True) # 4 nodes with 8 CPUs and 4 GPUs each. scaling_config = ScalingConfig(num_workers=16, use_gpu=True) When using GPUs, you will also need to update your training function to use the assigned GPU. This can be done by setting the `"device"` parameter as `"gpu"`. For more details on LightGBM's GPU support, see the `LightGBM GPU documentation `__. .. code-block:: diff def train_func(): ... params = { ..., + "device": "gpu", } bst = lightgbm.train( params, ... ) Configure persistent storage ---------------------------- Create a :class:`~ray.train.RunConfig` object to specify the path where results (including checkpoints and artifacts) will be saved. .. testcode:: from ray.train import RunConfig # Local path (/some/local/path/unique_run_name) run_config = RunConfig(storage_path="/some/local/path", name="unique_run_name") # Shared cloud storage URI (s3://bucket/unique_run_name) run_config = RunConfig(storage_path="s3://bucket", name="unique_run_name") # Shared NFS path (/mnt/nfs/unique_run_name) run_config = RunConfig(storage_path="/mnt/nfs", name="unique_run_name") .. warning:: Specifying a *shared storage location* (such as cloud storage or NFS) is *optional* for single-node clusters, but it is **required for multi-node clusters.** Using a local path will :ref:`raise an error ` during checkpointing for multi-node clusters. For more details, see :ref:`persistent-storage-guide`. Launch a training job --------------------- Tying it all together, you can now launch a distributed training job with a :class:`~ray.train.lightgbm.LightGBMTrainer`. .. testcode:: :hide: from ray.train import ScalingConfig train_func = lambda: None scaling_config = ScalingConfig(num_workers=1) run_config = None .. testcode:: from ray.train.lightgbm import LightGBMTrainer trainer = LightGBMTrainer( train_func, scaling_config=scaling_config, run_config=run_config ) result = trainer.fit() Access training results ----------------------- After training completes, a :class:`~ray.train.Result` object is returned which contains information about the training run, including the metrics and checkpoints reported during training. .. testcode:: result.metrics # The metrics reported during training. result.checkpoint # The latest checkpoint reported during training. result.path # The path where logs are stored. result.error # The exception that was raised, if training failed. For more usage examples, see :ref:`train-inspect-results`. Next steps ---------- After you have converted your LightGBM training script to use Ray Train: * See :ref:`User Guides ` to learn more about how to perform specific tasks. * Browse the :doc:`Examples ` for end-to-end examples of how to use Ray Train. * Consult the :ref:`API Reference ` for more details on the classes and methods from this tutorial. --- .. _train-pytorch-lightning: Get Started with Distributed Training using PyTorch Lightning ============================================================= This tutorial walks through the process of converting an existing PyTorch Lightning script to use Ray Train. Learn how to: 1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU or GPU device. 2. Configure :ref:`training function ` to report metrics and save checkpoints. 3. Configure :ref:`scaling ` and CPU or GPU resource requirements for a training job. 4. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. Quickstart ---------- For reference, the final code is as follows: .. testcode:: :skipif: True from ray.train.torch import TorchTrainer from ray.train import ScalingConfig def train_func(): # Your PyTorch Lightning training code here. scaling_config = ScalingConfig(num_workers=2, use_gpu=True) trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() 1. `train_func` is the Python code that executes on each distributed training worker. 2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs. 3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job. Compare a PyTorch Lightning training script with and without Ray Train. .. tab-set:: .. tab-item:: PyTorch Lightning + Ray Train .. code-block:: python :emphasize-lines: 11-12, 38, 52-57, 59, 63, 66-73 import os import tempfile import torch from torch.utils.data import DataLoader from torchvision.models import resnet18 from torchvision.datasets import FashionMNIST from torchvision.transforms import ToTensor, Normalize, Compose import lightning.pytorch as pl import ray.train.lightning from ray.train.torch import TorchTrainer # Model, Loss, Optimizer class ImageClassifier(pl.LightningModule): def __init__(self): super(ImageClassifier, self).__init__() self.model = resnet18(num_classes=10) self.model.conv1 = torch.nn.Conv2d( 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False ) self.criterion = torch.nn.CrossEntropyLoss() def forward(self, x): return self.model(x) def training_step(self, batch, batch_idx): x, y = batch outputs = self.forward(x) loss = self.criterion(outputs, y) self.log("loss", loss, on_step=True, prog_bar=True) return loss def configure_optimizers(self): return torch.optim.Adam(self.model.parameters(), lr=0.001) def train_func(): # Data transform = Compose([ToTensor(), Normalize((0.28604,), (0.32025,))]) data_dir = os.path.join(tempfile.gettempdir(), "data") train_data = FashionMNIST(root=data_dir, train=True, download=True, transform=transform) train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True) # Training model = ImageClassifier() # [1] Configure PyTorch Lightning Trainer. trainer = pl.Trainer( max_epochs=10, devices="auto", accelerator="auto", strategy=ray.train.lightning.RayDDPStrategy(), plugins=[ray.train.lightning.RayLightningEnvironment()], callbacks=[ray.train.lightning.RayTrainReportCallback()], # [1a] Optionally, disable the default checkpointing behavior # in favor of the `RayTrainReportCallback` above. enable_checkpointing=False, ) trainer = ray.train.lightning.prepare_trainer(trainer) trainer.fit(model, train_dataloaders=train_dataloader) # [2] Configure scaling and resource requirements. scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True) # [3] Launch distributed training job. trainer = TorchTrainer( train_func, scaling_config=scaling_config, # [3a] If running in a multi-node cluster, this is where you # should configure the run's persistent storage that is accessible # across all worker nodes. # run_config=ray.train.RunConfig(storage_path="s3://..."), ) result: ray.train.Result = trainer.fit() # [4] Load the trained model. with result.checkpoint.as_directory() as checkpoint_dir: model = ImageClassifier.load_from_checkpoint( os.path.join( checkpoint_dir, ray.train.lightning.RayTrainReportCallback.CHECKPOINT_NAME, ), ) .. tab-item:: PyTorch Lightning .. This snippet isn't tested because it doesn't use any Ray code. .. testcode:: :skipif: True import torch from torchvision.models import resnet18 from torchvision.datasets import FashionMNIST from torchvision.transforms import ToTensor, Normalize, Compose from torch.utils.data import DataLoader import lightning.pytorch as pl # Model, Loss, Optimizer class ImageClassifier(pl.LightningModule): def __init__(self): super(ImageClassifier, self).__init__() self.model = resnet18(num_classes=10) self.model.conv1 = torch.nn.Conv2d( 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False ) self.criterion = torch.nn.CrossEntropyLoss() def forward(self, x): return self.model(x) def training_step(self, batch, batch_idx): x, y = batch outputs = self.forward(x) loss = self.criterion(outputs, y) self.log("loss", loss, on_step=True, prog_bar=True) return loss def configure_optimizers(self): return torch.optim.Adam(self.model.parameters(), lr=0.001) # Data transform = Compose([ToTensor(), Normalize((0.28604,), (0.32025,))]) train_data = FashionMNIST(root='./data', train=True, download=True, transform=transform) train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True) # Training model = ImageClassifier() trainer = pl.Trainer(max_epochs=10) trainer.fit(model, train_dataloaders=train_dataloader) Set up a training function -------------------------- .. include:: ./common/torch-configure-train_func.rst Ray Train sets up your distributed process group on each worker. You only need to make a few changes to your Lightning Trainer definition. .. code-block:: diff import lightning.pytorch as pl -from pl.strategies import DDPStrategy -from pl.plugins.environments import LightningEnvironment +import ray.train.lightning def train_func(): ... model = MyLightningModule(...) datamodule = MyLightningDataModule(...) trainer = pl.Trainer( - devices=[0, 1, 2, 3], - strategy=DDPStrategy(), - plugins=[LightningEnvironment()], + devices="auto", + accelerator="auto", + strategy=ray.train.lightning.RayDDPStrategy(), + plugins=[ray.train.lightning.RayLightningEnvironment()] ) + trainer = ray.train.lightning.prepare_trainer(trainer) trainer.fit(model, datamodule=datamodule) The following sections discuss each change. Configure the distributed strategy ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ray Train offers several sub-classed distributed strategies for Lightning. These strategies retain the same argument list as their base strategy classes. Internally, they configure the root device and the distributed sampler arguments. - :class:`~ray.train.lightning.RayDDPStrategy` - :class:`~ray.train.lightning.RayFSDPStrategy` - :class:`~ray.train.lightning.RayDeepSpeedStrategy` .. code-block:: diff import lightning.pytorch as pl -from pl.strategies import DDPStrategy +import ray.train.lightning def train_func(): ... trainer = pl.Trainer( ... - strategy=DDPStrategy(), + strategy=ray.train.lightning.RayDDPStrategy(), ... ) ... Configure the Ray cluster environment plugin ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ray Train also provides a :class:`~ray.train.lightning.RayLightningEnvironment` class as a specification for the Ray Cluster. This utility class configures the worker's local, global, and node rank and world size. .. code-block:: diff import lightning.pytorch as pl -from pl.plugins.environments import LightningEnvironment +import ray.train.lightning def train_func(): ... trainer = pl.Trainer( ... - plugins=[LightningEnvironment()], + plugins=[ray.train.lightning.RayLightningEnvironment()], ... ) ... Configure parallel devices ^^^^^^^^^^^^^^^^^^^^^^^^^^ In addition, Ray TorchTrainer has already configured the correct ``CUDA_VISIBLE_DEVICES`` for you. One should always use all available GPUs by setting ``devices="auto"`` and ``acelerator="auto"``. .. code-block:: diff import lightning.pytorch as pl def train_func(): ... trainer = pl.Trainer( ... - devices=[0,1,2,3], + devices="auto", + accelerator="auto", ... ) ... Report checkpoints and metrics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To persist your checkpoints and monitor training progress, add a :class:`ray.train.lightning.RayTrainReportCallback` utility callback to your Trainer. .. code-block:: diff import lightning.pytorch as pl from ray.train.lightning import RayTrainReportCallback def train_func(): ... trainer = pl.Trainer( ... - callbacks=[...], + callbacks=[..., RayTrainReportCallback()], ) ... Reporting metrics and checkpoints to Ray Train enables you to support :ref:`fault-tolerant training ` and :ref:`hyperparameter optimization `. Note that the :class:`ray.train.lightning.RayTrainReportCallback` class only provides a simple implementation, and can be :ref:`further customized `. Prepare your Lightning Trainer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Finally, pass your Lightning Trainer into :meth:`~ray.train.lightning.prepare_trainer` to validate your configurations. .. code-block:: diff import lightning.pytorch as pl import ray.train.lightning def train_func(): ... trainer = pl.Trainer(...) + trainer = ray.train.lightning.prepare_trainer(trainer) ... .. include:: ./common/torch-configure-run.rst Next steps ---------- After you have converted your PyTorch Lightning training script to use Ray Train: * See :ref:`User Guides ` to learn more about how to perform specific tasks. * Browse the :doc:`Examples ` for end-to-end examples of how to use Ray Train. * Consult the :ref:`API Reference ` for more details on the classes and methods from this tutorial. Version Compatibility --------------------- Ray Train is tested with `pytorch_lightning` versions `1.6.5` and `2.1.2`. For full compatibility, use ``pytorch_lightning>=1.6.5`` . Earlier versions aren't prohibited but may result in unexpected issues. If you run into any compatibility issues, consider upgrading your PyTorch Lightning version or `file an issue `_. .. note:: If you are using Lightning 2.x, please use the import path `lightning.pytorch.xxx` instead of `pytorch_lightning.xxx`. .. _lightning-trainer-migration-guide: LightningTrainer Migration Guide -------------------------------- Ray 2.4 introduced the `LightningTrainer`, and exposed a `LightningConfigBuilder` to define configurations for `pl.LightningModule` and `pl.Trainer`. It then instantiates the model and trainer objects and runs a pre-defined training function in a black box. This version of the LightningTrainer API was constraining and limited your ability to manage the training functionality. Ray 2.7 introduced the newly unified :class:`~ray.train.torch.TorchTrainer` API, which offers enhanced transparency, flexibility, and simplicity. This API is more aligned with standard PyTorch Lightning scripts, ensuring users have better control over their native Lightning code. .. tab-set:: .. tab-item:: (Deprecating) LightningTrainer .. This snippet isn't tested because it raises a hard deprecation warning. .. testcode:: :skipif: True from ray.train.lightning import LightningConfigBuilder, LightningTrainer config_builder = LightningConfigBuilder() # [1] Collect model configs config_builder.module(cls=MyLightningModule, lr=1e-3, feature_dim=128) # [2] Collect checkpointing configs config_builder.checkpointing(monitor="val_accuracy", mode="max", save_top_k=3) # [3] Collect pl.Trainer configs config_builder.trainer( max_epochs=10, accelerator="gpu", log_every_n_steps=100, ) # [4] Build datasets on the head node datamodule = MyLightningDataModule(batch_size=32) config_builder.fit_params(datamodule=datamodule) # [5] Execute the internal training function in a black box ray_trainer = LightningTrainer( lightning_config=config_builder.build(), scaling_config=ScalingConfig(num_workers=4, use_gpu=True), run_config=RunConfig( checkpoint_config=CheckpointConfig( num_to_keep=3, checkpoint_score_attribute="val_accuracy", checkpoint_score_order="max", ), ) ) result = ray_trainer.fit() # [6] Load the trained model from an opaque Lightning-specific checkpoint. lightning_checkpoint = result.checkpoint model = lightning_checkpoint.get_model(MyLightningModule) .. tab-item:: (New API) TorchTrainer .. This snippet isn't tested because it runs with 4 GPUs, and CI is only run with 1. .. testcode:: :skipif: True import os import lightning.pytorch as pl import ray.train from ray.train.torch import TorchTrainer from ray.train.lightning import ( RayDDPStrategy, RayLightningEnvironment, RayTrainReportCallback, prepare_trainer ) def train_func(): # [1] Create a Lightning model model = MyLightningModule(lr=1e-3, feature_dim=128) # [2] Report Checkpoint with callback ckpt_report_callback = RayTrainReportCallback() # [3] Create a Lighting Trainer trainer = pl.Trainer( max_epochs=10, log_every_n_steps=100, # New configurations below devices="auto", accelerator="auto", strategy=RayDDPStrategy(), plugins=[RayLightningEnvironment()], callbacks=[ckpt_report_callback], ) # Validate your Lightning trainer configuration trainer = prepare_trainer(trainer) # [4] Build your datasets on each worker datamodule = MyLightningDataModule(batch_size=32) trainer.fit(model, datamodule=datamodule) # [5] Explicitly define and run the training function ray_trainer = TorchTrainer( train_func, scaling_config=ray.train.ScalingConfig(num_workers=4, use_gpu=True), run_config=ray.train.RunConfig( checkpoint_config=ray.train.CheckpointConfig( num_to_keep=3, checkpoint_score_attribute="val_accuracy", checkpoint_score_order="max", ), ) ) result = ray_trainer.fit() # [6] Load the trained model from a simplified checkpoint interface. checkpoint: ray.train.Checkpoint = result.checkpoint with checkpoint.as_directory() as checkpoint_dir: print("Checkpoint contents:", os.listdir(checkpoint_dir)) checkpoint_path = os.path.join(checkpoint_dir, "checkpoint.ckpt") model = MyLightningModule.load_from_checkpoint(checkpoint_path) --- .. _train-pytorch: Get Started with Distributed Training using PyTorch =================================================== This tutorial walks through the process of converting an existing PyTorch script to use Ray Train. Learn how to: 1. Configure a model to run distributed and on the correct CPU/GPU device. 2. Configure a dataloader to shard data across the :ref:`workers ` and place data on the correct CPU or GPU device. 3. Configure a :ref:`training function ` to report metrics and save checkpoints. 4. Configure :ref:`scaling ` and CPU or GPU resource requirements for a training job. 5. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer` class. Quickstart ---------- For reference, the final code will look something like the following: .. testcode:: :skipif: True from ray.train.torch import TorchTrainer from ray.train import ScalingConfig def train_func(): # Your PyTorch training code here. ... scaling_config = ScalingConfig(num_workers=2, use_gpu=True) trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() 1. `train_func` is the Python code that executes on each distributed training worker. 2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs. 3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job. Compare a PyTorch training script with and without Ray Train. .. tab-set:: .. tab-item:: PyTorch + Ray Train .. code-block:: python :emphasize-lines: 12, 14, 21, 32, 36-37, 55-58, 59, 63, 66-73 import os import tempfile import torch from torch.nn import CrossEntropyLoss from torch.optim import Adam from torch.utils.data import DataLoader from torchvision.models import resnet18 from torchvision.datasets import FashionMNIST from torchvision.transforms import ToTensor, Normalize, Compose import ray.train.torch def train_func(): # Model, Loss, Optimizer model = resnet18(num_classes=10) model.conv1 = torch.nn.Conv2d( 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False ) # [1] Prepare model. model = ray.train.torch.prepare_model(model) # model.to("cuda") # This is done by `prepare_model` criterion = CrossEntropyLoss() optimizer = Adam(model.parameters(), lr=0.001) # Data transform = Compose([ToTensor(), Normalize((0.28604,), (0.32025,))]) data_dir = os.path.join(tempfile.gettempdir(), "data") train_data = FashionMNIST(root=data_dir, train=True, download=True, transform=transform) train_loader = DataLoader(train_data, batch_size=128, shuffle=True) # [2] Prepare dataloader. train_loader = ray.train.torch.prepare_data_loader(train_loader) # Training for epoch in range(10): if ray.train.get_context().get_world_size() > 1: train_loader.sampler.set_epoch(epoch) for images, labels in train_loader: # This is done by `prepare_data_loader`! # images, labels = images.to("cuda"), labels.to("cuda") outputs = model(images) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() # [3] Report metrics and checkpoint. metrics = {"loss": loss.item(), "epoch": epoch} with tempfile.TemporaryDirectory() as temp_checkpoint_dir: torch.save( model.module.state_dict(), os.path.join(temp_checkpoint_dir, "model.pt") ) ray.train.report( metrics, checkpoint=ray.train.Checkpoint.from_directory(temp_checkpoint_dir), ) if ray.train.get_context().get_world_rank() == 0: print(metrics) # [4] Configure scaling and resource requirements. scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True) # [5] Launch distributed training job. trainer = ray.train.torch.TorchTrainer( train_func, scaling_config=scaling_config, # [5a] If running in a multi-node cluster, this is where you # should configure the run's persistent storage that is accessible # across all worker nodes. # run_config=ray.train.RunConfig(storage_path="s3://..."), ) result = trainer.fit() # [6] Load the trained model. with result.checkpoint.as_directory() as checkpoint_dir: model_state_dict = torch.load(os.path.join(checkpoint_dir, "model.pt")) model = resnet18(num_classes=10) model.conv1 = torch.nn.Conv2d( 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False ) model.load_state_dict(model_state_dict) .. tab-item:: PyTorch .. This snippet isn't tested because it doesn't use any Ray code. .. testcode:: :skipif: True import os import tempfile import torch from torch.nn import CrossEntropyLoss from torch.optim import Adam from torch.utils.data import DataLoader from torchvision.models import resnet18 from torchvision.datasets import FashionMNIST from torchvision.transforms import ToTensor, Normalize, Compose # Model, Loss, Optimizer model = resnet18(num_classes=10) model.conv1 = torch.nn.Conv2d( 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False ) model.to("cuda") criterion = CrossEntropyLoss() optimizer = Adam(model.parameters(), lr=0.001) # Data transform = Compose([ToTensor(), Normalize((0.28604,), (0.32025,))]) train_data = FashionMNIST(root='./data', train=True, download=True, transform=transform) train_loader = DataLoader(train_data, batch_size=128, shuffle=True) # Training for epoch in range(10): for images, labels in train_loader: images, labels = images.to("cuda"), labels.to("cuda") outputs = model(images) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() metrics = {"loss": loss.item(), "epoch": epoch} checkpoint_dir = tempfile.mkdtemp() checkpoint_path = os.path.join(checkpoint_dir, "model.pt") torch.save(model.state_dict(), checkpoint_path) print(metrics) Set up a training function -------------------------- .. include:: ./common/torch-configure-train_func.rst Set up a model ^^^^^^^^^^^^^^ Use the :func:`ray.train.torch.prepare_model` utility function to: 1. Move your model to the correct device. 2. Wrap it in ``DistributedDataParallel``. .. code-block:: diff -from torch.nn.parallel import DistributedDataParallel +import ray.train.torch def train_func(): ... # Create model. model = ... # Set up distributed training and device placement. - device_id = ... # Your logic to get the right device. - model = model.to(device_id or "cpu") - model = DistributedDataParallel(model, device_ids=[device_id]) + model = ray.train.torch.prepare_model(model) ... Set up a dataset ^^^^^^^^^^^^^^^^ .. TODO: Update this to use Ray Data. Use the :func:`ray.train.torch.prepare_data_loader` utility function, which: 1. Adds a :class:`~torch.utils.data.distributed.DistributedSampler` to your :class:`~torch.utils.data.DataLoader`. 2. Moves the batches to the right device. Note that this step isn't necessary if you're passing in Ray Data to your Trainer. See :ref:`data-ingest-torch`. .. code-block:: diff from torch.utils.data import DataLoader +import ray.train.torch def train_func(): ... dataset = ... data_loader = DataLoader(dataset, batch_size=worker_batch_size, shuffle=True) + data_loader = ray.train.torch.prepare_data_loader(data_loader) for epoch in range(10): + if ray.train.get_context().get_world_size() > 1: + data_loader.sampler.set_epoch(epoch) for X, y in data_loader: - X = X.to_device(device) - y = y.to_device(device) ... .. tip:: Keep in mind that ``DataLoader`` takes in a ``batch_size`` which is the batch size for each worker. The global batch size can be calculated from the worker batch size (and vice-versa) with the following equation: .. testcode:: :skipif: True global_batch_size = worker_batch_size * ray.train.get_context().get_world_size() .. note:: If you already manually set up your ``DataLoader`` with a ``DistributedSampler``, :meth:`~ray.train.torch.prepare_data_loader` will not add another one, and will respect the configuration of the existing sampler. .. note:: :class:`~torch.utils.data.distributed.DistributedSampler` does not work with a ``DataLoader`` that wraps :class:`~torch.utils.data.IterableDataset`. If you want to work with an dataset iterator, consider using :ref:`Ray Data ` instead of PyTorch DataLoader since it provides performant streaming data ingestion for large scale datasets. See :ref:`data-ingest-torch` for more details. Report checkpoints and metrics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To monitor progress, you can report intermediate metrics and checkpoints using the :func:`ray.train.report` utility function. .. code-block:: diff +import os +import tempfile +import ray.train def train_func(): ... with tempfile.TemporaryDirectory() as temp_checkpoint_dir: torch.save( model.state_dict(), os.path.join(temp_checkpoint_dir, "model.pt") ) + metrics = {"loss": loss.item()} # Training/validation metrics. # Build a Ray Train checkpoint from a directory + checkpoint = ray.train.Checkpoint.from_directory(temp_checkpoint_dir) # Ray Train will automatically save the checkpoint to persistent storage, # so the local `temp_checkpoint_dir` can be safely cleaned up after. + ray.train.report(metrics=metrics, checkpoint=checkpoint) ... For more details, see :ref:`train-monitoring-and-logging` and :ref:`train-checkpointing`. .. include:: ./common/torch-configure-run.rst Next steps ---------- After you have converted your PyTorch training script to use Ray Train: * See :ref:`User Guides ` to learn more about how to perform specific tasks. * Browse the :doc:`Examples ` for end-to-end examples of how to use Ray Train. * Dive into the :ref:`API Reference ` for more details on the classes and methods used in this tutorial. --- .. _train-pytorch-transformers: Get Started with Distributed Training using Hugging Face Transformers ===================================================================== This tutorial shows you how to convert an existing Hugging Face Transformers script to use Ray Train for distributed training. In this guide, learn how to: 1. Configure a :ref:`training function ` that properly reports metrics and saves checkpoints. 2. Configure :ref:`scaling ` and resource requirements for CPUs or GPUs for your distributed training job. 3. Launch a distributed training job with :class:`~ray.train.torch.TorchTrainer`. Requirements ------------ Install the necessary packages before you begin: .. code-block:: bash pip install "ray[train]" torch "transformers[torch]" datasets evaluate numpy scikit-learn Quickstart ---------- Here's a quick overview of the final code structure: .. testcode:: :skipif: True from ray.train.torch import TorchTrainer from ray.train import ScalingConfig def train_func(): # Your Transformers training code here ... scaling_config = ScalingConfig(num_workers=2, use_gpu=True) trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() The key components are: 1. `train_func`: Python code that runs on each distributed training worker. 2. :class:`~ray.train.ScalingConfig`: Defines the number of distributed training workers and GPU usage. 3. :class:`~ray.train.torch.TorchTrainer`: Launches and manages the distributed training job. Code Comparison: Hugging Face Transformers vs. Ray Train Integration -------------------------------------------------------------------- Compare a standard Hugging Face Transformers script with its Ray Train equivalent: .. tab-set:: .. tab-item:: Hugging Face Transformers + Ray Train .. code-block:: python :emphasize-lines: 13-15, 21, 67-68, 72, 80-87 import os import numpy as np import evaluate from datasets import load_dataset from transformers import ( Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification, ) import ray.train.huggingface.transformers from ray.train import ScalingConfig from ray.train.torch import TorchTrainer # [1] Encapsulate data preprocessing, training, and evaluation # logic in a training function # ============================================================ def train_func(): # Datasets dataset = load_dataset("yelp_review_full") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) small_train_dataset = ( dataset["train"].select(range(100)).map(tokenize_function, batched=True) ) small_eval_dataset = ( dataset["test"].select(range(100)).map(tokenize_function, batched=True) ) # Model model = AutoModelForSequenceClassification.from_pretrained( "bert-base-cased", num_labels=5 ) # Evaluation Metrics metric = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) # Hugging Face Trainer training_args = TrainingArguments( output_dir="test_trainer", evaluation_strategy="epoch", save_strategy="epoch", report_to="none", ) trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics, ) # [2] Report Metrics and Checkpoints to Ray Train # =============================================== callback = ray.train.huggingface.transformers.RayTrainReportCallback() trainer.add_callback(callback) # [3] Prepare Transformers Trainer # ================================ trainer = ray.train.huggingface.transformers.prepare_trainer(trainer) # Start Training trainer.train() # [4] Define a Ray TorchTrainer to launch `train_func` on all workers # =================================================================== ray_trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2, use_gpu=True), # [4a] For multi-node clusters, configure persistent storage that is # accessible across all worker nodes # run_config=ray.train.RunConfig(storage_path="s3://..."), ) result: ray.train.Result = ray_trainer.fit() # [5] Load the trained model with result.checkpoint.as_directory() as checkpoint_dir: checkpoint_path = os.path.join( checkpoint_dir, ray.train.huggingface.transformers.RayTrainReportCallback.CHECKPOINT_NAME, ) model = AutoModelForSequenceClassification.from_pretrained(checkpoint_path) .. tab-item:: Hugging Face Transformers .. This snippet isn't tested because it doesn't use any Ray code. .. testcode:: :skipif: True # Adapted from Hugging Face tutorial: https://huggingface.co/docs/transformers/training import numpy as np import evaluate from datasets import load_dataset from transformers import ( Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification, ) # Datasets dataset = load_dataset("yelp_review_full") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) small_train_dataset = dataset["train"].select(range(100)).map(tokenize_function, batched=True) small_eval_dataset = dataset["test"].select(range(100)).map(tokenize_function, batched=True) # Model model = AutoModelForSequenceClassification.from_pretrained( "bert-base-cased", num_labels=5 ) # Metrics metric = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) # Hugging Face Trainer training_args = TrainingArguments( output_dir="test_trainer", evaluation_strategy="epoch", report_to="none" ) trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics, ) # Start Training trainer.train() Set up a training function -------------------------- .. include:: ./common/torch-configure-train_func.rst Ray Train sets up the distributed process group on each worker before entering the training function. Put all your logic into this function, including: - Dataset construction and preprocessing - Model initialization - Transformers trainer definition .. note:: When using Hugging Face Datasets or Evaluate, always call ``datasets.load_dataset`` and ``evaluate.load`` inside the training function. Don't pass loaded datasets and metrics from outside the training function, as this can cause serialization errors when transferring objects to workers. Report checkpoints and metrics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To persist checkpoints and monitor training progress, add a :class:`ray.train.huggingface.transformers.RayTrainReportCallback` utility callback to your Trainer: .. code-block:: diff import transformers from ray.train.huggingface.transformers import RayTrainReportCallback def train_func(): ... trainer = transformers.Trainer(...) + trainer.add_callback(RayTrainReportCallback()) ... Reporting metrics and checkpoints to Ray Train enables integration with Ray Tune and :ref:`fault-tolerant training `. The :class:`ray.train.huggingface.transformers.RayTrainReportCallback` provides a basic implementation, and you can :ref:`customize it ` to fit your needs. Prepare a Transformers Trainer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Pass your Transformers Trainer into :meth:`~ray.train.huggingface.transformers.prepare_trainer` to validate configurations and enable Ray Data integration: .. code-block:: diff import transformers import ray.train.huggingface.transformers def train_func(): ... trainer = transformers.Trainer(...) + trainer = ray.train.huggingface.transformers.prepare_trainer(trainer) trainer.train() ... .. include:: ./common/torch-configure-run.rst Next steps ---------- Now that you've converted your Hugging Face Transformers script to use Ray Train: * Explore :ref:`User Guides ` to learn about specific tasks * Browse the :doc:`Examples ` for end-to-end Ray Train applications * Consult the :ref:`API Reference ` for detailed information on the classes and methods .. _transformers-trainer-migration-guide: TransformersTrainer Migration Guide ----------------------------------- Ray 2.1 introduced `TransformersTrainer` with a `trainer_init_per_worker` interface to define `transformers.Trainer` and execute a pre-defined training function. Ray 2.7 introduced the unified :class:`~ray.train.torch.TorchTrainer` API, which offers better transparency, flexibility, and simplicity. This API aligns more closely with standard Hugging Face Transformers scripts, giving you better control over your training code. .. tab-set:: .. tab-item:: (Deprecating) TransformersTrainer .. This snippet isn't tested because it contains skeleton code. .. testcode:: :skipif: True import transformers from transformers import AutoConfig, AutoModelForCausalLM from datasets import load_dataset import ray from ray.train.huggingface import TransformersTrainer from ray.train import ScalingConfig from huggingface_hub import HfFileSystem # Load datasets using HfFileSystem path = "hf://datasets/Salesforce/wikitext/wikitext-2-raw-v1/" fs = HfFileSystem() # List the parquet files for each split all_files = [f["name"] for f in fs.ls(path)] train_files = [f for f in all_files if "train" in f and f.endswith(".parquet")] validation_files = [f for f in all_files if "validation" in f and f.endswith(".parquet")] ray_train_ds = ray.data.read_parquet(train_files, filesystem=fs) ray_eval_ds = ray.data.read_parquet(validation_files, filesystem=fs) # Define the Trainer generation function def trainer_init_per_worker(train_dataset, eval_dataset, **config): MODEL_NAME = "gpt2" model_config = AutoConfig.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_config(model_config) args = transformers.TrainingArguments( output_dir=f"{MODEL_NAME}-wikitext2", evaluation_strategy="epoch", save_strategy="epoch", logging_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, max_steps=100, ) return transformers.Trainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) # Build a Ray TransformersTrainer scaling_config = ScalingConfig(num_workers=4, use_gpu=True) ray_trainer = TransformersTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=scaling_config, datasets={"train": ray_train_ds, "validation": ray_eval_ds}, ) result = ray_trainer.fit() .. tab-item:: (New API) TorchTrainer .. This snippet isn't tested because it contains skeleton code. .. testcode:: :skipif: True import transformers from transformers import AutoConfig, AutoModelForCausalLM from datasets import load_dataset import ray from ray.train.torch import TorchTrainer from ray.train.huggingface.transformers import ( RayTrainReportCallback, prepare_trainer, ) from ray.train import ScalingConfig from huggingface_hub import HfFileSystem # Load datasets using HfFileSystem path = "hf://datasets/Salesforce/wikitext/wikitext-2-raw-v1/" fs = HfFileSystem() # List the parquet files for each split all_files = [f["name"] for f in fs.ls(path)] train_files = [f for f in all_files if "train" in f and f.endswith(".parquet")] validation_files = [f for f in all_files if "validation" in f and f.endswith(".parquet")] ray_train_ds = ray.data.read_parquet(train_files, filesystem=fs) ray_eval_ds = ray.data.read_parquet(validation_files, filesystem=fs) # [1] Define the full training function # ===================================== def train_func(): MODEL_NAME = "gpt2" model_config = AutoConfig.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_config(model_config) # [2] Build Ray Data iterables # ============================ train_dataset = ray.train.get_dataset_shard("train") eval_dataset = ray.train.get_dataset_shard("validation") train_iterable_ds = train_dataset.iter_torch_batches(batch_size=8) eval_iterable_ds = eval_dataset.iter_torch_batches(batch_size=8) args = transformers.TrainingArguments( output_dir=f"{MODEL_NAME}-wikitext2", evaluation_strategy="epoch", save_strategy="epoch", logging_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, max_steps=100, ) trainer = transformers.Trainer( model=model, args=args, train_dataset=train_iterable_ds, eval_dataset=eval_iterable_ds, ) # [3] Add Ray Train Report Callback # ================================= trainer.add_callback(RayTrainReportCallback()) # [4] Prepare your trainer # ======================== trainer = prepare_trainer(trainer) trainer.train() # Build a Ray TorchTrainer scaling_config = ScalingConfig(num_workers=4, use_gpu=True) ray_trainer = TorchTrainer( train_func, scaling_config=scaling_config, datasets={"train": ray_train_ds, "validation": ray_eval_ds}, ) result = ray_trainer.fit() --- .. _train-xgboost: Get Started with Distributed Training using XGBoost =================================================== This tutorial walks through the process of converting an existing XGBoost script to use Ray Train. Learn how to: 1. Configure a :ref:`training function ` to report metrics and save checkpoints. 2. Configure :ref:`scaling ` and CPU or GPU resource requirements for a training job. 3. Launch a distributed training job with a :class:`~ray.train.xgboost.XGBoostTrainer`. Quickstart ---------- For reference, the final code will look something like this: .. testcode:: :skipif: True import ray.train from ray.train.xgboost import XGBoostTrainer def train_func(): # Your XGBoost training code here. ... scaling_config = ray.train.ScalingConfig(num_workers=2, resources_per_worker={"CPU": 4}) trainer = XGBoostTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() 1. `train_func` is the Python code that executes on each distributed training worker. 2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs. 3. :class:`~ray.train.xgboost.XGBoostTrainer` launches the distributed training job. Compare a XGBoost training script with and without Ray Train. .. tab-set:: .. tab-item:: XGBoost + Ray Train .. literalinclude:: ./doc_code/xgboost_quickstart.py :emphasize-lines: 3-4, 7-8, 11, 15-16, 19-20, 48, 53, 56-64 :language: python :start-after: __xgboost_ray_start__ :end-before: __xgboost_ray_end__ .. tab-item:: XGBoost .. literalinclude:: ./doc_code/xgboost_quickstart.py :language: python :start-after: __xgboost_start__ :end-before: __xgboost_end__ Set up a training function -------------------------- First, update your training code to support distributed training. Begin by wrapping your `native `_ or `scikit-learn estimator `_ XGBoost training code in a :ref:`training function `: .. testcode:: :skipif: True def train_func(): # Your native XGBoost training code here. dmatrix = ... xgboost.train(...) Each distributed training worker executes this function. You can also specify the input argument for `train_func` as a dictionary via the Trainer's `train_loop_config`. For example: .. testcode:: python :skipif: True def train_func(config): label_column = config["label_column"] num_boost_round = config["num_boost_round"] ... config = {"label_column": "y", "num_boost_round": 10} trainer = ray.train.xgboost.XGBoostTrainer(train_func, train_loop_config=config, ...) .. warning:: Avoid passing large data objects through `train_loop_config` to reduce the serialization and deserialization overhead. Instead, initialize large objects (e.g. datasets, models) directly in `train_func`. .. code-block:: diff def load_dataset(): # Return a large in-memory dataset ... def load_model(): # Return a large in-memory model instance ... -config = {"data": load_dataset(), "model": load_model()} def train_func(config): - data = config["data"] - model = config["model"] + data = load_dataset() + model = load_model() ... trainer = ray.train.xgboost.XGBoostTrainer(train_func, train_loop_config=config, ...) Ray Train automatically performs the worker communication setup that is needed to do distributed xgboost training. Report metrics and save checkpoints ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To persist your checkpoints and monitor training progress, add a :class:`ray.train.xgboost.RayTrainReportCallback` utility callback to your Trainer: .. testcode:: python :skipif: True import xgboost from ray.train.xgboost import RayTrainReportCallback def train_func(): ... bst = xgboost.train( ..., callbacks=[ RayTrainReportCallback( metrics=["eval-logloss"], frequency=1 ) ], ) ... Reporting metrics and checkpoints to Ray Train enables :ref:`fault-tolerant training ` and the integration with Ray Tune. Loading data ------------ When running distributed XGBoost training, each worker should use a different shard of the dataset. .. testcode:: python :skipif: True def get_train_dataset(world_rank: int) -> xgboost.DMatrix: # Define logic to get the DMatrix shard for this worker rank ... def get_eval_dataset(world_rank: int) -> xgboost.DMatrix: # Define logic to get the DMatrix for each worker ... def train_func(): rank = ray.train.get_world_rank() dtrain = get_train_dataset(rank) deval = get_eval_dataset(rank) ... A common way to do this is to pre-shard the dataset and then assign each worker a different set of files to read. Pre-sharding the dataset is not very flexible to changes in the number of workers, since some workers may be assigned more data than others. For more flexibility, Ray Data provides a solution for sharding the dataset at runtime. Use Ray Data to shard the dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :ref:`Ray Data ` is a distributed data processing library that allows you to easily shard and distribute your data across multiple workers. First, load your **entire** dataset as a Ray Data Dataset. Reference the :ref:`Ray Data Quickstart ` for more details on how to load and preprocess data from different sources. .. testcode:: python :skipif: True train_dataset = ray.data.read_parquet("s3://path/to/entire/train/dataset/dir") eval_dataset = ray.data.read_parquet("s3://path/to/entire/eval/dataset/dir") In the training function, you can access the dataset shards for this worker using :meth:`ray.train.get_dataset_shard`. Convert this into a native `xgboost.DMatrix `_. .. testcode:: python :skipif: True def get_dmatrix(dataset_name: str) -> xgboost.DMatrix: shard = ray.train.get_dataset_shard(dataset_name) df = shard.materialize().to_pandas() X, y = df.drop("target", axis=1), df["target"] return xgboost.DMatrix(X, label=y) def train_func(): dtrain = get_dmatrix("train") deval = get_dmatrix("eval") ... Finally, pass the dataset to the Trainer. This will automatically shard the dataset across the workers. These keys must match the keys used when calling ``get_dataset_shard`` in the training function. .. testcode:: python :skipif: True trainer = XGBoostTrainer(..., datasets={"train": train_dataset, "eval": eval_dataset}) trainer.fit() For more details, see :ref:`data-ingest-torch`. Configure scale and GPUs ------------------------ Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure: 1. :class:`num_workers ` - The number of distributed training worker processes. 2. :class:`use_gpu ` - Whether each worker should use a GPU (or CPU). 3. :class:`resources_per_worker ` - The number of CPUs or GPUs per worker. .. testcode:: from ray.train import ScalingConfig # 4 nodes with 8 CPUs each. scaling_config = ScalingConfig(num_workers=4, resources_per_worker={"CPU": 8}) .. note:: When using Ray Data with Ray Train, be careful not to request all available CPUs in your cluster with the `resources_per_worker` parameter. Ray Data needs CPU resources to execute data preprocessing operations in parallel. If all CPUs are allocated to training workers, Ray Data operations may be bottlenecked, leading to reduced performance. A good practice is to leave some portion of CPU resources available for Ray Data operations. For example, if your cluster has 8 CPUs per node, you might allocate 6 CPUs to training workers and leave 2 CPUs for Ray Data: .. testcode:: # Allocate 6 CPUs per worker, leaving resources for Ray Data operations scaling_config = ScalingConfig(num_workers=4, resources_per_worker={"CPU": 6}) In order to use GPUs, you will need to set the `use_gpu` parameter to `True` in your :class:`~ray.train.ScalingConfig` object. This will request and assign a single GPU per worker. .. testcode:: # 1 node with 8 CPUs and 4 GPUs each. scaling_config = ScalingConfig(num_workers=4, use_gpu=True) # 4 nodes with 8 CPUs and 4 GPUs each. scaling_config = ScalingConfig(num_workers=16, use_gpu=True) When using GPUs, you will also need to update your training function to use the assigned GPU. This can be done by setting the `"device"` parameter as `"cuda"`. For more details on XGBoost's GPU support, see the `XGBoost GPU documentation `__. .. code-block:: diff def train_func(): ... params = { ..., + "device": "cuda", } bst = xgboost.train( params, ... ) Configure persistent storage ---------------------------- Create a :class:`~ray.train.RunConfig` object to specify the path where results (including checkpoints and artifacts) will be saved. .. testcode:: from ray.train import RunConfig # Local path (/some/local/path/unique_run_name) run_config = RunConfig(storage_path="/some/local/path", name="unique_run_name") # Shared cloud storage URI (s3://bucket/unique_run_name) run_config = RunConfig(storage_path="s3://bucket", name="unique_run_name") # Shared NFS path (/mnt/nfs/unique_run_name) run_config = RunConfig(storage_path="/mnt/nfs", name="unique_run_name") .. warning:: Specifying a *shared storage location* (such as cloud storage or NFS) is *optional* for single-node clusters, but it is **required for multi-node clusters.** Using a local path will :ref:`raise an error ` during checkpointing for multi-node clusters. For more details, see :ref:`persistent-storage-guide`. Launch a training job --------------------- Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.xgboost.XGBoostTrainer`. .. testcode:: :hide: from ray.train import ScalingConfig train_func = lambda: None scaling_config = ScalingConfig(num_workers=1) run_config = None .. testcode:: from ray.train.xgboost import XGBoostTrainer trainer = XGBoostTrainer( train_func, scaling_config=scaling_config, run_config=run_config ) result = trainer.fit() Access training results ----------------------- After training completes, a :class:`~ray.train.Result` object is returned which contains information about the training run, including the metrics and checkpoints reported during training. .. testcode:: result.metrics # The metrics reported during training. result.checkpoint # The latest checkpoint reported during training. result.path # The path where logs are stored. result.error # The exception that was raised, if training failed. For more usage examples, see :ref:`train-inspect-results`. Next steps ---------- After you have converted your XGBoost training script to use Ray Train: * See :ref:`User Guides ` to learn more about how to perform specific tasks. * Browse the :doc:`Examples ` for end-to-end examples of how to use Ray Train. * Consult the :ref:`API Reference ` for more details on the classes and methods from this tutorial. --- .. _train-horovod: Get Started with Distributed Training using Horovod =================================================== Ray Train configures the Horovod environment and Rendezvous server for you, allowing you to run your ``DistributedOptimizer`` training script. See the `Horovod documentation `_ for more information. Quickstart ----------- .. literalinclude:: ./doc_code/hvd_trainer.py :language: python Update your training function ----------------------------- First, update your :ref:`training function ` to support distributed training. If you have a training function that already runs with the `Horovod Ray Executor `_, you shouldn't need to make any additional changes. To onboard onto Horovod, visit the `Horovod guide `_. Create a HorovodTrainer ----------------------- ``Trainer``\s are the primary Ray Train classes to use to manage state and execute training. For Horovod, use a :class:`~ray.train.horovod.HorovodTrainer` that you can setup like this: .. testcode:: :hide: train_func = lambda: None .. testcode:: from ray.train import ScalingConfig from ray.train.horovod import HorovodTrainer # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = HorovodTrainer( train_func, scaling_config=ScalingConfig(use_gpu=use_gpu, num_workers=2) ) When training with Horovod, always use a HorovodTrainer, irrespective of the training framework, for example, PyTorch or TensorFlow. To customize the backend setup, you can pass a :class:`~ray.train.horovod.HorovodConfig`: .. testcode:: :skipif: True from ray.train import ScalingConfig from ray.train.horovod import HorovodTrainer, HorovodConfig trainer = HorovodTrainer( train_func, tensorflow_backend=HorovodConfig(...), scaling_config=ScalingConfig(num_workers=2), ) For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API. Run a training function ----------------------- With a distributed training function and a Ray Train ``Trainer``, you are now ready to start training. .. testcode:: :skipif: True trainer.fit() Further reading --------------- Ray Train's :class:`~ray.train.horovod.HorovodTrainer` replaces the distributed communication backend of the native libraries with its own implementation. Thus, the remaining integration points remain the same. If you're using Horovod with :ref:`PyTorch ` or :ref:`Tensorflow `, refer to the respective guides for further configuration and information. If you are implementing your own Horovod-based training routine without using any of the training libraries, read through the :ref:`User Guides `, as you can apply much of the content to generic use cases and adapt them easily. --- .. _train-hf-accelerate: Get Started with Distributed Training using Hugging Face Accelerate =================================================================== The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `Accelerate `_ training across a distributed Ray cluster. You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this: .. testcode:: :skipif: True from accelerate import Accelerator def train_func(): # Instantiate the accelerator accelerator = Accelerator(...) model = ... optimizer = ... train_dataloader = ... eval_dataloader = ... lr_scheduler = ... # Prepare everything for distributed training ( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler, ) = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler ) # Start training ... from ray.train.torch import TorchTrainer from ray.train import ScalingConfig trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(...), # If running in a multi-node cluster, this is where you # should configure the run's persistent storage that is accessible # across all worker nodes. # run_config=ray.train.RunConfig(storage_path="s3://..."), ... ) trainer.fit() .. tip:: Model and data preparation for distributed training is completely handled by the `Accelerator `_ object and its `Accelerator.prepare() `_ method. Unlike with native PyTorch, **don't** call any additional Ray Train utilities like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training function. Configure Accelerate -------------------- In Ray Train, you can set configurations through the `accelerate.Accelerator `_ object in your training function. Below are starter examples for configuring Accelerate. .. tab-set:: .. tab-item:: DeepSpeed For example, to run DeepSpeed with Accelerate, create a `DeepSpeedPlugin `_ from a dictionary: .. testcode:: :skipif: True from accelerate import Accelerator, DeepSpeedPlugin DEEPSPEED_CONFIG = { "fp16": { "enabled": True }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": False }, "overlap_comm": True, "contiguous_gradients": True, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "gather_16bit_weights_on_model_save": True, "round_robin_gradients": True }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 10, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": False } def train_func(): # Create a DeepSpeedPlugin from config dict ds_plugin = DeepSpeedPlugin(hf_ds_config=DEEPSPEED_CONFIG) # Initialize Accelerator accelerator = Accelerator( ..., deepspeed_plugin=ds_plugin, ) # Start training ... from ray.train.torch import TorchTrainer from ray.train import ScalingConfig trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(...), run_config=ray.train.RunConfig(storage_path="s3://..."), ... ) trainer.fit() .. tab-item:: FSDP :sync: FSDP For PyTorch FSDP, create a `FullyShardedDataParallelPlugin `_ and pass it to the Accelerator. .. testcode:: :skipif: True from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig from accelerate import Accelerator, FullyShardedDataParallelPlugin def train_func(): fsdp_plugin = FullyShardedDataParallelPlugin( state_dict_config=FullStateDictConfig( offload_to_cpu=False, rank0_only=False ), optim_state_dict_config=FullOptimStateDictConfig( offload_to_cpu=False, rank0_only=False ) ) # Initialize accelerator accelerator = Accelerator( ..., fsdp_plugin=fsdp_plugin, ) # Start training ... from ray.train.torch import TorchTrainer from ray.train import ScalingConfig trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(...), run_config=ray.train.RunConfig(storage_path="s3://..."), ... ) trainer.fit() Note that Accelerate also provides a CLI tool, `"accelerate config"`, to generate a configuration and launch your training job with `"accelerate launch"`. However, it's not necessary here because Ray's `TorchTrainer` already sets up the Torch distributed environment and launches the training function on all workers. Next, see these end-to-end examples below for more details: .. tab-set:: .. tab-item:: Example with Ray Data .. dropdown:: Show Code .. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py :language: python :start-after: __accelerate_torch_basic_example_start__ :end-before: __accelerate_torch_basic_example_end__ .. tab-item:: Example with PyTorch DataLoader .. dropdown:: Show Code .. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer_no_raydata.py :language: python :start-after: __accelerate_torch_basic_example_no_raydata_start__ :end-before: __accelerate_torch_basic_example_no_raydata_end__ .. seealso:: If you're looking for more advanced use cases, check out this Llama-2 fine-tuning example: - `Fine-tuning Llama-2 series models with Deepspeed, Accelerate, and Ray Train. `_ You may also find these user guides helpful: - :ref:`Configuring Scale and GPUs ` - :ref:`Configuration and Persistent Storage ` - :ref:`Saving and Loading Checkpoints ` - :ref:`How to use Ray Data with Ray Train ` AccelerateTrainer Migration Guide --------------------------------- Before Ray 2.7, Ray Train's `AccelerateTrainer` API was the recommended way to run Accelerate code. As a subclass of :class:`TorchTrainer `, the AccelerateTrainer takes in a configuration file generated by ``accelerate config`` and applies it to all workers. Aside from that, the functionality of ``AccelerateTrainer`` is identical to ``TorchTrainer``. However, this caused confusion around whether this was the *only* way to run Accelerate code. Because you can express the full Accelerate functionality with the ``Accelerator`` and ``TorchTrainer`` combination, the plan is to deprecate the ``AccelerateTrainer`` in Ray 2.8, and it's recommend to run your Accelerate code directly with ``TorchTrainer``. --- .. _train-more-frameworks: More Frameworks =============== .. toctree:: :hidden: Hugging Face Accelerate Guide DeepSpeed Guide TensorFlow and Keras Guide LightGBM Guide Horovod Guide .. grid:: 1 2 3 4 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :img-top: /images/accelerate_logo.png :class-img-top: mt-2 w-75 d-block mx-auto fixed-height-img :link: huggingface-accelerate :link-type: doc Hugging Face Accelerate .. grid-item-card:: :img-top: /images/deepspeed_logo.svg :class-img-top: mt-2 w-75 d-block mx-auto fixed-height-img :link: deepspeed :link-type: doc DeepSpeed .. grid-item-card:: :img-top: /images/tf_logo.png :class-img-top: mt-2 w-75 d-block mx-auto fixed-height-img :link: distributed-tensorflow-keras :link-type: doc TensorFlow and Keras .. grid-item-card:: :img-top: /images/lightgbm_logo.png :class-img-top: mt-2 w-75 d-block mx-auto fixed-height-img :link: getting-started-lightgbm :link-type: doc LightGBM .. grid-item-card:: :img-top: /images/horovod.png :class-img-top: mt-2 w-75 d-block mx-auto fixed-height-img :link: horovod :link-type: doc Horovod --- .. _train-key-concepts: .. _train-overview: Ray Train Overview ================== To use Ray Train effectively, you need to understand four main concepts: #. :ref:`Training function `: A Python function that contains your model training logic. #. :ref:`Worker `: A process that runs the training function. #. :ref:`Scaling configuration: ` A configuration of the number of workers and compute resources (for example, CPUs or GPUs). #. :ref:`Trainer `: A Python class that ties together the training function, workers, and scaling configuration to execute a distributed training job. .. figure:: images/overview.png :align: center .. _train-overview-training-function: Training function ----------------- The training function is a user-defined Python function that contains the end-to-end model training loop logic. When launching a distributed training job, each worker executes this training function. Ray Train documentation uses the following conventions: #. `train_func` is a user-defined function that contains the training code. #. `train_func` is passed into the Trainer's `train_loop_per_worker` parameter. .. testcode:: def train_func(): """User-defined training function that runs on each distributed worker process. This function typically contains logic for loading the model, loading the dataset, training the model, saving checkpoints, and logging metrics. """ ... .. _train-overview-worker: Worker ------ Ray Train distributes model training compute to individual worker processes across the cluster. Each worker is a process that executes the `train_func`. The number of workers determines the parallelism of the training job and is configured in the :class:`~ray.train.ScalingConfig`. .. _train-overview-scaling-config: Scaling configuration --------------------- The :class:`~ray.train.ScalingConfig` is the mechanism for defining the scale of the training job. Specify two basic parameters for worker parallelism and compute resources: * :class:`num_workers `: The number of workers to launch for a distributed training job. * :class:`use_gpu `: Whether each worker should use a GPU or CPU. .. testcode:: from ray.train import ScalingConfig # Single worker with a CPU scaling_config = ScalingConfig(num_workers=1, use_gpu=False) # Single worker with a GPU scaling_config = ScalingConfig(num_workers=1, use_gpu=True) # Multiple workers, each with a GPU scaling_config = ScalingConfig(num_workers=4, use_gpu=True) .. _train-overview-trainers: Trainer ------- The Trainer ties the previous three concepts together to launch distributed training jobs. Ray Train provides :ref:`Trainer classes ` for different frameworks. Calling the :meth:`fit() ` method executes the training job by: #. Launching workers as defined by the :ref:`scaling_config `. #. Setting up the framework's distributed environment on all workers. #. Running the `train_func` on all workers. .. testcode:: :hide: def train_func(): pass scaling_config = ScalingConfig(num_workers=1, use_gpu=False) .. testcode:: from ray.train.torch import TorchTrainer trainer = TorchTrainer(train_func, scaling_config=scaling_config) trainer.fit() --- .. _train-docs: Ray Train: Scalable Model Training ================================== .. toctree:: :hidden: Overview PyTorch Guide PyTorch Lightning Guide Hugging Face Transformers Guide XGBoost Guide JAX Guide more-frameworks User Guides Examples Benchmarks api/api .. div:: sd-d-flex-row sd-align-major-center sd-align-minor-center .. div:: sd-w-50 .. raw:: html :file: images/logo.svg Ray Train is a scalable machine learning library for distributed training and fine-tuning. Ray Train allows you to scale model training code from a single machine to a cluster of machines in the cloud, and abstracts away the complexities of distributed computing. Whether you have large models or large datasets, Ray Train is the simplest solution for distributed training. Ray Train provides support for many frameworks: .. list-table:: :widths: 1 1 :header-rows: 1 * - PyTorch Ecosystem - More Frameworks * - PyTorch - TensorFlow * - PyTorch Lightning - Keras * - Hugging Face Transformers - Horovod * - Hugging Face Accelerate - XGBoost * - DeepSpeed - LightGBM Install Ray Train ----------------- To install Ray Train, run: .. code-block:: console $ pip install -U "ray[train]" To learn more about installing Ray and its libraries, see :ref:`Installing Ray `. Get started ----------- .. grid:: 1 2 2 2 :gutter: 1 :class-container: container pb-6 .. grid-item-card:: **Overview** ^^^ Understand the key concepts for distributed training with Ray Train. +++ .. button-ref:: train-overview :color: primary :outline: :expand: Learn the basics .. grid-item-card:: **PyTorch** ^^^ Get started on distributed model training with Ray Train and PyTorch. +++ .. button-ref:: train-pytorch :color: primary :outline: :expand: Try Ray Train with PyTorch .. grid-item-card:: **PyTorch Lightning** ^^^ Get started on distributed model training with Ray Train and Lightning. +++ .. button-ref:: train-pytorch-lightning :color: primary :outline: :expand: Try Ray Train with Lightning .. grid-item-card:: **Hugging Face Transformers** ^^^ Get started on distributed model training with Ray Train and Transformers. +++ .. button-ref:: train-pytorch-transformers :color: primary :outline: :expand: Try Ray Train with Transformers .. grid-item-card:: **JAX** ^^^ Get started on distributed model training with Ray Train and JAX. +++ .. button-ref:: train-jax :color: primary :outline: :expand: Try Ray Train with JAX Learn more ---------- .. grid:: 1 2 2 2 :gutter: 1 :class-container: container pb-6 .. grid-item-card:: **More Frameworks** ^^^ Don't see your framework? See these guides. +++ .. button-ref:: train-more-frameworks :color: primary :outline: :expand: Try Ray Train with other frameworks .. grid-item-card:: **User Guides** ^^^ Get how-to instructions for common training tasks with Ray Train. +++ .. button-ref:: train-user-guides :color: primary :outline: :expand: Read how-to guides .. grid-item-card:: **Examples** ^^^ Browse end-to-end code examples for different use cases. +++ .. button-ref:: examples :color: primary :outline: :expand: :ref-type: doc Learn through examples .. grid-item-card:: **API** ^^^ Consult the API Reference for full descriptions of the Ray Train API. +++ .. button-ref:: train-api :color: primary :outline: :expand: Read the API Reference --- :orphan: .. _train-collate-utils: Collate Utilities ================= .. literalinclude:: ../doc_code/collate_utils.py :language: python .. _random-text-generator: Random Text Generator ===================== The following helper functions generate random text samples with labels: .. literalinclude:: ../doc_code/random_text_generator.py :language: python --- .. _train-validating-checkpoints: Validating checkpoints asynchronously ===================================== During training, you may want to validate the model periodically to monitor training progress. The standard way to do this is to periodically switch between training and validation within the training loop. Instead, Ray Train allows you to asynchronously validate the model in a separate Ray task, which has following benefits: * Running validation in parallel without blocking the training loop * Running validation on different hardware than training * Leveraging :ref:`autoscaling ` to launch user-specified machines only for the duration of the validation * Letting training continue immediately after saving a checkpoint with partial metrics (for example, loss) and then receiving validation metrics (for example, accuracy) as soon as they are available. If the initial and validated metrics share the same key, the validated metrics overwrite the initial metrics. Tutorial -------- First, define a ``validate_fn`` that takes a :class:`ray.train.Checkpoint` to validate and an optional ``validate_config`` dictionary. This dictionary can contain arguments needed for validation, such as the validation dataset. Your function should return a dictionary of metrics from that validation. The following is a simple example for teaching purposes only. It is impractical because the validation task always runs on cpu; for a more realistic example, see :ref:`train-distributed-validate-fn`. .. literalinclude:: ../doc_code/asynchronous_validation.py :language: python :start-after: __validate_fn_simple_start__ :end-before: __validate_fn_simple_end__ .. warning:: Don't pass large objects to the ``validate_fn`` because Ray Train runs it as a Ray task and serializes all captured variables. Instead, package large objects in the ``Checkpoint`` and access them from shared storage later as explained in :ref:`train-checkpointing`. Next, within your training loop, call :func:`ray.train.report` with ``validate_fn`` and ``validate_config`` as arguments from the rank 0 worker like the following: .. literalinclude:: ../doc_code/asynchronous_validation.py :language: python :start-after: __validate_fn_report_start__ :end-before: __validate_fn_report_end__ Finally, after training is done, you can access your checkpoints and their associated metrics with the :class:`ray.train.Result` object. See :ref:`train-inspect-results` for more details. .. _train-distributed-validate-fn: Write a distributed validation function --------------------------------------- The ``validate_fn`` above runs in a single Ray task, but you can improve its performance by spawning even more Ray tasks or actors. The Ray team recommends doing this with one of the following approaches: * Creating a :class:`ray.train.torch.TorchTrainer` that only does validation, not training. * Using :func:`ray.data.Dataset.map_batches` to calculate metrics on a validation set. Choose an approach ~~~~~~~~~~~~~~~~~~ You should use ``TorchTrainer`` if: * You want to keep your existing validation logic and avoid migrating to Ray Data. The training function API lets you fully customize the validation loop to match your current setup. * Your validation code depends on running within a Torch process group — for example, your metric aggregation logic uses collective communication calls, or your model parallelism setup requires cross-GPU communication during the forward pass. You should use ``map_batches`` if: * You care about validation performance. Preliminary benchmarks show that ``map_batches`` is faster. * You prefer Ray Data’s native metric aggregation APIs over PyTorch, where you must implement aggregation manually using low-level collective operations or rely on third-party libraries such as `torchmetrics `_. Example: validation with Ray Train TorchTrainer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is a ``validate_fn`` that uses a ``TorchTrainer`` to calculate average cross entropy loss on a validation set. Note the following about this example: * It ``report``\s a dummy checkpoint so that the ``TorchTrainer`` keeps the metrics. * While you typically use the ``TorchTrainer`` for training, you can use it solely for validation like in this example. * Because training generally has a higher GPU memory requirement than inference, you can set different resource requirements for training and validation, for example, A100 for training and A10G for validation. .. literalinclude:: ../doc_code/asynchronous_validation.py :language: python :start-after: __validate_fn_torch_trainer_start__ :end-before: __validate_fn_torch_trainer_end__ Example: validation with Ray Data map_batches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following is a ``validate_fn`` that uses :func:`ray.data.Dataset.map_batches` to calculate average accuracy on a validation set. To learn more about how to use ``map_batches`` for batch inference, see :ref:`batch_inference_home`. .. literalinclude:: ../doc_code/asynchronous_validation.py :language: python :start-after: __validate_fn_map_batches_start__ :end-before: __validate_fn_map_batches_end__ Checkpoint metrics lifecycle ----------------------------- During the training loop the following happens to your checkpoints and metrics : 1. You report a checkpoint with some initial metrics, such as training loss, as well as a ``validate_fn`` and ``validate_config``. 2. Ray Train asynchronously runs your ``validate_fn`` with that checkpoint and ``validate_config`` in a new Ray task. 3. When that validation task completes, Ray Train associates the metrics returned by your ``validate_fn`` with that checkpoint. 4. After training is done, you can access your checkpoints and their associated metrics with the :class:`ray.train.Result` object. See :ref:`train-inspect-results` for more details. .. figure:: ../images/checkpoint_metrics_lifecycle.png How Ray Train populates checkpoint metrics during training and how you access them after training. --- .. _train-checkpointing: Saving and Loading Checkpoints ============================== Ray Train provides a way to snapshot training progress with :class:`Checkpoints `. This is useful for: 1. **Storing the best-performing model weights:** Save your model to persistent storage, and use it for downstream serving or inference. 2. **Fault tolerance:** Handle worker process and node failures in a long-running training job and leverage pre-emptible machines. 3. **Distributed checkpointing:** Ray Train checkpointing can be used to :ref:`upload model shards from multiple workers in parallel. ` .. _train-dl-saving-checkpoints: Saving checkpoints during training ---------------------------------- The :class:`Checkpoint ` is a lightweight interface provided by Ray Train that represents a *directory* that exists on local or remote storage. For example, a checkpoint could point to a directory in cloud storage: ``s3://my-bucket/my-checkpoint-dir``. A locally available checkpoint points to a location on the local filesystem: ``/tmp/my-checkpoint-dir``. Here's how you save a checkpoint in the training loop: 1. Write your model checkpoint to a local directory. - Since a :class:`Checkpoint ` just points to a directory, the contents are completely up to you. - This means that you can use any serialization format you want. - This makes it **easy to use familiar checkpoint utilities provided by training frameworks**, such as ``torch.save``, ``pl.Trainer.save_checkpoint``, Accelerate's ``accelerator.save_model``, Transformers' ``save_pretrained``, ``tf.keras.Model.save``, etc. 2. Create a :class:`Checkpoint ` from the directory using :meth:`Checkpoint.from_directory `. 3. Report the checkpoint to Ray Train using :func:`ray.train.report(metrics, checkpoint=...) `. - The metrics reported alongside the checkpoint are used to :ref:`keep track of the best-performing checkpoints `. - This will **upload the checkpoint to persistent storage** if configured. See :ref:`persistent-storage-guide`. .. figure:: ../images/checkpoint_lifecycle.png The lifecycle of a :class:`~ray.train.Checkpoint`, from being saved locally to disk to being uploaded to persistent storage via ``train.report``. As shown in the figure above, the best practice for saving checkpoints is to first dump the checkpoint to a local temporary directory. Then, the call to ``train.report`` uploads the checkpoint to its final persistent storage location. Then, the local temporary directory can be safely cleaned up to free up disk space (e.g., from exiting the ``tempfile.TemporaryDirectory`` context). .. tip:: In standard DDP training, where each worker has a copy of the full-model, you should only save and report a checkpoint from a single worker to prevent redundant uploads. This typically looks like: .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __checkpoint_from_single_worker_start__ :end-before: __checkpoint_from_single_worker_end__ If using parallel training strategies such as DeepSpeed Zero and FSDP, where each worker only has a shard of the full training state, you can save and report a checkpoint from each worker. See :ref:`train-distributed-checkpointing` for an example. Here are a few examples of saving checkpoints with different training frameworks: .. tab-set:: .. tab-item:: Native PyTorch .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __pytorch_save_start__ :end-before: __pytorch_save_end__ .. tip:: You most likely want to unwrap the DDP model before saving it to a checkpoint. ``model.module.state_dict()`` is the state dict without each key having a ``"module."`` prefix. .. tab-item:: PyTorch Lightning Ray Train leverages PyTorch Lightning's ``Callback`` interface to report metrics and checkpoints. We provide a simple callback implementation that reports ``on_train_epoch_end``. Specifically, on each train epoch end, it - collects all the logged metrics from ``trainer.callback_metrics`` - saves a checkpoint via ``trainer.save_checkpoint`` - reports to Ray Train via :func:`ray.train.report(metrics, checkpoint) ` .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __lightning_save_example_start__ :end-before: __lightning_save_example_end__ You can always get the saved checkpoint path from :attr:`result.checkpoint ` and :attr:`result.best_checkpoints `. For more advanced usage (e.g. reporting at different frequency, reporting customized checkpoint files), you can implement your own customized callback. Here is a simple example that reports a checkpoint every 3 epochs: .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __lightning_custom_save_example_start__ :end-before: __lightning_custom_save_example_end__ .. tab-item:: Hugging Face Transformers Ray Train leverages Hugging Face Transformers Trainer's ``Callback`` interface to report metrics and checkpoints. **Option 1: Use Ray Train's default report callback** We provide a simple callback implementation :class:`~ray.train.huggingface.transformers.RayTrainReportCallback` that reports on checkpoint save. You can change the checkpointing frequency by ``save_strategy`` and ``save_steps``. It collects the latest logged metrics and report them together with the latest saved checkpoint. .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __transformers_save_example_start__ :end-before: __transformers_save_example_end__ Note that :class:`~ray.train.huggingface.transformers.RayTrainReportCallback` binds the latest metrics and checkpoints together, so users can properly configure ``logging_strategy``, ``save_strategy`` and ``evaluation_strategy`` to ensure the monitoring metric is logged at the same step as checkpoint saving. For example, the evaluation metrics (``eval_loss`` in this case) are logged during evaluation. If users want to keep the best 3 checkpoints according to ``eval_loss``, they should align the saving and evaluation frequency. Below are two examples of valid configurations: .. testcode:: :skipif: True args = TrainingArguments( ..., evaluation_strategy="epoch", save_strategy="epoch", ) args = TrainingArguments( ..., evaluation_strategy="steps", save_strategy="steps", eval_steps=50, save_steps=100, ) # And more ... **Option 2: Implement your customized report callback** If you feel that Ray Train's default :class:`~ray.train.huggingface.transformers.RayTrainReportCallback` is not sufficient for your use case, you can also implement a callback yourself! Below is a example implementation that collects latest metrics and reports on checkpoint save. .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __transformers_custom_save_example_start__ :end-before: __transformers_custom_save_example_end__ You can customize when (``on_save``, ``on_epoch_end``, ``on_evaluate``) and what (customized metrics and checkpoint files) to report by implementing your own Transformers Trainer callback. .. _train-distributed-checkpointing: Saving checkpoints from multiple workers (distributed checkpointing) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In model parallel training strategies where each worker only has a shard of the full-model, you can save and report checkpoint shards in parallel from each worker. .. figure:: ../images/persistent_storage_checkpoint.png Distributed checkpointing in Ray Train. Each worker uploads its own checkpoint shard to persistent storage independently. Distributed checkpointing is the best practice for saving checkpoints when doing model-parallel training (e.g., DeepSpeed, FSDP, Megatron-LM). There are two major benefits: 1. **It is faster, resulting in less idle time.** Faster checkpointing incentivizes more frequent checkpointing! Each worker can upload its checkpoint shard in parallel, maximizing the network bandwidth of the cluster. Instead of a single node uploading the full model of size ``M``, the cluster distributes the load across ``N`` nodes, each uploading a shard of size ``M / N``. 2. **Distributed checkpointing avoids needing to gather the full model onto a single worker's CPU memory.** This gather operation puts a large CPU memory requirement on the worker that performs checkpointing and is a common source of OOM errors. Here is an example of distributed checkpointing with PyTorch: .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __distributed_checkpointing_start__ :end-before: __distributed_checkpointing_end__ .. note:: Checkpoint files with the same name will collide between workers. You can get around this by adding a rank-specific suffix to checkpoint files. Note that having filename collisions does not error, but it will result in the last uploaded version being the one that is persisted. This is fine if the file contents are the same across all workers. Model shard saving utilities provided by frameworks such as DeepSpeed will create rank-specific filenames already, so you usually do not need to worry about this. .. _train-checkpoint-upload-modes: Checkpoint upload modes ----------------------- By default, when you call :func:`~ray.train.report`, Ray Train synchronously pushes your checkpoint from ``checkpoint.path`` on local disk to ``checkpoint_dir_name`` on your ``storage_path``. This is equivalent to calling :func:`~ray.train.report` with :class:`~ray.train.CheckpointUploadMode` set to ``ray.train.CheckpointUploadMode.SYNC``. .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __checkpoint_upload_mode_sync_start__ :end-before: __checkpoint_upload_mode_sync_end__ .. _train-checkpoint-upload-mode-async: Asynchronous checkpoint uploading ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You may want to upload your checkpoint asynchronously instead so that the next training step can start in parallel. If so, you should use ``ray.train.CheckpointUploadMode.ASYNC``, which kicks off a new thread to upload the checkpoint. This is helpful for larger checkpoints that might take longer to upload, but might add unnecessary complexity (see below) if you want to immediately upload only a small checkpoint. Each ``report`` blocks until the previous ``report``\'s checkpoint upload completes before starting a new checkpoint upload thread. Ray Train does this to avoid accumulating too many upload threads and potentially running out of memory. Because ``report`` returns without waiting for the checkpoint upload to complete, you must ensure that the local checkpoint directory stays alive until the checkpoint upload completes. This means you can't use a temporary directory that Ray Train may delete before the upload finishes, for example from ``tempfile.TemporaryDirectory``. ``report`` also exposes the ``delete_local_checkpoint_after_upload`` parameter, which defaults to ``True`` if ``checkpoint_upload_mode`` is ``ray.train.CheckpointUploadMode.ASYNC``. .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __checkpoint_upload_mode_async_start__ :end-before: __checkpoint_upload_mode_async_end__ .. figure:: ../images/sync_vs_async_checkpointing.png This figure illustrates the difference between synchronous and asynchronous checkpoint uploading. Custom checkpoint uploading ~~~~~~~~~~~~~~~~~~~~~~~~~~~ :func:`~ray.train.report` defaults to uploading from disk to the remote ``storage_path`` with the PyArrow filesystem copying utilities before reporting the checkpoint to Ray Train. If you would rather upload the checkpoint manually or with a third-party library such as `Torch Distributed Checkpointing `_, you have the following options: .. tab-set:: .. tab-item:: Synchronous If you want to upload the checkpoint synchronously, you can first upload the checkpoint to the ``storage_path``and then report a reference to the uploaded checkpoint with ``ray.train.CheckpointUploadMode.NO_UPLOAD``. .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __checkpoint_upload_mode_no_upload_start__ :end-before: __checkpoint_upload_mode_no_upload_end__ .. tab-item:: Asynchronous If you want to upload the checkpoint asynchronously, you can set ``checkpoint_upload_mode`` to ``ray.train.CheckpointUploadMode.ASYNC`` and pass a ``checkpoint_upload_fn`` to ``ray.train.report``. This function takes the ``Checkpoint`` and ``checkpoint_dir_name`` passed to ``ray.train.report`` and returns the persisted ``Checkpoint``. .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __checkpoint_upload_function_start__ :end-before: __checkpoint_upload_function_end__ .. warning:: In your ``checkpoint_upload_fn``, you should not call ``ray.train.report``, which may lead to unexpected behavior. You should also avoid collective operations, such as :func:`~ray.train.report` or ``model.state_dict()``, which can cause deadlocks. .. note:: Do not pass a ``checkpoint_upload_fn`` with ``checkpoint_upload_mode=ray.train.CheckpointUploadMode.NO_UPLOAD`` because Ray Train will simply ignore ``checkpoint_upload_fn``. You can pass a ``checkpoint_upload_fn`` with ``checkpoint_upload_mode=ray.train.CheckpointUploadMode.SYNC``, but this is equivalent to uploading the checkpoint yourself and reporting the checkpoint with ``ray.train.CheckpointUploadMode.NO_UPLOAD``. .. _train-dl-configure-checkpoints: Configure checkpointing ----------------------- Ray Train provides some configuration options for checkpointing via :class:`~ray.train.CheckpointConfig`. The primary configuration is keeping only the top ``K`` checkpoints with respect to a metric. Lower-performing checkpoints are deleted to save storage space. By default, all checkpoints are kept. .. literalinclude:: ../doc_code/key_concepts.py :language: python :start-after: __checkpoint_config_start__ :end-before: __checkpoint_config_end__ .. note:: If you want to save the top ``num_to_keep`` checkpoints with respect to a metric via :py:class:`~ray.train.CheckpointConfig`, please ensure that the metric is always reported together with the checkpoints. Using checkpoints during training ---------------------------------- During training, you may want to access checkpoints you've reported and their associated metrics from training workers for a variety of reasons, such as reporting the best checkpoint so far to an experiment tracker. You can do this by calling :func:`~ray.train.get_all_reported_checkpoints` from within your training function. This function returns a list of :class:`~ray.train.ReportedCheckpoint` objects that represent all the :class:`~ray.train.Checkpoint`\s and their associated metrics that you've reported so far and have been kept based on the :ref:`checkpoint configuration `. This function supports two consistency modes: - ``CheckpointConsistencyMode.COMMITTED``: Block until the checkpoint from the latest ``ray.train.report`` has been uploaded to persistent storage and committed. - ``CheckpointConsistencyMode.VALIDATED``: Block until the checkpoint from the latest ``ray.train.report`` has been uploaded to persistent storage, committed, and validated (see :ref:`train-validating-checkpoints`). This is the default consistency mode and has the same behavior as ``CheckpointConsistencyMode.COMMITTED`` if your report did not kick off validation. .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __get_all_reported_checkpoints_example_start__ :end-before: __get_all_reported_checkpoints_example_end__ Using checkpoints after training -------------------------------- The latest saved checkpoint can be accessed with :attr:`Result.checkpoint `. The full list of persisted checkpoints can be accessed with :attr:`Result.best_checkpoints `. If :class:`CheckpointConfig(num_to_keep) ` is set, this list will contain the best ``num_to_keep`` checkpoints. See :ref:`train-inspect-results` for a full guide on inspecting training results. :meth:`Checkpoint.as_directory ` and :meth:`Checkpoint.to_directory ` are the two main APIs to interact with Train checkpoints: .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __inspect_checkpoint_example_start__ :end-before: __inspect_checkpoint_example_end__ For Lightning and Transformers, if you are using the default `RayTrainReportCallback` for checkpoint saving in your training function, you can retrieve the original checkpoint files as below: .. tab-set:: .. tab-item:: PyTorch Lightning .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __inspect_lightning_checkpoint_example_start__ :end-before: __inspect_lightning_checkpoint_example_end__ .. tab-item:: Transformers .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __inspect_transformers_checkpoint_example_start__ :end-before: __inspect_transformers_checkpoint_example_end__ .. _train-dl-loading-checkpoints: Restore training state from a checkpoint ---------------------------------------- In order to enable fault tolerance, you should modify your training loop to restore training state from a :class:`~ray.train.Checkpoint`. The :class:`Checkpoint ` to restore from can be accessed in the training function with :func:`ray.train.get_checkpoint `. The checkpoint returned by :func:`ray.train.get_checkpoint ` is populated as the latest reported checkpoint during :ref:`automatic failure recovery `. See :ref:`train-fault-tolerance` for more details on restoration and fault tolerance. .. tab-set:: .. tab-item:: Native PyTorch .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __pytorch_restore_start__ :end-before: __pytorch_restore_end__ .. tab-item:: PyTorch Lightning .. literalinclude:: ../doc_code/checkpoints.py :language: python :start-after: __lightning_restore_example_start__ :end-before: __lightning_restore_example_end__ .. note:: In these examples, :meth:`Checkpoint.as_directory ` is used to view the checkpoint contents as a local directory. *If the checkpoint points to a local directory*, this method just returns the local directory path without making a copy. *If the checkpoint points to a remote directory*, this method will download the checkpoint to a local temporary directory and return the path to the temporary directory. **If multiple processes on the same node call this method simultaneously,** only a single process will perform the download, while the others wait for the download to finish. Once the download finishes, all processes receive the same local (temporary) directory to read from. Once all processes have finished working with the checkpoint, the temporary directory is cleaned up. --- .. _data-ingest-torch: Data Loading and Preprocessing ============================== Ray Train integrates with :ref:`Ray Data ` to offer a performant and scalable streaming solution for loading and preprocessing large datasets. Key advantages include: - Streaming data loading and preprocessing, scalable to petabyte-scale data. - Scaling out heavy data preprocessing to CPU nodes, to avoid bottlenecking GPU training. - Automatic and fast failure recovery. - Automatic on-the-fly data splitting across distributed training workers. For more details about Ray Data, check out the :ref:`Ray Data documentation`. .. note:: In addition to Ray Data, you can continue to use framework-native data utilities with Ray Train, such as PyTorch Dataset, Hugging Face Dataset, and Lightning DataModule. In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline. .. TODO: Replace this image with a better one. .. figure:: ../images/train_ingest.png :align: center :width: 300px Quickstart ---------- Install Ray Data and Ray Train: .. code-block:: bash pip install -U "ray[data,train]" Data ingestion can be set up with four basic steps: 1. Create a Ray Dataset from your input data. 2. Apply preprocessing operations to your Ray Dataset. 3. Input the preprocessed Dataset into the Ray Train Trainer, which internally splits the dataset equally in a streaming way across the distributed training workers. 4. Consume the Ray Dataset in your training function. .. tab-set:: .. tab-item:: PyTorch .. code-block:: python :emphasize-lines: 14,21,29,33-35,53 import torch import ray from ray import train from ray.train import Checkpoint, ScalingConfig from ray.train.torch import TorchTrainer # Set this to True to use GPU. # If False, do CPU training instead of GPU training. use_gpu = False # Step 1: Create a Ray Dataset from in-memory Python lists. # You can also create a Ray Dataset from many other sources and file # formats. train_dataset = ray.data.from_items([{"x": [x], "y": [2 * x]} for x in range(200)]) # Step 2: Preprocess your Ray Dataset. def increment(batch): batch["y"] = batch["y"] + 1 return batch train_dataset = train_dataset.map_batches(increment) def train_func(): batch_size = 16 # Step 4: Access the dataset shard for the training worker via # ``get_dataset_shard``. train_data_shard = train.get_dataset_shard("train") # `iter_torch_batches` returns an iterable object that # yield tensor batches. Ray Data automatically moves the Tensor batches # to GPU if you enable GPU training. train_dataloader = train_data_shard.iter_torch_batches( batch_size=batch_size, dtypes=torch.float32 ) for epoch_idx in range(1): for batch in train_dataloader: inputs, labels = batch["x"], batch["y"] assert type(inputs) == torch.Tensor assert type(labels) == torch.Tensor assert inputs.shape[0] == batch_size assert labels.shape[0] == batch_size # Only check one batch for demo purposes. # Replace the above with your actual model training code. break # Step 3: Create a TorchTrainer. Specify the number of training workers and # pass in your Ray Dataset. # The Ray Dataset is automatically split across all training workers. trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=2, use_gpu=use_gpu) ) result = trainer.fit() .. tab-item:: PyTorch Lightning .. code-block:: python :emphasize-lines: 4-5,10-11,14-15,26-27,33 from ray import train # Create the train and validation datasets. train_data = ray.data.read_csv("./train.csv") val_data = ray.data.read_csv("./validation.csv") def train_func_per_worker(): # Access Ray datsets in your train_func via ``get_dataset_shard``. # Ray Data shards all datasets across workers by default. train_ds = train.get_dataset_shard("train") val_ds = train.get_dataset_shard("validation") # Create Ray dataset iterables via ``iter_torch_batches``. train_dataloader = train_ds.iter_torch_batches(batch_size=16) val_dataloader = val_ds.iter_torch_batches(batch_size=16) ... trainer = pl.Trainer( # ... ) # Feed the Ray dataset iterables to ``pl.Trainer.fit``. trainer.fit( model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader ) trainer = TorchTrainer( train_func, # You can pass in multiple datasets to the Trainer. datasets={"train": train_data, "validation": val_data}, scaling_config=ScalingConfig(num_workers=4), ) trainer.fit() .. tab-item:: HuggingFace Transformers .. code-block:: python :emphasize-lines: 7-9,14-15,18-19,25,31-32,42 import ray import ray.train from huggingface_hub import HfFileSystem ... # Create the train and evaluation datasets using HfFileSystem. fs = HfFileSystem() train_data = ray.data.read_parquet("hf://datasets/your-dataset/train/", filesystem=fs) eval_data = ray.data.read_parquet("hf://datasets/your-dataset/validation/", filesystem=fs) def train_func(): # Access Ray datsets in your train_func via ``get_dataset_shard``. # Ray Data shards all datasets across workers by default. train_ds = ray.train.get_dataset_shard("train") eval_ds = ray.train.get_dataset_shard("evaluation") # Create Ray dataset iterables via ``iter_torch_batches``. train_iterable_ds = train_ds.iter_torch_batches(batch_size=16) eval_iterable_ds = eval_ds.iter_torch_batches(batch_size=16) ... args = transformers.TrainingArguments( ..., max_steps=max_steps # Required for iterable datasets ) trainer = transformers.Trainer( ..., model=model, train_dataset=train_iterable_ds, eval_dataset=eval_iterable_ds, ) # Prepare your Transformers Trainer trainer = ray.train.huggingface.transformers.prepare_trainer(trainer) trainer.train() trainer = TorchTrainer( train_func, # You can pass in multiple datasets to the Trainer. datasets={"train": train_data, "evaluation": val_data}, scaling_config=ScalingConfig(num_workers=4, use_gpu=True), ) trainer.fit() .. _train-datasets-load: Loading data ~~~~~~~~~~~~ Ray Datasets can be created from many different data sources and formats. For more details, see :ref:`Loading Data `. .. _train-datasets-preprocess: Preprocessing data ~~~~~~~~~~~~~~~~~~ Ray Data supports a wide range of preprocessing operations that you can use to transform data prior to training. - For general preprocessing, see :ref:`Transforming Data `. - For tabular data, see :ref:`Preprocessing Structured Data `. - For PyTorch tensors, see :ref:`Transformations with torch tensors `. - For optimizing expensive preprocessing operations, see :ref:`Caching the preprocessed dataset `. .. _train-datasets-input: Inputting and splitting data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Your preprocessed datasets can be passed into a Ray Train Trainer (e.g. :class:`~ray.train.torch.TorchTrainer`) through the ``datasets`` argument. The datasets passed into the Trainer's ``datasets`` can be accessed inside of the ``train_loop_per_worker`` run on each distributed training worker by calling :meth:`ray.train.get_dataset_shard`. Ray Data splits all datasets across the training workers by default. :meth:`~ray.train.get_dataset_shard` returns ``1/n`` of the dataset, where ``n`` is the number of training workers. Ray Data does data splitting in a streaming fashion on the fly. .. note:: Be aware that because Ray Data splits the evaluation dataset, you have to aggregate the evaluation results across workers. You might consider using `TorchMetrics `_ (:doc:`example <../examples/deepspeed/deepspeed_example>`) or utilities available in other frameworks that you can explore. This behavior can be overwritten by passing in the ``dataset_config`` argument. For more information on configuring splitting logic, see :ref:`Splitting datasets `. .. _train-datasets-consume: Consuming data ~~~~~~~~~~~~~~ Inside the ``train_loop_per_worker``, each worker can access its shard of the dataset via :meth:`ray.train.get_dataset_shard`. This data can be consumed in a variety of ways: - To create a generic Iterable of batches, you can call :meth:`~ray.data.DataIterator.iter_batches`. - To create a replacement for a PyTorch DataLoader, you can call :meth:`~ray.data.DataIterator.iter_torch_batches`. For more details on how to iterate over your data, see :ref:`Iterating over data `. .. _train-datasets-pytorch: Starting with PyTorch data -------------------------- Some frameworks provide their own dataset and data loading utilities. For example: - **PyTorch:** `Dataset & DataLoader `_ - **Hugging Face:** `Dataset `_ - **PyTorch Lightning:** `LightningDataModule `_ You can still use these framework data utilities directly with Ray Train. At a high level, you can compare these concepts as follows: .. list-table:: :header-rows: 1 * - PyTorch API - HuggingFace API - Ray Data API * - `torch.utils.data.Dataset `_ - `datasets.Dataset `_ - :class:`ray.data.Dataset` * - `torch.utils.data.DataLoader `_ - n/a - :meth:`ray.data.Dataset.iter_torch_batches` For more details, see the following sections for each framework: .. tab-set:: .. tab-item:: PyTorch DataLoader **Option 1 (with Ray Data):** 1. Convert your PyTorch Dataset to a Ray Dataset. 2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument. 3. Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. 4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`. For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders `. **Option 2 (without Ray Data):** 1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. 2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. .. tab-item:: LightningDataModule The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here. .. tab-item:: Hugging Face Dataset **Option 1 (with Ray Data):** 1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face `. 2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument. 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`. 4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. 5. Pass the iterable dataset while initializing ``transformers.Trainer``. 6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility. **Option 2 (without Ray Data):** 1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``. 2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization. .. tip:: When using Torch or Hugging Face Datasets directly without Ray Data, make sure to instantiate your Dataset *inside* the ``train_loop_per_worker``. Instantiating the Dataset outside of the ``train_loop_per_worker`` and passing it in via global scope can cause errors due to the Dataset not being serializable. .. note:: When using PyTorch DataLoader with more than 1 worker, you should set the process start method to be `forkserver` or `spawn`. :ref:`Forking Ray Actors and Tasks is an anti-pattern ` that can lead to unexpected issues such as deadlocks. .. code-block:: python data_loader = DataLoader( dataset, num_workers=2, multiprocessing_context=multiprocessing.get_context("forkserver"), ... ) .. _train-datasets-split: Splitting datasets ------------------ By default, Ray Train splits all datasets across workers using :meth:`Dataset.streaming_split `. Each worker sees a disjoint subset of the data, instead of iterating over the entire dataset. If want to customize which datasets are split, pass in a :class:`DataConfig ` to the Trainer constructor. For example, to split only the training dataset, do the following: .. testcode:: import ray from ray import train from ray.train import ScalingConfig from ray.train.torch import TorchTrainer ds = ray.data.read_text( "s3://anonymous@ray-example-data/sms_spam_collection_subset.txt" ) train_ds, val_ds = ds.train_test_split(0.3) def train_loop_per_worker(): # Get the sharded training dataset train_ds = train.get_dataset_shard("train") for _ in range(2): for batch in train_ds.iter_batches(batch_size=128): print("Do some training on batch", batch) # Get the unsharded full validation dataset val_ds = train.get_dataset_shard("val") for _ in range(2): for batch in val_ds.iter_batches(batch_size=128): print("Do some evaluation on batch", batch) my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"train": train_ds, "val": val_ds}, dataset_config=ray.train.DataConfig( datasets_to_split=["train"], ), ) my_trainer.fit() Full customization (advanced) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For use cases not covered by the default config class, you can also fully customize exactly how your input datasets are split. Define a custom :class:`DataConfig ` class (DeveloperAPI). The :class:`DataConfig ` class is responsible for that shared setup and splitting of data across nodes. .. testcode:: # Note that this example class is doing the same thing as the basic DataConfig # implementation included with Ray Train. from typing import Optional, Dict, List import ray from ray import train from ray.train.torch import TorchTrainer from ray.train import DataConfig, ScalingConfig from ray.data import Dataset, DataIterator, NodeIdStr from ray.actor import ActorHandle ds = ray.data.read_text( "s3://anonymous@ray-example-data/sms_spam_collection_subset.txt" ) def train_loop_per_worker(): # Get an iterator to the dataset we passed in below. it = train.get_dataset_shard("train") for _ in range(2): for batch in it.iter_batches(batch_size=128): print("Do some training on batch", batch) class MyCustomDataConfig(DataConfig): def configure( self, datasets: Dict[str, Dataset], world_size: int, worker_handles: Optional[List[ActorHandle]], worker_node_ids: Optional[List[NodeIdStr]], **kwargs, ) -> List[Dict[str, DataIterator]]: assert len(datasets) == 1, "This example only handles the simple case" # Configure Ray Data for ingest. ctx = ray.data.DataContext.get_current() ctx.execution_options = DataConfig.default_ingest_options() # Split the stream into shards. iterator_shards = datasets["train"].streaming_split( world_size, equal=True, locality_hints=worker_node_ids ) # Return the assigned iterators for each worker. return [{"train": it} for it in iterator_shards] my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"train": ds}, dataset_config=MyCustomDataConfig(), ) my_trainer.fit() The subclass must be serializable, since Ray Train copies it from the driver script to the driving actor of the Trainer. Ray Train calls its :meth:`configure ` method on the main actor of the Trainer group to create the data iterators for each worker. In general, you can use :class:`DataConfig ` for any shared setup that has to occur ahead of time before the workers start iterating over data. The setup runs at the start of each Trainer run. Random shuffling ---------------- Randomly shuffling data for each epoch can be important for model quality depending on what model you are training. Ray Data provides multiple options for random shuffling, see :ref:`Shuffling Data ` for more details. Enabling reproducibility ------------------------ When developing or hyperparameter tuning models, reproducibility is important during data ingest so that data ingest does not affect model quality. Follow these three steps to enable reproducibility: **Step 1:** Enable deterministic execution in Ray Datasets by setting the `preserve_order` flag in the :class:`DataContext `. .. testcode:: import ray # Preserve ordering in Ray Datasets for reproducibility. ctx = ray.data.DataContext.get_current() ctx.execution_options.preserve_order = True ds = ray.data.read_text( "s3://anonymous@ray-example-data/sms_spam_collection_subset.txt" ) **Step 2:** Set a seed for any shuffling operations: * `seed` argument to :meth:`random_shuffle ` * `seed` argument to :meth:`randomize_block_order ` * `local_shuffle_seed` argument to :meth:`iter_batches ` **Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide `_. .. _preprocessing_structured_data: Preprocessing structured data ----------------------------- .. note:: This section is for tabular/structured data. The recommended way for preprocessing unstructured data is to use Ray Data operations such as `map_batches`. See the :ref:`Ray Data Working with Pytorch guide ` for more details. For tabular data, use Ray Data :ref:`preprocessors `, which implement common data preprocessing operations. You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example: .. testcode:: import base64 import numpy as np from tempfile import TemporaryDirectory import ray from ray import train from ray.train import Checkpoint, ScalingConfig from ray.train.torch import TorchTrainer from ray.data.preprocessors import Concatenator, StandardScaler dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Create preprocessors to scale some columns and concatenate the results. scaler = StandardScaler(columns=["mean radius", "mean texture"]) columns_to_concatenate = dataset.columns() columns_to_concatenate.remove("target") concatenator = Concatenator(columns=columns_to_concatenate, dtype=np.float32) # Compute dataset statistics and get transformed datasets. Note that the # fit call is executed immediately, but the transformation is lazy. dataset = scaler.fit_transform(dataset) dataset = concatenator.fit_transform(dataset) def train_loop_per_worker(): context = train.get_context() print(context.get_metadata()) # prints {"preprocessor_pkl": ...} # Get an iterator to the dataset we passed in below. it = train.get_dataset_shard("train") for _ in range(2): # Prefetch 10 batches at a time. for batch in it.iter_batches(batch_size=128, prefetch_batches=10): print("Do some training on batch", batch) # Save a checkpoint. with TemporaryDirectory() as temp_dir: train.report( {"score": 2.0}, checkpoint=Checkpoint.from_directory(temp_dir), ) # Serialize the preprocessor. Since serialize() returns bytes, # convert to base64 string for JSON compatibility. serialized_preprocessor = base64.b64encode(scaler.serialize()).decode("ascii") my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"train": dataset}, metadata={"preprocessor_pkl": serialized_preprocessor}, ) # Get the fitted preprocessor back from the result metadata. metadata = my_trainer.fit().checkpoint.get_metadata() # Decode from base64 before deserializing serialized_data = base64.b64decode(metadata["preprocessor_pkl"]) print(StandardScaler.deserialize(serialized_data)) This example persists the fitted preprocessor using the ``Trainer(metadata={...})`` constructor argument. This arg specifies a dict that is available from ``TrainContext.get_metadata()`` and ``checkpoint.get_metadata()`` for checkpoints that the Trainer saves. This design enables the recreation of the fitted preprocessor for inference. Performance tips ---------------- Prefetching batches ~~~~~~~~~~~~~~~~~~~ While iterating over a dataset for training, you can increase ``prefetch_batches`` in :meth:`iter_batches ` or :meth:`iter_torch_batches ` to further increase performance. While training on the current batch, this approach launches background threads to fetch and process the next ``N`` batches. This approach can help if training is bottlenecked on cross-node data transfer or on last-mile preprocessing such as converting batches to tensors or executing ``collate_fn``. However, increasing ``prefetch_batches`` leads to more data that needs to be held in heap memory. By default, ``prefetch_batches`` is set to 1. For example, the following code prefetches 10 batches at a time for each training worker: .. testcode:: import ray from ray import train from ray.train import ScalingConfig from ray.train.torch import TorchTrainer ds = ray.data.read_text( "s3://anonymous@ray-example-data/sms_spam_collection_subset.txt" ) def train_loop_per_worker(): # Get an iterator to the dataset we passed in below. it = train.get_dataset_shard("train") for _ in range(2): # Prefetch 10 batches at a time. for batch in it.iter_batches(batch_size=128, prefetch_batches=10): print("Do some training on batch", batch) my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"train": ds}, ) my_trainer.fit() Avoid heavy transformation in collate_fn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``collate_fn`` parameter in :meth:`iter_batches ` or :meth:`iter_torch_batches ` allows you to transform data before feeding it to the model. This operation happens locally in the training workers. Avoid adding a heavy transformation in this function as it may become the bottleneck. Instead, :ref:`apply the transformation with map or map_batches ` before passing the dataset to the Trainer. When your expensive transformation requires batch_size as input, such as text tokenization, you can :ref:`scale it out to Ray Data ` for better performance. .. _dataset_cache_performance: Caching the preprocessed dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your preprocessed Dataset is small enough to fit in Ray object store memory (by default this is 30% of total cluster RAM), *materialize* the preprocessed dataset in Ray's built-in object store, by calling :meth:`materialize() ` on the preprocessed dataset. This method tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, if the preprocessed data is too large to fit into Ray object store memory, this approach will greatly decreases performance as data needs to be spilled to and read back from disk. Transformations that you want to run per-epoch, such as randomization, should go after the materialize call. .. testcode:: from typing import Dict import numpy as np import ray # Load the data. train_ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") # Define a preprocessing function. def normalize_length(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: new_col = batch["sepal.length"] / np.max(batch["sepal.length"]) batch["normalized.sepal.length"] = new_col del batch["sepal.length"] return batch # Preprocess the data. Transformations that are made before the materialize call # below are only run once. train_ds = train_ds.map_batches(normalize_length) # Materialize the dataset in object store memory. # Only do this if train_ds is small enough to fit in object store memory. train_ds = train_ds.materialize() # Dummy augmentation transform. def augment_data(batch): return batch # Add per-epoch preprocessing. Transformations that you want to run per-epoch, such # as data augmentation or randomization, should go after the materialize call. train_ds = train_ds.map_batches(augment_data) # Pass train_ds to the Trainer Adding CPU-only nodes to your cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the GPU training is bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then materializing the dataset doesn't work. In this case, Ray's native support for heterogeneous resources enables you to simply add more CPU-only nodes to your cluster, and Ray Data automatically scales out CPU-only preprocessing tasks to CPU-only nodes, making GPUs more saturated. In general, adding CPU-only nodes can help in two ways: * Adding more CPU cores helps further parallelize preprocessing. This approach is helpful when CPU compute time is the bottleneck. * Increasing object store memory, which 1) allows Ray Data to buffer more data in between preprocessing and training stages, and 2) provides more memory to make it possible to :ref:`cache the preprocessed dataset `. This approach is helpful when memory is the bottleneck. --- .. _train-experiment-tracking-native: =================== Experiment Tracking =================== Most experiment tracking libraries work out-of-the-box with Ray Train. This guide provides instructions on how to set up the code so that your favorite experiment tracking libraries can work for distributed training with Ray Train. The end of the guide has common errors to aid in debugging the setup. The following pseudo code demonstrates how to use the native experiment tracking library calls inside of Ray Train: .. testcode:: :skipif: True from ray.train.torch import TorchTrainer from ray.train import ScalingConfig def train_func(): # Training code and native experiment tracking library calls go here. scaling_config = ScalingConfig(num_workers=2, use_gpu=True) trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() Ray Train lets you use native experiment tracking libraries by customizing the tracking logic inside the :ref:`train_func` function. In this way, you can port your experiment tracking logic to Ray Train with minimal changes. Getting Started =============== Let's start by looking at some code snippets. The following examples uses Weights & Biases (W&B) and MLflow but it's adaptable to other frameworks. .. tab-set:: .. tab-item:: W&B .. testcode:: :skipif: True import ray from ray import train import wandb # Step 1 # This ensures that all ray worker processes have `WANDB_API_KEY` set. ray.init(runtime_env={"env_vars": {"WANDB_API_KEY": "your_api_key"}}) def train_func(): # Step 1 and 2 if train.get_context().get_world_rank() == 0: wandb.init( name=..., project=..., # ... ) # ... loss = optimize() metrics = {"loss": loss} # Step 3 if train.get_context().get_world_rank() == 0: # Only report the results from the rank 0 worker to W&B to avoid duplication. wandb.log(metrics) # ... # Step 4 # Make sure that all loggings are uploaded to the W&B backend. if train.get_context().get_world_rank() == 0: wandb.finish() .. tab-item:: MLflow .. testcode:: :skipif: True from ray import train import mlflow # Run the following on the head node: # $ databricks configure --token # mv ~/.databrickscfg YOUR_SHARED_STORAGE_PATH # This function assumes `databricks_config_file` is specified in the Trainer's `train_loop_config`. def train_func(config): # Step 1 and 2 os.environ["DATABRICKS_CONFIG_FILE"] = config["databricks_config_file"] mlflow.set_tracking_uri("databricks") mlflow.set_experiment_id(...) mlflow.start_run() # ... loss = optimize() metrics = {"loss": loss} # Step 3 if train.get_context().get_world_rank() == 0: # Only report the results from the rank 0 worker to MLflow to avoid duplication. mlflow.log_metrics(metrics) .. tip:: A major difference between distributed and non-distributed training is that in distributed training, multiple processes are running in parallel and under certain setups they have the same results. If all of them report results to the tracking backend, you may get duplicated results. To address that, Ray Train lets you apply logging logic to only the rank 0 worker with the following method: :meth:`ray.train.get_context().get_world_rank() `. .. testcode:: :skipif: True from ray import train def train_func(): ... if train.get_context().get_world_rank() == 0: # Add your logging logic only for rank0 worker. ... The interaction with the experiment tracking backend within the :ref:`train_func` has 4 logical steps: #. Set up the connection to a tracking backend #. Configure and launch a run #. Log metrics #. Finish the run More details about each step follows. Step 1: Connect to your tracking backend ---------------------------------------- First, decide which tracking backend to use: W&B, MLflow, TensorBoard, Comet, etc. If applicable, make sure that you properly set up credentials on each training worker. .. tab-set:: .. tab-item:: W&B W&B offers both *online* and *offline* modes. **Online** For *online* mode, because you log to W&B's tracking service, ensure that you set the credentials inside of :ref:`train_func`. See :ref:`Set up credentials` for more information. .. testcode:: :skipif: True # This is equivalent to `os.environ["WANDB_API_KEY"] = "your_api_key"` wandb.login(key="your_api_key") **Offline** For *offline* mode, because you log towards a local file system, point the offline directory to a shared storage path that all nodes can write to. See :ref:`Set up a shared file system` for more information. .. testcode:: :skipif: True os.environ["WANDB_MODE"] = "offline" wandb.init(dir="some_shared_storage_path/wandb") .. tab-item:: MLflow MLflow offers both *local* and *remote* (for example, to Databrick's MLflow service) modes. **Local** For *local* mode, because you log to a local file system, point offline directory to a shared storage path. that all nodes can write to. See :ref:`Set up a shared file system` for more information. .. testcode:: :skipif: True mlflow.set_tracking_uri(uri="file://some_shared_storage_path/mlruns") mlflow.start_run() **Remote, hosted by Databricks** Ensure that all nodes have access to the Databricks config file. See :ref:`Set up credentials` for more information. .. testcode:: :skipif: True # The MLflow client looks for a Databricks config file # at the location specified by `os.environ["DATABRICKS_CONFIG_FILE"]`. os.environ["DATABRICKS_CONFIG_FILE"] = config["databricks_config_file"] mlflow.set_tracking_uri("databricks") mlflow.start_run() .. _set-up-credentials: Set up credentials ~~~~~~~~~~~~~~~~~~ Refer to each tracking library's API documentation on setting up credentials. This step usually involves setting an environment variable or accessing a config file. The easiest way to pass an environment variable credential to training workers is through :ref:`runtime environments `, where you initialize with the following code: .. testcode:: :skipif: True import ray # This makes sure that training workers have the same env var set ray.init(runtime_env={"env_vars": {"SOME_API_KEY": "your_api_key"}}) For accessing the config file, ensure that the config file is accessible to all nodes. One way to do this is by setting up a shared storage. Another way is to save a copy in each node. .. _set-up-shared-file-system: Set up a shared file system ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Set up a network filesystem accessible to all nodes in the cluster. For example, AWS EFS or Google Cloud Filestore. Step 2: Configure and start the run ----------------------------------- This step usually involves picking an identifier for the run and associating it with a project. Refer to the tracking libraries' documentation for semantics. .. To conveniently link back to Ray Train run, you may want to log the persistent storage path .. of the run as a config. .. .. testcode:: def train_func(): if ray.train.get_context().get_world_rank() == 0: wandb.init(..., config={"ray_train_persistent_storage_path": "TODO: fill in when API stabilizes"}) .. tip:: When performing **fault-tolerant training** with auto-restoration, use a consistent ID to configure all tracking runs that logically belong to the same training run. Step 3: Log metrics ------------------- You can customize how to log parameters, metrics, models, or media contents, within :ref:`train_func`, just as in a non-distributed training script. You can also use native integrations that a particular tracking framework has with specific training frameworks. For example, ``mlflow.pytorch.autolog()``, ``lightning.pytorch.loggers.MLFlowLogger``, etc. Step 4: Finish the run ---------------------- This step ensures that all logs are synced to the tracking service. Depending on the implementation of various tracking libraries, sometimes logs are first cached locally and only synced to the tracking service in an asynchronous fashion. Finishing the run makes sure that all logs are synced by the time training workers exit. .. tab-set:: .. tab-item:: W&B .. testcode:: :skipif: True # https://docs.wandb.ai/ref/python/finish wandb.finish() .. tab-item:: MLflow .. testcode:: :skipif: True # https://mlflow.org/docs/1.2.0/python_api/mlflow.html mlflow.end_run() .. tab-item:: Comet .. testcode:: :skipif: True # https://www.comet.com/docs/v2/api-and-sdk/python-sdk/reference/Experiment/#experimentend Experiment.end() Examples ======== The following are runnable examples for PyTorch and PyTorch Lightning. PyTorch ------- .. dropdown:: Log to W&B .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking//torch_exp_tracking_wandb.py :emphasize-lines: 16, 19-21, 59-60, 62-63 :language: python :start-after: __start__ .. dropdown:: Log to file-based MLflow .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/torch_exp_tracking_mlflow.py :emphasize-lines: 22-25, 58-59, 61-62, 68 :language: python :start-after: __start__ :end-before: __end__ PyTorch Lightning ----------------- You can use the native Logger integration in PyTorch Lightning with W&B, CometML, MLFlow, and Tensorboard, while using Ray Train's TorchTrainer. The following example walks you through the process. The code here is runnable. .. dropdown:: W&B .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_model_dl.py :language: python :start-after: __model_dl_start__ .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_wandb.py :language: python :start-after: __lightning_experiment_tracking_wandb_start__ .. dropdown:: MLflow .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_model_dl.py :language: python :start-after: __model_dl_start__ .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_mlflow.py :language: python :start-after: __lightning_experiment_tracking_mlflow_start__ :end-before: __lightning_experiment_tracking_mlflow_end__ .. dropdown:: Comet .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_model_dl.py :language: python :start-after: __model_dl_start__ .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_comet.py :language: python :start-after: __lightning_experiment_tracking_comet_start__ .. dropdown:: TensorBoard .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_model_dl.py :language: python :start-after: __model_dl_start__ .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_tensorboard.py :language: python :start-after: __lightning_experiment_tracking_tensorboard_start__ :end-before: __lightning_experiment_tracking_tensorboard_end__ Common Errors ============= Missing Credentials ------------------- **I have already called `wandb login` cli, but am still getting** .. code-block:: none wandb: ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key]). This is probably due to wandb credentials are not set up correctly on worker nodes. Make sure that you run ``wandb.login`` or pass ``WANDB_API_KEY`` to each training function. See :ref:`Set up credentials ` for more details. Missing Configurations ---------------------- **I have already run `databricks configure`, but am still getting** .. code-block:: none databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! This is usually caused by running ``databricks configure`` which generates ``~/.databrickscfg`` only on head node. Move this file to a shared location or copy it to each node. See :ref:`Set up credentials ` for more details. --- .. _train-fault-tolerance: Handling Failures and Node Preemption ===================================== .. important:: This user guide shows how to configure fault tolerance for the revamped Ray Train V2 available starting from Ray 2.43 by enabling the environment variable ``RAY_TRAIN_V2_ENABLED=1``. **This user guide assumes that the environment variable has been enabled.** Please see :ref:`here ` for information about the deprecation and migration. Ray Train provides fault tolerance at three levels: 1. **Worker process fault tolerance** handles errors that happen to one or more Train worker processes while they are executing the user defined training function. 2. **Worker node fault tolerance** handles node failures that may occur during training. 3. **Job driver fault tolerance** handles the case where Ray Train driver process crashes, and training needs to be kicked off again, possibly from a new cluster. This user guide covers how to configure and use these fault tolerance mechanisms. .. _train-worker-fault-tolerance: Worker Process and Node Fault Tolerance --------------------------------------- **Worker process failures** are errors that occur within the user defined training function of a training worker, such as GPU out-of-memory (OOM) errors, cloud storage access errors, or other runtime errors. **Node failures** are errors that bring down the entire node, including node preemption, OOM, network partitions, or other hardware failures. This section covers worker node failures. Recovery from head node failures is discussed in the :ref:`next section `. Ray Train can be configured to automatically recover from worker process and worker node failures. When a failure is detected, all the workers are shut down, new nodes are added if necessary, and a new set of workers is started. The restarted training worker processes can resume training by loading the latest checkpoint. In order to retain progress upon recovery, your training function should implement logic for both :ref:`saving ` *and* :ref:`loading checkpoints `. Otherwise, the training will just start from scratch. Each recovery from a worker process or node failure is considered a retry. The number of retries is configurable through the ``max_failures`` attribute of the :class:`~ray.train.FailureConfig` argument set in the :class:`~ray.train.RunConfig` passed to the ``Trainer``. By default, worker fault tolerance is disabled with ``max_failures=0``. .. literalinclude:: ../doc_code/fault_tolerance.py :language: python :start-after: __failure_config_start__ :end-before: __failure_config_end__ Altogether, this is what an example Torch training script with worker fault tolerance looks like: .. literalinclude:: ../doc_code/fault_tolerance.py :language: python :start-after: __worker_fault_tolerance_start__ :end-before: __worker_fault_tolerance_end__ Which checkpoint will be restored? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Train will populate :func:`ray.train.get_checkpoint() ` with the latest available :ref:`checkpoint reported to Ray Train `. The :class:`~ray.train.Checkpoint` object returned by this method has the :meth:`~ray.train.Checkpoint.as_directory` and :meth:`~ray.train.Checkpoint.to_directory` methods to download the checkpoint from the :class:`RunConfig(storage_path) ` to local disk. .. note:: :meth:`~ray.train.Checkpoint.as_directory` and :meth:`~ray.train.Checkpoint.to_directory` will only download the checkpoint once per node even if there are multiple workers on the node. The workers share the same checkpoint directory on local disk. Illustrated Example ~~~~~~~~~~~~~~~~~~~ Consider the following example of a cluster containing a CPU head node and 2 GPU worker nodes. There are 4 GPU training workers running on the 2 worker nodes. The :ref:`storage path has been configured ` to use cloud storage, which is where checkpoints are saved. .. figure:: ../images/fault_tolerance/worker_failure_start.png :align: left Training has been running for some time, and the latest checkpoint has been saved to cloud storage. .. figure:: ../images/fault_tolerance/worker_node_failure.png :align: left One of the worker GPU nodes fails due to a hardware fault. Ray Train detects this failure and shuts down all the workers. Since the number of failures detected so far is less than the configured ``max_failures``, Ray Train will attempt to restart training, rather than exiting and raising an error. .. figure:: ../images/fault_tolerance/worker_node_replacement.png :align: left Ray Train has requested a new worker node to join the cluster and is waiting for it to come up. .. figure:: ../images/fault_tolerance/worker_group_recovery.png :align: left The new worker node has joined the cluster. Ray Train restarts all the worker processes and provides them with the latest checkpoint. The workers download the checkpoint from storage and use it to resume training. .. _train-restore-guide: .. _train-job-driver-fault-tolerance: Job Driver Fault Tolerance -------------------------- Job driver fault tolerance is to handle cases where the Ray Train driver process is interrupted. The Ray Train driver process is the process that calls ``trainer.fit()`` and is usually located on the head node of the cluster. The driver process may be interrupted due to one of the following reasons: - The run is manually interrupted by a user (e.g., Ctrl+C). - The node where the driver process is running (head node) crashes (e.g., out of memory, out of disk). - The entire cluster goes down (e.g., network error affecting all nodes). In these cases, the Ray Train driver (which calls ``trainer.fit()``) needs to be launched again. The relaunched Ray Train driver needs to find a minimal amount of run state in order to pick up where the previous run left off. This state includes the latest reported checkpoints, which are located at the :ref:`storage path `. Ray Train fetches the latest checkpoint information from storage and passes it to the newly launched worker processes to resume training. To find this run state, Ray Train relies on passing in the **same** :class:`RunConfig(storage_path, name) ` pair as the previous run. If the ``storage_path`` or ``name`` do not match, Ray Train will not be able to find the previous run state and will start a new run from scratch. .. warning:: If ``name`` is reused unintentionally, Ray Train will fetch the previous run state, even if the user is trying to start a new run. Therefore, always pass a unique run name when launching a new run. In other words, ``name`` should be a unique identifier for a training job. .. note:: Job driver crashes and interrupts do not count toward the ``max_failures`` limit of :ref:`worker fault tolerance `. Here's an example training script that highlights best practices for job driver fault tolerance: .. literalinclude:: ../doc_code/fault_tolerance.py :language: python :start-after: __job_driver_fault_tolerance_start__ :end-before: __job_driver_fault_tolerance_end__ Then, the entrypoint script can be launched with the following command: .. code-block:: bash python entrypoint.py --storage_path s3://my_bucket/ --run_name unique_run_id=da823d5 If the job is interrupted, the same command can be used to resume training. This example shows a ``da823d5`` id, which is determined by the one launching the job. The id can often be used for other purposes such as setting the ``wandb`` or ``mlflow`` run id. Illustrated Example ~~~~~~~~~~~~~~~~~~~ Consider the following example of a cluster containing a CPU head node and 2 GPU worker nodes. There are 4 GPU training workers running on the 2 worker nodes. The storage path has been configured to use cloud storage, which is where checkpoints are saved. .. figure:: ../images/fault_tolerance/cluster_failure_start.png :align: left Training has been running for some time, and the latest checkpoints and run state has been saved to storage. .. figure:: ../images/fault_tolerance/head_node_failure.png :align: left The head node crashes for some reason (e.g., an out-of-memory error), and the Ray Train driver process is interrupted. .. figure:: ../images/fault_tolerance/cluster_failure.png :align: left The entire cluster goes down due to the head node failure. .. figure:: ../images/fault_tolerance/cluster_recovery.png :align: left A manual cluster restart or some job submission system brings up a new Ray cluster. The Ray Train driver process runs on a new head node. Ray Train fetches the run state information from storage at ``{storage_path}/{name}`` (e.g., ``s3://my_bucket/my_run_name``) and passes the latest checkpoint to the newly launched worker processes to resume training. .. _train-fault-tolerance-deprecation-info: Fault Tolerance API Deprecations -------------------------------- ``Trainer.restore`` API Deprecation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``Trainer.restore`` and ``Trainer.can_restore`` APIs are deprecated as of Ray 2.43 and will be removed in a future release. Motivation ********** This API change provides several benefits: 1. **Avoid saving user code to pickled files**: The old API saved user code to pickled files, which could lead to issues with deserialization, leading to unrecoverable runs. 2. **Improved configuration experience**: While some configurations were loaded from the pickled files, certain arguments were required to be re-specified, and another subset of arguments could even be optionally re-specified. This confused users about the set of configurations that are actually being used in the restored run. Migration Steps *************** To migrate from the old ``Trainer.restore`` API to the new pattern: 1. Enable the environment variable ``RAY_TRAIN_V2_ENABLED=1``. 2. Replace ``Trainer.restore`` with the regular ``Trainer`` constructor, making sure to pass in the same ``storage_path`` and ``name`` as the previous run. ``Trainer(restore_from_checkpoint)`` API Deprecation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``Trainer(restore_from_checkpoint)`` API is deprecated as of Ray 2.43 and will be removed in a future release. Motivation ********** This API was a common source of confusion that provided minimal value. It was only used to set the initial value of ``ray.train.get_checkpoint()`` but did not load any other run state. Migration Steps *************** Simply pass in the initial checkpoint through the ``train_loop_config`` argument. See the migration guide linked below for a code example. Additional Resources ~~~~~~~~~~~~~~~~~~~~ * `Train V2 Migration Guide `_: Full migration guide for Train V2 * `Train V2 REP `_: Technical details about the API change * :ref:`train-fault-tolerance-deprecated-api`: Documentation for the old API --- .. _train-tune: Hyperparameter Tuning with Ray Tune =================================== .. important:: This user guide shows how to integrate Ray Train and Ray Tune to tune over distributed hyperparameter runs for the revamped Ray Train V2 available starting from Ray 2.43 by enabling the environment variable ``RAY_TRAIN_V2_ENABLED=1``. **This user guide assumes that the environment variable has been enabled.** Please see :ref:`here ` for information about the deprecation and migration. Ray Train can be used together with Ray Tune to do hyperparameter sweeps of distributed training runs. This is often useful when you want to do a small sweep over critical hyperparameters, before launching a run with the best performing hyperparameters on all available cluster resources for a long duration. Quickstart ---------- In the example below: * :class:`~ray.tune.Tuner` launches the tuning job, which runs trials of ``train_driver_fn`` with different hyperparameter configurations. * ``train_driver_fn``, which (1) takes in a hyperparameter configuration, (2) instantiates a ``TorchTrainer`` (or some other framework trainer), and (3) launches the distributed training job. * :class:`~ray.train.ScalingConfig` defines the number of training workers and resources per worker for a single Ray Train run. * ``train_fn_per_worker`` is the Python code that executes on each distributed training worker for a trial. .. literalinclude:: ../doc_code/train_tune_interop.py :language: python :start-after: __quickstart_start__ :end-before: __quickstart_end__ What does Ray Tune provide? --------------------------- Ray Tune provides utilities for: * :ref:`Defining hyperparameter search spaces ` and :ref:`launching multiple trials concurrently ` on a Ray cluster * :ref:`Using search algorithms ` * :ref:`Early stopping runs based on metrics ` This user guide only focuses on the integration layer between Ray Train and Ray Tune. For more details on how to use Ray Tune, refer to the :ref:`Ray Tune documentation `. Configuring resources for multiple trials ----------------------------------------- Ray Tune launches multiple trials which :ref:`run a user-defined function in a remote Ray actor `, where each trial gets a different sampled hyperparameter configuration. When using Ray Tune by itself, trials do computation directly inside the Ray actor. For example, each trial could request 1 GPU and do some single-process model training within the remote actor itself. When using Ray Train inside Ray Tune functions, the Tune trial is actually not doing extensive computation inside this actor -- instead it just acts as a driver process to launch and monitor the Ray Train workers running elsewhere. Ray Train requests its own resources via the :class:`~ray.train.ScalingConfig`. See :ref:`train_scaling_config` for more details. .. figure:: ../images/hyperparameter_optimization/train_without_tune.png :align: center A single Ray Train run to showcase how using Ray Tune in the next figure just adds a layer of hierarchy to this tree of processes. .. figure:: ../images/hyperparameter_optimization/train_tune_interop.png :align: center Example of Ray Train runs being launched from within Ray Tune trials. Limit the number of concurrent Ray Train runs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Train runs can only start when resources for all workers can be acquired at once. This means that multiple Tune trials spawning Train runs will be competing for the logical resources available in the Ray cluster. If there is a limiting cluster resource such as GPUs, then it won't be possible to run training for all hyperparameter configurations concurrently. Since the cluster only has enough resources for a handful of trials to run concurrently, set :class:`tune.TuneConfig(max_concurrent_trials) ` on the Tuner to limit the number of “in-flight” Train runs so that no trial is being starved of resources. .. literalinclude:: ../doc_code/train_tune_interop.py :language: python :start-after: __max_concurrent_trials_start__ :end-before: __max_concurrent_trials_end__ As a concrete example, consider a fixed sized cluster with 128 CPUs and 8 GPUs. * The ``Tuner(param_space)`` sweeps over 4 hyperparameter configurations with a grid search: ``param_space={“train_loop_config”: {“batch_size”: tune.grid_search([8, 16, 32, 64])}}`` * Each Ray Train run is configured to train with 4 GPU workers: ``ScalingConfig(num_workers=4, use_gpu=True)``. Since there are only 8 GPUs, only 2 Train runs can acquire their full set of resources at a time. * However, since there are many CPUs available in the cluster, the 4 total Ray Tune trials (which default to requesting 1 CPU) can be launched immediately. This results in 2 extra Ray Tune trial processes being launched, even though their inner Ray Train run just waits for resources until one of the other trials finishes. This introduces some spammy log messages when Train waits for resources. There may also be an excessive number of Ray Tune trial processes if the total number of hyperparameter configurations is large. * To fix this issue, set ``Tuner(tune_config=tune.TuneConfig(max_concurrent_trials=2))``. Now, only two Ray Tune trial processes will be running at a time. This number can be calculated based on the limiting cluster resource and the amount of that resources required by each trial. Advanced: Set Train driver resources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The default Train driver runs as a Ray Tune function with 1 CPU. Ray Tune will schedule these functions to run anywhere on the cluster that has free logical CPU resources. **Recommendation:** If you are launching longer-running training jobs or using spot instances, these Tune functions which act as the Ray Train driver process should be run on “safe nodes” that are at lower risk of going down. For example, they should not be scheduled to run on preemptible spot instances and should not be colocated with training workers. This could be the head node or a dedicated CPU node in your cluster. This is because the Ray Train driver process is responsible for handling fault tolerance of the worker processes, which are more likely to error. Nodes that are running Train workers can crash due to spot preemption or other errors that come up due to the user-defined model training code. * If a Train worker node dies, the Ray Train driver process that is still alive on a different node can gracefully handle the error. * On the other hand, if the driver process dies, then all Ray Train workers will ungracefully exit and some of the run state may not be committed fully. One way to achieve this behavior is to set custom resources on certain node types and configure the Tune functions to request those resources. .. literalinclude:: ../doc_code/train_tune_interop.py :language: python :start-after: __trainable_resources_start__ :end-before: __trainable_resources_end__ Reporting metrics and checkpoints --------------------------------- Both Ray Train and Ray Tune provide utilities to help upload and track checkpoints via the :func:`ray.train.report ` and :func:`ray.tune.report ` APIs. See the :ref:`train-checkpointing` user guide for more details. If the Ray Train workers report checkpoints, saving another Ray Tune checkpoint at the Train driver level is not needed because it does not hold any extra training state. The Ray Train driver process will already periodically snapshot its status to the configured storage_path, which is further described in the next section on fault tolerance. In order to access the checkpoints from the Tuner output, you can append the checkpoint path as a metric. The provided :class:`~ray.tune.integration.ray_train.TuneReportCallback` does this by propagating reported Ray Train results over to Ray Tune, where the checkpoint path is attached as a separate metric. Advanced: Fault Tolerance ~~~~~~~~~~~~~~~~~~~~~~~~~ In the event that the Ray Tune trials running the Ray Train driver process crash, you can enable trial fault tolerance on the Ray Tune side via: :class:`ray.tune.Tuner(run_config=ray.tune.RunConfig(failure_config)) `. Fault tolerance on the Ray Train side is configured and handled separately. See the :ref:`train-fault-tolerance` user guide for more details. .. literalinclude:: ../doc_code/train_tune_interop.py :language: python :start-after: __fault_tolerance_start__ :end-before: __fault_tolerance_end__ .. _train-with-tune-callbacks: Advanced: Using Ray Tune callbacks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Tune callbacks should be passed into the :class:`ray.tune.RunConfig(callbacks) ` at the Tuner level. For Ray Train users that depend on behavior of built-in or custom Ray Tune callbacks, it's possible to use them by running Ray Train as a single trial Tune run and passing in the callbacks to the Tuner. If any callback functionality depends on reported metrics, make sure to pass the :class:`ray.tune.integration.ray_train.TuneReportCallback` to the trainer callbacks, which propagates results to the Tuner. .. testcode:: :skipif: True import ray.tune from ray.tune.integration.ray_train import TuneReportCallback from ray.tune.logger import TBXLoggerCallback def train_driver_fn(config: dict): trainer = TorchTrainer( ..., run_config=ray.train.RunConfig(..., callbacks=[TuneReportCallback()]) ) trainer.fit() tuner = ray.tune.Tuner( train_driver_fn, run_config=ray.tune.RunConfig(callbacks=[TBXLoggerCallback()]) ) .. _train-tune-deprecation: ``Tuner(trainer)`` API Deprecation ---------------------------------- The ``Tuner(trainer)`` API which directly takes in a Ray Train trainer instance is deprecated as of Ray 2.43 and will be removed in a future release. Motivation ~~~~~~~~~~ This API change provides several benefits: 1. **Better separation of concerns**: Decouples Ray Train and Ray Tune responsibilities 2. **Improved configuration experience**: Makes hyperparameter and run configuration more explicit and flexible Migration Steps ~~~~~~~~~~~~~~~ To migrate from the old ``Tuner(trainer)`` API to the new pattern: 1. Enable the environment variable ``RAY_TRAIN_V2_ENABLED=1``. 2. Replace ``Tuner(trainer)`` with a function-based approach where Ray Train is launched inside a Tune trial. 3. Move your training logic into a driver function that Tune will call with different hyperparameters. Additional Resources ~~~~~~~~~~~~~~~~~~~~ * `Train V2 REP `_: Technical details about the API change * `Train V2 Migration Guide `_: Full migration guide for Train V2 * :ref:`train-tune-deprecated-api`: Documentation for the old API --- .. _train-local-mode: Local Mode ========== .. important:: This user guide shows how to use local mode with Ray Train V2 only. For information about migrating from Ray Train V1 to V2, see the Train V2 migration guide: https://github.com/ray-project/ray/issues/49454 What is local mode? ------------------- Local mode in Ray Train runs your training function without launching Ray Train worker actors. Instead of distributing your training code across multiple Ray actors, local mode executes your training function directly in the current process. This provides a simplified debugging environment where you can iterate quickly on your training logic. Local mode supports two execution modes: * **Single-process mode**: Runs your training function in a single process, ideal for rapid iteration and debugging. * **Multi-process mode with torchrun**: Launches multiple processes for multi-GPU training, useful for debugging distributed training logic with familiar tools. How to enable local mode ------------------------- You can enable local mode by setting ``num_workers=0`` in your :class:`~ray.train.ScalingConfig`: .. testcode:: :skipif: True from ray.train import ScalingConfig from ray.train.torch import TorchTrainer def train_func(config): # Your training logic pass trainer = TorchTrainer( train_loop_per_worker=train_func, scaling_config=ScalingConfig(num_workers=0), ) result = trainer.fit() Local mode provides the same ``ray.train`` APIs you use in distributed training, so your training code runs without any other modifications. This makes it simple to verify your training logic locally before scaling to distributed training. When to use local mode ---------------------- Use single-process local mode to: * **Develop and iterate quickly**: Test changes to your training function locally. * **Write unit tests**: Verify your training logic works correctly in a simplified environment. * **Debug training logic**: Use standard Python debugging tools to step through your training code and identify issues. Use multi-process local mode with ``torchrun`` to: * **Test multi-GPU logic**: Verify your distributed training code works correctly across multiple GPUs using familiar ``torchrun`` commands. * **Migrate existing code**: Bring existing ``torchrun`` based training scripts into Ray Train while preserving your development workflow. * **Debug distributed behavior**: Isolate issues in your distributed training logic using ``torchrun``'s process management. .. note:: In local mode, Ray Train doesn't launch worker actors, but your training code can still use other Ray features such as Ray Data (in single-process mode) or launch Ray actors if needed. Single-process local mode -------------------------- The following example shows how to use single-process local mode with PyTorch: .. testcode:: :skipif: True import torch from torch import nn import ray from ray.train import ScalingConfig from ray.train.torch import TorchTrainer def train_func(config): model = nn.Linear(10, 1) optimizer = torch.optim.SGD(model.parameters(), lr=config["lr"]) for epoch in range(config["epochs"]): # Training loop loss = model(torch.randn(32, 10)).sum() loss.backward() optimizer.step() # Report metrics ray.train.report({"loss": loss.item()}) trainer = TorchTrainer( train_loop_per_worker=train_func, train_loop_config={"lr": 0.01, "epochs": 3}, scaling_config=ScalingConfig(num_workers=0), ) result = trainer.fit() print(f"Final loss: {result.metrics['loss']}") .. note:: Local mode works with all Ray Train framework integrations, including PyTorch Lightning, Hugging Face Transformers, LightGBM, XGBoost, TensorFlow, and others. Testing with local mode ~~~~~~~~~~~~~~~~~~~~~~~ The following example shows how to write a unit test with local mode: .. testcode:: :skipif: True import pytest import ray from ray.train import ScalingConfig from ray.train.torch import TorchTrainer def test_training_runs(): def train_func(config): # Report minimal training result ray.train.report({"loss": 0.5}) trainer = TorchTrainer( train_loop_per_worker=train_func, scaling_config=ScalingConfig(num_workers=0), ) result = trainer.fit() assert result.error is None assert result.metrics["loss"] == 0.5 Using local mode with Ray Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Single-process local mode works seamlessly with Ray Data for data loading and preprocessing. When you use Ray Data with local mode, Ray Data processes your data and provides it back to your training function in the local process. The following example shows how to use Ray Data with single-process local mode: .. testcode:: :skipif: True import ray from ray.train import ScalingConfig from ray.train.torch import TorchTrainer def train_func(config): # Get the dataset shard train_dataset = ray.train.get_dataset_shard("train") # Iterate over batches for batch in train_dataset.iter_batches(batch_size=32): # Training logic pass # Create a Ray Dataset dataset = ray.data.read_csv("s3://bucket/data.csv") trainer = TorchTrainer( train_loop_per_worker=train_func, scaling_config=ScalingConfig(num_workers=0), datasets={"train": dataset}, ) result = trainer.fit() .. warning:: Ray Data isn't supported when using ``torchrun`` for multi-process training in local mode. For multi-process training, use standard PyTorch data loading mechanisms such as DataLoader with DistributedSampler. Multi-process local mode with ``torchrun`` ------------------------------------------- Local mode supports multi-GPU training through ``torchrun``, allowing you to develop and debug using ``torchrun``'s process management. Single-node multi-GPU training ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following example shows how to use ``torchrun`` with local mode for multi-GPU training on a single node. This approach is useful when migrating existing PyTorch training code or when you want to debug distributed training logic using ``torchrun``'s familiar process management. The example uses standard PyTorch ``DataLoader`` for data loading, making it easy to adapt your existing PyTorch training code. First, create your training script (``train_script.py``): .. testcode:: :skipif: True import os import tempfile import torch import torch.distributed as dist from torch import nn from torch.utils.data import DataLoader from torchvision.datasets import FashionMNIST from torchvision.transforms import ToTensor, Normalize, Compose from filelock import FileLock import ray from ray.train import Checkpoint, ScalingConfig, get_context from ray.train.torch import TorchTrainer def train_func(config): # Load dataset with file locking to avoid multiple downloads transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))]) data_dir = "./data" # Only local rank 0 downloads the dataset local_rank = get_context().get_local_rank() if local_rank == 0: with FileLock(os.path.join(data_dir, "fashionmnist.lock")): train_dataset = FashionMNIST( root=data_dir, train=True, download=True, transform=transform ) # Wait for rank 0 to finish downloading dist.barrier() # Now all ranks can safely load the dataset train_dataset = FashionMNIST( root=data_dir, train=True, download=False, transform=transform ) train_loader = DataLoader( train_dataset, batch_size=config["batch_size"], shuffle=True ) # Prepare dataloader for distributed training train_loader = ray.train.torch.prepare_data_loader(train_loader) # Prepare model for distributed training model = nn.Sequential( nn.Flatten(), nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 10) ) model = ray.train.torch.prepare_model(model) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"]) # Training loop for epoch in range(config["epochs"]): # Set epoch for distributed sampler if ray.train.get_context().get_world_size() > 1: train_loader.sampler.set_epoch(epoch) epoch_loss = 0.0 for batch_idx, (images, labels) in enumerate(train_loader): outputs = model(images) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() epoch_loss += loss.item() avg_loss = epoch_loss / len(train_loader) # Report metrics and checkpoint with tempfile.TemporaryDirectory() as temp_dir: torch.save(model.state_dict(), os.path.join(temp_dir, "model.pt")) ray.train.report( {"loss": avg_loss, "epoch": epoch}, checkpoint=Checkpoint.from_directory(temp_dir) ) # Configure trainer for local mode trainer = TorchTrainer( train_loop_per_worker=train_func, train_loop_config={"lr": 0.001, "epochs": 10, "batch_size": 32}, scaling_config=ScalingConfig(num_workers=0, use_gpu=True), ) result = trainer.fit() Then, launch training with ``torchrun``: .. code-block:: bash # Train on 4 GPUs on a single node torchrun --nproc-per-node=4 train_script.py Ray Train automatically detects the ``torchrun`` environment variables and configures the distributed training accordingly. You can access distributed training information through :func:`ray.train.get_context()`: .. testcode:: :skipif: True from ray.train import get_context context = get_context() print(f"World size: {context.get_world_size()}") print(f"World rank: {context.get_world_rank()}") print(f"Local rank: {context.get_local_rank()}") .. warning:: Ray Data isn't supported when using ``torchrun`` for multi-process training in local mode. For multi-process training, use standard PyTorch data loading mechanisms such as DataLoader with DistributedSampler. Multi-node multi-GPU training ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can also use ``torchrun`` to launch multi-node training with local mode. The following example shows how to launch training across 2 nodes with 4 GPUs each: On the master node (``192.168.1.1``): .. code-block:: bash RAY_TRAIN_V2_ENABLED=1 torchrun \ --nnodes=2 \ --nproc-per-node=4 \ --node_rank=0 \ --rdzv_backend=c10d \ --rdzv_endpoint=192.168.1.1:29500 \ --rdzv_id=job_id \ train_script.py On the worker node: .. code-block:: bash RAY_TRAIN_V2_ENABLED=1 torchrun \ --nnodes=2 \ --nproc-per-node=4 \ --node_rank=1 \ --rdzv_backend=c10d \ --rdzv_endpoint=192.168.1.1:29500 \ --rdzv_id=job_id \ train_script.py Transitioning from local mode to distributed training ----------------------------------------------------- When you're ready to scale from local mode to distributed training, simply change ``num_workers`` to a value greater than 0: .. code-block:: diff trainer = TorchTrainer( train_loop_per_worker=train_func, train_loop_config=config, - scaling_config=ScalingConfig(num_workers=0), + scaling_config=ScalingConfig(num_workers=4, use_gpu=True), ) Your training function code remains the same, and Ray Train handles the distributed coordination automatically. Limitations and API differences -------------------------------- Local mode provides simplified implementations of Ray Train APIs to enable rapid debugging without distributed orchestration. However, this means some features behave differently or aren't available. Features not available in local mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following Ray Train features aren't available in local mode: * **Worker-level fault tolerance**: Ray Train's automatic fault tolerance features, such as worker restart on failure, aren't available. If you configured :class:`~ray.train.FailureConfig`, the settings don't apply in local mode. * **Callbacks**: User-defined callbacks specified in :class:`~ray.train.RunConfig` aren't invoked in local mode. * **Ray Data with multi-process training**: Ray Data isn't supported when using ``torchrun`` with local mode for multi-process training. Use standard PyTorch data loading mechanisms instead. API behavior differences ~~~~~~~~~~~~~~~~~~~~~~~~ The following table summarizes how ``ray.train`` APIs behave differently in local mode: .. list-table:: :header-rows: 1 :widths: 30 70 * - API - Behavior in local mode * - :func:`ray.train.report` - Stores checkpoints in memory only (not persisted to storage). Ignores ``checkpoint_upload_mode``, ``checkpoint_upload_fn``, ``validate_fn``, and ``delete_local_checkpoint_after_upload`` parameters. Logs metrics locally instead of through the reporting pipeline. Doesn't invoke a synchronization barrier across workers. * - :func:`ray.train.get_checkpoint` - Returns the last checkpoint from memory. Doesn't load checkpoints from persistent storage. * - :func:`ray.train.get_all_reported_checkpoints` - Always returns an empty list. Doesn't track checkpoint history. * - :func:`ray.train.collective.barrier` - No-op. * - :func:`ray.train.collective.broadcast_from_rank_zero` - Returns data as-is. * - :meth:`ray.train.get_context().get_storage() ` - Raises ``NotImplementedError`` --- .. _train-metrics: Ray Train Metrics ----------------- Ray Train exports Prometheus metrics including the Ray Train controller state, worker group start times, checkpointing times and more. You can use these metrics to monitor Ray Train runs. The Ray dashboard displays these metrics in the Ray Train Grafana Dashboard. See :ref:`Ray Dashboard documentation` for more information. The Ray Train dashboard also displays a subset of Ray Core metrics that are useful for monitoring training but are not listed in the table below. For more information about these metrics, see the :ref:`System Metrics documentation`. The following table lists the Prometheus metrics emitted by Ray Train: .. list-table:: Train Metrics :header-rows: 1 * - Prometheus Metric - Labels - Description * - `ray_train_controller_state` - `ray_train_run_name`, `ray_train_run_id`, `ray_train_controller_state` - Current state of the Ray Train controller. * - `ray_train_worker_group_start_total_time_s` - `ray_train_run_name`, `ray_train_run_id` - Total time taken to start the worker group. * - `ray_train_worker_group_shutdown_total_time_s` - `ray_train_run_name`, `ray_train_run_id` - Total time taken to shut down the worker group. * - `ray_train_report_total_blocked_time_s` - `ray_train_run_name`, `ray_train_run_id`, `ray_train_worker_world_rank`, `ray_train_worker_actor_id` - Cumulative time in seconds to report a checkpoint to storage. --- .. _train-monitoring-and-logging: Monitoring and Logging Metrics ============================== Ray Train provides an API for attaching metrics to :ref:`checkpoints ` from the training function by calling :func:`ray.train.report(metrics, checkpoint) `. The results will be collected from the distributed workers and passed to the Ray Train driver process for book-keeping. The primary use cases for reporting are: * metrics (accuracy, loss, etc.) at the end of each training epoch. See :ref:`train-dl-saving-checkpoints` for usage examples. * validating checkpoints on a validation set with a user-defined validation function. See :ref:`train-validating-checkpoints` for usage examples. Only the result reported by the rank 0 worker is attached to the checkpoint. However, in order to ensure consistency, ``train.report()`` acts as a barrier and must be called on each worker. To aggregate results from multiple workers, see :ref:`train-aggregating-results`. .. _train-aggregating-results: How to obtain and aggregate results from different workers? ----------------------------------------------------------- In real applications, you may want to calculate optimization metrics besides accuracy and loss: recall, precision, Fbeta, etc. You may also want to collect metrics from multiple workers. While Ray Train currently only reports metrics from the rank 0 worker, you can use third-party libraries or distributed primitives of your machine learning framework to report metrics from multiple workers. .. tab-set:: .. tab-item:: Native PyTorch Ray Train natively supports `TorchMetrics `_, which provides a collection of machine learning metrics for distributed, scalable PyTorch models. Here is an example of reporting both the aggregated R2 score and mean train and validation loss from all workers. .. literalinclude:: ../doc_code/metric_logging.py :language: python :start-after: __torchmetrics_start__ :end-before: __torchmetrics_end__ .. _train-metric-only-reporting-deprecation: (Deprecated) Reporting free-floating metrics -------------------------------------------- Reporting metrics with ``ray.train.report(metrics, checkpoint=None)`` from every worker writes the metrics to a Ray Tune log file (``progress.csv``, ``result.json``) and is accessible via the ``Result.metrics_dataframe`` on the :class:`~ray.train.Result` returned by ``trainer.fit()``. As of Ray 2.43, this behavior is deprecated and will not be supported in Ray Train V2, which is an overhaul of Ray Train's implementation and select APIs. Ray Train V2 only keeps a slim set of experiment tracking features that are necessary for fault tolerance, so it does not support reporting free-floating metrics that are not attached to checkpoints. The recommendation for metric tracking is to report metrics directly from the workers to experiment tracking tools such as MLFlow and WandB. See :ref:`train-experiment-tracking-native` for examples. In Ray Train V2, reporting only metrics from all workers is a no-op. However, it is still possible to access the results reported by all workers to implement custom metric-handling logic. .. literalinclude:: ../doc_code/metric_logging.py :language: python :start-after: __report_callback_start__ :end-before: __report_callback_end__ To use Ray Tune :class:`Callbacks ` that depend on free-floating metrics reported by workers, :ref:`run Ray Train as a single Ray Tune trial. ` See the following resources for more information: * `Train V2 REP `_: Technical details about the API changes in Train V2 * `Train V2 Migration Guide `_: Full migration guide for Train V2 --- .. _persistent-storage-guide: .. _train-log-dir: Configuring Persistent Storage ============================== A Ray Train run produces :ref:`checkpoints ` that can be saved to a persistent storage location. .. figure:: ../images/persistent_storage_checkpoint.png :align: center :width: 600px An example of multiple workers spread across multiple nodes uploading checkpoints to persistent storage. **Ray Train expects all workers to be able to write files to the same persistent storage location.** Therefore, Ray Train requires some form of external persistent storage such as cloud storage (e.g., S3, GCS) or a shared filesystem (e.g., AWS EFS, Google Filestore, HDFS) for multi-node training. Here are some capabilities that persistent storage enables: - **Checkpointing and fault tolerance**: Saving checkpoints to a persistent storage location allows you to resume training from the last checkpoint in case of a node failure. See :ref:`train-checkpointing` for a detailed guide on how to set up checkpointing. - **Post-experiment analysis**: A consolidated location storing data such as the best checkpoints and hyperparameter configs after the Ray cluster has already been terminated. - **Bridge training/fine-tuning with downstream serving and batch inference tasks**: You can easily access the models and artifacts to share them with others or use them in downstream tasks. Cloud storage (AWS S3, Google Cloud Storage) -------------------------------------------- .. tip:: Cloud storage is the recommended persistent storage option. Use cloud storage by specifying a bucket URI as the :class:`RunConfig(storage_path) `: .. testcode:: :skipif: True from ray import train from ray.train.torch import TorchTrainer trainer = TorchTrainer( ..., run_config=train.RunConfig( storage_path="s3://bucket-name/sub-path/", name="experiment_name", ) ) Ensure that all nodes in the Ray cluster have access to cloud storage, so outputs from workers can be uploaded to a shared cloud bucket. In this example, all files are uploaded to shared storage at ``s3://bucket-name/sub-path/experiment_name`` for further processing. Shared filesystem (NFS, HDFS) ----------------------------- Use by specifying the shared storage path as the :class:`RunConfig(storage_path) `: .. testcode:: :skipif: True from ray import train from ray.train.torch import TorchTrainer trainer = TorchTrainer( ..., run_config=train.RunConfig( storage_path="/mnt/cluster_storage", # HDFS example: # storage_path=f"hdfs://{hostname}:{port}/subpath", name="experiment_name", ) ) Ensure that all nodes in the Ray cluster have access to the shared filesystem, e.g. AWS EFS, Google Cloud Filestore, or HDFS, so that outputs can be saved to there. In this example, all files are saved to ``/mnt/cluster_storage/experiment_name`` for further processing. Local storage ------------- Using local storage for a single-node cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you're just running an experiment on a single node (e.g., on a laptop), Ray Train will use the local filesystem as the storage location for checkpoints and other artifacts. Results are saved to ``~/ray_results`` in a sub-directory with a unique auto-generated name by default, unless you customize this with ``storage_path`` and ``name`` in :class:`~ray.train.RunConfig`. .. testcode:: :skipif: True from ray import train from ray.train.torch import TorchTrainer trainer = TorchTrainer( ..., run_config=train.RunConfig( storage_path="/tmp/custom/storage/path", name="experiment_name", ) ) In this example, all experiment results can found locally at ``/tmp/custom/storage/path/experiment_name`` for further processing. .. _multinode-local-storage-warning: Using local storage for a multi-node cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. warning:: When running on multiple nodes, using the local filesystem of the head node as the persistent storage location is no longer supported. If you save checkpoints with :meth:`ray.train.report(..., checkpoint=...) ` and run on a multi-node cluster, Ray Train will raise an error if NFS or cloud storage is not setup. This is because Ray Train expects all workers to be able to write the checkpoint to the same persistent storage location. If your training loop does not save checkpoints, the reported metrics will still be aggregated to the local storage path on the head node. See `this issue `_ for more information. .. _custom-storage-filesystem: Custom storage -------------- If the cases above don't suit your needs, Ray Train can support custom filesystems and perform custom logic. Ray Train standardizes on the ``pyarrow.fs.FileSystem`` interface to interact with storage (`see the API reference here `_). By default, passing ``storage_path=s3://bucket-name/sub-path/`` will use pyarrow's `default S3 filesystem implementation `_ to upload files. (`See the other default implementations. `_) Implement custom storage upload and download logic by providing an implementation of ``pyarrow.fs.FileSystem`` to :class:`RunConfig(storage_filesystem) `. .. warning:: When providing a custom filesystem, the associated ``storage_path`` is expected to be a qualified filesystem path *without the protocol prefix*. For example, if you provide a custom S3 filesystem for ``s3://bucket-name/sub-path/``, then the ``storage_path`` should be ``bucket-name/sub-path/`` with the ``s3://`` stripped. See the example below for example usage. .. testcode:: :skipif: True import pyarrow.fs from ray import train from ray.train.torch import TorchTrainer fs = pyarrow.fs.S3FileSystem( endpoint_override="http://localhost:9000", access_key=..., secret_key=... ) trainer = TorchTrainer( ..., run_config=train.RunConfig( storage_filesystem=fs, storage_path="bucket-name/sub-path", name="unique-run-id", ) ) ``fsspec`` filesystems ~~~~~~~~~~~~~~~~~~~~~~~ `fsspec `_ offers many filesystem implementations, such as ``s3fs``, ``gcsfs``, etc. You can use any of these implementations by wrapping the ``fsspec`` filesystem with a ``pyarrow.fs`` utility: .. testcode:: :skipif: True # Make sure to install: `pip install -U s3fs` import s3fs import pyarrow.fs s3_fs = s3fs.S3FileSystem( key='miniokey...', secret='asecretkey...', endpoint_url='https://...' ) custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs)) run_config = RunConfig(storage_path="minio_bucket", storage_filesystem=custom_fs) .. seealso:: See the API references to the ``pyarrow.fs`` wrapper utilities: * https://arrow.apache.org/docs/python/generated/pyarrow.fs.PyFileSystem.html * https://arrow.apache.org/docs/python/generated/pyarrow.fs.FSSpecHandler.html MinIO and other S3-compatible storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can follow the :ref:`examples shown above ` to configure a custom S3 filesystem to work with MinIO. Note that including these as query parameters in the ``storage_path`` URI directly is another option: .. testcode:: :skipif: True from ray import train from ray.train.torch import TorchTrainer trainer = TorchTrainer( ..., run_config=train.RunConfig( storage_path="s3://bucket-name/sub-path?endpoint_override=http://localhost:9000", name="unique-run-id", ) ) Overview of Ray Train outputs ----------------------------- So far, we covered how to configure the storage location for Ray Train outputs. Let's walk through a concrete example to see what exactly these outputs are, and how they're structured in storage. .. seealso:: This example includes checkpointing, which is covered in detail in :ref:`train-checkpointing`. .. testcode:: :skipif: True import os import tempfile import ray.train from ray.train import Checkpoint from ray.train.torch import TorchTrainer def train_fn(config): for i in range(10): # Training logic here metrics = {"loss": ...} with tempfile.TemporaryDirectory() as temp_checkpoint_dir: torch.save(..., os.path.join(temp_checkpoint_dir, "checkpoint.pt")) train.report( metrics, checkpoint=Checkpoint.from_directory(temp_checkpoint_dir) ) trainer = TorchTrainer( train_fn, scaling_config=ray.train.ScalingConfig(num_workers=2), run_config=ray.train.RunConfig( storage_path="s3://bucket-name/sub-path/", name="unique-run-id", ) ) result: train.Result = trainer.fit() last_checkpoint: Checkpoint = result.checkpoint Here's a rundown of all files that will be persisted to storage: .. code-block:: text {RunConfig.storage_path} (ex: "s3://bucket-name/sub-path/") └── {RunConfig.name} (ex: "unique-run-id") <- Train run output directory ├── *_snapshot.json <- Train run metadata files (DeveloperAPI) ├── checkpoint_epoch=0/ <- Checkpoints ├── checkpoint_epoch=1/ └── ... The :class:`~ray.train.Result` and :class:`~ray.train.Checkpoint` objects returned by ``trainer.fit`` are the easiest way to access the data in these files: .. testcode:: :skipif: True result.filesystem, result.path # S3FileSystem, "bucket-name/sub-path/unique-run-id" result.checkpoint.filesystem, result.checkpoint.path # S3FileSystem, "bucket-name/sub-path/unique-run-id/checkpoint_epoch=0" See :ref:`train-inspect-results` for a full guide on interacting with training :class:`Results `. .. _train-storage-advanced: Advanced configuration ---------------------- .. _train-working-directory: Keep the original current working directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Train changes the current working directory of each worker to the same path. By default, this path is a sub-directory of the Ray session directory (e.g., ``/tmp/ray/session_latest``), which is also where other Ray logs and temporary files are dumped. The location of the Ray session directory :ref:`can be customized `. To disable the default behavior of Ray Train changing the current working directory, set the ``RAY_CHDIR_TO_TRIAL_DIR=0`` environment variable. This is useful if you want your training workers to access relative paths from the directory you launched the training script from. .. tip:: When running in a distributed cluster, you will need to make sure that all workers have a mirrored working directory to access the same relative paths. One way to achieve this is setting the :ref:`working directory in the Ray runtime environment `. .. testcode:: import os import ray import ray.train from ray.train.torch import TorchTrainer os.environ["RAY_CHDIR_TO_TRIAL_DIR"] = "0" # Write some file in the current working directory with open("./data.txt", "w") as f: f.write("some data") # Set the working directory in the Ray runtime environment ray.init(runtime_env={"working_dir": "."}) def train_fn_per_worker(config): # Check that each worker can access the working directory # NOTE: The working directory is copied to each worker and is read only. assert os.path.exists("./data.txt"), os.getcwd() trainer = TorchTrainer( train_fn_per_worker, scaling_config=ray.train.ScalingConfig(num_workers=2), run_config=ray.train.RunConfig( # storage_path=..., ), ) trainer.fit() Deprecated ---------- The following sections describe behavior that is deprecated as of Ray 2.43 and will not be supported in Ray Train V2, which is an overhaul of Ray Train's implementation and select APIs. See the following resources for more information: * `Train V2 REP `_: Technical details about the API change * `Train V2 Migration Guide `_: Full migration guide for Train V2 (Deprecated) Persisting training artifacts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: This feature of persisting training worker artifacts is deprecated as of Ray 2.43. The feature relied on Ray Tune's local working directory abstraction, where the local files of each worker would be copied to storage. Ray Train V2 decouples the two libraries, so this API, which already provided limited value, has been deprecated. In the example above, we saved some artifacts within the training loop to the worker's *current working directory*. If you were training a stable diffusion model, you could save some sample generated images every so often as a training artifact. By default, Ray Train changes the current working directory of each worker to be inside the run's :ref:`local staging directory `. This way, all distributed training workers share the same absolute path as the working directory. See :ref:`below ` for how to disable this default behavior, which is useful if you want your training workers to keep their original working directories. If :class:`RunConfig(SyncConfig(sync_artifacts=True)) `, then all artifacts saved in this directory will be persisted to storage. The frequency of artifact syncing can be configured via :class:`SyncConfig `. Note that this behavior is off by default. Here's an example of what the Train run output directory looks like, with the worker artifacts: .. code-block:: text s3://bucket-name/sub-path (RunConfig.storage_path) └── experiment_name (RunConfig.name) <- The "experiment directory" ├── experiment_state-*.json ├── basic-variant-state-*.json ├── trainer.pkl ├── tuner.pkl └── TorchTrainer_46367_00000_0_... <- The "trial directory" ├── events.out.tfevents... <- Tensorboard logs of reported metrics ├── result.json <- JSON log file of reported metrics ├── checkpoint_000000/ <- Checkpoints ├── checkpoint_000001/ ├── ... ├── artifact-rank=0-iter=0.txt <- Worker artifacts ├── artifact-rank=1-iter=0.txt └── ... .. warning:: Artifacts saved by *every worker* will be synced to storage. If you have multiple workers co-located on the same node, make sure that workers don't delete files within their shared working directory. A best practice is to only write artifacts from a single worker unless you really need artifacts from multiple. .. testcode:: :skipif: True from ray import train if train.get_context().get_world_rank() == 0: # Only the global rank 0 worker saves artifacts. ... if train.get_context().get_local_rank() == 0: # Every local rank 0 worker saves artifacts. ... .. _train-local-staging-dir: (Deprecated) Setting the local staging directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: This section describes behavior depending on Ray Tune implementation details that no longer applies to Ray Train V2. .. warning:: Prior to 2.10, the ``RAY_AIR_LOCAL_CACHE_DIR`` environment variable and ``RunConfig(local_dir)`` were ways to configure the local staging directory to be outside of the home directory (``~/ray_results``). **These configurations are no longer used to configure the local staging directory. Please instead use** ``RunConfig(storage_path)`` **to configure where your run's outputs go.** Apart from files such as checkpoints written directly to the ``storage_path``, Ray Train also writes some logfiles and metadata files to an intermediate *local staging directory* before they get persisted (copied/uploaded) to the ``storage_path``. The current working directory of each worker is set within this local staging directory. By default, the local staging directory is a sub-directory of the Ray session directory (e.g., ``/tmp/ray/session_latest``), which is also where other temporary Ray files are dumped. Customize the location of the staging directory by :ref:`setting the location of the temporary Ray session directory `. Here's an example of what the local staging directory looks like: .. code-block:: text /tmp/ray/session_latest/artifacts// └── experiment_name ├── driver_artifacts <- These are all uploaded to storage periodically │ ├── Experiment state snapshot files needed for resuming training │ └── Metrics logfiles └── working_dirs <- These are uploaded to storage if `SyncConfig(sync_artifacts=True)` └── Current working directory of training workers, which contains worker artifacts .. warning:: You should not need to look into the local staging directory. The ``storage_path`` should be the only path that you need to interact with. The structure of the local staging directory is subject to change in future versions of Ray Train -- do not rely on these local staging files in your application. --- .. _train-reproducibility: Reproducibility --------------- .. tab-set:: .. tab-item:: PyTorch To limit sources of nondeterministic behavior, add :func:`ray.train.torch.enable_reproducibility` to the top of your training function. .. code-block:: diff def train_func(): + train.torch.enable_reproducibility() model = NeuralNetwork() model = train.torch.prepare_model(model) ... .. warning:: :func:`ray.train.torch.enable_reproducibility` can't guarantee completely reproducible results across executions. To learn more, read the `PyTorch notes on randomness `_. .. import ray from ray import tune def training_func(config): dataloader = ray.train.get_dataset()\ .get_shard(torch.rank())\ .iter_torch_batches(batch_size=config["batch_size"]) for i in config["epochs"]: ray.train.report(...) # use same intermediate reporting API # Declare the specification for training. trainer = Trainer(backend="torch", num_workers=12, use_gpu=True) dataset = ray.dataset.window() # Convert this to a trainable. trainable = trainer.to_tune_trainable(training_func, dataset=dataset) tuner = tune.Tuner(trainable, param_space={"lr": tune.uniform(), "batch_size": tune.randint(1, 2, 3)}, tune_config=tune.TuneConfig(num_samples=12)) results = tuner.fit() --- .. _train-inspect-results: Inspecting Training Results =========================== The return value of ``trainer.fit()`` is a :class:`~ray.train.Result` object. The :class:`~ray.train.Result` object contains, among other information: - The last reported checkpoint (to load the model) and its attached metrics - Error messages, if any errors occurred Viewing metrics --------------- You can retrieve reported metrics that were attached to a checkpoint from the :class:`~ray.train.Result` object. Common metrics include the training or validation loss, or prediction accuracies. The metrics retrieved from the :class:`~ray.train.Result` object correspond to those you passed to :func:`train.report ` as an argument :ref:`in your training function `. .. note:: Persisting free-floating metrics reported via ``ray.train.report(metrics, checkpoint=None)`` is deprecated. This also means that retrieving these metrics from the :class:`~ray.train.Result` object is deprecated. Only metrics attached to checkpoints are persisted. See :ref:`train-metric-only-reporting-deprecation` for more details. Last reported metrics ~~~~~~~~~~~~~~~~~~~~~ Use :attr:`Result.metrics ` to retrieve the metrics attached to the last reported checkpoint. .. literalinclude:: ../doc_code/key_concepts.py :language: python :start-after: __result_metrics_start__ :end-before: __result_metrics_end__ Dataframe of all reported metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use :attr:`Result.metrics_dataframe ` to retrieve a pandas DataFrame of all metrics reported alongside checkpoints. .. literalinclude:: ../doc_code/key_concepts.py :language: python :start-after: __result_dataframe_start__ :end-before: __result_dataframe_end__ Retrieving checkpoints ---------------------- You can retrieve checkpoints reported to Ray Train from the :class:`~ray.train.Result` object. :ref:`Checkpoints ` contain all the information that is needed to restore the training state. This usually includes the trained model. You can use checkpoints for common downstream tasks such as :doc:`offline batch inference with Ray Data ` or :doc:`online model serving with Ray Serve `. The checkpoints retrieved from the :class:`~ray.train.Result` object correspond to those you passed to :func:`train.report ` as an argument :ref:`in your training function `. Last saved checkpoint ~~~~~~~~~~~~~~~~~~~~~ Use :attr:`Result.checkpoint ` to retrieve the last checkpoint. .. literalinclude:: ../doc_code/key_concepts.py :language: python :start-after: __result_checkpoint_start__ :end-before: __result_checkpoint_end__ Other checkpoints ~~~~~~~~~~~~~~~~~ Sometimes you want to access an earlier checkpoint. For instance, if your loss increased after more training due to overfitting, you may want to retrieve the checkpoint with the lowest loss. You can retrieve a list of all available checkpoints and their metrics with :attr:`Result.best_checkpoints ` .. literalinclude:: ../doc_code/key_concepts.py :language: python :start-after: __result_best_checkpoint_start__ :end-before: __result_best_checkpoint_end__ .. seealso:: See :ref:`train-checkpointing` for more information on checkpointing. Accessing storage location --------------------------- If you need to retrieve the results later, you can get the storage location of the training run with :attr:`Result.path `. This path will correspond to the :ref:`storage_path ` you configured in the :class:`~ray.train.RunConfig`. It will be a (nested) subdirectory within that path, usually of the form `TrainerName_date-string/TrainerName_id_00000_0_...`. The result also contains a :class:`pyarrow.fs.FileSystem` that can be used to access the storage location, which is useful if the path is on cloud storage. .. literalinclude:: ../doc_code/key_concepts.py :language: python :start-after: __result_path_start__ :end-before: __result_path_end__ .. You can restore a result with :meth:`Result.from_path `: .. .. literalinclude:: ../doc_code/key_concepts.py .. :language: python .. :start-after: __result_restore_start__ .. :end-before: __result_restore_end__ Catching Errors --------------- If an error occurred during training, :attr:`Result.error ` will be set and contain the exception that was raised. .. literalinclude:: ../doc_code/key_concepts.py :language: python :start-after: __result_error_start__ :end-before: __result_error_end__ Finding results on persistent storage ------------------------------------- All training results including reported metrics and checkpoints are stored on the configured :ref:`persistent storage `. See :ref:`the persistent storage guide ` to configure this location for your training run. --- .. _train-scaling-collation-functions: Advanced: Scaling out expensive collate functions ================================================= By default, the collate function executes on the training worker when you call :meth:`ray.data.DataIterator.iter_torch_batches`. This approach has two main drawbacks: - **Low scalability**: The collate function runs sequentially on each training worker, limiting parallelism. - **Resource competition**: The collate function consumes CPU and memory resources from the training worker, potentially slowing down model training. Scaling out the collate function to Ray Data allows you to scale collation across multiple CPU nodes independently of training workers, improving better overall pipeline throughput, especially with heavy collate functions. This optimization is particularly effective when the collate function is computationally expensive (such as tokenization, image augmentation, or complex feature engineering) and you have additional CPU resources available for data preprocessing. Moving the collate function to Ray Data --------------------------------------- The following example shows a typical collate function that runs on the training worker: .. code-block:: python train_dataset = read_parquet().map(...) def train_func(): for batch in ray.train.get_dataset_shard("train").iter_torch_batches( collate_fn=collate_fn, batch_size=BATCH_SIZE ): # Training logic here pass trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=4, use_gpu=True) ) result = trainer.fit() If the collate function is time/compute intensive and you'd like to scale it out,you should: * Create a custom collate function that runs in Ray Data and use :meth:`ray.data.Dataset.map_batches` to scale it out. * Use :meth:`ray.data.Dataset.repartition` to ensure the batch size alignment. Creating a custom collate function that runs in Ray Data -------------------------------------------------------- To scale out, you'll want to move the ``collate_fn`` into a Ray Data ``map_batches`` operation: .. code-block:: python def collate_fn(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: return batch train_dataset = train_dataset.map_batches(collate_fn, batch_size=BATCH_SIZE) def train_func(): for batch in ray.train.get_dataset_shard("train").iter_torch_batches( collate_fn=None, batch_size=BATCH_SIZE, ): # Training logic here pass trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=4, use_gpu=True) ) result = trainer.fit() A couple of things to note: - The ``collate_fn`` returns a dictionary of NumPy arrays, which is a standard Ray Data batch format. - The ``iter_torch_batches`` method uses ``collate_fn=None``, which reduces the amount of work is done on the training worker process. Ensuring batch size alignment ----------------------------- Typically, collate functions are used to create complete batches of data with a target batch size. However, if you move the collate function to Ray Data using :meth:`ray.data.Dataset.map_batches`, by default, it will not guarantee the batch size for each function call. There are two common problems that you may encounter. 1. The collate function requires a certain number of rows provided as an input to work properly. 2. You want to avoid any reformatting / rebatching of the data on the training worker process. To solve these problems, you can use :meth:`ray.data.Dataset.repartition` with ``target_num_rows_per_block`` to ensure the batch size alignment. By calling ``repartition`` before ``map_batches``, you ensure that the input blocks contain the desired number of rows. .. code-block:: python # Note: If you only use map_batches(batch_size=BATCH_SIZE), you are not guaranteed to get the desired number of rows as an input. dataset = dataset.repartition(target_num_rows_per_block=BATCH_SIZE).map_batches(collate_fn, batch_size=BATCH_SIZE) By calling ``repartition`` after ``map_batches``, you ensure that the output blocks contain the desired number of rows. This avoids any reformatting / rebatching of the data on the training worker process. .. code-block:: python dataset = dataset.map_batches(collate_fn, batch_size=BATCH_SIZE).repartition(target_num_rows_per_block=BATCH_SIZE) def train_func(): for batch in ray.train.get_dataset_shard("train").iter_torch_batches( collate_fn=None, batch_size=BATCH_SIZE, ): # Training logic here pass trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=4, use_gpu=True) ) result = trainer.fit() Putting things together ----------------------- Throughout this guide, we use a mock text dataset to demonstrate the optimization. You can find the implementation of the mock dataset in :ref:`random-text-generator`. .. tab-set:: .. tab-item:: Baseline implementation The following example shows a typical collate function that runs on the training worker: .. testcode:: :skipif: True from transformers import AutoTokenizer import torch import numpy as np from typing import Dict from ray.train.torch import TorchTrainer from ray.train import ScalingConfig from mock_dataset import create_mock_ray_text_dataset BATCH_SIZE = 10000 def vanilla_collate_fn(tokenizer: AutoTokenizer, batch: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]: outputs = tokenizer( list(batch["text"]), truncation=True, padding="longest", return_tensors="pt", ) outputs["labels"] = torch.LongTensor(batch["label"]) return outputs def train_func(): tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") collate_fn = lambda x: vanilla_collate_fn(tokenizer, x) # Collate function runs on the training worker for batch in ray.train.get_dataset_shard("train").iter_torch_batches( collate_fn=collate_fn, batch_size=BATCH_SIZE ): # Training logic here pass train_dataset = create_mock_ray_text_dataset( dataset_size=1000000, min_len=1000, max_len=3000 ) trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=4, use_gpu=True) ) result = trainer.fit() .. tab-item:: Optimized implementation The following example moves the collate function to Ray Data preprocessing: .. testcode:: :skipif: True from transformers import AutoTokenizer import numpy as np from typing import Dict from ray.train.torch import TorchTrainer from ray.train import ScalingConfig from mock_dataset import create_mock_ray_text_dataset import pyarrow as pa BATCH_SIZE = 10000 class CollateFnRayData: def __init__(self): self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def __call__(self, batch: pa.Table) -> Dict[str, np.ndarray]: results = self.tokenizer( batch["text"].to_pylist(), truncation=True, padding="longest", return_tensors="np", ) results["labels"] = np.array(batch["label"]) return results def train_func(): # Collate function already ran in Ray Data for batch in ray.train.get_dataset_shard("train").iter_torch_batches( collate_fn=None, batch_size=BATCH_SIZE, ): # Training logic here pass # Apply preprocessing in Ray Data train_dataset = ( create_mock_ray_text_dataset( dataset_size=1000000, min_len=1000, max_len=3000 ) .map_batches( CollateFnRayData, batch_size=BATCH_SIZE, batch_format="pyarrow", ) .repartition(target_num_rows_per_block=BATCH_SIZE) # Ensure batch size alignment ) trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=4, use_gpu=True) ) result = trainer.fit() The optimized implementation makes these changes: - **Preprocessing in Ray Data**: The tokenization logic moves from ``train_func`` to ``CollateFnRayData``, which runs in ``map_batches``. - **NumPy output**: The collate function returns ``Dict[str, np.ndarray]`` instead of PyTorch tensors, which Ray Data natively supports. - **Batch alignment**: ``repartition(target_num_rows_per_block=BATCH_SIZE)`` after ``map_batches`` ensures the collate function receives exact batch sizes and output blocks align with the batch size. - **No collate_fn in iterator**: ``iter_torch_batches`` uses ``collate_fn=None`` because preprocessing already happened in Ray Data. Benchmark results ~~~~~~~~~~~~~~~~~ The following benchmarks demonstrate the performance improvement from scaling out the collate function. The test uses text tokenization with a batch size of 10,000 on a dataset of 1 million rows with text lengths between 1,000 and 3,000 characters. **Single node (g4dn.12xlarge: 48 vCPU, 4 NVIDIA T4 GPUs, 192 GiB memory)** .. list-table:: :header-rows: 1 * - Configuration - Throughput * - Collate in iterator (baseline) - 1,588 rows/s * - Collate in Ray Data - 3,437 rows/s **With 2 additional CPU nodes (m5.8xlarge: 32 vCPU, 128 GiB memory each)** .. list-table:: :header-rows: 1 * - Configuration - Throughput * - Collate in iterator (baseline) - 1,659 rows/s * - Collate in Ray Data - 10,717 rows/s The results show that scaling out the collate function to Ray Data provides a 2x speedup on a single node and a 6x speedup when adding CPU-only nodes for preprocessing. Advanced: Handling custom data types ------------------------------------ The optimized implementation above returns ``Dict[str, np.ndarray]``, which Ray Data natively supports. However, if your collate function needs to return PyTorch tensors or other custom data types that :meth:`ray.data.Dataset.map_batches` doesn't directly support, you need to serialize them. .. _train-tensor-serialization-utility: Tensor serialization utility ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following utility serializes PyTorch tensors into PyArrow format. It flattens all tensors in a batch into a single binary buffer, stores metadata about tensor shapes and dtypes, and packs everything into a single-row PyArrow table. On the training side, it deserializes the table back into the original tensor structure. The serialization and deserialization operations are typically lightweight compared to the actual collate function work (such as tokenization or image processing), so the overhead is minimal relative to the performance gains from scaling the collate function. You can use :ref:`train-collate-utils` as a reference implementation and adapt it to your needs. Example with tensor serialization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following example demonstrates using tensor serialization when your collate function must return PyTorch tensors. This approach requires ``repartition`` before ``map_batches`` because the collate function changes the number of output rows (each batch becomes a single serialized row). .. testcode:: :skipif: True from transformers import AutoTokenizer import torch from typing import Dict from ray.data.collate_fn import ArrowBatchCollateFn import pyarrow as pa from collate_utils import serialize_tensors_to_table, deserialize_table_to_tensors from ray.train.torch import TorchTrainer from ray.train import ScalingConfig from mock_dataset import create_mock_ray_text_dataset BATCH_SIZE = 10000 class TextTokenizerCollateFn: """Collate function that runs in Ray Data preprocessing.""" def __init__(self): self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def __call__(self, batch: pa.Table) -> pa.Table: # Tokenize the batch outputs = self.tokenizer( batch["text"].to_pylist(), truncation=True, padding="longest", return_tensors="pt", ) outputs["labels"] = torch.LongTensor(batch["label"].to_numpy()) # Serialize to single-row table using the utility return serialize_tensors_to_table(outputs) class IteratorCollateFn(ArrowBatchCollateFn): """Collate function for iter_torch_batches that deserializes the batch.""" def __init__(self, pin_memory=False): self._pin_memory = pin_memory def __call__(self, batch: pa.Table) -> Dict[str, torch.Tensor]: # Deserialize from single-row table using the utility return deserialize_table_to_tensors(batch, pin_memory=self._pin_memory) def train_func(): collate_fn = IteratorCollateFn() # Collate function only deserializes on the training worker for batch in ray.train.get_dataset_shard("train").iter_torch_batches( collate_fn=collate_fn, batch_size=1 # Each "row" is actually a full batch ): # Training logic here pass # Apply preprocessing in Ray Data # Use repartition BEFORE map_batches because output row count changes train_dataset = ( create_mock_ray_text_dataset( dataset_size=1000000, min_len=1000, max_len=3000 ) .repartition(target_num_rows_per_block=BATCH_SIZE) .map_batches( TextTokenizerCollateFn, batch_size=BATCH_SIZE, batch_format="pyarrow", ) ) trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=4, use_gpu=True) ) result = trainer.fit() --- .. _train_scaling_config: Configuring Scale and GPUs ========================== Increasing the scale of a Ray Train training run is simple and can be done in a few lines of code. The main interface for this is the :class:`~ray.train.ScalingConfig`, which configures the number of workers and the resources they should use. In this guide, a *worker* refers to a Ray Train distributed training worker, which is a :ref:`Ray Actor ` that runs your training function. Increasing the number of workers -------------------------------- The main interface to control parallelism in your training code is to set the number of workers. This can be done by passing the ``num_workers`` attribute to the :class:`~ray.train.ScalingConfig`: .. testcode:: from ray.train import ScalingConfig scaling_config = ScalingConfig( num_workers=8 ) Using GPUs ---------- To use GPUs, pass ``use_gpu=True`` to the :class:`~ray.train.ScalingConfig`. This will request one GPU per training worker. In the example below, training will run on 8 GPUs (8 workers, each using one GPU). .. testcode:: from ray.train import ScalingConfig scaling_config = ScalingConfig( num_workers=8, use_gpu=True ) Using GPUs in the training function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When ``use_gpu=True`` is set, Ray Train will automatically set up environment variables in your training function so that the GPUs can be detected and used (e.g. ``CUDA_VISIBLE_DEVICES``). You can get the associated devices with :meth:`ray.train.torch.get_device`. .. testcode:: import torch from ray.train import ScalingConfig from ray.train.torch import TorchTrainer, get_device def train_func(): assert torch.cuda.is_available() device = get_device() assert device == torch.device("cuda:0") trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=1, use_gpu=True ) ) trainer.fit() Assigning multiple GPUs to a worker ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes you might want to allocate multiple GPUs for a worker. For example, you can specify `resources_per_worker={"GPU": 2}` in the `ScalingConfig` if you want to assign 2 GPUs for each worker. You can get a list of associated devices with :meth:`ray.train.torch.get_devices`. .. testcode:: import torch from ray.train import ScalingConfig from ray.train.torch import TorchTrainer, get_device, get_devices def train_func(): assert torch.cuda.is_available() device = get_device() devices = get_devices() assert device == torch.device("cuda:0") assert devices == [torch.device("cuda:0"), torch.device("cuda:1")] trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=1, use_gpu=True, resources_per_worker={"GPU": 2} ) ) trainer.fit() Setting the GPU type ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Train allows you to specify the accelerator type for each worker. This is useful if you want to use a specific accelerator type for model training. In a heterogeneous Ray cluster, this means that your training workers will be forced to run on the specified GPU type, rather than on any arbitrary GPU node. You can get a list of supported `accelerator_type` from :ref:`the available accelerator types `. For example, you can specify `accelerator_type="A100"` in the :class:`~ray.train.ScalingConfig` if you want to assign each worker a NVIDIA A100 GPU. .. tip:: Ensure that your cluster has instances with the specified accelerator type or is able to autoscale to fulfill the request. .. testcode:: ScalingConfig( num_workers=1, use_gpu=True, accelerator_type="A100" ) (PyTorch) Setting the communication backend ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PyTorch Distributed supports multiple `backends `__ for communicating tensors across workers. By default Ray Train will use NCCL when ``use_gpu=True`` and Gloo otherwise. If you explicitly want to override this setting, you can configure a :class:`~ray.train.torch.TorchConfig` and pass it into the :class:`~ray.train.torch.TorchTrainer`. .. testcode:: :hide: num_training_workers = 1 .. testcode:: from ray.train.torch import TorchConfig, TorchTrainer trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=num_training_workers, use_gpu=True, # Defaults to NCCL ), torch_config=TorchConfig(backend="gloo"), ) (NCCL) Setting the communication network interface ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When using NCCL for distributed training, you can configure the network interface cards that are used for communicating between GPUs by setting the `NCCL_SOCKET_IFNAME `__ environment variable. To ensure that the environment variable is set for all training workers, you can pass it in a :ref:`Ray runtime environment `: .. testcode:: :skipif: True import ray runtime_env = {"env_vars": {"NCCL_SOCKET_IFNAME": "ens5"}} ray.init(runtime_env=runtime_env) trainer = TorchTrainer(...) Setting the resources per worker -------------------------------- If you want to allocate more than one CPU or GPU per training worker, or if you defined :ref:`custom cluster resources `, set the ``resources_per_worker`` attribute: .. testcode:: from ray.train import ScalingConfig scaling_config = ScalingConfig( num_workers=8, resources_per_worker={ "CPU": 4, "GPU": 2, }, use_gpu=True, ) .. note:: If you specify GPUs in ``resources_per_worker``, you also need to set ``use_gpu=True``. You can also instruct Ray Train to use fractional GPUs. In that case, multiple workers will be assigned the same CUDA device. .. testcode:: from ray.train import ScalingConfig scaling_config = ScalingConfig( num_workers=8, resources_per_worker={ "CPU": 4, "GPU": 0.5, }, use_gpu=True, ) (Deprecated) Trainer resources ------------------------------ .. important:: This API is deprecated. See `this migration guide `_ for more details. So far we've configured resources for each training worker. Technically, each training worker is a :ref:`Ray Actor `. Ray Train also schedules an actor for the trainer object when you call ``trainer.fit()``. This object often only manages lightweight communication between the training workers. Per default, a trainer uses 1 CPU. If you have a cluster with 8 CPUs and want to start 4 training workers a 2 CPUs, this will not work, as the total number of required CPUs will be 9 (4 * 2 + 1). In that case, you can specify the trainer resources to use 0 CPUs: .. testcode:: from ray.train import ScalingConfig scaling_config = ScalingConfig( num_workers=4, resources_per_worker={ "CPU": 2, }, trainer_resources={ "CPU": 0, } ) --- .. _train-user-guides: Ray Train User Guides ===================== .. toctree:: :maxdepth: 2 user-guides/data-loading-preprocessing user-guides/using-gpus user-guides/local_mode user-guides/persistent-storage user-guides/monitoring-logging user-guides/checkpoints user-guides/asynchronous-validation user-guides/experiment-tracking user-guides/results user-guides/fault-tolerance user-guides/monitor-your-application user-guides/reproducibility Hyperparameter Optimization user-guides/scaling-collation-functions --- .. _tune-api-ref: Ray Tune API ============ .. tip:: We'd love to hear your feedback on using Tune - `get in touch `_! This section contains a reference for the Tune API. If there is anything missing, please open an issue on `GitHub`_. .. _`GitHub`: https://github.com/ray-project/ray/issues .. toctree:: :maxdepth: 2 execution.rst result_grid.rst trainable.rst search_space.rst suggestion.rst schedulers.rst stoppers.rst reporters.rst syncing.rst logging.rst callbacks.rst env.rst integration.rst internals.rst cli.rst --- .. _tune-callbacks-docs: Tune Callbacks (tune.Callback) ============================== See :doc:`this user guide ` for more details. .. seealso:: :doc:`Tune's built-in loggers ` use the ``Callback`` interface. Callback Interface ------------------ Callback Initialization and Setup ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. currentmodule:: ray.tune .. autosummary:: :nosignatures: :toctree: doc/ Callback .. autosummary:: :nosignatures: :toctree: doc/ Callback.setup Callback Hooks ~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ Callback.on_checkpoint Callback.on_experiment_end Callback.on_step_begin Callback.on_step_end Callback.on_trial_complete Callback.on_trial_error Callback.on_trial_restore Callback.on_trial_result Callback.on_trial_save Callback.on_trial_start Stateful Callbacks ~~~~~~~~~~~~~~~~~~ The following methods must be overridden for stateful callbacks to be saved/restored properly by Tune. .. autosummary:: :nosignatures: :toctree: doc/ Callback.get_state Callback.set_state --- Tune CLI (Experimental) ======================= ``tune`` has an easy-to-use command line interface (CLI) to manage and monitor your experiments on Ray. Here is an example command line call: ``tune list-trials``: List tabular information about trials within an experiment. Empty columns will be dropped by default. Add the ``--sort`` flag to sort the output by specific columns. Add the ``--filter`` flag to filter the output in the format ``" "``. Add the ``--output`` flag to write the trial information to a specific file (CSV or Pickle). Add the ``--columns`` and ``--result-columns`` flags to select specific columns to display. .. code-block:: bash $ tune list-trials [EXPERIMENT_DIR] --output note.csv +------------------+-----------------------+------------+ | trainable_name | experiment_tag | trial_id | |------------------+-----------------------+------------| | MyTrainableClass | 0_height=40,width=37 | 87b54a1d | | MyTrainableClass | 1_height=21,width=70 | 23b89036 | | MyTrainableClass | 2_height=99,width=90 | 518dbe95 | | MyTrainableClass | 3_height=54,width=21 | 7b99a28a | | MyTrainableClass | 4_height=90,width=69 | ae4e02fb | +------------------+-----------------------+------------+ Dropped columns: ['status', 'last_update_time'] Please increase your terminal size to view remaining columns. Output saved at: note.csv $ tune list-trials [EXPERIMENT_DIR] --filter "trial_id == 7b99a28a" +------------------+-----------------------+------------+ | trainable_name | experiment_tag | trial_id | |------------------+-----------------------+------------| | MyTrainableClass | 3_height=54,width=21 | 7b99a28a | +------------------+-----------------------+------------+ Dropped columns: ['status', 'last_update_time'] Please increase your terminal size to view remaining columns. --- .. _tune-env-vars: Environment variables used by Ray Tune -------------------------------------- Some of Ray Tune's behavior can be configured using environment variables. These are the environment variables Ray Tune currently considers: * **TUNE_DISABLE_AUTO_CALLBACK_LOGGERS**: Ray Tune automatically adds a CSV and JSON logger callback if they haven't been passed. Setting this variable to `1` disables this automatic creation. Please note that this will most likely affect analyzing your results after the tuning run. * **TUNE_DISABLE_AUTO_INIT**: Disable automatically calling ``ray.init()`` if not attached to a Ray session. * **TUNE_DISABLE_DATED_SUBDIR**: Ray Tune automatically adds a date string to experiment directories when the name is not specified explicitly or the trainable isn't passed as a string. Setting this environment variable to ``1`` disables adding these date strings. * **TUNE_DISABLE_STRICT_METRIC_CHECKING**: When you report metrics to Tune via ``tune.report()`` and passed a ``metric`` parameter to ``Tuner()``, a scheduler, or a search algorithm, Tune will error if the metric was not reported in the result. Setting this environment variable to ``1`` will disable this check. * **TUNE_DISABLE_SIGINT_HANDLER**: Ray Tune catches SIGINT signals (e.g. sent by Ctrl+C) to gracefully shutdown and do a final checkpoint. Setting this variable to ``1`` will disable signal handling and stop execution right away. Defaults to ``0``. * **TUNE_FORCE_TRIAL_CLEANUP_S**: By default, Ray Tune will gracefully terminate trials, letting them finish the current training step and any user-defined cleanup. Setting this variable to a non-zero, positive integer will cause trials to be forcefully terminated after a grace period of that many seconds. Defaults to ``600`` (seconds). * **TUNE_FUNCTION_THREAD_TIMEOUT_S**: Time in seconds the function API waits for threads to finish after instructing them to complete. Defaults to ``2``. * **TUNE_GLOBAL_CHECKPOINT_S**: Time in seconds that limits how often experiment state is checkpointed. If not, set this will default to ``'auto'``. ``'auto'`` measures the time it takes to snapshot the experiment state and adjusts the period so that ~5% of the driver's time is spent on snapshotting. You should set this to a fixed value (ex: ``TUNE_GLOBAL_CHECKPOINT_S=60``) to snapshot your experiment state every X seconds. * **TUNE_MAX_LEN_IDENTIFIER**: Maximum length of trial subdirectory names (those with the parameter values in them) * **TUNE_MAX_PENDING_TRIALS_PG**: Maximum number of pending trials when placement groups are used. Defaults to ``auto``, which will be updated to ``max(200, cluster_cpus * 1.1)`` for random/grid search and ``1`` for any other search algorithms. * **TUNE_PLACEMENT_GROUP_PREFIX**: Prefix for placement groups created by Ray Tune. This prefix is used e.g. to identify placement groups that should be cleaned up on start/stop of the tuning run. This is initialized to a unique name at the start of the first run. * **TUNE_PLACEMENT_GROUP_RECON_INTERVAL**: How often to reconcile placement groups. Reconcilation is used to make sure that the number of requested placement groups and pending/running trials are in sync. In normal circumstances these shouldn't differ anyway, but reconcilation makes sure to capture cases when placement groups are manually destroyed. Reconcilation doesn't take much time, but it can add up when running a large number of short trials. Defaults to every ``5`` (seconds). * **TUNE_PRINT_ALL_TRIAL_ERRORS**: If ``1``, will print all trial errors as they come up. Otherwise, errors will only be saved as text files to the trial directory and not printed. Defaults to ``1``. * **TUNE_RESULT_BUFFER_LENGTH**: Ray Tune can buffer results from trainables before they are passed to the driver. Enabling this might delay scheduling decisions, as trainables are speculatively continued. Setting this to ``1`` disables result buffering. Cannot be used with ``checkpoint_at_end``. Defaults to disabled. * **TUNE_RESULT_DELIM**: Delimiter used for nested entries in :class:`ExperimentAnalysis ` dataframes. Defaults to ``.`` (but will be changed to ``/`` in future versions of Ray). * **TUNE_RESULT_BUFFER_MAX_TIME_S**: Similarly, Ray Tune buffers results up to ``number_of_trial/10`` seconds, but never longer than this value. Defaults to 100 (seconds). * **TUNE_RESULT_BUFFER_MIN_TIME_S**: Additionally, you can specify a minimum time to buffer results. Defaults to 0. * **TUNE_WARN_THRESHOLD_S**: Threshold for logging if an Tune event loop operation takes too long. Defaults to 0.5 (seconds). * **TUNE_WARN_INSUFFICIENT_RESOURCE_THRESHOLD_S**: Threshold for throwing a warning if no active trials are in ``RUNNING`` state for this amount of seconds. If the Ray Tune job is stuck in this state (most likely due to insufficient resources), the warning message is printed repeatedly every this amount of seconds. Defaults to 60 (seconds). * **TUNE_WARN_INSUFFICIENT_RESOURCE_THRESHOLD_S_AUTOSCALER**: Threshold for throwing a warning when the autoscaler is enabled and if no active trials are in ``RUNNING`` state for this amount of seconds. If the Ray Tune job is stuck in this state (most likely due to insufficient resources), the warning message is printed repeatedly every this amount of seconds. Defaults to 60 (seconds). * **TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S**: Threshold for logging a warning if the experiment state syncing takes longer than this time in seconds. The experiment state files should be very lightweight, so this should not take longer than ~5 seconds. Defaults to 5 (seconds). * **TUNE_STATE_REFRESH_PERIOD**: Frequency of updating the resource tracking from Ray. Defaults to 10 (seconds). * **TUNE_RESTORE_RETRY_NUM**: The number of retries that are done before a particular trial's restore is determined unsuccessful. After that, the trial is not restored to its previous checkpoint but rather from scratch. Default is ``0``. While this retry counter is taking effect, per trial failure number will not be incremented, which is compared against ``max_failures``. * **TUNE_ONLY_STORE_CHECKPOINT_SCORE_ATTRIBUTE**: If set to ``1``, only the metric defined by ``checkpoint_score_attribute`` will be stored with each ``Checkpoint``. As a result, ``Result.best_checkpoints`` will contain only this metric, omitting others that would normally be included. This can significantly reduce memory usage, especially when many checkpoints are stored or when metrics are large. Defaults to ``0`` (i.e., all metrics are stored). * **RAY_AIR_FULL_TRACEBACKS**: If set to 1, will print full tracebacks for training functions, including internal code paths. Otherwise, abbreviated tracebacks that only show user code are printed. Defaults to 0 (disabled). * **RAY_AIR_NEW_OUTPUT**: If set to 0, this disables the `experimental new console output `_. There are some environment variables that are mostly relevant for integrated libraries: * **WANDB_API_KEY**: Weights and Biases API key. You can also use ``wandb login`` instead. --- Tune Execution (tune.Tuner) =========================== .. _tune-run-ref: Tuner ----- .. currentmodule:: ray.tune .. autosummary:: :nosignatures: :toctree: doc/ Tuner .. autosummary:: :nosignatures: :toctree: doc/ Tuner.fit Tuner.get_results Tuner Configuration ~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ TuneConfig RunConfig CheckpointConfig FailureConfig Restoring a Tuner ~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ Tuner.restore Tuner.can_restore tune.run_experiments -------------------- .. autosummary:: :nosignatures: :toctree: doc/ run_experiments run Experiment TuneError --- .. _tune-integration: External library integrations for Ray Tune =========================================== .. currentmodule:: ray .. _tune-integration-pytorch-lightning: PyTorch Lightning (tune.integration.pytorch_lightning) ------------------------------------------------------ .. autosummary:: :nosignatures: :toctree: doc/ ~tune.integration.pytorch_lightning.TuneReportCheckpointCallback .. _tune-integration-xgboost: XGBoost (tune.integration.xgboost) ---------------------------------- .. autosummary:: :nosignatures: :template: autosummary/class_without_autosummary.rst :toctree: doc/ ~tune.integration.xgboost.TuneReportCheckpointCallback .. _tune-integration-lightgbm: LightGBM (tune.integration.lightgbm) ------------------------------------ .. autosummary:: :nosignatures: :template: autosummary/class_without_autosummary.rst :toctree: doc/ ~tune.integration.lightgbm.TuneReportCheckpointCallback --- Tune Internals ============== .. _raytrialexecutor-docstring: TunerInternal --------------- .. autoclass:: ray.tune.impl.tuner_internal.TunerInternal :members: .. _trial-docstring: Trial ----- .. autoclass:: ray.tune.experiment.trial.Trial :members: FunctionTrainable ----------------- .. autoclass:: ray.tune.trainable.function_trainable.FunctionTrainable .. autofunction:: ray.tune.trainable.function_trainable.wrap_function Registry -------- .. autofunction:: ray.tune.register_trainable .. autofunction:: ray.tune.register_env Output ------ .. autoclass:: ray.tune.experimental.output.ProgressReporter .. autoclass:: ray.tune.experimental.output.TrainReporter .. autoclass:: ray.tune.experimental.output.TuneReporterBase .. autoclass:: ray.tune.experimental.output.TuneTerminalReporter --- .. _loggers-docstring: Tune Loggers (tune.logger) ========================== Tune automatically uses loggers for TensorBoard, CSV, and JSON formats. By default, Tune only logs the returned result dictionaries from the training function. If you need to log something lower level like model weights or gradients, see :ref:`Trainable Logging `. .. note:: Tune's per-trial ``Logger`` classes have been deprecated. Use the ``LoggerCallback`` interface instead. .. currentmodule:: ray .. _logger-interface: LoggerCallback Interface (tune.logger.LoggerCallback) ----------------------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~tune.logger.LoggerCallback .. autosummary:: :nosignatures: :toctree: doc/ ~tune.logger.LoggerCallback.log_trial_start ~tune.logger.LoggerCallback.log_trial_restore ~tune.logger.LoggerCallback.log_trial_save ~tune.logger.LoggerCallback.log_trial_result ~tune.logger.LoggerCallback.log_trial_end Tune Built-in Loggers --------------------- .. autosummary:: :nosignatures: :toctree: doc/ tune.logger.JsonLoggerCallback tune.logger.CSVLoggerCallback tune.logger.TBXLoggerCallback MLFlow Integration ------------------ Tune also provides a logger for `MLflow `_. You can install MLflow via ``pip install mlflow``. See the :doc:`tutorial here `. .. autosummary:: :nosignatures: :toctree: doc/ ~air.integrations.mlflow.MLflowLoggerCallback ~air.integrations.mlflow.setup_mlflow Wandb Integration ----------------- Tune also provides a logger for `Weights & Biases `_. You can install Wandb via ``pip install wandb``. See the :doc:`tutorial here `. .. autosummary:: :nosignatures: :toctree: doc/ ~air.integrations.wandb.WandbLoggerCallback ~air.integrations.wandb.setup_wandb Comet Integration ------------------------------ Tune also provides a logger for `Comet `_. You can install Comet via ``pip install comet-ml``. See the :doc:`tutorial here `. .. autosummary:: :nosignatures: :toctree: doc/ ~air.integrations.comet.CometLoggerCallback Aim Integration --------------- Tune also provides a logger for the `Aim `_ experiment tracker. You can install Aim via ``pip install aim``. See the :doc:`tutorial here `. .. autosummary:: :nosignatures: :toctree: doc/ ~tune.logger.aim.AimLoggerCallback Other Integrations ------------------ Viskit ~~~~~~ Tune automatically integrates with `Viskit `_ via the ``CSVLoggerCallback`` outputs. To use VisKit (you may have to install some dependencies), run: .. code-block:: bash $ git clone https://github.com/rll/rllab.git $ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment The non-relevant metrics (like timing stats) can be disabled on the left to show only the relevant ones (like accuracy, loss, etc.). .. image:: ../images/ray-tune-viskit.png --- .. _tune-reporter-doc: Tune Console Output (Reporters) =============================== By default, Tune reports experiment progress periodically to the command-line as follows. .. code-block:: bash == Status == Memory usage on this node: 11.4/16.0 GiB Using FIFO scheduling algorithm. Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects Result logdir: /Users/foo/ray_results/myexp Number of trials: 4 (4 RUNNING) +----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------+ | Trial name | status | loc | param1 | param2 | param3 | acc | loss | total time (s) | iter | |----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------| | MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.4328 | 0.1289 | 1.8572 | 7.54952 | 15 | | MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.3417 | 0.4865 | 1.6307 | 7.0501 | 14 | | MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.1147 | 0.9585 | 1.9603 | 7.0477 | 14 | | MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1784 | 0.1797 | 1.7161 | 7.05715 | 14 | +----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------+ Note that columns will be hidden if they are completely empty. The output can be configured in various ways by instantiating a ``CLIReporter`` instance (or ``JupyterNotebookReporter`` if you're using jupyter notebook). Here's an example: .. TODO: test these snippets .. code-block:: python import ray.tune from ray.tune import CLIReporter # Limit the number of rows. reporter = CLIReporter(max_progress_rows=10) # Add a custom metric column, in addition to the default metrics. # Note that this must be a metric that is returned in your training results. reporter.add_metric_column("custom_metric") tuner = tune.Tuner(my_trainable, run_config=ray.tune.RunConfig(progress_reporter=reporter)) results = tuner.fit() Extending ``CLIReporter`` lets you control reporting frequency. For example: .. code-block:: python from ray.tune.experiment.trial import Trial class ExperimentTerminationReporter(CLIReporter): def should_report(self, trials, done=False): """Reports only on experiment termination.""" return done tuner = tune.Tuner(my_trainable, run_config=ray.tune.RunConfig(progress_reporter=ExperimentTerminationReporter())) results = tuner.fit() class TrialTerminationReporter(CLIReporter): def __init__(self): super(TrialTerminationReporter, self).__init__() self.num_terminated = 0 def should_report(self, trials, done=False): """Reports only on trial termination events.""" old_num_terminated = self.num_terminated self.num_terminated = len([t for t in trials if t.status == Trial.TERMINATED]) return self.num_terminated > old_num_terminated tuner = tune.Tuner(my_trainable, run_config=ray.tune.RunConfig(progress_reporter=TrialTerminationReporter())) results = tuner.fit() The default reporting style can also be overridden more broadly by extending the ``ProgressReporter`` interface directly. Note that you can print to any output stream, file etc. .. code-block:: python from ray.tune import ProgressReporter class CustomReporter(ProgressReporter): def should_report(self, trials, done=False): return True def report(self, trials, *sys_info): print(*sys_info) print("\n".join([str(trial) for trial in trials])) tuner = tune.Tuner(my_trainable, run_config=ray.tune.RunConfig(progress_reporter=CustomReporter())) results = tuner.fit() .. currentmodule:: ray.tune Reporter Interface (tune.ProgressReporter) ------------------------------------------ .. autosummary:: :nosignatures: :toctree: doc/ ProgressReporter .. autosummary:: :nosignatures: :toctree: doc/ ProgressReporter.report ProgressReporter.should_report Tune Built-in Reporters ----------------------- .. autosummary:: :nosignatures: :toctree: doc/ CLIReporter JupyterNotebookReporter --- .. _air-results-ref: .. _tune-analysis-docs: .. _result-grid-docstring: Tune Experiment Results (tune.ResultGrid) ========================================= ResultGrid (tune.ResultGrid) ---------------------------- .. currentmodule:: ray .. autosummary:: :nosignatures: :toctree: doc/ ~tune.ResultGrid .. autosummary:: :nosignatures: :toctree: doc/ ~tune.ResultGrid.get_best_result ~tune.ResultGrid.get_dataframe .. _result-docstring: Result (tune.Result) --------------------- .. autosummary:: :nosignatures: :template: autosummary/class_without_autosummary.rst :toctree: doc/ ~tune.Result .. _exp-analysis-docstring: ExperimentAnalysis (tune.ExperimentAnalysis) -------------------------------------------- .. note:: An `ExperimentAnalysis` is the output of the ``tune.run`` API. It's now recommended to use :meth:`Tuner.fit `, which outputs a `ResultGrid` object. .. autosummary:: :nosignatures: :toctree: doc/ ~tune.ExperimentAnalysis --- .. _tune-schedulers: Tune Trial Schedulers (tune.schedulers) ======================================= In Tune, some hyperparameter optimization algorithms are written as "scheduling algorithms". These Trial Schedulers can early terminate bad trials, pause trials, clone trials, and alter hyperparameters of a running trial. All Trial Schedulers take in a ``metric``, which is a value returned in the result dict of your Trainable and is maximized or minimized according to ``mode``. .. code-block:: python from ray import tune from ray.tune.schedulers import ASHAScheduler def train_fn(config): # This objective function is just for demonstration purposes tune.report({"loss": config["param"]}) tuner = tune.Tuner( train_fn, tune_config=tune.TuneConfig( scheduler=ASHAScheduler(), metric="loss", mode="min", num_samples=10, ), param_space={"param": tune.uniform(0, 1)}, ) results = tuner.fit() .. currentmodule:: ray.tune.schedulers .. _tune-scheduler-hyperband: ASHA (tune.schedulers.ASHAScheduler) ------------------------------------ The `ASHA `__ scheduler can be used by setting the ``scheduler`` parameter of ``tune.TuneConfig``, which is taken in by ``Tuner``, e.g. .. code-block:: python from ray import tune from ray.tune.schedulers import ASHAScheduler asha_scheduler = ASHAScheduler( time_attr='training_iteration', metric='loss', mode='min', max_t=100, grace_period=10, reduction_factor=3, brackets=1, ) tuner = tune.Tuner( train_fn, tune_config=tune.TuneConfig(scheduler=asha_scheduler), ) results = tuner.fit() Compared to the original version of HyperBand, this implementation provides better parallelism and avoids straggler issues during eliminations. **We recommend using this over the standard HyperBand scheduler.** An example of this can be found here: :doc:`/tune/examples/includes/async_hyperband_example`. Even though the original paper mentions a bracket count of 3, discussions with the authors concluded that the value should be left to 1 bracket. This is the default used if no value is provided for the ``brackets`` argument. .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_autosummary.rst AsyncHyperBandScheduler ASHAScheduler .. _tune-original-hyperband: HyperBand (tune.schedulers.HyperBandScheduler) ---------------------------------------------- Tune implements the `standard version of HyperBand `__. **We recommend using the ASHA Scheduler over the standard HyperBand scheduler.** .. autosummary:: :nosignatures: :toctree: doc/ HyperBandScheduler HyperBand Implementation Details ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Implementation details may deviate slightly from theory but are focused on increasing usability. Note: ``R``, ``s_max``, and ``eta`` are parameters of HyperBand given by the paper. See `this post `_ for context. 1. Both ``s_max`` (representing the ``number of brackets - 1``) and ``eta``, representing the downsampling rate, are fixed. In many practical settings, ``R``, which represents some resource unit and often the number of training iterations, can be set reasonably large, like ``R >= 200``. For simplicity, assume ``eta = 3``. Varying ``R`` between ``R = 200`` and ``R = 1000`` creates a huge range of the number of trials needed to fill up all brackets. .. image:: /images/hyperband_bracket.png On the other hand, holding ``R`` constant at ``R = 300`` and varying ``eta`` also leads to HyperBand configurations that are not very intuitive: .. image:: /images/hyperband_eta.png The implementation takes the same configuration as the example given in the paper and exposes ``max_t``, which is not a parameter in the paper. 2. The example in the `post `_ to calculate ``n_0`` is actually a little different than the algorithm given in the paper. In this implementation, we implement ``n_0`` according to the paper (which is `n` in the below example): .. image:: /images/hyperband_allocation.png 3. There are also implementation specific details like how trials are placed into brackets which are not covered in the paper. This implementation places trials within brackets according to smaller bracket first - meaning that with low number of trials, there will be less early stopping. .. _tune-scheduler-msr: Median Stopping Rule (tune.schedulers.MedianStoppingRule) --------------------------------------------------------- The Median Stopping Rule implements the simple strategy of stopping a trial if its performance falls below the median of other trials at similar points in time. .. autosummary:: :nosignatures: :toctree: doc/ MedianStoppingRule .. _tune-scheduler-pbt: Population Based Training (tune.schedulers.PopulationBasedTraining) ------------------------------------------------------------------- Tune includes a distributed implementation of `Population Based Training (PBT) `__. This can be enabled by setting the ``scheduler`` parameter of ``tune.TuneConfig``, which is taken in by ``Tuner``, e.g. .. code-block:: python from ray import tune from ray.tune.schedulers import PopulationBasedTraining pbt_scheduler = PopulationBasedTraining( time_attr='training_iteration', metric='loss', mode='min', perturbation_interval=1, hyperparam_mutations={ "lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5], "alpha": tune.uniform(0.0, 1.0), } ) tuner = tune.Tuner( train_fn, tune_config=tune.TuneConfig( num_samples=4, scheduler=pbt_scheduler, ), ) tuner.fit() When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, **top-performing trials are checkpointed** (this requires your Trainable to support :ref:`save and restore `). **Low-performing trials clone the hyperparameter configurations of top performers and perturb them** slightly in the hopes of discovering even better hyperparameter settings. **Low-performing trials also resume from the checkpoints of the top performers**, allowing the trials to explore the new hyperparameter configuration starting from a partially trained model (e.g. by copying model weights from one of the top-performing trials). Take a look at :doc:`/tune/examples/pbt_visualization/pbt_visualization` to get an idea of how PBT operates. :doc:`/tune/examples/pbt_guide` gives more examples of PBT usage. .. autosummary:: :nosignatures: :toctree: doc/ PopulationBasedTraining .. _tune-scheduler-pbt-replay: Population Based Training Replay (tune.schedulers.PopulationBasedTrainingReplay) -------------------------------------------------------------------------------- Tune includes a utility to replay hyperparameter schedules of Population Based Training runs. You just specify an existing experiment directory and the ID of the trial you would like to replay. The scheduler accepts only one trial, and it will update its config according to the obtained schedule. .. code-block:: python from ray import tune from ray.tune.schedulers import PopulationBasedTrainingReplay replay = PopulationBasedTrainingReplay( experiment_dir="~/ray_results/pbt_experiment/", trial_id="XXXXX_00001" ) tuner = tune.Tuner( train_fn, tune_config=tune.TuneConfig(scheduler=replay) ) results = tuner.fit() See :ref:`here for an example ` on how to use the replay utility in practice. .. autosummary:: :nosignatures: :toctree: doc/ PopulationBasedTrainingReplay .. _tune-scheduler-pb2: Population Based Bandits (PB2) (tune.schedulers.pb2.PB2) -------------------------------------------------------- Tune includes a distributed implementation of `Population Based Bandits (PB2) `__. This algorithm builds upon PBT, with the main difference being that instead of using random perturbations, PB2 selects new hyperparameter configurations using a Gaussian Process model. The Tune implementation of PB2 requires scikit-learn to be installed: .. code-block:: bash pip install scikit-learn PB2 can be enabled by setting the ``scheduler`` parameter of ``tune.TuneConfig`` which is taken in by ``Tuner``, e.g.: .. code-block:: python from ray.tune.schedulers.pb2 import PB2 pb2_scheduler = PB2( time_attr='time_total_s', metric='mean_accuracy', mode='max', perturbation_interval=600.0, hyperparam_bounds={ "lr": [1e-3, 1e-5], "alpha": [0.0, 1.0], ... } ) tuner = tune.Tuner( ... , tune_config=tune.TuneConfig(scheduler=pb2_scheduler)) results = tuner.fit() When the PB2 scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support :ref:`save and restore `). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation. The primary motivation for PB2 is the ability to find promising hyperparameters with only a small population size. With that in mind, you can run this :doc:`PB2 PPO example ` to compare PB2 vs. PBT, with a population size of ``4`` (as in the paper). The example uses the ``BipedalWalker`` environment so does not require any additional licenses. .. autosummary:: :nosignatures: :toctree: doc/ pb2.PB2 .. _tune-scheduler-bohb: BOHB (tune.schedulers.HyperBandForBOHB) --------------------------------------- This class is a variant of HyperBand that enables the `BOHB Algorithm `_. This implementation is true to the original HyperBand implementation and does not implement pipelining nor straggler mitigation. This is to be used in conjunction with the Tune BOHB search algorithm. See :ref:`TuneBOHB ` for package requirements, examples, and details. An example of this in use can be found here: :doc:`/tune/examples/includes/bohb_example`. .. autosummary:: :nosignatures: :toctree: doc/ HyperBandForBOHB .. _tune-resource-changing-scheduler: ResourceChangingScheduler ------------------------- This class is a utility scheduler, allowing for trial resource requirements to be changed during tuning. It wraps around another scheduler and uses its decisions. * If you are using the Trainable (class) API for tuning, your Trainable must implement ``Trainable.update_resources``, which will let your model know about the new resources assigned. You can also obtain the current trial resources by calling ``Trainable.trial_resources``. * If you are using the functional API for tuning, get the current trial resources obtained by calling `tune.get_trial_resources()` inside the training function. The function should be able to :ref:`load and save checkpoints ` (the latter preferably every iteration). An example of this in use can be found here: :doc:`/tune/examples/includes/xgboost_dynamic_resources_example`. .. autosummary:: :nosignatures: :toctree: doc/ ResourceChangingScheduler resource_changing_scheduler.DistributeResources resource_changing_scheduler.DistributeResourcesToTopJob FIFOScheduler (Default Scheduler) --------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ FIFOScheduler TrialScheduler Interface ------------------------ .. autosummary:: :nosignatures: :toctree: doc/ TrialScheduler .. autosummary:: :nosignatures: :toctree: doc/ TrialScheduler.choose_trial_to_run TrialScheduler.on_trial_result TrialScheduler.on_trial_complete Shim Instantiation (tune.create_scheduler) ------------------------------------------ There is also a shim function that constructs the scheduler based on the provided string. This can be useful if the scheduler you want to use changes often (e.g., specifying the scheduler via a CLI option or config file). .. autosummary:: :nosignatures: :toctree: doc/ create_scheduler --- .. _tune-search-space: Tune Search Space API ===================== This section covers the functions you can use to define your search spaces. .. caution:: Not all Search Algorithms support all distributions. In particular, ``tune.sample_from`` and ``tune.grid_search`` are often unsupported. The default :ref:`tune-basicvariant` supports all distributions. .. tip:: Avoid passing large objects as values in the search space, as that will incur a performance overhead. Use :func:`tune.with_parameters ` to pass large objects in or load them inside your trainable from disk (making sure that all nodes have access to the files) or cloud storage. See :ref:`tune-bottlenecks` for more information. For a high-level overview, see this example: .. TODO: test this .. code-block :: python config = { # Sample a float uniformly between -5.0 and -1.0 "uniform": tune.uniform(-5, -1), # Sample a float uniformly between 3.2 and 5.4, # rounding to multiples of 0.2 "quniform": tune.quniform(3.2, 5.4, 0.2), # Sample a float uniformly between 0.0001 and 0.01, while # sampling in log space "loguniform": tune.loguniform(1e-4, 1e-2), # Sample a float uniformly between 0.0001 and 0.1, while # sampling in log space and rounding to multiples of 0.00005 "qloguniform": tune.qloguniform(1e-4, 1e-1, 5e-5), # Sample a random float from a normal distribution with # mean=10 and sd=2 "randn": tune.randn(10, 2), # Sample a random float from a normal distribution with # mean=10 and sd=2, rounding to multiples of 0.2 "qrandn": tune.qrandn(10, 2, 0.2), # Sample a integer uniformly between -9 (inclusive) and 15 (exclusive) "randint": tune.randint(-9, 15), # Sample a random uniformly between -21 (inclusive) and 12 (inclusive (!)) # rounding to multiples of 3 (includes 12) # if q is 1, then randint is called instead with the upper bound exclusive "qrandint": tune.qrandint(-21, 12, 3), # Sample a integer uniformly between 1 (inclusive) and 10 (exclusive), # while sampling in log space "lograndint": tune.lograndint(1, 10), # Sample a integer uniformly between 1 (inclusive) and 10 (inclusive (!)), # while sampling in log space and rounding to multiples of 2 # if q is 1, then lograndint is called instead with the upper bound exclusive "qlograndint": tune.qlograndint(1, 10, 2), # Sample an option uniformly from the specified choices "choice": tune.choice(["a", "b", "c"]), # Sample from a random function, in this case one that # depends on another value from the search space "func": tune.sample_from(lambda spec: spec.config.uniform * 0.01), # Do a grid search over these values. Every value will be sampled # ``num_samples`` times (``num_samples`` is the parameter you pass to ``tune.TuneConfig``, # which is taken in by ``Tuner``) "grid": tune.grid_search([32, 64, 128]) } .. currentmodule:: ray Random Distributions API ------------------------ .. autosummary:: :nosignatures: :toctree: doc/ tune.uniform tune.quniform tune.loguniform tune.qloguniform tune.randn tune.qrandn tune.randint tune.qrandint tune.lograndint tune.qlograndint tune.choice Grid Search and Custom Function APIs ------------------------------------ .. autosummary:: :nosignatures: :toctree: doc/ tune.grid_search tune.sample_from References ---------- See also :ref:`tune-basicvariant`. --- .. _tune-stoppers: Tune Stopping Mechanisms (tune.stopper) ======================================= In addition to Trial Schedulers like :ref:`ASHA `, where a number of trials are stopped if they perform subpar, Ray Tune also supports custom stopping mechanisms to stop trials early. They can also stop the entire experiment after a condition is met. For instance, stopping mechanisms can specify to stop trials when they reached a plateau and the metric doesn't change anymore. Ray Tune comes with several stopping mechanisms out of the box. For custom stopping behavior, you can inherit from the :class:`Stopper ` class. Other stopping behaviors are described :ref:`in the user guide `. .. _tune-stop-ref: Stopper Interface (tune.Stopper) -------------------------------- .. currentmodule:: ray.tune.stopper .. autosummary:: :nosignatures: :toctree: doc/ Stopper .. autosummary:: :nosignatures: :toctree: doc/ Stopper.__call__ Stopper.stop_all Tune Built-in Stoppers ---------------------- .. autosummary:: :nosignatures: :toctree: doc/ MaximumIterationStopper ExperimentPlateauStopper TrialPlateauStopper TimeoutStopper CombinedStopper ~function_stopper.FunctionStopper ~noop.NoopStopper --- .. _tune-search-alg: Tune Search Algorithms (tune.search) ==================================== Tune's Search Algorithms are wrappers around open-source optimization libraries for efficient hyperparameter selection. Each library has a specific way of defining the search space - please refer to their documentation for more details. Tune will automatically convert search spaces passed to ``Tuner`` to the library format in most cases. You can utilize these search algorithms as follows: .. code-block:: python from ray import tune from ray.tune.search.optuna import OptunaSearch def train_fn(config): # This objective function is just for demonstration purposes tune.report({"loss": config["param"]}) tuner = tune.Tuner( train_fn, tune_config=tune.TuneConfig( search_alg=OptunaSearch(), num_samples=100, metric="loss", mode="min", ), param_space={"param": tune.uniform(0, 1)}, ) results = tuner.fit() Saving and Restoring Tune Search Algorithms ------------------------------------------- .. TODO: what to do about this section? It doesn't really belong here and is not worth its own guide. .. TODO: at least check that this pseudo-code runs. Certain search algorithms have ``save/restore`` implemented, allowing reuse of searchers that are fitted on the results of multiple tuning runs. .. code-block:: python search_alg = HyperOptSearch() tuner_1 = tune.Tuner( train_fn, tune_config=tune.TuneConfig(search_alg=search_alg) ) results_1 = tuner_1.fit() search_alg.save("./my-checkpoint.pkl") # Restore the saved state onto another search algorithm, # in a new tuning script search_alg2 = HyperOptSearch() search_alg2.restore("./my-checkpoint.pkl") tuner_2 = tune.Tuner( train_fn, tune_config=tune.TuneConfig(search_alg=search_alg2) ) results_2 = tuner_2.fit() Tune automatically saves searcher state inside the current experiment folder during tuning. See ``Result logdir: ...`` in the output logs for this location. Note that if you have two Tune runs with the same experiment folder, the previous state checkpoint will be overwritten. You can avoid this by making sure ``RunConfig(name=...)`` is set to a unique identifier: .. code-block:: python search_alg = HyperOptSearch() tuner_1 = tune.Tuner( train_fn, tune_config=tune.TuneConfig( num_samples=5, search_alg=search_alg, ), run_config=tune.RunConfig( name="my-experiment-1", storage_path="~/my_results", ) ) results = tuner_1.fit() search_alg2 = HyperOptSearch() search_alg2.restore_from_dir( os.path.join("~/my_results", "my-experiment-1") ) .. _tune-basicvariant: Random search and grid search (tune.search.basic_variant.BasicVariantGenerator) ------------------------------------------------------------------------------- The default and most basic way to do hyperparameter search is via random and grid search. Ray Tune does this through the :class:`BasicVariantGenerator ` class that generates trial variants given a search space definition. The :class:`BasicVariantGenerator ` is used per default if no search algorithm is passed to :func:`Tuner `. .. currentmodule:: ray.tune.search .. autosummary:: :nosignatures: :toctree: doc/ basic_variant.BasicVariantGenerator .. _tune-ax: Ax (tune.search.ax.AxSearch) ---------------------------- .. autosummary:: :nosignatures: :toctree: doc/ ax.AxSearch .. _bayesopt: Bayesian Optimization (tune.search.bayesopt.BayesOptSearch) ----------------------------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ bayesopt.BayesOptSearch .. _suggest-TuneBOHB: BOHB (tune.search.bohb.TuneBOHB) -------------------------------- BOHB (Bayesian Optimization HyperBand) is an algorithm that both terminates bad trials and also uses Bayesian Optimization to improve the hyperparameter search. It is available from the `HpBandSter library `_. Importantly, BOHB is intended to be paired with a specific scheduler class: :ref:`HyperBandForBOHB `. In order to use this search algorithm, you will need to install ``HpBandSter`` and ``ConfigSpace``: .. code-block:: bash $ pip install hpbandster ConfigSpace See the `BOHB paper `_ for more details. .. autosummary:: :nosignatures: :toctree: doc/ bohb.TuneBOHB .. _tune-hebo: HEBO (tune.search.hebo.HEBOSearch) ---------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ hebo.HEBOSearch .. _tune-hyperopt: HyperOpt (tune.search.hyperopt.HyperOptSearch) ---------------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ hyperopt.HyperOptSearch .. _nevergrad: Nevergrad (tune.search.nevergrad.NevergradSearch) ------------------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ nevergrad.NevergradSearch .. _tune-optuna: Optuna (tune.search.optuna.OptunaSearch) ---------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ optuna.OptunaSearch .. _zoopt: ZOOpt (tune.search.zoopt.ZOOptSearch) ------------------------------------- .. autosummary:: :nosignatures: :toctree: doc/ zoopt.ZOOptSearch .. _repeater: Repeated Evaluations (tune.search.Repeater) ------------------------------------------- Use ``ray.tune.search.Repeater`` to average over multiple evaluations of the same hyperparameter configurations. This is useful in cases where the evaluated training procedure has high variance (i.e., in reinforcement learning). By default, ``Repeater`` will take in a ``repeat`` parameter and a ``search_alg``. The ``search_alg`` will suggest new configurations to try, and the ``Repeater`` will run ``repeat`` trials of the configuration. It will then average the ``search_alg.metric`` from the final results of each repeated trial. .. warning:: It is recommended to not use ``Repeater`` with a TrialScheduler. Early termination can negatively affect the average reported metric. .. autosummary:: :nosignatures: :toctree: doc/ Repeater .. _limiter: ConcurrencyLimiter (tune.search.ConcurrencyLimiter) --------------------------------------------------- Use ``ray.tune.search.ConcurrencyLimiter`` to limit the amount of concurrency when using a search algorithm. This is useful when a given optimization algorithm does not parallelize very well (like a naive Bayesian Optimization). .. autosummary:: :nosignatures: :toctree: doc/ ConcurrencyLimiter .. _byo-algo: Custom Search Algorithms (tune.search.Searcher) ----------------------------------------------- If you are interested in implementing or contributing a new Search Algorithm, provide the following interface: .. autosummary:: :nosignatures: :toctree: doc/ Searcher .. autosummary:: :nosignatures: :toctree: doc/ Searcher.suggest Searcher.save Searcher.restore Searcher.on_trial_result Searcher.on_trial_complete If contributing, make sure to add test cases and an entry in the function described below. .. _shim: Shim Instantiation (tune.create_searcher) ----------------------------------------- There is also a shim function that constructs the search algorithm based on the provided string. This can be useful if the search algorithm you want to use changes often (e.g., specifying the search algorithm via a CLI option or config file). .. autosummary:: :nosignatures: :toctree: doc/ create_searcher --- Syncing in Tune =============== .. seealso:: See :doc:`this user guide ` for more details and examples. .. _tune-sync-config: Tune Syncing Configuration -------------------------- .. autosummary:: :nosignatures: :toctree: doc/ ~ray.tune.SyncConfig --- .. _trainable-docs: .. TODO: these "basic" sections before the actual API docs start don't really belong here. Then again, the function API does not really have a signature to just describe. .. TODO: Reusing actors and advanced resources allocation seem ill-placed. Training in Tune (tune.Trainable, tune.report) ================================================= Training can be done with either a **Function API** (:func:`tune.report() `) or **Class API** (:ref:`tune.Trainable `). For the sake of example, let's maximize this objective function: .. literalinclude:: /tune/doc_code/trainable.py :language: python :start-after: __example_objective_start__ :end-before: __example_objective_end__ .. _tune-function-api: Function Trainable API ---------------------- Use the Function API to define a custom training function that Tune runs in Ray actor processes. Each trial is placed into a Ray actor process and runs in parallel. The ``config`` argument in the function is a dictionary populated automatically by Ray Tune and corresponding to the hyperparameters selected for the trial from the :ref:`search space `. With the Function API, you can report intermediate metrics by simply calling :func:`tune.report() ` within the function. .. literalinclude:: /tune/doc_code/trainable.py :language: python :start-after: __function_api_report_intermediate_metrics_start__ :end-before: __function_api_report_intermediate_metrics_end__ .. tip:: Do not use :func:`tune.report() ` within a ``Trainable`` class. In the previous example, we reported on every step, but this metric reporting frequency is configurable. For example, we could also report only a single time at the end with the final score: .. literalinclude:: /tune/doc_code/trainable.py :language: python :start-after: __function_api_report_final_metrics_start__ :end-before: __function_api_report_final_metrics_end__ It's also possible to return a final set of metrics to Tune by returning them from your function: .. literalinclude:: /tune/doc_code/trainable.py :language: python :start-after: __function_api_return_final_metrics_start__ :end-before: __function_api_return_final_metrics_end__ Note that Ray Tune outputs extra values in addition to the user reported metrics, such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation of these values. See how to configure checkpointing for a function trainable :ref:`here `. .. _tune-class-api: Class Trainable API -------------------------- .. caution:: Do not use :func:`tune.report() ` within a ``Trainable`` class. The Trainable **class API** will require users to subclass ``ray.tune.Trainable``. Here's a naive example of this API: .. literalinclude:: /tune/doc_code/trainable.py :language: python :start-after: __class_api_example_start__ :end-before: __class_api_example_end__ As a subclass of ``tune.Trainable``, Tune will create a ``Trainable`` object on a separate process (using the :ref:`Ray Actor API `). 1. ``setup`` function is invoked once training starts. 2. ``step`` is invoked **multiple times**. Each time, the Trainable object executes one logical iteration of training in the tuning process, which may include one or more iterations of actual training. 3. ``cleanup`` is invoked when training is finished. The ``config`` argument in the ``setup`` method is a dictionary populated automatically by Tune and corresponding to the hyperparameters selected for the trial from the :ref:`search space `. .. tip:: As a rule of thumb, the execution time of ``step`` should be large enough to avoid overheads (i.e. more than a few seconds), but short enough to report progress periodically (i.e. at most a few minutes). You'll notice that Ray Tune will output extra values in addition to the user reported metrics, such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation/glossary of these values. See how to configure checkpoint for class trainable :ref:`here `. Advanced: Reusing Actors in Tune ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: This feature is only for the Trainable Class API. Your Trainable can often take a long time to start. To avoid this, you can do ``tune.TuneConfig(reuse_actors=True)`` (which is taken in by ``Tuner``) to reuse the same Trainable Python process and object for multiple hyperparameters. This requires you to implement ``Trainable.reset_config``, which provides a new set of hyperparameters. It is up to the user to correctly update the hyperparameters of your trainable. .. code-block:: python from time import sleep import ray from ray import tune from ray.tune.tuner import Tuner def expensive_setup(): print("EXPENSIVE SETUP") sleep(1) class QuadraticTrainable(tune.Trainable): def setup(self, config): self.config = config expensive_setup() # use reuse_actors=True to only run this once self.max_steps = 5 self.step_count = 0 def step(self): # Extract hyperparameters from the config h1 = self.config["hparam1"] h2 = self.config["hparam2"] # Compute a simple quadratic objective where the optimum is at hparam1=3 and hparam2=5 loss = (h1 - 3) ** 2 + (h2 - 5) ** 2 metrics = {"loss": loss} self.step_count += 1 if self.step_count > self.max_steps: metrics["done"] = True # Return the computed loss as the metric return metrics def reset_config(self, new_config): # Update the configuration for a new trial while reusing the actor self.config = new_config return True ray.init() tuner_with_reuse = Tuner( QuadraticTrainable, param_space={ "hparam1": tune.uniform(-10, 10), "hparam2": tune.uniform(-10, 10), }, tune_config=tune.TuneConfig( num_samples=10, max_concurrent_trials=1, reuse_actors=True, # Enable actor reuse and avoid expensive setup ), run_config=ray.tune.RunConfig( verbose=0, checkpoint_config=ray.tune.CheckpointConfig(checkpoint_at_end=False), ), ) tuner_with_reuse.fit() Comparing Tune's Function API and Class API ------------------------------------------- Here are a few key concepts and what they look like for the Function and Class API's. ======================= =============================================== ============================================== Concept Function API Class API ======================= =============================================== ============================================== Training Iteration Increments on each `tune.report` call Increments on each `Trainable.step` call Report metrics `tune.report(metrics)` Return metrics from `Trainable.step` Saving a checkpoint `tune.report(..., checkpoint=checkpoint)` `Trainable.save_checkpoint` Loading a checkpoint `tune.get_checkpoint()` `Trainable.load_checkpoint` Accessing config Passed as an argument `def train_func(config):` Passed through `Trainable.setup` ======================= =============================================== ============================================== Advanced Resource Allocation ---------------------------- Trainables can themselves be distributed. If your trainable function / class creates further Ray actors or tasks that also consume CPU / GPU resources, you will want to add more bundles to the :class:`PlacementGroupFactory` to reserve extra resource slots. For example, if a trainable class requires 1 GPU itself, but also launches 4 actors, each using another GPU, then you should use :func:`tune.with_resources ` like this: .. code-block:: python :emphasize-lines: 4-10 tuner = tune.Tuner( tune.with_resources(my_trainable, tune.PlacementGroupFactory([ {"CPU": 1, "GPU": 1}, {"GPU": 1}, {"GPU": 1}, {"GPU": 1}, {"GPU": 1} ])), run_config=RunConfig(name="my_trainable") ) The ``Trainable`` also provides the ``default_resource_requests`` interface to automatically declare the resources per trial based on the given configuration. It is also possible to specify memory (``"memory"``, in bytes) and custom resource requirements. .. currentmodule:: ray Function API ------------ For reporting results and checkpoints with the function API, see the :ref:`Ray Train utilities ` documentation. **Classes** .. autosummary:: :nosignatures: :toctree: doc/ ~tune.Checkpoint ~tune.TuneContext **Functions** .. autosummary:: :nosignatures: :toctree: doc/ ~tune.get_checkpoint ~tune.get_context ~tune.report .. _tune-trainable-docstring: Trainable (Class API) --------------------- Constructor ~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~tune.Trainable Trainable Methods to Implement ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ ~tune.Trainable.setup ~tune.Trainable.save_checkpoint ~tune.Trainable.load_checkpoint ~tune.Trainable.step ~tune.Trainable.reset_config ~tune.Trainable.cleanup ~tune.Trainable.default_resource_request .. _tune-util-ref: Tune Trainable Utilities ------------------------- Tune Data Ingestion Utilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ tune.with_parameters Tune Resource Assignment Utilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ tune.with_resources ~tune.execution.placement_groups.PlacementGroupFactory tune.utils.wait_for_gpu Tune Trainable Debugging Utilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :nosignatures: :toctree: doc/ tune.utils.diagnose_serialization tune.utils.validate_save_restore tune.utils.util.validate_warmstart --- :orphan: Asynchronous HyperBand Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This example demonstrates how to use Ray Tune's Asynchronous Successive Halving Algorithm (ASHA) scheduler to efficiently optimize hyperparameters for a machine learning model. ASHA is particularly useful for large-scale hyperparameter optimization as it can adaptively allocate resources and end poorly performing trials early. Requirements: `pip install "ray[tune]"` .. literalinclude:: /../../python/ray/tune/examples/async_hyperband_example.py See Also -------- - `ASHA Paper `_ --- :orphan: AX Example ~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/ax_example.py --- :orphan: BayesOpt Example ~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/bayesopt_example.py --- :orphan: BOHB Example ~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/bohb_example.py --- :orphan: Custom Checkpointing Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/custom_func_checkpointing.py --- :orphan: HyperBand Example ================= .. literalinclude:: /../../python/ray/tune/examples/hyperband_example.py --- :orphan: HyperBand Function Example ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/hyperband_function_example.py --- :orphan: Hyperopt Conditional Search Space Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/hyperopt_conditional_search_space_example.py --- :orphan: Logging Example ~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/logging_example.py --- :orphan: MLflow PyTorch Lightning Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/mlflow_ptl.py --- :orphan: MNIST PyTorch Lightning Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/mnist_ptl_mini.py --- :orphan: MNIST PyTorch Example ~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/mnist_pytorch.py If you consider switching to PyTorch Lightning to get rid of some of your boilerplate training code, please know that we also have a walkthrough on :doc:`how to use Tune with PyTorch Lightning models `. --- :orphan: MNIST PyTorch Trainable Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/mnist_pytorch_trainable.py --- :orphan: Nevergrad Example ~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/nevergrad_example.py --- :orphan: PB2 Example ~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/pb2_example.py --- :orphan: PB2 PPO Example ~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/pb2_ppo_example.py --- :orphan: PBT ConvNet Example ~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/pbt_convnet_function_example.py --- :orphan: PBT Example ~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/pbt_example.py --- :orphan: PBT Function Example ~~~~~~~~~~~~~~~~~~~~ The following script produces the following results. For a population of 8 trials, the PBT learning rate schedule roughly matches the optimal learning rate schedule. .. image:: images/pbt_function_results.png .. literalinclude:: /../../python/ray/tune/examples/pbt_function.py --- :orphan: Memory NN Example ~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/pbt_memnn_example.py --- :orphan: Keras Cifar10 Example ~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/pbt_tune_cifar10_with_keras.py --- :orphan: TensorFlow MNIST Example ~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/tf_mnist_example.py --- :orphan: tune_basic_example ~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/tune_basic_example.py --- :orphan: XGBoost Dynamic Resources Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: /../../python/ray/tune/examples/xgboost_dynamic_resources_example.py --- .. _tune-examples-ref: .. _tune-recipes: ================= Ray Tune Examples ================= .. tip:: See :ref:`tune-main` to learn more about Tune features. Below are examples for using Ray Tune for a variety of use cases and sorted by categories: * `ML frameworks`_ * `Experiment tracking tools`_ * `Hyperparameter optimization frameworks`_ * `Others`_ * `Exercises`_ .. _ml-frameworks: ML frameworks ------------- .. toctree:: :hidden: PyTorch Example PyTorch Lightning Example XGBoost Example LightGBM Example Hugging Face Transformers Example Ray RLlib Example Keras Example Ray Tune integrates with many popular machine learning frameworks. Here you find a few practical examples showing you how to tune your models. At the end of these guides you will often find links to even more examples. .. list-table:: * - :doc:`How to use Tune with Keras and TensorFlow models ` * - :doc:`How to use Tune with PyTorch models ` * - :doc:`How to tune PyTorch Lightning models ` * - :doc:`Tuning RL experiments with Ray Tune and Ray Serve ` * - :doc:`Tuning XGBoost parameters with Tune ` * - :doc:`Tuning LightGBM parameters with Tune ` * - :doc:`Tuning Hugging Face Transformers with Tune ` .. _experiment-tracking-tools: Experiment tracking tools ------------------------- .. toctree:: :hidden: Weights & Biases Example MLflow Example Aim Example Comet Example Ray Tune integrates with some popular Experiment tracking and management tools, such as CometML, or Weights & Biases. For how to use Ray Tune with Tensorboard, see :ref:`Guide to logging and outputs `. .. list-table:: * - :doc:`Using Aim with Ray Tune for experiment management ` * - :doc:`Using Comet with Ray Tune for experiment management ` * - :doc:`Tracking your experiment process Weights & Biases ` * - :doc:`Using MLflow tracking and auto logging with Tune ` .. _hyperparameter-optimization-frameworks: Hyperparameter optimization frameworks -------------------------------------- .. toctree:: :hidden: Ax Example HyperOpt Example Bayesopt Example BOHB Example Nevergrad Example Optuna Example Tune integrates with a wide variety of hyperparameter optimization frameworks and their respective search algorithms. See the following detailed examples for each integration: .. list-table:: * - :doc:`ax_example` * - :doc:`hyperopt_example` * - :doc:`bayesopt_example` * - :doc:`bohb_example` * - :doc:`nevergrad_example` * - :doc:`optuna_example` .. _tune-examples-others: Others ------ .. list-table:: * - :doc:`Simple example for doing a basic random and grid search ` * - :doc:`Example of using a simple tuning function with AsyncHyperBandScheduler ` * - :doc:`Example of using a trainable function with HyperBandScheduler and the AsyncHyperBandScheduler ` * - :doc:`Configuring and running (synchronous) PBT and understanding the underlying algorithm behavior with a simple example ` * - :doc:`includes/pbt_function` * - :doc:`includes/pb2_example` * - :doc:`includes/logging_example` .. _tune-examples-exercises: Exercises --------- Learn how to use Tune in your browser with the following Colab-based exercises. .. list-table:: :widths: 50 30 20 :header-rows: 1 * - Description - Library - Colab link * - Basics of using Tune - PyTorch - .. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/ray-project/tutorial/blob/master/tune_exercises/exercise_1_basics.ipynb :alt: Open in Colab * - Using search algorithms and trial schedulers to optimize your model - PyTorch - .. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/ray-project/tutorial/blob/master/tune_exercises/exercise_2_optimize.ipynb :alt: Open in Colab * - Using Population-Based Training (PBT) - PyTorch - .. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/ray-project/tutorial/blob/master/tune_exercises/exercise_3_pbt.ipynb" target="_parent :alt: Open in Colab * - Fine-tuning Hugging Face Transformers with PBT - Hugging Face Transformers and PyTorch - .. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/drive/1tQgAKgcKQzheoh503OzhS4N9NtfFgmjF?usp=sharing :alt: Open in Colab * - Logging Tune runs to Comet ML - Comet - .. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/drive/1dp3VwVoAH1acn_kG7RuT62mICnOqxU1z?usp=sharing :alt: Open in Colab Tutorial source files are on `GitHub `_. --- :orphan: PBT Visualization Helper File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Used in :doc:`/tune/examples/pbt_visualization/pbt_visualization`. .. literalinclude:: ./pbt_visualization_utils.py --- .. _tune-faq: Ray Tune FAQ ------------ Here we try to answer questions that come up often. If you still have questions after reading this FAQ, let us know! .. contents:: :local: :depth: 1 What are Hyperparameters? ~~~~~~~~~~~~~~~~~~~~~~~~~ What are *hyperparameters?* And how are they different from *model parameters*? In supervised learning, we train a model with labeled data so the model can properly identify new data values. Everything about the model is defined by a set of parameters, such as the weights in a linear regression. These are *model parameters*; they are learned during training. .. image:: /images/hyper-model-parameters.png In contrast, the *hyperparameters* define structural details about the kind of model itself, like whether or not we are using a linear regression or classification, what architecture is best for a neural network, how many layers, what kind of filters, etc. They are defined before training, not learned. .. image:: /images/hyper-network-params.png Other quantities considered *hyperparameters* include learning rates, discount rates, etc. If we want our training process and resulting model to work well, we first need to determine the optimal or near-optimal set of *hyperparameters*. How do we determine the optimal *hyperparameters*? The most direct approach is to perform a loop where we pick a candidate set of values from some reasonably inclusive list of possible values, train a model, compare the results achieved with previous loop iterations, and pick the set that performed best. This process is called *Hyperparameter Tuning* or *Optimization* (HPO). And *hyperparameters* are specified over a configured and confined search space, collectively defined for each *hyperparameter* in a ``config`` dictionary. .. TODO: We *really* need to improve this section. Which search algorithm/scheduler should I choose? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Tune offers :ref:`many different search algorithms ` and :ref:`schedulers `. Deciding on which to use mostly depends on your problem: * Is it a small or large problem (how long does it take to train? How costly are the resources, like GPUs)? Can you run many trials in parallel? * How many hyperparameters would you like to tune? * What values are valid for hyperparameters? **If your model returns incremental results** (eg. results per epoch in deep learning, results per each added tree in GBDTs, etc.) using early stopping usually allows for sampling more configurations, as unpromising trials are pruned before they run their full course. Please note that not all search algorithms can use information from pruned trials. Early stopping cannot be used without incremental results - in case of the functional API, that means that ``tune.report()`` has to be called more than once - usually in a loop. **If your model is small**, you can usually try to run many different configurations. A **random search** can be used to generate configurations. You can also grid search over some values. You should probably still use :ref:`ASHA for early termination of bad trials ` (if your problem supports early stopping). **If your model is large**, you can try to either use **Bayesian Optimization-based search algorithms** like :ref:`BayesOpt ` to get good parameter configurations after few trials. :ref:`Ax ` is similar but more robust to noisy data. Please note that these algorithms only work well with **a small number of hyperparameters**. Alternatively, you can use :ref:`Population Based Training ` which works well with few trials, e.g. 8 or even 4. However, this will output a hyperparameter *schedule* rather than one fixed set of hyperparameters. **If you have a small number of hyperparameters**, Bayesian Optimization methods work well. Take a look at :ref:`BOHB ` or :ref:`Optuna ` with the :ref:`ASHA ` scheduler to combine the benefits of Bayesian Optimization with early stopping. **If you only have continuous values for hyperparameters** this will work well with most Bayesian Optimization methods. Discrete or categorical variables still work, but less good with an increasing number of categories. **If you have many categorical values for hyperparameters**, consider using random search, or a TPE-based Bayesian Optimization algorithm such as :ref:`Optuna ` or :ref:`HyperOpt `. **Our go-to solution** is usually to use **random search** with :ref:`ASHA for early stopping ` for smaller problems. Use :ref:`BOHB ` for **larger problems** with a **small number of hyperparameters** and :ref:`Population Based Training ` for **larger problems** with a **large number of hyperparameters** if a learning schedule is acceptable. How do I choose hyperparameter ranges? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A good start is to look at the papers that introduced the algorithms, and also to see what other people are using. Most algorithms also have sensible defaults for some of their parameters. For instance, `XGBoost's parameter overview `_ reports to use ``max_depth=6`` for the maximum decision tree depth. Here, anything between 2 and 10 might make sense (though that naturally depends on your problem). For **learning rates**, we suggest using a **loguniform distribution** between **1e-5** and **1e-1**: ``tune.loguniform(1e-5, 1e-1)``. For **batch sizes**, we suggest trying **powers of 2**, for instance, 2, 4, 8, 16, 32, 64, 128, 256, etc. The magnitude depends on your problem. For easy problems with lots of data, use higher batch sizes, for harder problems with not so much data, use lower batch sizes. For **layer sizes** we also suggest trying **powers of 2**. For small problems (e.g. Cartpole), use smaller layer sizes. For larger problems, try larger ones. For **discount factors** in reinforcement learning we suggest sampling uniformly between 0.9 and 1.0. Depending on the problem, a much stricter range above 0.97 or even above 0.99 can make sense (e.g. for Atari). How can I use nested/conditional search spaces? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes you might need to define parameters whose value depend on the value of other parameters. Ray Tune offers some methods to define these. Nested spaces ''''''''''''' You can nest hyperparameter definition in sub dictionaries: .. literalinclude:: doc_code/faq.py :language: python :start-after: __basic_config_start__ :end-before: __basic_config_end__ The trial config will be nested exactly like the input config. Conditional spaces '''''''''''''''''' :ref:`Custom and conditional search spaces are explained in detail here `. In short, you can pass custom functions to ``tune.sample_from()`` that can return values that depend on other values: .. literalinclude:: doc_code/faq.py :language: python :start-after: __conditional_spaces_start__ :end-before: __conditional_spaces_end__ Conditional grid search ''''''''''''''''''''''' If you would like to grid search over two parameters that depend on each other, this might not work out of the box. For instance say that *a* should be a value between 5 and 10 and *b* should be a value between 0 and a. In this case, we cannot use ``tune.sample_from`` because it doesn't support grid searching. The solution here is to create a list of valid *tuples* with the help of a helper function, like this: .. literalinclude:: doc_code/faq.py :language: python :start-after: __iter_start__ :end-before: __iter_end__ Your trainable then can do something like ``a, b = config["ab"]`` to split the a and b variables and use them afterwards. How does early termination (e.g. Hyperband/ASHA) work? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Early termination algorithms look at the intermediately reported values, e.g. what is reported to them via ``tune.report()`` after each training epoch. After a certain number of steps, they then remove the worst performing trials and keep only the best performing trials. Goodness of a trial is determined by ordering them by the objective metric, for instance accuracy or loss. In ASHA, you can decide how many trials are early terminated. ``reduction_factor=4`` means that only 25% of all trials are kept each time they are reduced. With ``grace_period=n`` you can force ASHA to train each trial at least for ``n`` epochs. Why are all my trials returning "1" iteration? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **This is most likely applicable for the Tune function API.** Ray Tune counts iterations internally every time ``tune.report()`` is called. If you only call ``tune.report()`` once at the end of the training, the counter has only been incremented once. If you're using the class API, the counter is increased after calling ``step()``. Note that it might make sense to report metrics more often than once. For instance, if you train your algorithm for 1000 timesteps, consider reporting intermediate performance values every 100 steps. That way, schedulers like Hyperband/ASHA can terminate bad performing trials early. What are all these extra outputs? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You'll notice that Ray Tune not only reports hyperparameters (from the ``config``) or metrics (passed to ``tune.report()``), but also some other outputs. .. code-block:: bash Result for easy_objective_c64c9112: date: 2020-10-07_13-29-18 done: false experiment_id: 6edc31257b564bf8985afeec1df618ee experiment_tag: 7_activation=tanh,height=-53.116,steps=100,width=13.885 hostname: ubuntu iterations: 0 iterations_since_restore: 1 mean_loss: 4.688385317424468 neg_mean_loss: -4.688385317424468 node_ip: 192.168.1.115 pid: 5973 time_since_restore: 7.605552673339844e-05 time_this_iter_s: 7.605552673339844e-05 time_total_s: 7.605552673339844e-05 timestamp: 1602102558 timesteps_since_restore: 0 training_iteration: 1 trial_id: c64c9112 See the :ref:`tune-autofilled-metrics` section for a glossary. How do I set resources? ~~~~~~~~~~~~~~~~~~~~~~~ If you want to allocate specific resources to a trial, you can use the ``tune.with_resources`` and wrap it around you trainable together with a dict or a :class:`PlacementGroupFactory ` object: .. literalinclude:: doc_code/faq.py :dedent: :language: python :start-after: __resources_start__ :end-before: __resources_end__ The example above showcases three things: 1. The `cpu` and `gpu` options set how many CPUs and GPUs are available for each trial, respectively. **Trials cannot request more resources** than these (exception: see 3). 2. It is possible to request **fractional GPUs**. A value of 0.5 means that half of the memory of the GPU is made available to the trial. You will have to make sure yourself that your model still fits on the fractional memory. 3. You can request custom resources you supplied to Ray when starting the cluster. Trials will only be scheduled on single nodes that can provide all resources you requested. One important thing to keep in mind is that each Ray worker (and thus each Ray Tune Trial) will only be scheduled on **one machine**. That means if you for instance request 2 GPUs for your trial, but your cluster consists of 4 machines with 1 GPU each, the trial will never be scheduled. In other words, you will have to make sure that your Ray cluster has machines that can actually fulfill your resource requests. In some cases your trainable might want to start other remote actors, for instance if you're leveraging distributed training via Ray Train. In these cases, you can use :ref:`placement groups ` to request additional resources: .. literalinclude:: doc_code/faq.py :dedent: :language: python :start-after: __resources_pgf_start__ :end-before: __resources_pgf_end__ Here, you're requesting 2 additional CPUs for remote tasks. These two additional actors do not necessarily have to live on the same node as your main trainable. In fact, you can control this via the ``strategy`` parameter. In this example, ``PACK`` will try to schedule the actors on the same node, but allows them to be scheduled on other nodes as well. Please refer to the :ref:`placement groups documentation ` to learn more about these placement strategies. You can also allocate specific resources to a trial based on a custom rule via lambda functions. For instance, if you want to allocate GPU resources to trials based on a setting in your param space: .. literalinclude:: doc_code/faq.py :dedent: :language: python :start-after: __resources_lambda_start__ :end-before: __resources_lambda_end__ Why is my training stuck and Ray reporting that pending actor or tasks cannot be scheduled? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is usually caused by Ray actors or tasks being started by the trainable without the trainable resources accounting for them, leading to a deadlock. This can also be "stealthily" caused by using other libraries in the trainable that are based on Ray, such as Modin. In order to fix the issue, request additional resources for the trial using :ref:`placement groups `, as outlined in the section above. For example, if your trainable is using Modin dataframes, operations on those will spawn Ray tasks. By allocating an additional CPU bundle to the trial, those tasks will be able to run without being starved of resources. .. literalinclude:: doc_code/faq.py :dedent: :language: python :start-after: __modin_start__ :end-before: __modin_end__ How can I pass further parameter values to my trainable? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray Tune expects your trainable functions to accept only up to two parameters, ``config`` and ``checkpoint_dir``. But sometimes there are cases where you want to pass constant arguments, like the number of epochs to run, or a dataset to train on. Ray Tune offers a wrapper function to achieve just that, called :func:`tune.with_parameters() `: .. literalinclude:: doc_code/faq.py :language: python :start-after: __huge_data_start__ :end-before: __huge_data_end__ This function works similarly to ``functools.partial``, but it stores the parameters directly in the Ray object store. This means that you can pass even huge objects like datasets, and Ray makes sure that these are efficiently stored and retrieved on your cluster machines. :func:`tune.with_parameters() ` also works with class trainables. Please see :func:`tune.with_parameters() ` for more details and examples. How can I reproduce experiments? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reproducing experiments and experiment results means that you get the exact same results when running an experiment again and again. To achieve this, the conditions have to be exactly the same each time you run the experiment. In terms of ML training and tuning, this mostly concerns the random number generators that are used for sampling in various places of the training and tuning lifecycle. Random number generators are used to create randomness, for instance to sample a hyperparameter value for a parameter you defined. There is no true randomness in computing, rather there are sophisticated algorithms that generate numbers that *seem* to be random and fulfill all properties of a random distribution. These algorithms can be *seeded* with an initial state, after which the generated random numbers are always the same. .. literalinclude:: doc_code/faq.py :language: python :start-after: __seeded_1_start__ :end-before: __seeded_1_end__ The most commonly used random number generators from Python libraries are those in the native ``random`` submodule and the ``numpy.random`` module. .. literalinclude:: doc_code/faq.py :language: python :start-after: __seeded_2_start__ :end-before: __seeded_2_end__ In your tuning and training run, there are several places where randomness occurs, and at all these places we will have to introduce seeds to make sure we get the same behavior. * **Search algorithm**: Search algorithms have to be seeded to generate the same hyperparameter configurations in each run. Some search algorithms can be explicitly instantiated with a random seed (look for a ``seed`` parameter in the constructor). For others, try to use the above code block. * **Schedulers**: Schedulers like Population Based Training rely on resampling some of the parameters, requiring randomness. Use the code block above to set the initial seeds. * **Training function**: In addition to initializing the configurations, the training functions themselves have to use seeds. This could concern e.g. the data splitting. You should make sure to set the seed at the start of your training function. PyTorch and TensorFlow use their own RNGs, which have to be initialized, too: .. literalinclude:: doc_code/faq.py :language: python :start-after: __torch_tf_seeds_start__ :end-before: __torch_tf_seeds_end__ You should thus seed both Ray Tune's schedulers and search algorithms, and the training code. The schedulers and search algorithms should always be seeded with the same seed. This is also true for the training code, but often it is beneficial that the seeds differ *between different training runs*. Here's a blueprint on how to do all this in your training code: .. literalinclude:: doc_code/faq.py :language: python :start-after: __torch_seed_example_start__ :end-before: __torch_seed_example_end__ **Please note** that it is not always possible to control all sources of non-determinism. For instance, if you use schedulers like ASHA or PBT, some trials might finish earlier than other trials, affecting the behavior of the schedulers. Which trials finish first can however depend on the current system load, network communication, or other factors in the environment that we cannot control with random seeds. This is also true for search algorithms such as Bayesian Optimization, which take previous results into account when sampling new configurations. This can be tackled by using the **synchronous modes** of PBT and Hyperband, where the schedulers wait for all trials to finish an epoch before deciding which trials to promote. We strongly advise to try reproduction on smaller toy problems first before relying on it for larger experiments. .. _tune-bottlenecks: How can I avoid bottlenecks? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes you might run into a message like this: .. code-block:: The `experiment_checkpoint` operation took 2.43 seconds to complete, which may be a performance bottleneck Most commonly, the ``experiment_checkpoint`` operation is throwing this warning, but it might be something else, like ``process_trial_result``. These operations should usually take less than 500ms to complete. When it consistently takes longer, this might indicate a problem or inefficiencies. To get rid of this message, it is important to understand where it comes from. These are the main reasons this problem comes up: **The Trial config is very large** This is the case if you e.g. try to pass a dataset or other large object via the ``config`` parameter. If this is the case, the dataset is serialized and written to disk repeatedly during experiment checkpointing, which takes a long time. **Solution**: Use :func:`tune.with_parameters ` to pass large objects to function trainables via the objects store. For class trainables you can do this manually via ``ray.put()`` and ``ray.get()``. If you need to pass a class definition, consider passing an indicator (e.g. a string) instead and let the trainable select the class instead. Generally, your config dictionary should only contain primitive types, like numbers or strings. **The Trial result is very large** This is the case if you return objects, data, or other large objects via the return value of ``step()`` in your class trainable or to ``tune.report()`` in your function trainable. The effect is the same as above: The results are repeatedly serialized and written to disk, and this can take a long time. **Solution**: Use checkpoint by writing data to the trainable's current working directory instead. There are various ways to do that depending on whether you are using class or functional Trainable API. **You are training a large number of trials on a cluster, or you are saving huge checkpoints** **Solution**: You can use :ref:`cloud checkpointing ` to save logs and checkpoints to a specified `storage_path`. This is the preferred way to deal with this. All syncing will be taken care of automatically, as all nodes are able to access the cloud storage. Additionally, your results will be safe, so even when you're working on pre-emptible instances, you won't lose any of your data. **You are reporting results too often** Each result is processed by the search algorithm, trial scheduler, and callbacks (including loggers and the trial syncer). If you're reporting a large number of results per trial (e.g. multiple results per second), this can take a long time. **Solution**: The solution here is obvious: Just don't report results that often. In class trainables, ``step()`` should maybe process a larger chunk of data. In function trainables, you can report only every n-th iteration of the training loop. Try to balance the number of results you really need to make scheduling or searching decisions. If you need more fine grained metrics for logging or tracking, consider using a separate logging mechanism for this instead of the Ray Tune-provided progress logging of results. How can I develop and test Tune locally? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, follow the instructions in :ref:`python-develop` to develop Tune without compiling Ray. After Ray is set up, run ``pip install -r ray/python/ray/tune/requirements-dev.txt`` to install all packages required for Tune development. Now, to run all Tune tests simply run: .. code-block:: shell pytest ray/python/ray/tune/tests/ If you plan to submit a pull request, we recommend you to run unit tests locally beforehand to speed up the review process. Even though we have hooks to run unit tests automatically for each pull request, it's usually quicker to run them on your machine first to avoid any obvious mistakes. How can I get started contributing to Tune? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use GitHub to track issues, feature requests, and bugs. Take a look at the ones labeled `"good first issue" `__ and `"help wanted" `__ for a place to start. Look for issues with "[tune]" in the title. .. note:: If raising a new issue or PR related to Tune, be sure to include "[tune]" in the title and add a ``tune`` label. .. _tune-reproducible: How can I make my Tune experiments reproducible? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exact reproducibility of machine learning runs is hard to achieve. This is even more true in a distributed setting, as more non-determinism is introduced. For instance, if two trials finish at the same time, the convergence of the search algorithm might be influenced by which trial result is processed first. This depends on the searcher - for random search, this shouldn't make a difference, but for most other searchers it will. If you try to achieve some amount of reproducibility, there are two places where you'll have to set random seeds: 1. On the driver program, e.g. for the search algorithm. This will ensure that at least the initial configurations suggested by the search algorithms are the same. 2. In the trainable (if required). Neural networks are usually initialized with random numbers, and many classical ML algorithms, like GBDTs, make use of randomness. Thus you'll want to make sure to set a seed here so that the initialization is always the same. Here is an example that will always produce the same result (except for trial runtimes). .. literalinclude:: doc_code/faq.py :language: python :start-after: __reproducible_start__ :end-before: __reproducible_end__ Some searchers use their own random states to sample new configurations. These searchers usually accept a ``seed`` parameter that can be passed on initialization. Other searchers use Numpy's ``np.random`` interface - these seeds can be then set with ``np.random.seed()``. We don't offer an interface to do this in the searcher classes as setting a random seed globally could have side effects. For instance, it could influence the way your dataset is split. Thus, we leave it up to the user to make these global configuration changes. How can I use large datasets in Tune? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You often will want to compute a large object (e.g., training data, model weights) on the driver and use that object within each trial. Tune provides a wrapper function ``tune.with_parameters()`` that allows you to broadcast large objects to your trainable. Objects passed with this wrapper will be stored on the :ref:`Ray object store ` and will be automatically fetched and passed to your trainable as a parameter. .. tip:: If the objects are small in size or already exist in the :ref:`Ray Object Store `, there's no need to use ``tune.with_parameters()``. You can use `partials `__ or pass in directly to ``config`` instead. .. literalinclude:: doc_code/faq.py :language: python :start-after: __large_data_start__ :end-before: __large_data_end__ .. _tune-cloud-syncing: How can I upload my Tune results to cloud storage? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See :ref:`tune-cloud-checkpointing`. Make sure that worker nodes have the write access to the cloud storage. Failing to do so would cause error messages like ``Error message (1): fatal error: Unable to locate credentials``. For AWS set up, this involves adding an IamInstanceProfile configuration for worker nodes. Please :ref:`see here for more tips `. .. _tune-kubernetes: How can I use Tune with Kubernetes? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You should configure shared storage. See this user guide: :ref:`tune-storage-options`. .. _tune-docker: How can I use Tune with Docker? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You should configure shared storage. See this user guide: :ref:`tune-storage-options`. .. _tune-default-search-space: How do I configure search spaces? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can specify a grid search or sampling distribution via the dict passed into ``Tuner(param_space=...)``. .. literalinclude:: doc_code/faq.py :dedent: :language: python :start-after: __grid_search_start__ :end-before: __grid_search_end__ By default, each random variable and grid search point is sampled once. To take multiple random samples, add ``num_samples: N`` to the experiment config. If `grid_search` is provided as an argument, the grid will be repeated ``num_samples`` of times. .. literalinclude:: doc_code/faq.py :language: python :start-after: __grid_search_2_start__ :end-before: __grid_search_2_end__ Note that search spaces may not be interoperable across different search algorithms. For example, for many search algorithms, you will not be able to use a ``grid_search`` or ``sample_from`` parameters. Read about this in the :ref:`Search Space API ` page. .. _tune-working-dir: How do I access relative filepaths in my Tune training function? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's say you launch a Tune experiment with ``my_script.py`` from inside ``~/code``. By default, Tune changes the working directory of each worker to its corresponding trial directory (e.g. ``~/ray_results/exp_name/trial_0000x``). This guarantees separate working directories for each worker process, avoiding conflicts when saving trial-specific outputs. You can configure this by setting the `RAY_CHDIR_TO_TRIAL_DIR=0` environment variable. This explicitly tells Tune to not change the working directory to the trial directory, giving access to paths relative to the original working directory. One caveat is that the working directory is now shared between workers, so the :meth:`tune.get_context().get_trial_dir() ` API should be used to get the path for saving trial-specific outputs. .. literalinclude:: doc_code/faq.py :dedent: :emphasize-lines: 3, 10, 11, 12, 16 :language: python :start-after: __no_chdir_start__ :end-before: __no_chdir_end__ .. warning:: The `TUNE_ORIG_WORKING_DIR` environment variable was the original workaround for accessing paths relative to the original working directory. This environment variable is deprecated, and the `RAY_CHDIR_TO_TRIAL_DIR` environment variable above should be used instead. .. _tune-multi-tenancy: How can I run multiple Ray Tune jobs on the same cluster at the same time (multi-tenancy)? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running multiple Ray Tune runs on the same cluster at the same time is not officially supported. We do not test this workflow and we recommend using a separate cluster for each tuning job. The reasons for this are: 1. When multiple Ray Tune jobs run at the same time, they compete for resources. One job could run all its trials at the same time, while the other job waits for a long time until it gets resources to run the first trial. 2. If it is easy to start a new Ray cluster on your infrastructure, there is often no cost benefit to running one large cluster instead of multiple smaller clusters. For instance, running one cluster of 32 instances incurs almost the same cost as running 4 clusters with 8 instances each. 3. Concurrent jobs are harder to debug. If a trial of job A fills the disk, trials from job B on the same node are impacted. In practice, it's hard to reason about these conditions from the logs if something goes wrong. Previously, some internal implementations in Ray Tune assumed that you only have one job running at a time. A symptom was when trials from job A used parameters specified in job B, leading to unexpected results. Please refer to `this GitHub issue `__ for more context and a workaround if you run into this issue. .. _tune-iterative-experimentation: How can I continue training a completed Tune experiment for longer and with new configurations (iterative experimentation)? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's say that I have a Tune experiment that has completed with the following configurations: .. literalinclude:: /tune/doc_code/faq.py :language: python :start-after: __iter_experimentation_initial_start__ :end-before: __iter_experimentation_initial_end__ Now, I want to continue training from a checkpoint (e.g., the best one) generated by the previous experiment, and search over a new hyperparameter search space, for another ``10`` epochs. :ref:`tune-fault-tolerance-ref` explains that the usage of :meth:`Tuner.restore ` is meant for resuming an *unfinished* experiment that was interrupted in the middle, according to the *exact configuration* that was supplied in the initial training run. Therefore, ``Tuner.restore`` is not suitable for our desired behavior. This style of "iterative experimentation" should be done with *new* Tune experiments rather than restoring a single experiment over and over and modifying the experiment spec. See the following for an example of how to create a new experiment that builds off of the old one: .. literalinclude:: /tune/doc_code/faq.py :language: python :start-after: __iter_experimentation_resume_start__ :end-before: __iter_experimentation_resume_end__ --- .. _tune-tutorial: .. TODO: make this an executable notebook later on. Getting Started with Ray Tune ============================= This tutorial will walk you through the process of setting up a Tune experiment. To get started, we take a PyTorch model and show you how to leverage Ray Tune to optimize the hyperparameters of this model. Specifically, we'll leverage early stopping and Bayesian Optimization via HyperOpt to do so. .. tip:: If you have suggestions on how to improve this tutorial, please `let us know `_! To run this example, you will need to install the following: .. code-block:: bash $ pip install "ray[tune]" torch torchvision Setting Up a PyTorch Model to Tune ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To start off, let's first import some dependencies. We import some PyTorch and TorchVision modules to help us create a model and train it. Also, we'll import Ray Tune to help us optimize the model. As you can see we use a so-called scheduler, in this case the ``ASHAScheduler`` that we will use for tuning the model later in this tutorial. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __tutorial_imports_begin__ :end-before: __tutorial_imports_end__ Then, let's define a simple PyTorch model that we'll be training. If you're not familiar with PyTorch, the simplest way to define a model is to implement a ``nn.Module``. This requires you to set up your model with ``__init__`` and then implement a ``forward`` pass. In this example we're using a small convolutional neural network consisting of one 2D convolutional layer, a fully connected layer, and a softmax function. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __model_def_begin__ :end-before: __model_def_end__ Below, we have implemented functions for training and evaluating your PyTorch model. We define a ``train`` and a ``test`` function for that purpose. If you know how to do this, skip ahead to the next section. .. dropdown:: Training and evaluating the model .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __train_def_begin__ :end-before: __train_def_end__ .. _tutorial-tune-setup: Setting up a ``Tuner`` for a Training Run with Tune ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Below, we define a function that trains the PyTorch model for multiple epochs. This function will be executed on a separate :ref:`Ray Actor (process) ` underneath the hood, so we need to communicate the performance of the model back to Tune (which is on the main Python process). To do this, we call :func:`tune.report() ` in our training function, which sends the performance value back to Tune. Since the function is executed on the separate process, make sure that the function is :ref:`serializable by Ray `. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __train_func_begin__ :end-before: __train_func_end__ Let's run one trial by calling :ref:`Tuner.fit ` and :ref:`randomly sample ` from a uniform distribution for learning rate and momentum. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __eval_func_begin__ :end-before: __eval_func_end__ ``Tuner.fit`` returns an :ref:`ResultGrid object `. You can use this to plot the performance of this trial. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __plot_begin__ :end-before: __plot_end__ .. note:: Tune will automatically run parallel trials across all available cores/GPUs on your machine or cluster. To limit the number of concurrent trials, use the :ref:`ConcurrencyLimiter `. Early Stopping with Adaptive Successive Halving (ASHAScheduler) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's integrate early stopping into our optimization process. Let's use :ref:`ASHA `, a scalable algorithm for `principled early stopping`_. .. _`principled early stopping`: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/ On a high level, ASHA terminates trials that are less promising and allocates more time and resources to more promising trials. As our optimization process becomes more efficient, we can afford to **increase the search space by 5x**, by adjusting the parameter ``num_samples``. ASHA is implemented in Tune as a "Trial Scheduler". These Trial Schedulers can early terminate bad trials, pause trials, clone trials, and alter hyperparameters of a running trial. See :ref:`the TrialScheduler documentation ` for more details of available schedulers and library integrations. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __run_scheduler_begin__ :end-before: __run_scheduler_end__ You can run the below in a Jupyter notebook to visualize trial progress. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __plot_scheduler_begin__ :end-before: __plot_scheduler_end__ .. image:: /images/tune-df-plot.png :scale: 50% :align: center You can also use :ref:`TensorBoard ` for visualizing results. .. code:: bash $ tensorboard --logdir {logdir} Using Search Algorithms in Tune ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In addition to :ref:`TrialSchedulers `, you can further optimize your hyperparameters by using an intelligent search technique like Bayesian Optimization. To do this, you can use a Tune :ref:`Search Algorithm `. Search Algorithms leverage optimization algorithms to intelligently navigate the given hyperparameter space. Note that each library has a specific way of defining the search space. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __run_searchalg_begin__ :end-before: __run_searchalg_end__ .. note:: Tune allows you to use some search algorithms in combination with different trial schedulers. See :ref:`this page for more details `. Evaluating Your Model after Tuning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can evaluate the best trained model using the :ref:`ExperimentAnalysis object ` to retrieve the best model: .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __run_analysis_begin__ :end-before: __run_analysis_end__ Next Steps ---------- * Check out the :ref:`Tune tutorials ` for guides on using Tune with your preferred machine learning library. * Browse our :ref:`gallery of examples ` to see how to use Tune with PyTorch, XGBoost, Tensorflow, etc. * `Let us know `__ if you ran into issues or have any questions by opening an issue on our GitHub. * To check how your application is doing, you can use the :ref:`Ray dashboard `. --- .. _tune-main: Ray Tune: Hyperparameter Tuning =============================== .. toctree:: :hidden: Getting Started Key Concepts tutorials/overview examples/index faq api/api .. image:: images/tune_overview.png :scale: 50% :align: center Tune is a Python library for experiment execution and hyperparameter tuning at any scale. You can tune your favorite machine learning framework (:ref:`PyTorch `, :ref:`XGBoost `, :doc:`TensorFlow and Keras `, and :doc:`more `) by running state of the art algorithms such as :ref:`Population Based Training (PBT) ` and :ref:`HyperBand/ASHA `. Tune further integrates with a wide range of additional hyperparameter optimization tools, including :doc:`Ax `, :doc:`BayesOpt `, :doc:`BOHB `, :doc:`Nevergrad `, and :doc:`Optuna `. **Click on the following tabs to see code examples for various machine learning frameworks**: .. tab-set:: .. tab-item:: Quickstart To run this example, install the following: ``pip install "ray[tune]"``. In this quick-start example you `minimize` a simple function of the form ``f(x) = a**2 + b``, our `objective` function. The closer ``a`` is to zero and the smaller ``b`` is, the smaller the total value of ``f(x)``. We will define a so-called `search space` for ``a`` and ``b`` and let Ray Tune explore the space for good values. .. callout:: .. literalinclude:: ../../../python/ray/tune/tests/example.py :language: python :start-after: __quick_start_begin__ :end-before: __quick_start_end__ .. annotations:: <1> Define an objective function. <2> Define a search space. <3> Start a Tune run and print the best result. .. tab-item:: Keras+Hyperopt To tune your Keras models with Hyperopt, you wrap your model in an objective function whose ``config`` you can access for selecting hyperparameters. In the example below we only tune the ``activation`` parameter of the first layer of the model, but you can tune any parameter of the model you want. After defining the search space, you can simply initialize the ``HyperOptSearch`` object and pass it to ``run``. It's important to tell Ray Tune which metric you want to optimize and whether you want to maximize or minimize it. .. callout:: .. literalinclude:: doc_code/keras_hyperopt.py :language: python :start-after: __keras_hyperopt_start__ :end-before: __keras_hyperopt_end__ .. annotations:: <1> Wrap a Keras model in an objective function. <2> Define a search space and initialize the search algorithm. <3> Start a Tune run that maximizes accuracy. .. tab-item:: PyTorch+Optuna To tune your PyTorch models with Optuna, you wrap your model in an objective function whose ``config`` you can access for selecting hyperparameters. In the example below we only tune the ``momentum`` and learning rate (``lr``) parameters of the model's optimizer, but you can tune any other model parameter you want. After defining the search space, you can simply initialize the ``OptunaSearch`` object and pass it to ``run``. It's important to tell Ray Tune which metric you want to optimize and whether you want to maximize or minimize it. We stop tuning this training run after ``5`` iterations, but you can easily define other stopping rules as well. .. callout:: .. literalinclude:: doc_code/pytorch_optuna.py :language: python :start-after: __pytorch_optuna_start__ :end-before: __pytorch_optuna_end__ .. annotations:: <1> Wrap a PyTorch model in an objective function. <2> Define a search space and initialize the search algorithm. <3> Start a Tune run that maximizes mean accuracy and stops after 5 iterations. With Tune you can also launch a multi-node :ref:`distributed hyperparameter sweep ` in less than 10 lines of code. And you can move your models from training to serving on the same infrastructure with `Ray Serve`_. .. _`Ray Serve`: ../serve/index.html .. grid:: 1 2 3 4 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: **Getting Started** ^^^ In our getting started tutorial you will learn how to tune a PyTorch model effectively with Tune. +++ .. button-ref:: tune-tutorial :color: primary :outline: :expand: Get Started with Tune .. grid-item-card:: **Key Concepts** ^^^ Understand the key concepts behind Ray Tune. Learn about tune runs, search algorithms, schedulers and other features. +++ .. button-ref:: tune-60-seconds :color: primary :outline: :expand: Tune's Key Concepts .. grid-item-card:: **User Guides** ^^^ Our guides teach you about key features of Tune, such as distributed training or early stopping. +++ .. button-ref:: tune-guides :color: primary :outline: :expand: Learn How To Use Tune .. grid-item-card:: **Examples** ^^^ In our examples you can find practical tutorials for using frameworks such as scikit-learn, Keras, TensorFlow, PyTorch, and mlflow, and state of the art search algorithm integrations. +++ .. button-ref:: tune-examples-ref :color: primary :outline: :expand: Ray Tune Examples .. grid-item-card:: **Ray Tune FAQ** ^^^ Find answers to commonly asked questions in our detailed FAQ. +++ .. button-ref:: tune-faq :color: primary :outline: :expand: Ray Tune FAQ .. grid-item-card:: **Ray Tune API** ^^^ Get more in-depth information about the Ray Tune API, including all about search spaces, algorithms and training configurations. +++ .. button-ref:: tune-api-ref :color: primary :outline: :expand: Read the API Reference Why choose Tune? ---------------- There are many other hyperparameter optimization libraries out there. If you're new to Tune, you're probably wondering, "what makes Tune different?" .. dropdown:: Cutting-Edge Optimization Algorithms :animate: fade-in-slide-down As a user, you're probably looking into hyperparameter optimization because you want to quickly increase your model performance. Tune enables you to leverage a variety of these cutting edge optimization algorithms, reducing the cost of tuning by `terminating bad runs early `_, :ref:`choosing better parameters to evaluate `, or even :ref:`changing the hyperparameters during training ` to optimize schedules. .. dropdown:: First-class Developer Productivity :animate: fade-in-slide-down A key problem with many hyperparameter optimization frameworks is the need to restructure your code to fit the framework. With Tune, you can optimize your model just by :ref:`adding a few code snippets `. Also, Tune removes boilerplate from your code training workflow, supporting :ref:`multiple storage options for experiment results (NFS, cloud storage) ` and :ref:`logs results to tools ` such as MLflow and TensorBoard, while also being highly customizable. .. dropdown:: Multi-GPU & Distributed Training Out Of The Box :animate: fade-in-slide-down Hyperparameter tuning is known to be highly time-consuming, so it is often necessary to parallelize this process. Most other tuning frameworks require you to implement your own multi-process framework or build your own distributed system to speed up hyperparameter tuning. However, Tune allows you to transparently :ref:`parallelize across multiple GPUs and multiple nodes `. Tune even has seamless :ref:`fault tolerance and cloud support `, allowing you to scale up your hyperparameter search by 100x while reducing costs by up to 10x by using cheap preemptible instances. .. dropdown:: Coming From Another Hyperparameter Optimization Tool? :animate: fade-in-slide-down You might be already using an existing hyperparameter tuning tool such as HyperOpt or Bayesian Optimization. In this situation, Tune actually allows you to power up your existing workflow. Tune's :ref:`Search Algorithms ` integrate with a variety of popular hyperparameter tuning libraries (see :ref:`examples `) and allow you to seamlessly scale up your optimization process - without sacrificing performance. Projects using Tune ------------------- Here are some of the popular open source repositories and research projects that leverage Tune. Feel free to submit a pull-request adding (or requesting a removal!) of a listed project. - `Softlearning `_: Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm. - `Flambe `_: An ML framework to accelerate research and its path to production. See `flambe.ai `_. - `Population Based Augmentation `_: Population Based Augmentation (PBA) is an algorithm that quickly and efficiently learns data augmentation functions for neural network training. PBA matches state-of-the-art results on CIFAR with one thousand times less compute. - `Fast AutoAugment by Kakao `_: Fast AutoAugment (Accepted at NeurIPS 2019) learns augmentation policies using a more efficient search strategy based on density matching. - `Allentune `_: Hyperparameter Search for AllenNLP from AllenAI. - `machinable `_: A modular configuration system for machine learning research. See `machinable.org `_. - `NeuroCard `_: NeuroCard (Accepted at VLDB 2021) is a neural cardinality estimator for multi-table join queries. It uses state of the art deep density models to learn correlations across relational database tables. Learn More About Ray Tune ------------------------- Below you can find blog posts and talks about Ray Tune: - [blog] `Tune: a Python library for fast hyperparameter tuning at any scale `_ - [blog] `Cutting edge hyperparameter tuning with Ray Tune `_ - [slides] `Talk given at RISECamp 2019 `_ - [video] `Talk given at RISECamp 2018 `_ - [video] `A Guide to Modern Hyperparameter Optimization (PyData LA 2019) `_ (`slides `_) Citing Tune ----------- If Tune helps you in your academic research, you are encouraged to cite `our paper `__. Here is an example bibtex: .. code-block:: tex @article{liaw2018tune, title={Tune: A Research Platform for Distributed Model Selection and Training}, author={Liaw, Richard and Liang, Eric and Nishihara, Robert and Moritz, Philipp and Gonzalez, Joseph E and Stoica, Ion}, journal={arXiv preprint arXiv:1807.05118}, year={2018} } --- .. _tune-60-seconds: ======================== Key Concepts of Ray Tune ======================== .. TODO: should we introduce checkpoints as well? .. TODO: should we at least mention "Stopper" classes here? Let's quickly walk through the key concepts you need to know to use Tune. If you want to see practical tutorials right away, go visit our :ref:`user guides `. In essence, Tune has six crucial components that you need to understand. First, you define the hyperparameters you want to tune in a `search space` and pass them into a `trainable` that specifies the objective you want to tune. Then you select a `search algorithm` to effectively optimize your parameters and optionally use a `scheduler` to stop searches early and speed up your experiments. Together with other configuration, your `trainable`, search algorithm, and scheduler are passed into ``Tuner``, which runs your experiments and creates `trials`. The `Tuner` returns a `ResultGrid` to inspect your experiment results. The following figure shows an overview of these components, which we cover in detail in the next sections. .. image:: images/tune_flow.png .. _tune_60_seconds_trainables: Ray Tune Trainables ------------------- In short, a :ref:`Trainable ` is an object that you can pass into a Tune run. Ray Tune has two ways of defining a `trainable`, namely the :ref:`Function API ` and the :ref:`Class API `. Both are valid ways of defining a `trainable`, but the Function API is generally recommended and is used throughout the rest of this guide. Consider an example of optimizing a simple objective function like ``a * (x ** 2) + b`` in which ``a`` and ``b`` are the hyperparameters we want to tune to `minimize` the objective. Since the objective also has a variable ``x``, we need to test for different values of ``x``. Given concrete choices for ``a``, ``b`` and ``x`` we can evaluate the objective function and get a `score` to minimize. .. tab-set:: .. tab-item:: Function API With the :ref:`function-based API ` you create a function (here called ``trainable``) that takes in a dictionary of hyperparameters. This function computes a ``score`` in a "training loop" and `reports` this score back to Tune: .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __function_api_start__ :end-before: __function_api_end__ Note that we use ``tune.report(...)`` to report the intermediate ``score`` in the training loop, which can be useful in many machine learning tasks. If you just want to report the final ``score`` outside of this loop, you can simply return the score at the end of the ``trainable`` function with ``return {"score": score}``. You can also use ``yield {"score": score}`` instead of ``tune.report()``. .. tab-item:: Class API Here's an example of specifying the objective function using the :ref:`class-based API `: .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __class_api_start__ :end-before: __class_api_end__ .. tip:: ``tune.report`` can't be used within a ``Trainable`` class. Learn more about the details of :ref:`Trainables here ` and :ref:`have a look at our examples `. Next, let's have a closer look at what the ``config`` dictionary is that you pass into your trainables. .. _tune-key-concepts-search-spaces: Tune Search Spaces ------------------ To optimize your *hyperparameters*, you have to define a *search space*. A search space defines valid values for your hyperparameters and can specify how these values are sampled (e.g. from a uniform distribution or a normal distribution). Tune offers various functions to define search spaces and sampling methods. :ref:`You can find the documentation of these search space definitions here `. Here's an example covering all search space functions. Again, :ref:`here is the full explanation of all these functions `. .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __config_start__ :end-before: __config_end__ .. _tune_60_seconds_trials: Tune Trials ----------- You use :ref:`Tuner.fit ` to execute and manage hyperparameter tuning and generate your `trials`. At a minimum, your ``Tuner`` call takes in a trainable as first argument, and a ``param_space`` dictionary to define the search space. The ``Tuner.fit()`` function also provides many features such as :ref:`logging `, :ref:`checkpointing `, and :ref:`early stopping `. In the example, minimizing ``a (x ** 2) + b``, a simple Tune run with a simplistic search space for ``a`` and ``b`` looks like this: .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __run_tunable_start__ :end-before: __run_tunable_end__ ``Tuner.fit`` will generate a couple of hyperparameter configurations from its arguments, wrapping them into :ref:`Trial objects `. Trials contain a lot of information. For instance, you can get the hyperparameter configuration using (``trial.config``), the trial ID (``trial.trial_id``), the trial's resource specification (``resources_per_trial`` or ``trial.placement_group_factory``) and many other values. By default ``Tuner.fit`` will execute until all trials stop or error. Here's an example output of a trial run: .. TODO: how to make sure this doesn't get outdated? .. code-block:: bash == Status == Memory usage on this node: 11.4/16.0 GiB Using FIFO scheduling algorithm. Resources requested: 1/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects Result logdir: /Users/foo/ray_results/myexp Number of trials: 1 (1 RUNNING) +----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+ | Trial name | status | loc | a | b | score | total time (s) | iter | |----------------------+----------+---------------------+-----------+--------+--------+----------------+-------| | Trainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.1289 | 7.54952 | 15 | +----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+ You can also easily run just 10 trials by specifying the number of samples (``num_samples``). Tune automatically :ref:`determines how many trials will run in parallel `. Note that instead of the number of samples, you can also specify a time budget in seconds through ``time_budget_s``, if you set ``num_samples=-1``. .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __run_tunable_samples_start__ :end-before: __run_tunable_samples_end__ Finally, you can use more interesting search spaces to optimize your hyperparameters via Tune's :ref:`search space API `, like using random samples or grid search. Here's an example of uniformly sampling between ``[0, 1]`` for ``a`` and ``b``: .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __search_space_start__ :end-before: __search_space_end__ To learn more about the various ways of configuring your Tune runs, check out the :ref:`Tuner API reference `. .. _search-alg-ref: Tune Search Algorithms ---------------------- To optimize the hyperparameters of your training process, you use a :ref:`Search Algorithm ` which suggests hyperparameter configurations. If you don't specify a search algorithm, Tune will use random search by default, which can provide you with a good starting point for your hyperparameter optimization. For instance, to use Tune with simple Bayesian optimization through the ``bayesian-optimization`` package (make sure to first run ``pip install bayesian-optimization``), we can define an ``algo`` using ``BayesOptSearch``. Simply pass in a ``search_alg`` argument to ``tune.TuneConfig``, which is taken in by ``Tuner``: .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __bayes_start__ :end-before: __bayes_end__ Tune has Search Algorithms that integrate with many popular **optimization** libraries, such as :ref:`HyperOpt ` or :ref:`Optuna `. Tune automatically converts the provided search space into the search spaces the search algorithms and underlying libraries expect. See the :ref:`Search Algorithm API documentation ` for more details. Here's an overview of all available search algorithms in Tune: .. list-table:: :widths: 5 5 2 10 :header-rows: 1 * - SearchAlgorithm - Summary - Website - Code Example * - :ref:`Random search/grid search ` - Random search/grid search - - :doc:`/tune/examples/includes/tune_basic_example` * - :ref:`AxSearch ` - Bayesian/Bandit Optimization - [`Ax `__] - :doc:`/tune/examples/includes/ax_example` * - :ref:`HyperOptSearch ` - Tree-Parzen Estimators - [`HyperOpt `__] - :doc:`/tune/examples/hyperopt_example` * - :ref:`BayesOptSearch ` - Bayesian Optimization - [`BayesianOptimization `__] - :doc:`/tune/examples/includes/bayesopt_example` * - :ref:`TuneBOHB ` - Bayesian Opt/HyperBand - [`BOHB `__] - :doc:`/tune/examples/includes/bohb_example` * - :ref:`NevergradSearch ` - Gradient-free Optimization - [`Nevergrad `__] - :doc:`/tune/examples/includes/nevergrad_example` * - :ref:`OptunaSearch ` - Optuna search algorithms - [`Optuna `__] - :doc:`/tune/examples/optuna_example` .. note:: Unlike :ref:`Tune's Trial Schedulers `, Tune Search Algorithms cannot affect or stop training processes. However, you can use them together to early stop the evaluation of bad trials. In case you want to implement your own search algorithm, the interface is easy to implement, you can :ref:`read the instructions here `. Tune also provides helpful utilities to use with Search Algorithms: * :ref:`repeater`: Support for running each *sampled hyperparameter* with multiple random seeds. * :ref:`limiter`: Limits the amount of concurrent trials when running optimization. * :ref:`shim`: Allows creation of the search algorithm object given a string. Note that in the example above we tell Tune to ``stop`` after ``20`` training iterations. This way of stopping trials with explicit rules is useful, but in many cases we can do even better with `schedulers`. .. _schedulers-ref: Tune Schedulers --------------- To make your training process more efficient, you can use a :ref:`Trial Scheduler `. For instance, in our ``trainable`` example minimizing a function in a training loop, we used ``tune.report()``. This reported `incremental` results, given a hyperparameter configuration selected by a search algorithm. Based on these reported results, a Tune scheduler can decide whether to stop the trial early or not. If you don't specify a scheduler, Tune will use a first-in-first-out (FIFO) scheduler by default, which simply passes through the trials selected by your search algorithm in the order they were picked and does not perform any early stopping. In short, schedulers can stop, pause, or tweak the hyperparameters of running trials, potentially making your hyperparameter tuning process much faster. Unlike search algorithms, :ref:`Trial Schedulers ` do not select which hyperparameter configurations to evaluate. Here's a quick example of using the so-called ``HyperBand`` scheduler to tune an experiment. All schedulers take in a ``metric``, which is the value reported by your trainable. The ``metric`` is then maximized or minimized according to the ``mode`` you provide. To use a scheduler, just pass in a ``scheduler`` argument to ``tune.TuneConfig``, which is taken in by ``Tuner``: .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __hyperband_start__ :end-before: __hyperband_end__ Tune includes distributed implementations of early stopping algorithms such as `Median Stopping Rule `__, `HyperBand `__, and `ASHA `__. Tune also includes a distributed implementation of `Population Based Training (PBT) `__ and `Population Based Bandits (PB2) `__. .. tip:: The easiest scheduler to start with is the ``ASHAScheduler`` which will aggressively terminate low-performing trials. When using schedulers, you may face compatibility issues, as shown in the below compatibility matrix. Certain schedulers cannot be used with search algorithms, and certain schedulers require that you implement :ref:`checkpointing `. Schedulers can dynamically change trial resource requirements during tuning. This is implemented in :ref:`ResourceChangingScheduler`, which can wrap around any other scheduler. .. list-table:: Scheduler Compatibility Matrix :header-rows: 1 * - Scheduler - Need Checkpointing? - SearchAlg Compatible? - Example * - :ref:`ASHA ` - No - Yes - :doc:`Link ` * - :ref:`Median Stopping Rule ` - No - Yes - :ref:`Link ` * - :ref:`HyperBand ` - Yes - Yes - :doc:`Link ` * - :ref:`BOHB ` - Yes - Only TuneBOHB - :doc:`Link ` * - :ref:`Population Based Training ` - Yes - Not Compatible - :doc:`Link ` * - :ref:`Population Based Bandits ` - Yes - Not Compatible - :doc:`Basic Example `, :doc:`PPO example ` Learn more about trial schedulers in :ref:`the scheduler API documentation `. .. _tune-concepts-analysis: Tune ResultGrid --------------- ``Tuner.fit()`` returns an :ref:`ResultGrid ` object which has methods you can use for analyzing your training. The following example shows you how to access various metrics from an ``ResultGrid`` object, like the best available trial, or the best hyperparameter configuration for that trial: .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __analysis_start__ :end-before: __analysis_end__ This object can also retrieve all training runs as dataframes, allowing you to do ad-hoc data analysis over your results. .. literalinclude:: doc_code/key_concepts.py :language: python :start-after: __results_start__ :end-before: __results_end__ See the :ref:`result analysis user guide ` for more usage examples. What's Next? ------------- Now that you have a working understanding of Tune, check out: * :ref:`tune-guides`: Tutorials for using Tune with your preferred machine learning library. * :doc:`/tune/examples/index`: End-to-end examples and templates for using Tune with your preferred machine learning library. * :doc:`/tune/getting-started`: A simple tutorial that walks you through the process of setting up a Tune experiment. Further Questions or Issues? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. include:: /_includes/_help.rst --- .. _tune-guides: =========== User Guides =========== .. toctree:: :maxdepth: 2 Running Basic Experiments tune-output Setting Trial Resources Using Search Spaces tune-stopping tune-trial-checkpoints tune-storage tune-fault-tolerance Using Callbacks and Metrics tune_get_data_in_and_out ../examples/tune_analyze_results ../examples/pbt_guide Deploying Tune in the Cloud Tune Architecture Scalability Benchmarks --- .. _tune-distributed-ref: Running Distributed Experiments with Ray Tune ============================================== Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview how to setup and launch a distributed experiment along with :ref:`commonly used commands ` for Tune when running distributed experiments. .. contents:: :local: :backlinks: none Summary ------- To run a distributed experiment with Tune, you need to: 1. First, :ref:`start a Ray cluster ` if you have not already. 2. Run the script on the head node, or use :ref:`ray submit `, or use :ref:`Ray Job Submission `. .. tune-distributed-cloud: Example: Distributed Tune on AWS VMs ------------------------------------ Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the :ref:`cluster setup documentation `. Save the below cluster configuration (``tune-default.yaml``): .. literalinclude:: /../../python/ray/tune/examples/tune-default.yaml :language: yaml :name: tune-default.yaml ``ray up`` starts Ray on the cluster of nodes. .. code-block:: bash ray up tune-default.yaml ``ray submit --start`` starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py [args]``. .. code-block:: bash ray submit tune-default.yaml tune_script.py --start -- --ray-address=localhost:6379 .. image:: /images/tune-upload.png :scale: 50% :align: center Analyze your results on TensorBoard by starting TensorBoard on the remote head machine. .. code-block:: bash # Go to http://localhost:6006 to access TensorBoard. ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006 Note that you can customize the directory of results by specifying: ``RunConfig(storage_path=..)``, taken in by ``Tuner``. You can then point TensorBoard to that directory to visualize results. You can also use `awless `_ for easy cluster management on AWS. Running a Distributed Tune Experiment ------------------------------------- Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud. Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``. To execute a distributed experiment, call ``ray.init(address=XXX)`` before ``Tuner.fit()``, where ``XXX`` is the Ray address, which defaults to ``localhost:6379``. The Tune python script should be executed only on the head node of the Ray cluster. One common approach to modifying an existing Tune experiment to go distributed is to set an ``argparse`` variable so that toggling between distributed and single-node is seamless. .. code-block:: python import ray import argparse parser = argparse.ArgumentParser() parser.add_argument("--address") args = parser.parse_args() ray.init(address=args.address) tuner = tune.Tuner(...) tuner.fit() .. code-block:: bash # On the head node, connect to an existing ray cluster $ python tune_script.py --ray-address=localhost:XXXX If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray submit --start``), use: .. code-block:: bash ray submit tune-default.yaml tune_script.py -- --ray-address=localhost:6379 .. tip:: 1. In the examples, the Ray address commonly used is ``localhost:6379``. 2. If the Ray cluster is already started, you should not need to run anything on the worker nodes. Storage Options in a Distributed Tune Run ----------------------------------------- In a distributed experiment, you should try to use :ref:`cloud checkpointing ` to reduce synchronization overhead. For this, you just have to specify a remote ``storage_path`` in the :class:`RunConfig `. `my_trainable` is a user-defined :ref:`Tune Trainable ` in the following example: .. code-block:: python from ray import tune from my_module import my_trainable tuner = tune.Tuner( my_trainable, run_config=tune.RunConfig( name="experiment_name", storage_path="s3://bucket-name/sub-path/", ) ) tuner.fit() For more details or customization, see our :ref:`guide on configuring storage in a distributed Tune experiment `. .. _tune-distributed-spot: Tune Runs on preemptible instances ----------------------------------- Running on spot instances (or preemptible instances) can reduce the cost of your experiment. You can enable spot instances in AWS via the following configuration modification: .. code-block:: yaml # Provider-specific config for worker nodes, e.g. instance type. worker_nodes: InstanceType: m5.large ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0 # Run workers on spot by default. Comment this out to use on-demand. InstanceMarketOptions: MarketType: spot SpotOptions: MaxPrice: 1.0 # Max Hourly Price In GCP, you can use the following configuration modification: .. code-block:: yaml worker_nodes: machineType: n1-standard-2 disks: - boot: true autoDelete: true type: PERSISTENT initializeParams: diskSizeGb: 50 # See https://cloud.google.com/compute/docs/images for more images sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu # Run workers on preemtible instances. scheduling: - preemptible: true Spot instances may be pre-empted suddenly while trials are still running. Tune allows you to mitigate the effects of this by preserving the progress of your model training through :ref:`checkpointing `. .. literalinclude:: /../../python/ray/tune/tests/tutorial.py :language: python :start-after: __trainable_run_begin__ :end-before: __trainable_run_end__ Example for Using Tune with Spot instances (AWS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is an example for running Tune on spot instances. This assumes your AWS credentials have already been setup (``aws configure``): 1. Download a full example Tune experiment script here. This includes a Trainable with checkpointing: :download:`mnist_pytorch_trainable.py `. To run this example, you will need to install the following: .. code-block:: bash $ pip install ray torch torchvision filelock 2. Download an example cluster yaml here: :download:`tune-default.yaml ` 3. Run ``ray submit`` as below to run Tune across them. Append ``[--start]`` if the cluster is not up yet. Append ``[--stop]`` to automatically shutdown your nodes after running. .. code-block:: bash ray submit tune-default.yaml mnist_pytorch_trainable.py --start -- --ray-address=localhost:6379 4. Optionally for testing on AWS or GCP, you can use the following to kill a random worker node after all the worker nodes are up .. code-block:: bash $ ray kill-random-node tune-default.yaml --hard To summarize, here are the commands to run: .. code-block:: bash wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/examples/mnist_pytorch_trainable.py wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-default.yaml ray submit tune-default.yaml mnist_pytorch_trainable.py --start -- --ray-address=localhost:6379 # wait a while until after all nodes have started ray kill-random-node tune-default.yaml --hard You should see Tune eventually continue the trials on a different worker node. See the :ref:`Fault Tolerance ` section for more details. You can also specify ``storage_path=...``, as part of ``RunConfig``, which is taken in by ``Tuner``, to upload results to cloud storage like S3, allowing you to persist results in case you want to start and stop your cluster automatically. .. _tune-fault-tol: Fault Tolerance of Tune Runs ---------------------------- Tune automatically restarts trials in the case of trial failures (if ``max_failures != 0``), both in the single node and distributed setting. For example, let's say a node is pre-empted or crashes while a trial is still executing on that node. Assuming that a checkpoint for this trial exists (and in the distributed setting, :ref:`some form of persistent storage is configured to access the trial's checkpoint `), Tune waits until available resources are available to begin executing the trial again from where it left off. If no checkpoint is found, the trial will restart from scratch. See :ref:`here for information on checkpointing `. If the trial or actor is then placed on a different node, Tune automatically pushes the previous checkpoint file to that node and restores the remote trial actor state, allowing the trial to resume from the latest checkpoint even after failure. Recovering From Failures ~~~~~~~~~~~~~~~~~~~~~~~~ Tune automatically persists the progress of your entire experiment (a ``Tuner.fit()`` session), so if an experiment crashes or is otherwise cancelled, it can be resumed through :meth:`~ray.tune.Tuner.restore`. .. _tune-distributed-common: Common Tune Commands -------------------- Below are some commonly used commands for submitting experiments. Please see the :ref:`Clusters page ` to see find more comprehensive documentation of commands. .. code-block:: bash # Upload `tune_experiment.py` from your local machine onto the cluster. Then, # run `python tune_experiment.py --address=localhost:6379` on the remote machine. $ ray submit CLUSTER.YAML tune_experiment.py -- --address=localhost:6379 # Start a cluster and run an experiment in a detached tmux session, # and shut down the cluster as soon as the experiment completes. # In `tune_experiment.py`, set `RunConfig(storage_path="s3://...")` # to persist results $ ray submit CLUSTER.YAML --tmux --start --stop tune_experiment.py -- --address=localhost:6379 # To start or update your cluster: $ ray up CLUSTER.YAML [-y] # Shut-down all instances of your cluster: $ ray down CLUSTER.YAML [-y] # Run TensorBoard and forward the port to your own machine. $ ray exec CLUSTER.YAML 'tensorboard --logdir ~/ray_results/ --port 6006' --port-forward 6006 # Run Jupyter Lab and forward the port to your own machine. $ ray exec CLUSTER.YAML 'jupyter lab --port 6006' --port-forward 6006 # Get a summary of all the experiments and trials that have executed so far. $ ray exec CLUSTER.YAML 'tune ls ~/ray_results' # Upload and sync file_mounts up to the cluster with this command. $ ray rsync-up CLUSTER.YAML # Download the results directory from your cluster head node to your local machine on ``~/cluster_results``. $ ray rsync-down CLUSTER.YAML '~/ray_results' ~/cluster_results # Launching multiple clusters using the same configuration. $ ray up CLUSTER.YAML -n="cluster1" $ ray up CLUSTER.YAML -n="cluster2" $ ray up CLUSTER.YAML -n="cluster3" Troubleshooting --------------- Sometimes, your program may freeze. Run this to restart the Ray cluster without running any of the installation commands. .. code-block:: bash $ ray up CLUSTER.YAML --restart-only --- .. _tune-fault-tolerance-ref: How to Enable Fault Tolerance in Ray Tune ========================================= Fault tolerance is an important feature for distributed machine learning experiments that can help mitigate the impact of node failures due to out of memory and out of disk issues. With fault tolerance, users can: - **Save time and resources by preserving training progress** even if a node fails. - **Access the cost savings of preemptible spot instance nodes** in the distributed setting. .. seealso:: In a *distributed* Tune experiment, a prerequisite to enabling fault tolerance is configuring some form of persistent storage where all trial results and checkpoints can be consolidated. See :ref:`tune-storage-options`. In this guide, we will cover how to enable different types of fault tolerance offered by Ray Tune. .. _tune-experiment-level-fault-tolerance: Experiment-level Fault Tolerance in Tune ---------------------------------------- At the experiment level, :meth:`Tuner.restore ` resumes a previously interrupted experiment from where it left off. You should use :meth:`Tuner.restore ` in the following cases: 1. The driver script that calls :meth:`Tuner.fit() ` errors out (e.g., due to the head node running out of memory or out of disk). 2. The experiment is manually interrupted with ``Ctrl+C``. 3. The entire cluster, and the experiment along with it, crashes due to an ephemeral error such as the network going down or Ray object store memory filling up. .. note:: :meth:`Tuner.restore ` is *not* meant for resuming a terminated experiment and modifying hyperparameter search spaces or stopping criteria. Rather, experiment restoration is meant to resume and complete the *exact job* that was previously submitted via :meth:`Tuner.fit `. For example, consider a Tune experiment configured to run for ``10`` training iterations, where all trials have already completed. :meth:`Tuner.restore ` cannot be used to restore the experiment, change the number of training iterations to ``20``, then continue training. Instead, this should be achieved by starting a *new* experiment and initializing your model weights with a checkpoint from the previous experiment. See :ref:`this FAQ post ` for an example. .. note:: Bugs in your user-defined training loop cannot be fixed with restoration. Instead, the issue that caused the experiment to crash in the first place should be *ephemeral*, meaning that the retry attempt after restoring can succeed the next time. .. _tune-experiment-restore-example: Restore a Tune Experiment ~~~~~~~~~~~~~~~~~~~~~~~~~ Let's say your initial Tune experiment is configured as follows. The actual training loop is just for demonstration purposes: the important detail is that :ref:`saving and loading checkpoints has been implemented in the trainable `. .. literalinclude:: /tune/doc_code/fault_tolerance.py :language: python :start-after: __ft_initial_run_start__ :end-before: __ft_initial_run_end__ The results and checkpoints of the experiment are saved to ``~/ray_results/tune_fault_tolerance_guide``, as configured by :class:`~ray.tune.RunConfig`. If the experiment has been interrupted due to one of the reasons listed above, use this path to resume: .. literalinclude:: /tune/doc_code/fault_tolerance.py :language: python :start-after: __ft_restored_run_start__ :end-before: __ft_restored_run_end__ .. tip:: You can also restore the experiment from a cloud bucket path: .. code-block:: python tuner = tune.Tuner.restore( path="s3://cloud-bucket/tune_fault_tolerance_guide", trainable=trainable ) See :ref:`tune-storage-options`. Restore Configurations ~~~~~~~~~~~~~~~~~~~~~~ Tune allows configuring which trials should be resumed, based on their status when the experiment was interrupted: - Unfinished trials left in the ``RUNNING`` state will be resumed by default. - Trials that have ``ERRORED`` can be resumed or retried from scratch. - ``TERMINATED`` trials *cannot* be resumed. .. literalinclude:: /tune/doc_code/fault_tolerance.py :language: python :start-after: __ft_restore_options_start__ :end-before: __ft_restore_options_end__ .. _tune-experiment-autoresume-example: Auto-resume ~~~~~~~~~~~ When running in a production setting, one may want a *single script* that (1) launches the initial training run in the beginning and (2) restores the experiment if (1) already happened. Use the :meth:`Tuner.can_restore ` utility to accomplish this: .. literalinclude:: /tune/doc_code/fault_tolerance.py :language: python :start-after: __ft_restore_multiplexing_start__ :end-before: __ft_restore_multiplexing_end__ Running this script the first time will launch the initial training run. Running this script the second time will attempt to resume from the outputs of the first run. Tune Experiment Restoration with Ray Object References (Advanced) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Experiment restoration often happens in a different Ray session than the original run, in which case Ray object references are automatically garbage collected. If object references are saved along with experiment state (e.g., within each trial's config), then attempting to retrieve these objects will not work properly after restoration: the objects these references point to no longer exist. To work around this, you must re-create these objects, put them in the Ray object store, and then pass the new object references to Tune. Example ******* Let's say we have some large pre-trained model that we want to use in some way in our training loop. For example, this could be a image classification model used to calculate an Inception Score to evaluate the quality of a generative model. We may have multiple models that we want to tune over, where each trial samples one of the models to use. .. literalinclude:: /tune/doc_code/fault_tolerance.py :language: python :start-after: __ft_restore_objrefs_initial_start__ :end-before: __ft_restore_objrefs_initial_end__ To restore, we just need to re-specify the ``param_space`` via :meth:`Tuner.restore `: .. literalinclude:: /tune/doc_code/fault_tolerance.py :language: python :start-after: __ft_restore_objrefs_restored_start__ :end-before: __ft_restore_objrefs_restored_end__ .. note:: If you're tuning over :ref:`Ray Data `, you'll also need to re-specify them in the ``param_space``. Ray Data can contain object references, so the same problems described above apply. See below for an example: .. code-block:: python ds_1 = ray.data.from_items([{"x": i, "y": 2 * i} for i in range(128)]) ds_2 = ray.data.from_items([{"x": i, "y": 3 * i} for i in range(128)]) param_space = { "datasets": {"train": tune.grid_search([ds_1, ds_2])}, } tuner = tune.Tuner.restore(..., param_space=param_space) .. _tune-trial-level-fault-tolerance: Trial-level Fault Tolerance in Tune ----------------------------------- Trial-level fault tolerance deals with individual trial failures in the cluster, which can be caused by: - Running with preemptible spot instances. - Ephemeral network connection issues. - Nodes running out of memory or out of disk space. Ray Tune provides a way to configure failure handling of individual trials with the :class:`~ray.tune.FailureConfig`. Assuming that we're using the ``trainable`` from the previous example that implements trial checkpoint saving and loading, here is how to configure :class:`~ray.tune.FailureConfig`: .. literalinclude:: /tune/doc_code/fault_tolerance.py :language: python :start-after: __ft_trial_failure_start__ :end-before: __ft_trial_failure_end__ When a trial encounters a runtime error, the above configuration will re-schedule that trial up to ``max_failures=3`` times. Similarly, if a node failure occurs for node ``X`` (e.g., pre-empted or lost connection), this configuration will reschedule all trials that lived on node ``X`` up to ``3`` times. Summary ------- In this user guide, we covered how to enable experiment-level and trial-level fault tolerance in Ray Tune. See the following resources for more information: - :ref:`tune-storage-options` - :ref:`tune-distributed-ref` - :ref:`tune-trial-checkpoint` --- How does Tune work? =================== This page provides an overview of Tune's inner workings. We describe in detail what happens when you call ``Tuner.fit()``, what the lifecycle of a Tune trial looks like and what the architectural components of Tune are. .. tip:: Before you continue, be sure to have read :ref:`the Tune Key Concepts page `. What happens in ``Tuner.fit``? ------------------------------ When calling the following: .. code-block:: python space = {"x": tune.uniform(0, 1)} tuner = tune.Tuner( my_trainable, param_space=space, tune_config=tune.TuneConfig(num_samples=10), ) results = tuner.fit() The provided ``my_trainable`` is evaluated multiple times in parallel with different hyperparameters (sampled from ``uniform(0, 1)``). Every Tune run consists of "driver process" and many "worker processes". The driver process is the python process that calls ``Tuner.fit()`` (which calls ``ray.init()`` underneath the hood). The Tune driver process runs on the node where you run your script (which calls ``Tuner.fit()``), while Ray Tune trainable "actors" run on any node (either on the same node or on worker nodes (distributed Ray only)). .. note:: :ref:`Ray Actors ` allow you to parallelize an instance of a class in Python. When you instantiate a class that is a Ray actor, Ray will start a instance of that class on a separate process either on the same machine (or another distributed machine, if running a Ray cluster). This actor can then asynchronously execute method calls and maintain its own internal state. The driver spawns parallel worker processes (:ref:`Ray actors `) that are responsible for evaluating each trial using its hyperparameter configuration and the provided trainable. While the Trainable is executing (:ref:`trainable-execution`), the Tune Driver communicates with each actor via actor methods to receive intermediate training results and pause/stop actors (see :ref:`trial-lifecycle`). When the Trainable terminates (or is stopped), the actor is also terminated. .. _trainable-execution: The execution of a trainable in Tune ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tune uses :ref:`Ray actors ` to parallelize the evaluation of multiple hyperparameter configurations. Each actor is a Python process that executes an instance of the user-provided Trainable. The definition of the user-provided Trainable will be :ref:`serialized via cloudpickle `) and sent to each actor process. Each Ray actor will start an instance of the Trainable to be executed. If the Trainable is a class, it will be executed iteratively by calling ``train/step``. After each invocation, the driver is notified that a "result dict" is ready. The driver will then pull the result via ``ray.get``. If the trainable is a callable or a function, it will be executed on the Ray actor process on a separate execution thread. Whenever ``tune.report`` is called, the execution thread is paused and waits for the driver to pull a result (see `function_trainable.py `__. After pulling, the actor’s execution thread will automatically resume. Resource Management in Tune ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before running a trial, the Ray Tune driver will check whether there are available resources on the cluster (see :ref:`resource-requirements`). It will compare the available resources with the resources required by the trial. If there is space on the cluster, then the Tune Driver will start a Ray actor (worker). This actor will be scheduled and executed on some node where the resources are available. See :doc:`tune-resources` for more information. .. _trial-lifecycle: Lifecycle of a Tune Trial ------------------------- A trial's life cycle consists of 6 stages: * **Initialization** (generation): A trial is first generated as a hyperparameter sample, and its parameters are configured according to what was provided in ``Tuner``. Trials are then placed into a queue to be executed (with status PENDING). * **PENDING**: A pending trial is a trial to be executed on the machine. Every trial is configured with resource values. Whenever the trial’s resource values are available, Tune will run the trial (by starting a ray actor holding the config and the training function. * **RUNNING**: A running trial is assigned a Ray Actor. There can be multiple running trials in parallel. See the :ref:`trainable execution ` section for more details. * **ERRORED**: If a running trial throws an exception, Tune will catch that exception and mark the trial as errored. Note that exceptions can be propagated from an actor to the main Tune driver process. If max_retries is set, Tune will set the trial back into "PENDING" and later start it from the last checkpoint. * **TERMINATED**: A trial is terminated if it is stopped by a Stopper/Scheduler. If using the Function API, the trial is also terminated when the function stops. * **PAUSED**: A trial can be paused by a Trial scheduler. This means that the trial’s actor will be stopped. A paused trial can later be resumed from the most recent checkpoint. Tune's Architecture ------------------- .. image:: ../../images/tune-arch.png The blue boxes refer to internal components, while green boxes are public-facing. Tune's main components consist of the :class:`~ray.tune.execution.tune_controller.TuneController`, :class:`~ray.tune.experiment.trial.Trial` objects, a :class:`~ray.tune.search.search_algorithm.SearchAlgorithm`, a :class:`~ray.tune.schedulers.trial_scheduler.TrialScheduler`, and a :class:`~ray.tune.trainable.trainable.Trainable`, .. _trial-runner-flow: This is an illustration of the high-level training flow and how some of the components interact: *Note: This figure is horizontally scrollable* .. figure:: ../../images/tune-trial-runner-flow-horizontal.png :class: horizontal-scroll TuneController ~~~~~~~~~~~~~~ [`source code `__] This is the main driver of the training loop. This component uses the TrialScheduler to prioritize and execute trials, queries the SearchAlgorithm for new configurations to evaluate, and handles the fault tolerance logic. **Fault Tolerance**: The TuneController executes checkpointing if ``checkpoint_freq`` is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set). For example, if a node is lost while a trial (specifically, the corresponding Trainable of the trial) is still executing on that node and checkpointing is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed from the last available checkpoint when it is run. The TuneController is also in charge of checkpointing the entire experiment execution state upon each loop iteration. This allows users to restart their experiment in case of machine failure. See the docstring at :class:`~ray.tune.execution.tune_controller.TuneController`. Trial objects ~~~~~~~~~~~~~ [`source code `__] This is an internal data structure that contains metadata about each training run. Each Trial object is mapped one-to-one with a Trainable object but are not themselves distributed/remote. Trial objects transition among the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and ``"TERMINATED"``. See the docstring at :ref:`trial-docstring`. SearchAlg ~~~~~~~~~ [`source code `__] The SearchAlgorithm is a user-provided object that is used for querying new hyperparameter configurations to evaluate. SearchAlgorithms will be notified every time a trial finishes executing one training step (of ``train()``), every time a trial errors, and every time a trial completes. TrialScheduler ~~~~~~~~~~~~~~ [`source code `__] TrialSchedulers operate over a set of possible trials to run, prioritizing trial execution given available cluster resources. TrialSchedulers are given the ability to kill or pause trials, and also are given the ability to reorder/prioritize incoming trials. Trainables ~~~~~~~~~~ [`source code `__] These are user-provided objects that are used for the training process. If a class is provided, it is expected to conform to the Trainable interface. If a function is provided. it is wrapped into a Trainable class, and the function itself is executed on a separate thread. Trainables will execute one step of ``train()`` before notifying the TrialRunner. --- A Guide To Callbacks & Metrics in Tune ====================================== .. _tune-callbacks: How to work with Callbacks in Ray Tune? --------------------------------------- Ray Tune supports callbacks that are called during various times of the training process. Callbacks can be passed as a parameter to ``RunConfig``, taken in by ``Tuner``, and the sub-method you provide will be invoked automatically. This simple callback just prints a metric each time a result is received: .. code-block:: python from ray import tune from ray.tune import Callback class MyCallback(Callback): def on_trial_result(self, iteration, trials, trial, result, **info): print(f"Got result: {result['metric']}") def train_fn(config): for i in range(10): tune.report({"metric": i}) tuner = tune.Tuner( train_fn, run_config=tune.RunConfig(callbacks=[MyCallback()])) tuner.fit() For more details and available hooks, please :ref:`see the API docs for Ray Tune callbacks `. .. _tune-autofilled-metrics: How to use log metrics in Tune? ------------------------------- You can log arbitrary values and metrics in both Function and Class training APIs: .. code-block:: python def trainable(config): for i in range(num_epochs): ... tune.report({"acc": accuracy, "metric_foo": random_metric_1, "bar": metric_2}) class Trainable(tune.Trainable): def step(self): ... # don't call report here! return dict(acc=accuracy, metric_foo=random_metric_1, bar=metric_2) .. tip:: Note that ``tune.report()`` is not meant to transfer large amounts of data, like models or datasets. Doing so can incur large overheads and slow down your Tune run significantly. Which Tune metrics get automatically filled in? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tune has the concept of auto-filled metrics. During training, Tune will automatically log the below metrics in addition to any user-provided values. All of these can be used as stopping conditions or passed as a parameter to Trial Schedulers/Search Algorithms. * ``config``: The hyperparameter configuration * ``date``: String-formatted date and time when the result was processed * ``done``: True if the trial has been finished, False otherwise * ``episodes_total``: Total number of episodes (for RLlib trainables) * ``experiment_id``: Unique experiment ID * ``experiment_tag``: Unique experiment tag (includes parameter values) * ``hostname``: Hostname of the worker * ``iterations_since_restore``: The number of times ``tune.report`` has been called after restoring the worker from a checkpoint * ``node_ip``: Host IP of the worker * ``pid``: Process ID (PID) of the worker process * ``time_since_restore``: Time in seconds since restoring from a checkpoint. * ``time_this_iter_s``: Runtime of the current training iteration in seconds (i.e. one call to the trainable function or to ``_train()`` in the class API. * ``time_total_s``: Total runtime in seconds. * ``timestamp``: Timestamp when the result was processed * ``timesteps_since_restore``: Number of timesteps since restoring from a checkpoint * ``timesteps_total``: Total number of timesteps * ``training_iteration``: The number of times ``tune.report()`` has been called * ``trial_id``: Unique trial ID All of these metrics can be seen in the ``Trial.last_result`` dictionary. --- Logging and Outputs in Tune =========================== By default, Tune logs results for TensorBoard, CSV, and JSON formats. If you need to log something lower level like model weights or gradients, see :ref:`Trainable Logging `. You can learn more about logging and customizations here: :ref:`loggers-docstring`. .. _tune-logging: How to configure logging in Tune? --------------------------------- Tune will log the results of each trial to a sub-folder under a specified local dir, which defaults to ``~/ray_results``. .. code-block:: python # This logs to two different trial folders: # ~/ray_results/trainable_name/trial_name_1 and ~/ray_results/trainable_name/trial_name_2 # trainable_name and trial_name are autogenerated. tuner = tune.Tuner(trainable, run_config=RunConfig(num_samples=2)) results = tuner.fit() You can specify the ``storage_path`` and ``trainable_name``: .. code-block:: python # This logs to 2 different trial folders: # ./results/test_experiment/trial_name_1 and ./results/test_experiment/trial_name_2 # Only trial_name is autogenerated. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=2), run_config=RunConfig(storage_path="./results", name="test_experiment")) results = tuner.fit() To learn more about Trials, see its detailed API documentation: :ref:`trial-docstring`. .. _tensorboard: How to log your Tune runs to TensorBoard? ----------------------------------------- Tune automatically outputs TensorBoard files during ``Tuner.fit()``. To visualize learning in tensorboard, install tensorboardX: .. code-block:: bash $ pip install tensorboardX Then, after you run an experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. .. code-block:: bash $ tensorboard --logdir=~/ray_results/my_experiment If you are running Ray on a remote multi-user cluster where you do not have sudo access, you can run the following commands to make sure tensorboard is able to write to the tmp directory: .. code-block:: bash $ export TMPDIR=/tmp/$USER; mkdir -p $TMPDIR; tensorboard --logdir=~/ray_results .. image:: ../images/ray-tune-tensorboard.png If using TensorFlow ``2.x``, Tune also automatically generates TensorBoard HParams output, as shown below: .. code-block:: python tuner = tune.Tuner( ..., param_space={ "lr": tune.grid_search([1e-5, 1e-4]), "momentum": tune.grid_search([0, 0.9]) } ) results = tuner.fit() .. image:: ../../images/tune-hparams.png .. _tune-console-output: How to control console output with Tune? ---------------------------------------- User-provided fields will be outputted automatically on a best-effort basis. You can use a :ref:`Reporter ` object to customize the console output. .. code-block:: bash == Status == Memory usage on this node: 11.4/16.0 GiB Using FIFO scheduling algorithm. Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects Result logdir: /Users/foo/ray_results/myexp Number of trials: 4 (4 RUNNING) +----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+ | Trial name | status | loc | param1 | param2 | acc | total time (s) | iter | |----------------------+----------+---------------------+-----------+--------+--------+----------------+-------| | MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.1289 | 7.54952 | 15 | | MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.4865 | 7.0501 | 14 | | MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.9585 | 7.0477 | 14 | | MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1797 | 7.05715 | 14 | +----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+ .. _tune-log_to_file: How to redirect Trainable logs to files in a Tune run? --------------------------------------------------------- In Tune, Trainables are run as remote actors. By default, Ray collects actors' stdout and stderr and prints them to the head process (see :ref:`ray worker logs ` for more information). Logging that happens within Tune Trainables follows this handling by default. However, if you wish to collect Trainable logs in files for analysis, Tune offers the option ``log_to_file`` for this. This applies to print statements, ``warnings.warn`` and ``logger.info`` etc. By passing ``log_to_file=True`` to ``RunConfig``, which is taken in by ``Tuner``, stdout and stderr will be logged to ``trial_logdir/stdout`` and ``trial_logdir/stderr``, respectively: .. code-block:: python tuner = tune.Tuner( trainable, run_config=RunConfig(log_to_file=True) ) results = tuner.fit() If you would like to specify the output files, you can either pass one filename, where the combined output will be stored, or two filenames, for stdout and stderr, respectively: .. code-block:: python tuner = tune.Tuner( trainable, run_config=RunConfig(log_to_file="std_combined.log") ) tuner.fit() tuner = tune.Tuner( trainable, run_config=RunConfig(log_to_file=("my_stdout.log", "my_stderr.log"))) results = tuner.fit() The file names are relative to the trial's logdir. You can pass absolute paths, too. Caveats ^^^^^^^ Logging that happens in distributed training workers (if you happen to use Ray Tune together with Ray Train) is not part of this ``log_to_file`` configuration. Where to find ``log_to_file`` files? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If your Tune workload is configured with syncing to head node, then the corresponding ``log_to_file`` outputs can be located under each trial folder. If your Tune workload is instead configured with syncing to cloud, then the corresponding ``log_to_file`` outputs are *NOT* synced to cloud and can only be found in the worker nodes that the corresponding trial happens. .. note:: This can cause problems when the trainable is moved across different nodes throughout its lifetime. This can happen with some schedulers or with node failures. We may prioritize enabling this if there are enough user requests. If this impacts your workflow, consider commenting on [this ticket](https://github.com/ray-project/ray/issues/32142). Leave us feedback on this feature ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We know that logging and observability can be a huge performance boost for your workflow. Let us know what is your preferred way to interact with logging that happens in trainables. Leave you comments in [this ticket](https://github.com/ray-project/ray/issues/32142). .. _trainable-logging: How do you log arbitrary files from a Tune Trainable? ----------------------------------------------------- By default, Tune only logs the *training result dictionaries* and *checkpoints* from your Trainable. However, you may want to save a file that visualizes the model weights or model graph, or use a custom logging library that requires multi-process logging. For example, you may want to do this if you're trying to log images to TensorBoard. We refer to these saved files as **trial artifacts**. .. note:: If :class:`SyncConfig(sync_artifacts=True) `, trial artifacts are uploaded periodically from each trial (or from each remote training worker for Ray Train) to the :class:`RunConfig(storage_path) `. See the :class:`~ray.tune.SyncConfig` API reference for artifact syncing configuration options. You can save trial artifacts directly in the trainable, as shown below: .. tip:: Make sure that any logging calls or objects stay within scope of the Trainable. You may see pickling or other serialization errors or inconsistent logs otherwise. .. tab-set:: .. tab-item:: Function API .. code-block:: python import logging_library # ex: mlflow, wandb from ray import tune def trainable(config): logging_library.init( name=trial_id, id=trial_id, resume=trial_id, reinit=True, allow_val_change=True) logging_library.set_log_path(os.getcwd()) for step in range(100): logging_library.log_model(...) logging_library.log(results, step=step) # You can also just write to a file directly. # The working directory is set to the trial directory, so # you don't need to worry about multiple workers saving # to the same location. with open(f"./artifact_{step}.txt", "w") as f: f.write("Artifact Data") tune.report(results) .. tab-item:: Class API .. code-block:: python import logging_library # ex: mlflow, wandb from ray import tune class CustomLogging(tune.Trainable) def setup(self, config): trial_id = self.trial_id logging_library.init( name=trial_id, id=trial_id, resume=trial_id, reinit=True, allow_val_change=True ) logging_library.set_log_path(os.getcwd()) def step(self): logging_library.log_model(...) # You can also write to a file directly. # The working directory is set to the trial directory, so # you don't need to worry about multiple workers saving # to the same location. with open(f"./artifact_{self.iteration}.txt", "w") as f: f.write("Artifact Data") def log_result(self, result): res_dict = { str(k): v for k, v in result.items() if (v and "config" not in k and not isinstance(v, str)) } step = result["training_iteration"] logging_library.log(res_dict, step=step) In the code snippet above, ``logging_library`` refers to whatever 3rd party logging library you are using. Note that ``logging_library.set_log_path(os.getcwd())`` is an imaginary API that we are using for demonstration purposes, and it highlights that the third-party library should be configured to log to the Trainable's *working directory.* By default, the current working directory of both functional and class trainables is set to the corresponding trial directory once it's been launched as a remote Ray actor. How to Build Custom Tune Loggers? --------------------------------- You can create a custom logger by inheriting the LoggerCallback interface (:ref:`logger-interface`): .. code-block:: python from typing import Dict, List import json import os from ray.tune.logger import LoggerCallback class CustomLoggerCallback(LoggerCallback): """Custom logger interface""" def __init__(self, filename: str = "log.txt"): self._trial_files = {} self._filename = filename def log_trial_start(self, trial: "Trial"): trial_logfile = os.path.join(trial.logdir, self._filename) self._trial_files[trial] = open(trial_logfile, "at") def log_trial_result(self, iteration: int, trial: "Trial", result: Dict): if trial in self._trial_files: self._trial_files[trial].write(json.dumps(result)) def on_trial_complete(self, iteration: int, trials: List["Trial"], trial: "Trial", **info): if trial in self._trial_files: self._trial_files[trial].close() del self._trial_files[trial] You can then pass in your own logger as follows: .. code-block:: python from ray import tune tuner = tune.Tuner( MyTrainableClass, run_config=tune.RunConfig( name="experiment_name", callbacks=[CustomLoggerCallback("log_test.txt")] ) ) results = tuner.fit() Per default, Ray Tune creates JSON, CSV and TensorBoardX logger callbacks if you don't pass them yourself. You can disable this behavior by setting the ``TUNE_DISABLE_AUTO_CALLBACK_LOGGERS`` environment variable to ``"1"``. An example of creating a custom logger can be found in :doc:`/tune/examples/includes/logging_example`. --- .. _tune-parallelism: A Guide To Parallelism and Resources for Ray Tune ------------------------------------------------- Parallelism is determined by per trial resources (defaulting to 1 CPU, 0 GPU per trial) and the resources available to Tune (``ray.cluster_resources()``). By default, Tune automatically runs `N` concurrent trials, where `N` is the number of CPUs (cores) on your machine. .. code-block:: python # If you have 4 CPUs on your machine, this will run 4 concurrent trials at a time. tuner = tune.Tuner( trainable, tune_config=tune.TuneConfig(num_samples=10) ) results = tuner.fit() You can override this per trial resources with :func:`tune.with_resources `. Here you can specify your resource requests using either a dictionary or a :class:`PlacementGroupFactory ` object. In either case, Ray Tune will try to start a placement group for each trial. .. code-block:: python # If you have 4 CPUs on your machine, this will run 2 concurrent trials at a time. trainable_with_resources = tune.with_resources(trainable, {"cpu": 2}) tuner = tune.Tuner( trainable_with_resources, tune_config=tune.TuneConfig(num_samples=10) ) results = tuner.fit() # If you have 4 CPUs on your machine, this will run 1 trial at a time. trainable_with_resources = tune.with_resources(trainable, {"cpu": 4}) tuner = tune.Tuner( trainable_with_resources, tune_config=tune.TuneConfig(num_samples=10) ) results = tuner.fit() # Fractional values are also supported, (i.e., {"cpu": 0.5}). # If you have 4 CPUs on your machine, this will run 8 concurrent trials at a time. trainable_with_resources = tune.with_resources(trainable, {"cpu": 0.5}) tuner = tune.Tuner( trainable_with_resources, tune_config=tune.TuneConfig(num_samples=10) ) results = tuner.fit() # Custom resource allocation via lambda functions are also supported. # If you want to allocate gpu resources to trials based on a setting in your config trainable_with_resources = tune.with_resources(trainable, resources=lambda spec: {"gpu": 1} if spec.config.use_gpu else {"gpu": 0}) tuner = tune.Tuner( trainable_with_resources, tune_config=tune.TuneConfig(num_samples=10) ) results = tuner.fit() Tune will allocate the specified GPU and CPU as specified by ``tune.with_resources`` to each individual trial. Even if the trial cannot be scheduled right now, Ray Tune will still try to start the respective placement group. If not enough resources are available, this will trigger :ref:`autoscaling behavior ` if you're using the Ray cluster launcher. It is also possible to specify memory (``"memory"``, in bytes) and custom resource requirements. If your trainable function starts more remote workers, you will need to pass so-called placement group factory objects to request these resources. See the :class:`PlacementGroupFactory documentation ` for further information. This also applies if you are using other libraries making use of Ray, such as Modin. Failure to set resources correctly may result in a deadlock, "hanging" the cluster. .. note:: The resources specified this way will only be allocated for scheduling Tune trials. These resources will not be enforced on your objective function (Tune trainable) automatically. You will have to make sure your trainable has enough resources to run (e.g. by setting ``n_jobs`` for a scikit-learn model accordingly). How to leverage GPUs in Tune? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To leverage GPUs, you must set ``gpu`` in ``tune.with_resources(trainable, resources_per_trial)``. This will automatically set ``CUDA_VISIBLE_DEVICES`` for each trial. .. code-block:: python # If you have 8 GPUs, this will run 8 trials at once. trainable_with_gpu = tune.with_resources(trainable, {"gpu": 1}) tuner = tune.Tuner( trainable_with_gpu, tune_config=tune.TuneConfig(num_samples=10) ) results = tuner.fit() # If you have 4 CPUs and 1 GPU on your machine, this will run 1 trial at a time. trainable_with_cpu_gpu = tune.with_resources(trainable, {"cpu": 2, "gpu": 1}) tuner = tune.Tuner( trainable_with_cpu_gpu, tune_config=tune.TuneConfig(num_samples=10) ) results = tuner.fit() You can find an example of this in the :doc:`Keras MNIST example `. .. warning:: If ``gpu`` is not set, ``CUDA_VISIBLE_DEVICES`` environment variable will be set as empty, disallowing GPU access. **Troubleshooting**: Occasionally, you may run into GPU memory issues when running a new trial. This may be due to the previous trial not cleaning up its GPU state fast enough. To avoid this, you can use :func:`tune.utils.wait_for_gpu `. .. _tune-dist-training: How to run distributed training with Tune? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To tune distributed training jobs, you can use Ray Tune with Ray Train. Ray Tune will run multiple trials in parallel, with each trial running distributed training with Ray Train. For more details, see :ref:`Ray Train Hyperparameter Optimization `. How to limit concurrency in Tune? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To specifies the max number of trials to run concurrently, set `max_concurrent_trials` in :class:`TuneConfig `. Note that actual parallelism can be less than `max_concurrent_trials` and will be determined by how many trials can fit in the cluster at once (i.e., if you have a trial that requires 16 GPUs, your cluster has 32 GPUs, and `max_concurrent_trials=10`, the `Tuner` can only run 2 trials concurrently). .. code-block:: python from ray.tune import TuneConfig config = TuneConfig( # ... num_samples=100, max_concurrent_trials=10, ) --- .. _tune-parallel-experiments-guide: Running Basic Tune Experiments ============================== The most common way to use Tune is also the simplest: as a parallel experiment runner. If you can define experiment trials in a Python function, you can use Tune to run hundreds to thousands of independent trial instances in a cluster. Tune manages trial execution, status reporting, and fault tolerance. Running Independent Tune Trials in Parallel ------------------------------------------- As a general example, let's consider executing ``N`` independent model training trials using Tune as a simple grid sweep. Each trial can execute different code depending on a passed-in config dictionary. **Step 1:** First, we define the model training function that we want to run variations of. The function takes in a config dictionary as argument, and returns a simple dict output. Learn more about logging Tune results at :ref:`tune-logging`. .. literalinclude:: ../doc_code/tune.py :language: python :start-after: __step1_begin__ :end-before: __step1_end__ **Step 2:** Next, define the space of trials to run. Here, we define a simple grid sweep from ``0..NUM_MODELS``, which will generate the config dicts to be passed to each model function. Learn more about what features Tune offers for defining spaces at :ref:`tune-search-space-tutorial`. .. literalinclude:: ../doc_code/tune.py :language: python :start-after: __step2_begin__ :end-before: __step2_end__ **Step 3:** Optionally, configure the resources allocated per trial. Tune uses this resources allocation to control the parallelism. For example, if each trial was configured to use 4 CPUs, and the cluster had only 32 CPUs, then Tune will limit the number of concurrent trials to 8 to avoid overloading the cluster. For more information, see :ref:`tune-parallelism`. .. literalinclude:: ../doc_code/tune.py :language: python :start-after: __step3_begin__ :end-before: __step3_end__ **Step 4:** Run the trial with Tune. Tune will report on experiment status, and after the experiment finishes, you can inspect the results. Tune can retry failed trials automatically, as well as entire experiments; see :ref:`tune-stopping-guide`. .. literalinclude:: ../doc_code/tune.py :language: python :start-after: __step4_begin__ :end-before: __step4_end__ **Step 5:** Inspect results. They will look something like this. Tune periodically prints a status summary to stdout showing the ongoing experiment status, until it finishes: .. code:: == Status == Current time: 2022-09-21 10:19:34 (running for 00:00:04.54) Memory usage on this node: 6.9/31.1 GiB Using FIFO scheduling algorithm. Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/16.13 GiB heap, 0.0/8.06 GiB objects Result logdir: /home/ubuntu/ray_results/train_model_2022-09-21_10-19-26 Number of trials: 100/100 (100 TERMINATED) +-------------------------+------------+----------------------+------------+--------+------------------+ | Trial name | status | loc | model_id | iter | total time (s) | |-------------------------+------------+----------------------+------------+--------+------------------| | train_model_8d627_00000 | TERMINATED | 192.168.1.67:2381731 | model_0 | 1 | 8.46386e-05 | | train_model_8d627_00001 | TERMINATED | 192.168.1.67:2381761 | model_1 | 1 | 0.000126362 | | train_model_8d627_00002 | TERMINATED | 192.168.1.67:2381763 | model_2 | 1 | 0.000112772 | ... | train_model_8d627_00097 | TERMINATED | 192.168.1.67:2381731 | model_97 | 1 | 5.57899e-05 | | train_model_8d627_00098 | TERMINATED | 192.168.1.67:2381767 | model_98 | 1 | 6.05583e-05 | | train_model_8d627_00099 | TERMINATED | 192.168.1.67:2381763 | model_99 | 1 | 6.69956e-05 | +-------------------------+------------+----------------------+------------+--------+------------------+ 2022-09-21 10:19:35,159 INFO tune.py:762 -- Total run time: 5.06 seconds (4.46 seconds for the tuning loop). The final result objects contain finished trial metadata: .. code:: Result(metrics={'score': 'model_0', 'other_data': Ellipsis, 'done': True, 'trial_id': '8d627_00000', 'experiment_tag': '0_model_id=model_0'}, error=None, log_dir=PosixPath('/home/ubuntu/ray_results/train_model_2022-09-21_10-19-26/train_model_8d627_00000_0_model_id=model_0_2022-09-21_10-19-30')) Result(metrics={'score': 'model_1', 'other_data': Ellipsis, 'done': True, 'trial_id': '8d627_00001', 'experiment_tag': '1_model_id=model_1'}, error=None, log_dir=PosixPath('/home/ubuntu/ray_results/train_model_2022-09-21_10-19-26/train_model_8d627_00001_1_model_id=model_1_2022-09-21_10-19-31')) Result(metrics={'score': 'model_2', 'other_data': Ellipsis, 'done': True, 'trial_id': '8d627_00002', 'experiment_tag': '2_model_id=model_2'}, error=None, log_dir=PosixPath('/home/ubuntu/ray_results/train_model_2022-09-21_10-19-26/train_model_8d627_00002_2_model_id=model_2_2022-09-21_10-19-31')) How does Tune compare to using Ray Core (``ray.remote``)? ---------------------------------------------------------- You might be wondering how Tune differs from simply using :ref:`ray-remote-functions` for parallel trial execution. Indeed, the above example could be re-written similarly as: .. literalinclude:: ../doc_code/tune.py :language: python :start-after: __tasks_begin__ :end-before: __tasks_end__ Compared to using Ray tasks, Tune offers the following additional functionality: * Status reporting and tracking, including integrations and callbacks to common monitoring tools. * Checkpointing of trials for fine-grained fault-tolerance. * Gang scheduling of multi-worker trials. In short, consider using Tune if you need status tracking or support for more advanced ML workloads. --- :orphan: Scalability and Overhead Benchmarks for Ray Tune ================================================ We conducted a series of micro-benchmarks where we evaluated the scalability of Ray Tune and analyzed the performance overhead we observed. The results from these benchmarks are reflected in the documentation, e.g. when we make suggestions on :ref:`how to remove performance bottlenecks `. This page gives an overview over the experiments we did. For each of these experiments, the goal was to examine the total runtime of the experiment and address issues when the observed overhead compared to the minimal theoretical time was too high (e.g. more than 20% overhead). In some of the experiments we tweaked the default settings for maximum throughput, e.g. by disabling trial synchronization or result logging. If this is the case, this is stated in the respective benchmark description. .. list-table:: Ray Tune scalability benchmarks overview :header-rows: 1 * - Variable - # of trials - Results/second /trial - # of nodes - # CPUs/node - Trial length (s) - Observed runtime * - `Trial bookkeeping /scheduling overhead `_ - 10,000 - 1 - 1 - 16 - 1 - | 715.27 | (625 minimum) * - `Result throughput (many trials) `_ - 1,000 - 0.1 - 16 - 64 - 100 - 168.18 * - `Result throughput (many results) `_ - 96 - 10 - 1 - 96 - 100 - 168.94 * - `Network communication overhead `_ - 200 - 1 - 200 - 2 - 300 - 2280.82 * - `Long running, 3.75 GB checkpoints `_ - 16 - | Results: 1/60 | Checkpoint: 1/900 - 1 - 16 - 86,400 - 88687.41 * - `Durable trainable `_ - 16 - | 10/60 | with 10MB CP - 16 - 2 - 300 - 392.42 Below we discuss some insights on results where we observed much overhead. Result throughput ----------------- Result throughput describes the number of results Ray Tune can process in a given timeframe (e.g. "results per second"). The higher the throughput, the more concurrent results can be processed without major delays. Result throughput is limited by the time it takes to process results. When a trial reports results, it only continues training once the trial executor re-triggered the remote training function. If many trials report results at the same time, each subsequent remote training call is only triggered after handling that trial's results. To speed the process up, Ray Tune adaptively buffers results, so that trial training is continued earlier if many trials are running in parallel and report many results at the same time. Still, processing hundreds of results per trial for dozens or hundreds of trials can become a bottleneck. **Main insight**: Ray Tune will throw a warning when trial processing becomes a bottleneck. If you notice that this becomes a problem, please follow our guidelines outlined :ref:`in the FAQ `. Generally, it is advised to not report too many results at the same time. Consider increasing the report intervals by a factor of 5-10x. Below we present more detailed results on the result throughput performance. Benchmarking many concurrent Tune trials """""""""""""""""""""""""""""""""""""""" In this setup, loggers (CSV, JSON, and TensorBoardX) and trial synchronization are disabled, except when explicitly noted. In this experiment, we're running many concurrent trials (up to 1,000) on a cluster. We then adjust the reporting frequency (number of results per second) of the trials to measure the throughput limits. It seems that around 500 total results/second seem to be the threshold for acceptable performance when logging and synchronization are disabled. With logging enabled, around 50-100 results per second can still be managed without too much overhead, but after that measures to decrease incoming results should be considered. +-------------+--------------------------+---------+---------------+------------------+---------+ | # of trials | Results / second / trial | # Nodes | # CPUs / Node | Length of trial. | Current | +=============+==========================+=========+===============+==================+=========+ | 1,000 | 10 | 16 | 64 | 100s | 248.39 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 1,000 | 1 | 16 | 64 | 100s | 175.00 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 1,000 | 0.1 with logging | 16 | 64 | 100s | 168.18 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 384 | 10 | 16 | 64 | 100s | 125.17 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 256 | 50 | 16 | 64 | 100s | 307.02 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 256 | 20 | 16 | 64 | 100s | 146.20 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 256 | 10 | 16 | 64 | 100s | 113.40 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 256 | 10 with logging | 16 | 64 | 100s | 436.12 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 256 | 0.1 with logging | 16 | 64 | 100s | 106.75 | +-------------+--------------------------+---------+---------------+------------------+---------+ Benchmarking many Tune results on a single node """"""""""""""""""""""""""""""""""""""""""""""" In this setup, loggers (CSV, JSON, and TensorBoardX) are disabled, except when explicitly noted. In this experiment, we're running 96 concurrent trials on a single node. We then adjust the reporting frequency (number of results per second) of the trials to find the throughput limits. Compared to the cluster experiment setup, we report much more often, as we're running less total trials in parallel. On a single node, throughput seems to be a bit higher. With logging, handling 1000 results per second seems acceptable in terms of overhead, though you should probably still target for a lower number. +-------------+--------------------------+---------+---------------+------------------+---------+ | # of trials | Results / second / trial | # Nodes | # CPUs / Node | Length of trial. | Current | +=============+==========================+=========+===============+==================+=========+ | 96 | 500 | 1 | 96 | 100s | 959.32 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 96 | 100 | 1 | 96 | 100s | 219.48 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 96 | 80 | 1 | 96 | 100s | 197.15 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 96 | 50 | 1 | 96 | 100s | 110.55 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 96 | 50 with logging | 1 | 96 | 100s | 702.64 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 96 | 10 | 1 | 96 | 100s | 103.51 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 96 | 10 with logging | 1 | 96 | 100s | 168.94 | +-------------+--------------------------+---------+---------------+------------------+---------+ Network overhead in Ray Tune ---------------------------- Running Ray Tune on a distributed setup leads to network communication overhead. This is mostly due to trial synchronization, where results and checkpoints are periodically synchronized and sent via the network. Per default this happens via SSH, where connection initialization can take between 1 and 2 seconds each time. Since this is a blocking operation that happens on a per-trial basis, running many concurrent trials quickly becomes bottlenecked by this synchronization. In this experiment, we ran a number of trials on a cluster. Each trial was run on a separate node. We varied the number of concurrent trials (and nodes) to see how much network communication affects total runtime. **Main insight**: When running many concurrent trials in a distributed setup, consider using :ref:`cloud checkpointing ` for checkpoint synchronization instead. Another option would be to use a shared storage and disable syncing to driver. The best practices are described :ref:`here for Kubernetes setups ` but is applicable for any kind of setup. In the table below we present more detailed results on the network communication overhead. +-------------+--------------------------+---------+---------------+------------------+---------+ | # of trials | Results / second / trial | # Nodes | # CPUs / Node | Length of trial | Current | +=============+==========================+=========+===============+==================+=========+ | 200 | 1 | 200 | 2 | 300s | 2280.82 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 100 | 1 | 100 | 2 | 300s | 1470 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 100 | 0.01 | 100 | 2 | 300s | 473.41 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 50 | 1 | 50 | 2 | 300s | 474.30 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 50 | 0.1 | 50 | 2 | 300s | 441.54 | +-------------+--------------------------+---------+---------------+------------------+---------+ | 10 | 1 | 10 | 2 | 300s | 334.37 | +-------------+--------------------------+---------+---------------+------------------+---------+ --- .. _tune-search-space-tutorial: Working with Tune Search Spaces =============================== Tune has a native interface for specifying search spaces. You can specify the search space via ``Tuner(param_space=...)``. Thereby, you can either use the ``tune.grid_search`` primitive to use grid search: .. code-block:: python tuner = tune.Tuner( trainable, param_space={"bar": tune.grid_search([True, False])}) results = tuner.fit() Or you can use one of the random sampling primitives to specify distributions (:doc:`/tune/api/search_space`): .. code-block:: python tuner = tune.Tuner( trainable, param_space={ "param1": tune.choice([True, False]), "bar": tune.uniform(0, 10), "alpha": tune.sample_from(lambda _: np.random.uniform(100) ** 2), "const": "hello" # It is also ok to specify constant values. }) results = tuner.fit() .. caution:: If you use a SearchAlgorithm, you may not be able to specify lambdas or grid search with this interface, as some search algorithms may not be compatible. To sample multiple times/run multiple trials, specify ``tune.RunConfig(num_samples=N``. If ``grid_search`` is provided as an argument, the *same* grid will be repeated ``N`` times. .. code-block:: python # 13 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=13), param_space={ "x": tune.choice([0, 1, 2]), } ) tuner.fit() # 13 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=13), param_space={ "x": tune.choice([0, 1, 2]), "y": tune.randn([0, 1, 2]), } ) tuner.fit() # 4 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=1), param_space={"x": tune.grid_search([1, 2, 3, 4])}) tuner.fit() # 3 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=1), param_space={"x": tune.grid_search([1, 2, 3])}) tuner.fit() # 6 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=2), param_space={"x": tune.grid_search([1, 2, 3])}) tuner.fit() # 9 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=1), param_space={ "x": tune.grid_search([1, 2, 3]), "y": tune.grid_search([a, b, c])} ) tuner.fit() # 18 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=2), param_space={ "x": tune.grid_search([1, 2, 3]), "y": tune.grid_search([a, b, c])} ) tuner.fit() # 45 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=5), param_space={ "x": tune.grid_search([1, 2, 3]), "y": tune.grid_search([a, b, c])} ) tuner.fit() Note that grid search and random search primitives are inter-operable. Each can be used independently or in combination with each other. .. code-block:: python # 6 different configs. tuner = tune.Tuner(trainable, tune_config=tune.TuneConfig(num_samples=2), param_space={ "x": tune.sample_from(...), "y": tune.grid_search([a, b, c]) } ) tuner.fit() In the below example, ``num_samples=10`` repeats the 3x3 grid search 10 times, for a total of 90 trials, each with randomly sampled values of ``alpha`` and ``beta``. .. code-block:: python :emphasize-lines: 12 tuner = tune.Tuner( my_trainable, run_config=tune.RunConfig(name="my_trainable"), # num_samples will repeat the entire config 10 times. tune_config=tune.TuneConfig(num_samples=10), param_space={ # ``sample_from`` creates a generator to call the lambda once per trial. "alpha": tune.sample_from(lambda spec: np.random.uniform(100)), # ``sample_from`` also supports "conditional search spaces" "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()), "nn_layers": [ # tune.grid_search will make it so that all values are evaluated. tune.grid_search([16, 64, 256]), tune.grid_search([16, 64, 256]), ], }, ) tuner.fit() .. tip:: Avoid passing large objects as values in the search space, as that will incur a performance overhead. Use :func:`tune.with_parameters ` to pass large objects in or load them inside your trainable from disk (making sure that all nodes have access to the files) or cloud storage. See :ref:`tune-bottlenecks` for more information. .. _tune_custom-search: How to use Custom and Conditional Search Spaces in Tune? -------------------------------------------------------- You'll often run into awkward search spaces (i.e., when one hyperparameter depends on another). Use ``tune.sample_from(func)`` to provide a **custom** callable function for generating a search space. The parameter ``func`` should take in a ``spec`` object, which has a ``config`` namespace from which you can access other hyperparameters. This is useful for conditional distributions: .. code-block:: python tuner = tune.Tuner( ..., param_space={ # A random function "alpha": tune.sample_from(lambda _: np.random.uniform(100)), # Use the `spec.config` namespace to access other hyperparameters "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()) } ) tuner.fit() Here's an example showing a grid search over two nested parameters combined with random sampling from two lambda functions, generating 9 different trials. Note that the value of ``beta`` depends on the value of ``alpha``, which is represented by referencing ``spec.config.alpha`` in the lambda function. This lets you specify conditional parameter distributions. .. code-block:: python :emphasize-lines: 4-11 tuner = tune.Tuner( my_trainable, run_config=RunConfig(name="my_trainable"), param_space={ "alpha": tune.sample_from(lambda spec: np.random.uniform(100)), "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()), "nn_layers": [ tune.grid_search([16, 64, 256]), tune.grid_search([16, 64, 256]), ], } ) .. note:: This format is not supported by every SearchAlgorithm, and only some SearchAlgorithms, like :ref:`HyperOpt ` and :ref:`Optuna `, handle conditional search spaces at all. In order to use conditional search spaces with :ref:`HyperOpt `, a `Hyperopt search space `_ isnecessary. :ref:`Optuna ` supports conditional search spaces through its define-by-run interface (:doc:`/tune/examples/optuna_example`). --- .. _tune-stopping-guide: .. _tune-stopping-ref: How to Define Stopping Criteria for a Ray Tune Experiment ========================================================= When running a Tune experiment, it can be challenging to determine the ideal duration of training beforehand. Stopping criteria in Tune can be useful for terminating training based on specific conditions. For instance, one may want to set up the experiment to stop under the following circumstances: 1. Set up an experiment to end after ``N`` epochs or when the reported evaluation score surpasses a particular threshold, whichever occurs first. 2. Stop the experiment after ``T`` seconds. 3. Terminate when trials encounter runtime errors. 4. Stop underperforming trials early by utilizing Tune's early-stopping schedulers. This user guide will illustrate how to achieve these types of stopping criteria in a Tune experiment. For all the code examples, we use the following training function for demonstration: .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __stopping_example_trainable_start__ :end-before: __stopping_example_trainable_end__ Stop a Tune experiment manually ------------------------------- If you send a ``SIGINT`` signal to the process running :meth:`Tuner.fit() ` (which is usually what happens when you press ``Ctrl+C`` in the terminal), Ray Tune shuts down training gracefully and saves the final experiment state. .. note:: Forcefully terminating a Tune experiment, for example, through multiple ``Ctrl+C`` commands, will not give Tune the opportunity to snapshot the experiment state one last time. If you resume the experiment in the future, this could result in resuming with stale state. Ray Tune also accepts the ``SIGUSR1`` signal to interrupt training gracefully. This should be used when running Ray Tune in a remote Ray task as Ray will filter out ``SIGINT`` and ``SIGTERM`` signals per default. Stop using metric-based criteria -------------------------------- In addition to manual stopping, Tune provides several ways to stop experiments programmatically. The simplest way is to use metric-based criteria. These are a fixed set of thresholds that determine when the experiment should stop. You can implement the stopping criteria using either a dictionary, a function, or a custom :class:`Stopper `. .. tab-set:: .. tab-item:: Dictionary If a dictionary is passed in, the keys may be any field in the return result of ``tune.report`` in the Function API or ``step()`` in the Class API. .. note:: This includes :ref:`auto-filled metrics ` such as ``training_iteration``. In the example below, each trial will be stopped either when it completes ``10`` iterations or when it reaches a mean accuracy of ``0.8`` or more. These metrics are assumed to be **increasing**, so the trial will stop once the reported metric has exceeded the threshold specified in the dictionary. .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __stopping_dict_start__ :end-before: __stopping_dict_end__ .. tab-item:: User-defined Function For more flexibility, you can pass in a function instead. If a function is passed in, it must take ``(trial_id: str, result: dict)`` as arguments and return a boolean (``True`` if trial should be stopped and ``False`` otherwise). In the example below, each trial will be stopped either when it completes ``10`` iterations or when it reaches a mean accuracy of ``0.8`` or more. .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __stopping_fn_start__ :end-before: __stopping_fn_end__ .. tab-item:: Custom Stopper Class Finally, you can implement the :class:`~ray.tune.stopper.Stopper` interface for stopping individual trials or even entire experiments based on custom stopping criteria. For example, the following example stops all trials after the criteria is achieved by any individual trial and prevents new ones from starting: .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __stopping_cls_start__ :end-before: __stopping_cls_end__ In the example, once any trial reaches a ``mean_accuracy`` of 0.8 or more, all trials will stop. .. note:: When returning ``True`` from ``stop_all``, currently running trials will not stop immediately. They will stop after finishing their ongoing training iteration (after ``tune.report`` or ``step``). Ray Tune comes with a set of out-of-the-box stopper classes. See the :ref:`Stopper ` documentation. Stop trials after a certain amount of time ------------------------------------------ There are two choices to stop a Tune experiment based on time: stopping trials individually after a specified timeout, or stopping the full experiment after a certain amount of time. Stop trials individually with a timeout ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can use a dictionary stopping criteria as described above, using the ``time_total_s`` metric that is auto-filled by Tune. .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __stopping_trials_by_time_start__ :end-before: __stopping_trials_by_time_end__ .. note:: You need to include some intermediate reporting via :meth:`tune.report ` if using the :ref:`Function Trainable API `. Each report will automatically record the trial's ``time_total_s``, which allows Tune to stop based on time as a metric. If the training loop hangs somewhere, Tune will not be able to intercept the training and stop the trial for you. In this case, you can explicitly implement timeout logic in the training loop. Stop the experiment with a timeout ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``TuneConfig(time_budget_s)`` configuration to tell Tune to stop the experiment after ``time_budget_s`` seconds. .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __stopping_experiment_by_time_start__ :end-before: __stopping_experiment_by_time_end__ .. note:: You need to include some intermediate reporting via :meth:`tune.report ` if using the :ref:`Function Trainable API `, for the same reason as above. Stop on trial failures ---------------------- In addition to stopping trials based on their performance, you can also stop the entire experiment if any trial encounters a runtime error. To do this, you can use the :class:`ray.tune.FailureConfig` class. With this configuration, if any trial encounters an error, the entire experiment will stop immediately. .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __stopping_on_trial_error_start__ :end-before: __stopping_on_trial_error_end__ This is useful when you are debugging a Tune experiment with many trials. Early stopping with Tune schedulers ----------------------------------- Another way to stop Tune experiments is to use early stopping schedulers. These schedulers monitor the performance of trials and stop them early if they are not making sufficient progress. :class:`~ray.tune.schedulers.AsyncHyperBandScheduler` and :class:`~ray.tune.schedulers.HyperBandForBOHB` are examples of early stopping schedulers built into Tune. See :ref:`the Tune scheduler API reference ` for a full list, as well as more realistic examples. In the following example, we use both a dictionary stopping criteria along with an early-stopping criteria: .. literalinclude:: /tune/doc_code/stopping.py :language: python :start-after: __early_stopping_start__ :end-before: __early_stopping_end__ Summary ------- In this user guide, we learned how to stop Tune experiments using metrics, trial errors, and early stopping schedulers. See the following resources for more information: - :ref:`Tune Stopper API reference ` - For an experiment that was manually interrupted or the cluster dies unexpectedly while trials are still running, it's possible to resume the experiment. See :ref:`tune-fault-tolerance-ref`. --- .. _tune-storage-options: How to Configure Persistent Storage in Ray Tune =============================================== .. seealso:: Before diving into storage options, one can take a look at :ref:`the different types of data stored by Tune `. Tune allows you to configure persistent storage options to enable following use cases in a distributed Ray cluster: - **Trial-level fault tolerance**: When trials are restored (e.g. after a node failure or when the experiment was paused), they may be scheduled on different nodes, but still would need access to their latest checkpoint. - **Experiment-level fault tolerance**: For an entire experiment to be restored (e.g. if the cluster crashes unexpectedly), Tune needs to be able to access the latest experiment state, along with all trial checkpoints to start from where the experiment left off. - **Post-experiment analysis**: A consolidated location storing data from all trials is useful for post-experiment analysis such as accessing the best checkpoints and hyperparameter configs after the cluster has already been terminated. - **Bridge with downstream serving/batch inference tasks**: With a configured storage, you can easily access the models and artifacts generated by trials, share them with others or use them in downstream tasks. Storage Options in Tune ----------------------- Tune provides support for three scenarios: 1. When using cloud storage (e.g. AWS S3 or Google Cloud Storage) accessible by all machines in the cluster. 2. When using a network filesystem (NFS) mounted to all machines in the cluster. 3. When running Tune on a single node and using the local filesystem as the persistent storage location. .. note:: A network filesystem or cloud storage can be configured for single-node experiments. This can be useful to persist your experiment results in external storage if, for example, the instance you run your experiment on clears its local storage after termination. .. seealso:: See :class:`~ray.tune.SyncConfig` for the full set of configuration options as well as more details. .. _tune-cloud-checkpointing: Configuring Tune with cloud storage (AWS S3, Google Cloud Storage) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If all nodes in a Ray cluster have access to cloud storage, e.g. AWS S3 or Google Cloud Storage (GCS), then all experiment outputs can be saved in a shared cloud bucket. We can configure cloud storage by telling Ray Tune to **upload to a remote** ``storage_path``: .. code-block:: python from ray import tune tuner = tune.Tuner( trainable, run_config=tune.RunConfig( name="experiment_name", storage_path="s3://bucket-name/sub-path/", ) ) tuner.fit() In this example, all experiment results can be found in the shared storage at ``s3://bucket-name/sub-path/experiment_name`` for further processing. .. note:: The head node will not have access to all experiment results locally. If you want to process e.g. the best checkpoint further, you will first have to fetch it from the cloud storage. Experiment restoration should also be done using the experiment directory at the cloud storage URI, rather than the local experiment directory on the head node. See :ref:`here for an example `. Configuring Tune with a network filesystem (NFS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If all Ray nodes have access to a network filesystem, e.g. AWS EFS or Google Cloud Filestore, they can all write experiment outputs to this directory. All we need to do is **set the shared network filesystem as the path to save results**. .. code-block:: python from ray import tune tuner = tune.Tuner( trainable, run_config=tune.RunConfig( name="experiment_name", storage_path="/mnt/path/to/shared/storage/", ) ) tuner.fit() In this example, all experiment results can be found in the shared storage at ``/path/to/shared/storage/experiment_name`` for further processing. .. _tune-default-syncing: Configure Tune without external persistent storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On a single-node cluster ************************ If you're just running an experiment on a single node (e.g., on a laptop), Tune will use the local filesystem as the default storage location for checkpoints and other artifacts. Results are saved to ``~/ray_results`` in a sub-directory with a unique auto-generated name by default, unless you customize this with ``storage_path`` and ``name`` in :class:`~ray.tune.RunConfig`. .. code-block:: python from ray import tune tuner = tune.Tuner( trainable, run_config=tune.RunConfig( storage_path="/tmp/custom/storage/path", name="experiment_name", ) ) tuner.fit() In this example, all experiment results can be found locally at ``/tmp/custom/storage/path/experiment_name`` for further processing. On a multi-node cluster (Deprecated) ************************************ .. warning:: When running on multiple nodes, using the local filesystem of the head node as the persistent storage location is *deprecated*. If you save trial checkpoints and run on a multi-node cluster, Tune will raise an error by default, if NFS or cloud storage is not setup. See `this issue `_ for more information. Examples -------- Let's show some examples of configuring storage location and synchronization options. We'll also show how to resume the experiment for each of the examples, in the case that your experiment gets interrupted. See :ref:`tune-fault-tolerance-ref` for more information on resuming experiments. In each example, we'll give a practical explanation of how *trial checkpoints* are saved across the cluster and the external storage location (if one is provided). See :ref:`tune-persisted-experiment-data` for an overview of other experiment data that Tune needs to persist. Example: Running Tune with cloud storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's assume that you're running this example script from your Ray cluster's head node. In the example below, ``my_trainable`` is a Tune :ref:`trainable ` that implements saving and loading checkpoints. .. code-block:: python import os import ray from ray import tune from your_module import my_trainable tuner = tune.Tuner( my_trainable, run_config=tune.RunConfig( # Name of your experiment name="my-tune-exp", # Configure how experiment data and checkpoints are persisted. # We recommend cloud storage checkpointing as it survives the cluster when # instances are terminated and has better performance. storage_path="s3://my-checkpoints-bucket/path/", checkpoint_config=tune.CheckpointConfig( # We'll keep the best five checkpoints at all times # (with the highest AUC scores, a metric reported by the trainable) checkpoint_score_attribute="max-auc", checkpoint_score_order="max", num_to_keep=5, ), ), ) # This starts the run! results = tuner.fit() In this example, trial checkpoints will be saved to: ``s3://my-checkpoints-bucket/path/my-tune-exp//checkpoint_`` .. _tune-syncing-restore-from-uri: If this run stopped for any reason (ex: user CTRL+C, terminated due to out of memory issues), you can resume it any time starting from the experiment state saved in the cloud: .. code-block:: python from ray import tune tuner = tune.Tuner.restore( "s3://my-checkpoints-bucket/path/my-tune-exp", trainable=my_trainable, resume_errored=True, ) tuner.fit() There are a few options for restoring an experiment: ``resume_unfinished``, ``resume_errored`` and ``restart_errored``. Please see the documentation of :meth:`~ray.tune.Tuner.restore` for more details. Advanced configuration ---------------------- See :ref:`Ray Train's section on advanced storage configuration `. All of the configurations also apply to Ray Tune. --- .. _tune-trial-checkpoint: How to Save and Load Trial Checkpoints ====================================== Trial checkpoints are one of :ref:`the three types of data stored by Tune `. These are user-defined and are meant to snapshot your training progress! Trial-level checkpoints are saved via the :ref:`Tune Trainable ` API: this is how you define your custom training logic, and it's also where you'll define which trial state to checkpoint. In this guide, we will show how to save and load checkpoints for Tune's Function Trainable and Class Trainable APIs, as well as walk you through configuration options. .. _tune-function-trainable-checkpointing: Function API Checkpointing -------------------------- If using Ray Tune's Function API, one can save and load checkpoints in the following manner. To create a checkpoint, use the :meth:`~ray.tune.Checkpoint.from_directory` APIs. .. literalinclude:: /tune/doc_code/trial_checkpoint.py :language: python :start-after: __function_api_checkpointing_from_dir_start__ :end-before: __function_api_checkpointing_from_dir_end__ In the above code snippet: - We implement *checkpoint saving* with :meth:`tune.report(..., checkpoint=checkpoint) `. Note that every checkpoint must be reported alongside a set of metrics -- this way, checkpoints can be ordered with respect to a specified metric. - The saved checkpoint during training iteration `epoch` is saved to the path ``///checkpoint_`` on the node on which training happens and can be further synced to a consolidated storage location depending on the :ref:`storage configuration `. - We implement *checkpoint loading* with :meth:`tune.get_checkpoint() `. This will be populated with a trial's latest checkpoint whenever Tune restores a trial. This happens when (1) a trial is configured to retry after encountering a failure, (2) the experiment is being restored, and (3) the trial is being resumed after a pause (ex: :doc:`PBT `). .. TODO: for (1), link to tune fault tolerance guide. For (2), link to tune restore guide. .. note:: ``checkpoint_frequency`` and ``checkpoint_at_end`` will not work with Function API checkpointing. These are configured manually with Function Trainable. For example, if you want to checkpoint every three epochs, you can do so through: .. literalinclude:: /tune/doc_code/trial_checkpoint.py :language: python :start-after: __function_api_checkpointing_periodic_start__ :end-before: __function_api_checkpointing_periodic_end__ See :class:`here for more information on creating checkpoints `. .. _tune-class-trainable-checkpointing: Class API Checkpointing ----------------------- You can also implement checkpoint/restore using the Trainable Class API: .. literalinclude:: /tune/doc_code/trial_checkpoint.py :language: python :start-after: __class_api_checkpointing_start__ :end-before: __class_api_checkpointing_end__ You can checkpoint with three different mechanisms: manually, periodically, and at termination. .. _tune-class-trainable-checkpointing_manual-checkpointing: Manual Checkpointing by Trainable ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A custom Trainable can manually trigger checkpointing by returning ``should_checkpoint: True`` (or ``tune.result.SHOULD_CHECKPOINT: True``) in the result dictionary of `step`. This can be especially helpful in spot instances: .. literalinclude:: /tune/doc_code/trial_checkpoint.py :language: python :start-after: __class_api_manual_checkpointing_start__ :end-before: __class_api_manual_checkpointing_end__ In the above example, if ``detect_instance_preemption`` returns True, manual checkpointing can be triggered. .. _tune-callback-checkpointing: Manual Checkpointing by Tuner Callback ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Similar to :ref:`tune-class-trainable-checkpointing_manual-checkpointing`, you can also trigger checkpointing through :class:`Tuner ` :class:`Callback ` methods by setting the ``result["should_checkpoint"] = True`` (or ``result[tune.result.SHOULD_CHECKPOINT] = True``) flag within the :meth:`on_trial_result() ` method of your custom callback. In contrast to checkpointing within the Trainable Class API, this approach decouples checkpointing logic from the training logic, and provides access to all :class:`Trial ` instances allowing for more complex checkpointing strategies. .. literalinclude:: /tune/doc_code/trial_checkpoint.py :language: python :start-after: __callback_api_checkpointing_start__ :end-before: __callback_api_checkpointing_end__ Periodic Checkpointing ~~~~~~~~~~~~~~~~~~~~~~ This can be enabled by setting ``checkpoint_frequency=N`` to checkpoint trials every *N* iterations, e.g.: .. literalinclude:: /tune/doc_code/trial_checkpoint.py :language: python :start-after: __class_api_periodic_checkpointing_start__ :end-before: __class_api_periodic_checkpointing_end__ Checkpointing at Termination ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The checkpoint_frequency may not coincide with the exact end of an experiment. If you want a checkpoint to be created at the end of a trial, you can additionally set the ``checkpoint_at_end=True``: .. literalinclude:: /tune/doc_code/trial_checkpoint.py :language: python :start-after: __class_api_end_checkpointing_start__ :end-before: __class_api_end_checkpointing_end__ Configurations -------------- Checkpointing can be configured through :class:`CheckpointConfig `. Some of the configurations do not apply to Function Trainable API, since checkpointing frequency is determined manually within the user-defined training loop. See the compatibility matrix below. .. list-table:: :header-rows: 1 * - - Class API - Function API * - ``num_to_keep`` - ✅ - ✅ * - ``checkpoint_score_attribute`` - ✅ - ✅ * - ``checkpoint_score_order`` - ✅ - ✅ * - ``checkpoint_frequency`` - ✅ - ❌ * - ``checkpoint_at_end`` - ✅ - ❌ Summary ------- In this user guide, we covered how to save and load trial checkpoints in Tune. Once checkpointing is enabled, move onto one of the following guides to find out how to: - :ref:`Extract checkpoints from Tune experiment results ` - :ref:`Configure persistent storage options ` for a :ref:`distributed Tune experiment ` .. _tune-persisted-experiment-data: Appendix: Types of data stored by Tune -------------------------------------- Experiment Checkpoints ~~~~~~~~~~~~~~~~~~~~~~ Experiment-level checkpoints save the experiment state. This includes the state of the searcher, the list of trials and their statuses (e.g., PENDING, RUNNING, TERMINATED, ERROR), and metadata pertaining to each trial (e.g., hyperparameter configuration, some derived trial results (min, max, last), etc). The experiment-level checkpoint is periodically saved by the driver on the head node. By default, the frequency at which it is saved is automatically adjusted so that at most 5% of the time is spent saving experiment checkpoints, and the remaining time is used for handling training results and scheduling. This time can also be adjusted with the :ref:`TUNE_GLOBAL_CHECKPOINT_S environment variable `. Trial Checkpoints ~~~~~~~~~~~~~~~~~ Trial-level checkpoints capture the per-trial state. This often includes the model and optimizer states. Following are a few uses of trial checkpoints: - If the trial is interrupted for some reason (e.g., on spot instances), it can be resumed from the last state. No training time is lost. - Some searchers or schedulers pause trials to free up resources for other trials to train in the meantime. This only makes sense if the trials can then continue training from the latest state. - The checkpoint can be later used for other downstream tasks like batch inference. Learn how to save and load trial checkpoints :ref:`here `. Trial Results ~~~~~~~~~~~~~ Metrics reported by trials are saved and logged to their respective trial directories. This is the data stored in CSV, JSON or Tensorboard (events.out.tfevents.*) formats. that can be inspected by Tensorboard and used for post-experiment analysis. --- (observability-configure-manage-dashboard)= # Configuring and Managing Ray Dashboard {ref}`Ray Dashboard` is one of the most important tools to monitor and debug Ray applications and Clusters. This page describes how to configure Ray Dashboard on your Clusters. Dashboard configurations may differ depending on how you launch Ray Clusters (e.g., local Ray Cluster vs. KubeRay). Integrations with Prometheus and Grafana are optional for enhanced Dashboard experience. :::{note} Ray Dashboard is useful for interactive development and debugging because when clusters terminate, the dashboard UI and the underlying data are no longer accessible. For production monitoring and debugging, you should rely on [persisted logs](../cluster/kubernetes/user-guides/persist-kuberay-custom-resource-logs.md), [persisted metrics](./metrics.md), [persisted Ray states](../ray-observability/user-guides/cli-sdk.rst), and other observability tools. ::: ## Changing the Ray Dashboard port Ray Dashboard runs on port `8265` of the head node. Follow the instructions below to customize the port if needed. ::::{tab-set} :::{tab-item} Single-node local cluster **Start the cluster explicitly with CLI**
Pass the ``--dashboard-port`` argument with ``ray start`` in the command line. **Start the cluster implicitly with `ray.init`**
Pass the keyword argument ``dashboard_port`` in your call to ``ray.init()``. ::: :::{tab-item} VM Cluster Launcher Include the ``--dashboard-port`` argument in the `head_start_ray_commands` section of the [Cluster Launcher's YAML file](https://github.com/ray-project/ray/blob/0574620d454952556fa1befc7694353d68c72049/python/ray/autoscaler/aws/example-full.yaml#L172). ```yaml head_start_ray_commands: - ray stop # Replace ${YOUR_PORT} with the port number you need. - ulimit -n 65536; ray start --head --dashboard-port=${YOUR_PORT} --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml ``` ::: :::{tab-item} KubeRay View the [specifying non-default ports](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#specifying-non-default-ports) page for details. ::: :::: (dashboard-in-browser)= ## Viewing Ray Dashboard in browsers When you start a single-node Ray cluster on your laptop, you can access the dashboard through a URL printed when Ray is initialized (the default URL is `http://localhost:8265`). When you start a remote Ray cluster with the {ref}`VM cluster launcher `, {ref}`KubeRay operator `, or manual configuration, the Ray Dashboard launches on the head node but the dashboard port may not be publicly exposed. You need an additional setup to access the Ray Dashboard from outside the head node. :::{danger} For security purposes, do not expose Ray Dashboard publicly without proper authentication in place. ::: ::::{tab-set} :::{tab-item} VM Cluster Launcher **Port forwarding**
You can securely port-forward local traffic to the dashboard with the ``ray dashboard`` command. ```shell $ ray dashboard [-p ] ``` The dashboard is now visible at ``http://localhost:8265``. ::: :::{tab-item} KubeRay The KubeRay operator makes Dashboard available via a Service targeting the Ray head pod, named ``-head-svc``. Access Dashboard from within the Kubernetes cluster at ``http://-head-svc:8265``. There are two ways to expose Dashboard outside the Cluster: **1. Setting up ingress**
Follow the [instructions](kuberay-ingress) to set up ingress to access Ray Dashboard. **The Ingress must only allows access from trusted sources.** **2. Port forwarding**
You can also view the dashboard from outside the Kubernetes cluster by using port-forwarding: ```shell $ kubectl port-forward service/${RAYCLUSTER_NAME}-head-svc 8265:8265 # Visit ${YOUR_IP}:8265 for the Dashboard (e.g. 127.0.0.1:8265 or ${YOUR_VM_IP}:8265) ``` ```{admonition} Note :class: note Do not use port forwarding for production environment. Follow the instructions above to expose the Dashboard with Ingress. ``` For more information about configuring network access to a Ray cluster on Kubernetes, see the {ref}`networking notes `. ::: :::: ## Running behind a reverse proxy Ray Dashboard should work out-of-the-box when accessed via a reverse proxy. API requests don't need to be proxied individually. Always access the dashboard with a trailing ``/`` at the end of the URL. For example, if your proxy is set up to handle requests to ``/ray/dashboard``, view the dashboard at ``www.my-website.com/ray/dashboard/``. The dashboard sends HTTP requests with relative URL paths. Browsers handle these requests as expected when the ``window.location.href`` ends in a trailing ``/``. This is a peculiarity of how many browsers handle requests with relative URLs, despite what [MDN](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL#examples_of_relative_urls) defines as the expected behavior. Make your dashboard visible without a trailing ``/`` by including a rule in your reverse proxy that redirects the user's browser to ``/``, i.e. ``/ray/dashboard`` --> ``/ray/dashboard/``. Below is an example with a [traefik](https://doc.traefik.io/traefik/getting-started/quick-start/) TOML file that accomplishes this: ```yaml [http] [http.routers] [http.routers.to-dashboard] rule = "PathPrefix(`/ray/dashboard`)" middlewares = ["test-redirectregex", "strip"] service = "dashboard" [http.middlewares] [http.middlewares.test-redirectregex.redirectRegex] regex = "^(.*)/ray/dashboard$" replacement = "${1}/ray/dashboard/" [http.middlewares.strip.stripPrefix] prefixes = ["/ray/dashboard"] [http.services] [http.services.dashboard.loadBalancer] [[http.services.dashboard.loadBalancer.servers]] url = "http://localhost:8265" ``` ```{admonition} Warning :class: warning The Ray Dashboard provides read **and write** access to the Ray Cluster. The reverse proxy must provide authentication or network ingress controls to prevent unauthorized access to the Cluster. ``` ## Disabling the Dashboard Dashboard is included if you use `ray[default]` or {ref}`other installation commands ` and automatically started. To disable the Dashboard, use the `--include-dashboard` argument. ::::{tab-set} :::{tab-item} Single-node local cluster **Start the cluster explicitly with CLI**
```bash ray start --include-dashboard=False ``` **Start the cluster implicitly with `ray.init`**
```{testcode} :hide: import ray ray.shutdown() ``` ```{testcode} import ray ray.init(include_dashboard=False) ``` ::: :::{tab-item} VM Cluster Launcher Include the `ray start --head --include-dashboard=False` argument in the `head_start_ray_commands` section of the [Cluster Launcher's YAML file](https://github.com/ray-project/ray/blob/0574620d454952556fa1befc7694353d68c72049/python/ray/autoscaler/aws/example-full.yaml#L172). ::: :::{tab-item} KubeRay ```{admonition} Warning :class: warning It's not recommended to disable Dashboard because several KubeRay features like `RayJob` and `RayService` depend on it. ``` Set `spec.headGroupSpec.rayStartParams.include-dashboard` to `False`. Check out this [example YAML file](https://gist.github.com/kevin85421/0e6a8dd02c056704327d949b9ec96ef9). ::: :::: (observability-visualization-setup)= ## Embed Grafana visualizations into Ray Dashboard For the enhanced Ray Dashboard experience, like {ref}`viewing time-series metrics` together with logs, Job info, etc., set up Prometheus and Grafana and integrate them with Ray Dashboard. ### Setting up Prometheus To render Grafana visualizations, you need Prometheus to scrape metrics from Ray Clusters. Follow {ref}`the instructions ` to set up your Prometheus server and start to scrape system and application metrics from Ray Clusters. ### Setting up Grafana Grafana is a tool that supports advanced visualizations of Prometheus metrics and allows you to create custom dashboards with your favorite metrics. Follow {ref}`the instructions ` to set up Grafana. (embed-grafana-in-dashboard)= ### Embedding Grafana visualizations into Ray Dashboard To view embedded time-series visualizations in Ray Dashboard, the following must be set up: 1. The head node of the cluster is able to access Prometheus and Grafana. 2. The browser of the dashboard user is able to access Grafana. Configure these settings using the `RAY_GRAFANA_HOST`, `RAY_PROMETHEUS_HOST`, `RAY_PROMETHEUS_NAME`, and `RAY_GRAFANA_IFRAME_HOST` environment variables when you start the Ray Clusters. * Set `RAY_GRAFANA_HOST` to an address that the head node can use to access Grafana. Head node does health checks on Grafana on the backend. * Set `RAY_GRAFANA_ORG_ID` to the organization ID you use in Grafana. Default is "1". * Set `RAY_PROMETHEUS_HOST` to an address the head node can use to access Prometheus. * Set `RAY_PROMETHEUS_NAME` to select a different data source to use for the Grafana dashboard panels to use. Default is "Prometheus". * Set `RAY_GRAFANA_IFRAME_HOST` to an address that the user's browsers can use to access Grafana and embed visualizations. If `RAY_GRAFANA_IFRAME_HOST` is not set, Ray Dashboard uses the value of `RAY_GRAFANA_HOST`. For example, if the IP of the head node is 55.66.77.88 and Grafana is hosted on port 3000. Set the value to `RAY_GRAFANA_HOST=http://55.66.77.88:3000`. * If you start a single-node Ray Cluster manually, make sure these environment variables are set and accessible before you start the cluster or as a prefix to the `ray start ...` command, e.g., `RAY_GRAFANA_HOST=http://55.66.77.88:3000 ray start ...` * If you start a Ray Cluster with {ref}`VM Cluster Launcher `, the environment variables should be set under `head_start_ray_commands` as a prefix to the `ray start ...` command. * If you start a Ray Cluster with {ref}`KubeRay `, refer to this {ref}`tutorial `. If all the environment variables are set properly, you should see time-series metrics in {ref}`Ray Dashboard `. :::{note} If you use a different Prometheus server for each Ray Cluster and use the same Grafana server for all Clusters, set the `RAY_PROMETHEUS_NAME` environment variable to different values for each Ray Cluster and add these datasources in Grafana. Follow {ref}`these instructions ` to set up Grafana. ::: #### Alternate Prometheus host location By default, Ray Dashboard assumes Prometheus is hosted at `localhost:9090`. You can choose to run Prometheus on a non-default port or on a different machine. In this case, make sure that Prometheus can scrape the metrics from your Ray nodes following instructions {ref}`here `. Then, configure `RAY_PROMETHEUS_HOST` environment variable properly as stated above. For example, if Prometheus is hosted at port 9000 on a node with ip 55.66.77.88, set `RAY_PROMETHEUS_HOST=http://55.66.77.88:9000`. #### Customize headers for requests from the Ray dashboard to Prometheus If Prometheus requires additional headers for authentication, set `RAY_PROMETHEUS_HEADERS` in one of the following JSON formats for Ray dashboard to send them to Prometheus: 1. `{"Header1": "Value1", "Header2": "Value2"}` 2. `[["Header1", "Value1"], ["Header2", "Value2"], ["Header2", "Value3"]]` #### Alternate Grafana host location By default, Ray Dashboard assumes Grafana is hosted at `localhost:3000`. You can choose to run Grafana on a non-default port or on a different machine as long as the head node and the dashboard browsers of can access it. If Grafana is exposed with NGINX ingress on a Kubernetes cluster, the following line should be present in the Grafana ingress annotation: ```yaml nginx.ingress.kubernetes.io/configuration-snippet: | add_header X-Frame-Options SAMEORIGIN always; ``` When both Grafana and the Ray Cluster are on the same Kubernetes cluster, set `RAY_GRAFANA_HOST` to the external URL of the Grafana ingress. #### User authentication for Grafana When the Grafana instance requires user authentication, the following settings have to be in its [configuration file](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/) to correctly embed in Ray Dashboard: ```ini [security] allow_embedding = true cookie_secure = true cookie_samesite = none ``` #### Troubleshooting ##### Dashboard message: either Prometheus or Grafana server is not detected If you have followed the instructions above to set up everything, run the connection checks below in your browser: * check Head Node connection to Prometheus server: add `api/prometheus_health` to the end of Ray Dashboard URL (for example: http://127.0.0.1:8265/api/prometheus_health)and visit it. * check Head Node connection to Grafana server: add `api/grafana_health` to the end of Ray Dashboard URL (for example: http://127.0.0.1:8265/api/grafana_health) and visit it. * check browser connection to Grafana server: visit the URL used in `RAY_GRAFANA_IFRAME_HOST`. ##### Getting an error that says `RAY_GRAFANA_HOST` is not setup If you have set up Grafana, check that: * You've included the protocol in the URL (e.g., `http://your-grafana-url.com` instead of `your-grafana-url.com`). * The URL doesn't have a trailing slash (e.g., `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`). ##### Certificate Authority (CA error) You may see a CA error if your Grafana instance is hosted behind HTTPS. Contact the Grafana service owner to properly enable HTTPS traffic. ## Viewing built-in Dashboard API metrics Dashboard is powered by a server that serves both the UI code and the data about the cluster via API endpoints. Ray emits basic Prometheus metrics for each API endpoint: `ray_dashboard_api_requests_count_requests_total`: Collects the total count of requests. This is tagged by endpoint, method, and http_status. `ray_dashboard_api_requests_duration_seconds_bucket`: Collects the duration of requests. This is tagged by endpoint and method. For example, you can view the p95 duration of all requests with this query: ```text histogram_quantile(0.95, sum(rate(ray_dashboard_api_requests_duration_seconds_bucket[5m])) by (le)) ``` You can query these metrics from the Prometheus or Grafana UI. Find instructions above for how to set these tools up. --- (kuberay-mem-scalability)= # KubeRay memory and scalability benchmark ## Architecture ![benchmark architecture](../images/benchmark_architecture.png) This architecture is not a good practice, but it can fulfill the current requirements. ## Preparation Clone the [KubeRay repository](https://github.com/ray-project/kuberay) and checkout the `master` branch. This tutorial requires several files in the repository. ## Step 1: Create a new Kubernetes cluster Create a GKE cluster with autoscaling enabled. The following command creates a Kubernetes cluster named `kuberay-benchmark-cluster` on Google GKE. The cluster can scale up to 16 nodes, and each node of type `e2-highcpu-16` has 16 CPUs and 16 GB of memory. The following experiments may create up to ~150 Pods in the Kubernetes cluster, and each Ray Pod requires 1 CPU and 1 GB of memory. ```sh gcloud container clusters create kuberay-benchmark-cluster \ --num-nodes=1 --min-nodes 0 --max-nodes 16 --enable-autoscaling \ --zone=us-west1-b --machine-type e2-highcpu-16 ``` ## Step 2: Install Prometheus and Grafana ```sh # Path: kuberay/ ./install/prometheus/install.sh ``` Follow "Step 2: Install Kubernetes Prometheus Stack via Helm chart" in [prometheus-grafana.md](kuberay-prometheus-grafana) to install the [kube-prometheus-stack v48.2.1](https://github.com/prometheus-community/helm-charts/tree/kube-prometheus-stack-48.2.1/charts/kube-prometheus-stack) chart and related custom resources. ## Step 3: Install a KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator via Helm repository. ## Step 4: Run experiments * Step 4.1: Make sure the `kubectl` CLI can connect to your GKE cluster. If not, run `gcloud auth login`. * Step 4.2: Run an experiment. ```sh # You can modify `memory_benchmark_utils` to run the experiment you want to run. # (path: benchmark/memory_benchmark/scripts) python3 memory_benchmark_utils.py | tee benchmark_log ``` * Step 4.3: Follow [prometheus-grafana.md](kuberay-prometheus-grafana) to access Grafana's dashboard. * Sign into the Grafana dashboard. * Click on "Dashboards". * Select "Kubernetes / Compute Resources / Pod". * Locate the "Memory Usage" panel for the KubeRay operator Pod. * Select the time range, then click on "Inspect" followed by "Data" to download the memory usage data of the KubeRay operator Pod. * Step 4.4: Delete all RayCluster custom resources. ```sh kubectl delete --all rayclusters.ray.io --namespace=default ``` * Step 4.5: Repeat Step 4.2 to Step 4.4 for other experiments. # Experiments This benchmark is based on three benchmark experiments: * Experiment 1: Launch a RayCluster with 1 head and no workers. A new cluster is initiated every 20 seconds until there are a total of 150 RayCluster custom resources. * Experiment 2: Create a Kubernetes cluster, with only 1 RayCluster. Add 5 new worker Pods to this RayCluster every 60 seconds until the total reaches 150 Pods. * Experiment 3: Create a 5-node (1 head + 4 workers) RayCluster every 60 seconds up to 30 RayCluster custom resources. Based on [the survey](https://forms.gle/KtMLzjXcKoeSTj359) for KubeRay users, the benchmark target is set at 150 Ray Pods to cover most use cases. ## Experiment results (KubeRay v0.6.0) ![benchmark result](../images/benchmark_result.png) * You can generate the above figure by running: ```sh # (path: benchmark/memory_benchmark/scripts) python3 experiment_figures.py # The output image `benchmark_result.png` will be stored in `scripts/`. ``` * As shown in the figure, the memory usage of the KubeRay operator Pod is highly and positively correlated to the number of Pods in the Kubernetes cluster. In addition, the number of custom resources in the Kubernetes cluster does not have a significant impact on the memory usage. * Note that the x-axis "Number of Pods" is the number of Pods that are created rather than running. If the Kubernetes cluster does not have enough computing resources, the GKE Autopilot adds a new Kubernetes node into the cluster. This process may take a few minutes, so some Pods may be pending in the process. This lag may can explain why the memory usage is somewhat throttled. --- (kuberay-benchmarks)= # KubeRay Benchmarks ```{toctree} :hidden: benchmarks/memory-scalability-benchmark ``` - {ref}`kuberay-mem-scalability` --- (deploying-on-argocd-example)= # Deploying Ray Clusters via ArgoCD This guide provides a step-by-step approach for deploying Ray clusters on Kubernetes using ArgoCD. ArgoCD is a declarative GitOps tool that enables you to manage Ray cluster configurations in Git repositories with automated synchronization, version control, and rollback capabilities. This approach is particularly valuable when managing multiple Ray clusters across different environments, implementing audit trails and approval workflows, or maintaining infrastructure-as-code practices. For simpler use cases like single-cluster development or quick experimentation, direct kubectl or Helm deployments may be sufficient. You can read more about the benefits of ArgoCD [here](https://argo-cd.readthedocs.io/en/stable/#why-argo-cd). This example demonstrates how to deploy the KubeRay operator and a RayCluster with three different worker groups, leveraging ArgoCD's GitOps capabilities for automated cluster management. ## Prerequisites Before proceeding with this guide, ensure you have the following: * A Kubernetes cluster with appropriate resources for running Ray workloads. * `kubectl` configured to access your Kubernetes cluster. * (Optional)[ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster. * (Optional)[ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment). * (Optional)Access to the ArgoCD UI or API server. ## Step 1: Deploy KubeRay Operator CRDs First, deploy the Custom Resource Definitions (CRDs) required by the KubeRay operator. Create a file named `ray-operator-crds.yaml` with the following content: ```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ray-operator-crds namespace: argocd spec: project: default destination: server: https://kubernetes.default.svc namespace: ray-cluster source: repoURL: https://github.com/ray-project/kuberay targetRevision: v1.5.1 # update this as necessary path: helm-chart/kuberay-operator/crds syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true - Replace=true ``` Apply the ArgoCD Application: ```sh kubectl apply -f ray-operator-crds.yaml ``` Wait for the CRDs Application to sync and become healthy. You can check the status using: ```sh kubectl get application ray-operator-crds -n argocd ``` Which should eventually give something like: ``` NAME SYNC STATUS HEALTH STATUS ray-operator-crds Synced Healthy ``` Alternatively, if you have the ArgoCD CLI installed, you can wait for the application: ```sh argocd app wait ray-operator-crds ``` ## Step 2: Deploy the KubeRay Operator After the CRDs are installed, deploy the KubeRay operator itself. Create a file named `ray-operator.yaml` with the following content: ```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ray-operator namespace: argocd spec: project: default source: repoURL: https://github.com/ray-project/kuberay targetRevision: v1.5.1 # update this as necessary path: helm-chart/kuberay-operator helm: skipCrds: true # CRDs are already installed in Step 1 destination: server: https://kubernetes.default.svc namespace: ray-cluster syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true ``` Note the `skipCrds: true` setting in the Helm configuration. This is required because the CRDs were installed separately in Step 1. Apply the ArgoCD Application: ```sh kubectl apply -f ray-operator.yaml ``` Wait for the operator Application to sync and become healthy. You can check the status using: ```sh kubectl get application ray-operator -n argocd ``` Which should give the following output eventually: ``` NAME SYNC STATUS HEALTH STATUS ray-operator Synced Healthy ``` Alternatively, if you have the ArgoCD CLI installed: ```sh argocd app wait ray-operator ``` Verify that the KubeRay operator pod is running: ```sh kubectl get pods -n ray-cluster -l app.kubernetes.io/name=kuberay-operator ``` ## Step 3: Deploy a RayCluster Now deploy a RayCluster with autoscaling enabled and three different worker groups. Create a file named `raycluster.yaml` with the following content: ```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: raycluster namespace: argocd spec: project: default destination: server: https://kubernetes.default.svc namespace: ray-cluster ignoreDifferences: - group: ray.io kind: RayCluster name: raycluster-kuberay namespace: ray-cluster jqPathExpressions: - .spec.workerGroupSpecs[].replicas source: repoURL: https://ray-project.github.io/kuberay-helm/ chart: ray-cluster targetRevision: "1.5.1" helm: releaseName: raycluster valuesObject: image: repository: docker.io/rayproject/ray tag: latest pullPolicy: IfNotPresent head: rayStartParams: num-cpus: "0" enableInTreeAutoscaling: true autoscalerOptions: version: v2 upscalingMode: Default idleTimeoutSeconds: 600 # 10 minutes env: - name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES value: "100" worker: groupName: standard-worker replicas: 1 minReplicas: 1 maxReplicas: 200 rayStartParams: resources: '"{\"standard-worker\": 1}"' resources: requests: cpu: "1" memory: "1G" additionalWorkerGroups: additional-worker-group1: image: repository: docker.io/rayproject/ray tag: latest pullPolicy: IfNotPresent disabled: false replicas: 1 minReplicas: 1 maxReplicas: 30 rayStartParams: resources: '"{\"additional-worker-group1\": 1}"' resources: requests: cpu: "1" memory: "1G" additional-worker-group2: image: repository: docker.io/rayproject/ray tag: latest pullPolicy: IfNotPresent disabled: false replicas: 1 minReplicas: 1 maxReplicas: 200 rayStartParams: resources: '"{\"additional-worker-group2\": 1}"' resources: requests: cpu: "1" memory: "1G" syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true ``` Apply the ArgoCD Application: ```sh kubectl apply -f raycluster.yaml ``` Wait for the RayCluster Application to sync and become healthy. You can check the status using: ```sh kubectl get application raycluster -n argocd ``` Alternatively, if you have the ArgoCD CLI installed: ```sh argocd app wait raycluster ``` Verify that the RayCluster is running: ```sh kubectl get raycluster -n ray-cluster ``` Which will give something like: ``` NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE raycluster-kuberay 3 3 ... ... ... ready ... ``` You should see the head pod and worker pods: ```sh kubectl get pods -n ray-cluster ``` Gives something like: ``` NAME READY STATUS RESTARTS AGE kuberay-operator-6c485bc876-28dnl 1/1 Running 0 11d raycluster-kuberay-additional-worker-group1-n45rc 1/1 Running 0 5d21h raycluster-kuberay-additional-worker-group2-b2455 1/1 Running 0 2d18h raycluster-kuberay-head 2/2 Running 0 5d21h raycluster-kuberay-standard-worker-worker-bs8t8 1/1 Running 0 5d21h ``` ## Understanding Ray Autoscaling with ArgoCD ### Determining Fields to Ignore The `ignoreDifferences` section in the RayCluster Application configuration is critical for proper autoscaling. To determine which fields need to be ignored, you can inspect the RayCluster resource to identify fields that change dynamically during runtime. First, describe the RayCluster resource to see its full specification: ```sh kubectl describe raycluster raycluster-kuberay -n ray-cluster ``` Or, get the resource in YAML format to see the exact field paths: ```sh kubectl get raycluster raycluster-kuberay -n ray-cluster -o yaml ``` Look for fields that are modified by controllers or autoscalers. In the case of Ray, the autoscaler modifies the `replicas` field under each worker group spec. You'll see output similar to: ```yaml spec: workerGroupSpecs: - replicas: 5 # This value changes dynamically minReplicas: 1 maxReplicas: 200 groupName: standard-worker # ... ``` When ArgoCD detects differences between the desired state (in Git) and the actual state (in the cluster), it will show these in the UI or via CLI: ```sh argocd app diff raycluster ``` If you see repeated differences in fields that should be managed by controllers (like autoscalers), those are candidates for `ignoreDifferences`. ### Configuring ignoreDifferences The `ignoreDifferences` section in the RayCluster Application configuration tells ArgoCD which fields to ignore. Without this setting, ArgoCD and the Ray Autoscaler may conflict, resulting in unexpected behavior when requesting workers dynamically (for example, using `ray.autoscaler.sdk.request_resources`). Specifically, when requesting N workers, the Autoscaler might not spin up the expected number of workers because ArgoCD could revert the replica count back to the original value defined in the Application manifest. The recommended approach is to use `jqPathExpressions`, which automatically handles any number of worker groups: ```yaml ignoreDifferences: - group: ray.io kind: RayCluster name: raycluster-kuberay namespace: ray-cluster jqPathExpressions: - .spec.workerGroupSpecs[].replicas ``` This configuration tells ArgoCD to ignore differences in the `replicas` field for all worker groups. The `jqPathExpressions` field uses JQ syntax with array wildcards (`[]`), which means you don't need to update the configuration when adding or removing worker groups. **Note**: The `name` and `namespace` must match your RayCluster resource name and namespace. Verify these values separately if you've customized them. **Alternative: Using jsonPointers** If you prefer explicit configuration, you can use `jsonPointers` instead: ```yaml ignoreDifferences: - group: ray.io kind: RayCluster name: raycluster-kuberay namespace: ray-cluster jsonPointers: - /spec/workerGroupSpecs/0/replicas - /spec/workerGroupSpecs/1/replicas - /spec/workerGroupSpecs/2/replicas ``` With `jsonPointers`, you must explicitly list each worker group by index: - `/spec/workerGroupSpecs/0/replicas` - First worker group (the default `worker` group) - `/spec/workerGroupSpecs/1/replicas` - Second worker group (`additional-worker-group1`) - `/spec/workerGroupSpecs/2/replicas` - Third worker group (`additional-worker-group2`) If you add or remove worker groups, you **must** update this list accordingly. The index corresponds to the order of worker groups as they appear in the RayCluster spec, with the default `worker` group at index 0 and `additionalWorkerGroups` following in the order they are defined. See the [ArgoCD diff customization documentation](https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/) for more details on both approaches. By ignoring these differences, ArgoCD allows the Ray Autoscaler to dynamically manage worker replicas without interference. ## Step 4: Access the Ray Dashboard To access the Ray Dashboard, port-forward the head service: ```sh kubectl port-forward -n ray-cluster svc/raycluster-kuberay-head-svc 8265:8265 ``` Navigate to `http://localhost:8265` in your browser to view the Ray Dashboard. ## Customizing the Configuration You can customize the RayCluster configuration by modifying the `valuesObject` section in the `raycluster.yaml` file: * **Image**: Change the `repository` and `tag` to use different Ray versions. * **Worker Groups**: Add or remove worker groups by modifying the `additionalWorkerGroups` section. * **Autoscaling**: Adjust `minReplicas`, `maxReplicas`, and `idleTimeoutSeconds` to control autoscaling behavior. * **Resources**: Modify `rayStartParams` to allocate custom resources to worker groups. After making changes, commit them to your Git repository. ArgoCD will automatically sync the changes to your cluster if automated sync is enabled. ## Alternative: Deploy Everything in One File If you prefer to deploy all components at once, you can combine all three ArgoCD Applications into a single file. Create a file named `ray-argocd-all.yaml` with the following content: ```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ray-operator-crds namespace: argocd spec: project: default destination: server: https://kubernetes.default.svc namespace: ray-cluster source: repoURL: https://github.com/ray-project/kuberay targetRevision: v1.5.1 # update this as necessary path: helm-chart/kuberay-operator/crds syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true - Replace=true --- apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ray-operator namespace: argocd spec: project: default source: repoURL: https://github.com/ray-project/kuberay targetRevision: v1.5.1 # update this as necessary path: helm-chart/kuberay-operator helm: skipCrds: true # CRDs are installed in the first Application destination: server: https://kubernetes.default.svc namespace: ray-cluster syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true --- apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: raycluster namespace: argocd spec: project: default destination: server: https://kubernetes.default.svc namespace: ray-cluster ignoreDifferences: - group: ray.io kind: RayCluster name: raycluster-kuberay # ensure this is aligned with the release name namespace: ray-cluster # ensure this is aligned with the namespace jqPathExpressions: - .spec.workerGroupSpecs[].replicas source: repoURL: https://ray-project.github.io/kuberay-helm/ chart: ray-cluster targetRevision: "1.4.1" helm: releaseName: raycluster # this affects the ignoreDifferences field valuesObject: image: repository: docker.io/rayproject/ray tag: latest pullPolicy: IfNotPresent head: rayStartParams: num-cpus: "0" enableInTreeAutoscaling: true autoscalerOptions: version: v2 upscalingMode: Default idleTimeoutSeconds: 600 # 10 minutes env: - name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES value: "100" worker: groupName: standard-worker replicas: 1 minReplicas: 1 maxReplicas: 200 rayStartParams: resources: '"{\"standard-worker\": 1}"' resources: requests: cpu: "1" memory: "1G" additionalWorkerGroups: additional-worker-group1: image: repository: docker.io/rayproject/ray tag: latest pullPolicy: IfNotPresent disabled: false replicas: 1 minReplicas: 1 maxReplicas: 30 rayStartParams: resources: '"{\"additional-worker-group1\": 1}"' resources: requests: cpu: "1" memory: "1G" additional-worker-group2: image: repository: docker.io/rayproject/ray tag: latest pullPolicy: IfNotPresent disabled: false replicas: 1 minReplicas: 1 maxReplicas: 200 rayStartParams: resources: '"{\"additional-worker-group2\": 1}"' resources: requests: cpu: "1" memory: "1G" syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true ``` Note in this example, the `jqPathExpressions` approach is used. Apply all three Applications at once: ```sh kubectl apply -f ray-argocd-all.yaml ``` This single-file approach is convenient for quick deployments, but the step-by-step approach in the earlier sections provides better visibility into the deployment process and makes it easier to troubleshoot issues. --- (kuberay-distributed-checkpointing-gcsfuse)= # Distributed checkpointing with KubeRay and GCSFuse This example orchestrates distributed checkpointing with KubeRay, using the GCSFuse CSI driver and Google Cloud Storage as the remote storage system. To illustrate the concepts, this guide uses the [Finetuning a Pytorch Image Classifier with Ray Train](https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html) example. ## Why distributed checkpointing with GCSFuse? In large-scale, high-performance machine learning, distributed checkpointing is crucial for fault tolerance, ensuring that if a node fails during training, Ray can resume the process from the latest saved checkpoint instead of starting from scratch. While it's possible to directly reference remote storage paths (e.g., `gs://my-checkpoint-bucket`), using Google Cloud Storage FUSE (GCSFuse) has distinct advantages for distributed applications. GCSFuse allows you to mount Cloud Storage buckets like local file systems, making checkpoint management more intuitive for distributed applications that rely on these semantics. Furthermore, GCSFuse is designed for high-performance workloads, delivering the performance and scalability you need for distributed checkpointing of large models. [Distributed checkpointing](https://docs.ray.io/en/latest/train/user-guides/checkpoints.html), in combination with [GCSFuse](https://cloud.google.com/storage/docs/gcs-fuse), allows for larger-scale model training with increased availability and efficiency. ## Create a Kubernetes cluster on GKE Create a GKE cluster with the [GCSFuse CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) and [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) enabled, as well as a GPU node pool with 4 L4 GPUs: ``` export PROJECT_ID= gcloud container clusters create kuberay-with-gcsfuse \ --addons GcsFuseCsiDriver \ --cluster-version=1.29.4 \ --location=us-east4-c \ --machine-type=g2-standard-8 \ --release-channel=rapid \ --num-nodes=4 \ --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \ --workload-pool=${PROJECT_ID}.svc.id.goog ``` Verify the successful creation of your cluster with 4 GPUs: ``` $ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" NAME GPU gke-kuberay-with-gcsfuse-default-pool-xxxx-0000 1 gke-kuberay-with-gcsfuse-default-pool-xxxx-1111 1 gke-kuberay-with-gcsfuse-default-pool-xxxx-2222 1 gke-kuberay-with-gcsfuse-default-pool-xxxx-3333 1 ``` ## Install the KubeRay operator Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. The KubeRay operator Pod must be on the CPU node if you set up the taint for the GPU node pool correctly. ## Configuring the GCS Bucket Create a GCS bucket that Ray uses as the remote filesystem. ``` BUCKET= gcloud storage buckets create gs://$BUCKET --uniform-bucket-level-access ``` Create a Kubernetes ServiceAccount that grants the RayCluster access to mount the GCS bucket: ``` kubectl create serviceaccount pytorch-distributed-training ``` Bind the `roles/storage.objectUser` role to the Kubernetes service account and bucket IAM policy. See [Identifying projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects) to find your project ID and project number: ``` PROJECT_ID= PROJECT_NUMBER= gcloud storage buckets add-iam-policy-binding gs://${BUCKET} --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/default/sa/pytorch-distributed-training" --role "roles/storage.objectUser" ``` See [Access Cloud Storage buckets with the Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) for more details. ## Deploy the RayJob Download the RayJob that executes all the steps documented in [Finetuning a Pytorch Image Classifier with Ray Train](https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html). The [source code](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/pytorch-resnet-image-classifier) is also in the KubeRay repository. ``` curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-resnet-image-classifier/ray-job.pytorch-image-classifier.yaml ``` Modify the RayJob by replacing all instances of the `GCS_BUCKET` placeholder with the Google Cloud Storage bucket you created earlier. Alternatively you can use `sed`: ``` sed -i "s/GCS_BUCKET/$BUCKET/g" ray-job.pytorch-image-classifier.yaml ``` Deploy the RayJob: ``` kubectl create -f ray-job.pytorch-image-classifier.yaml ``` The deployed RayJob includes the following configuration to enable distributed checkpointing to a shared filesystem: * 4 Ray workers, each with a single GPU. * All Ray nodes use the `pytorch-distributed-training` ServiceAccount, which we created earlier. * Includes volumes that are managed by the `gcsfuse.csi.storage.gke.io` CSI driver. * Mounts a shared storage path `/mnt/cluster_storage`, backed by the GCS bucket you created earlier. You can configure the Pod with annotations, which allows for finer grain control of the GCSFuse sidecar container. See [Specify Pod annotations](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#pod-annotations) for more details. ``` annotations: gke-gcsfuse/volumes: "true" gke-gcsfuse/cpu-limit: "0" gke-gcsfuse/memory-limit: 5Gi gke-gcsfuse/ephemeral-storage-limit: 10Gi ``` You can also specify mount options when defining the GCSFuse container volume: ``` csi: driver: gcsfuse.csi.storage.gke.io volumeAttributes: bucketName: GCS_BUCKET mountOptions: "implicit-dirs,uid=1000,gid=100" ``` See [Mount options](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#mount-options) to learn more about mount options. Logs from the Ray job should indicate the use of the shared remote filesystem in `/mnt/cluster_storage` and the checkpointing directory. For example: ``` Training finished iteration 10 at 2024-04-29 10:22:08. Total running time: 1min 30s ╭─────────────────────────────────────────╮ │ Training result │ ├─────────────────────────────────────────┤ │ checkpoint_dir_name checkpoint_000009 │ │ time_this_iter_s 6.47154 │ │ time_total_s 74.5547 │ │ training_iteration 10 │ │ acc 0.24183 │ │ loss 0.06882 │ ╰─────────────────────────────────────────╯ Training saved a checkpoint for iteration 10 at: (local)/mnt/cluster_storage/finetune-resnet/TorchTrainer_cbb82_00000_0_2024-04-29_10-20-37/checkpoint_000009 ``` ## Inspect checkpointing data Once the RayJob completes, you can inspect the contents of your bucket using a tool like [gsutil](https://cloud.google.com/storage/docs/gsutil). ``` gsutil ls gs://my-ray-bucket/** gs://my-ray-bucket/finetune-resnet/ gs://my-ray-bucket/finetune-resnet/.validate_storage_marker gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/ gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000007/ gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000007/checkpoint.pt gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000008/ gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000008/checkpoint.pt gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000009/ gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000009/checkpoint.pt gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/error.pkl gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/error.txt gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/events.out.tfevents.1714436502.orch-image-classifier-nc2sq-raycluster-tdrfx-head-xzcl8 gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/events.out.tfevents.1714436809.orch-image-classifier-zz4sj-raycluster-vn7kz-head-lwx8k gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/params.json gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/params.pkl gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/progress.csv gs://my-ray-bucket/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/result.json gs://my-ray-bucket/finetune-resnet/basic-variant-state-2024-04-29_17-21-29.json gs://my-ray-bucket/finetune-resnet/basic-variant-state-2024-04-29_17-26-35.json gs://my-ray-bucket/finetune-resnet/experiment_state-2024-04-29_17-21-29.json gs://my-ray-bucket/finetune-resnet/experiment_state-2024-04-29_17-26-35.json gs://my-ray-bucket/finetune-resnet/trainer.pkl gs://my-ray-bucket/finetune-resnet/tuner.pkl ``` ## Resuming from checkpoint In the event of a failed job, you can use the latest checkpoint to resume training of the model. This example configures `TorchTrainer` to automatically resume from the latest checkpoint: ```python experiment_path = os.path.expanduser("/mnt/cluster_storage/finetune-resnet") if TorchTrainer.can_restore(experiment_path): trainer = TorchTrainer.restore(experiment_path, train_loop_per_worker=train_loop_per_worker, train_loop_config=train_loop_config, scaling_config=scaling_config, run_config=run_config, ) else: trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config=train_loop_config, scaling_config=scaling_config, run_config=run_config, ) ``` You can verify automatic checkpoint recovery by redeploying the same RayJob: ``` kubectl create -f ray-job.pytorch-image-classifier.yaml ``` If the previous job succeeded, the training job should restore the checkpoint state from the `checkpoint_000009` directory and then immediately complete training with 0 iterations: ``` 2024-04-29 15:51:32,528 INFO experiment_state.py:366 -- Trying to find and download experiment checkpoint at /mnt/cluster_storage/finetune-resnet 2024-04-29 15:51:32,651 INFO experiment_state.py:396 -- A remote experiment checkpoint was found and will be used to restore the previous experiment state. 2024-04-29 15:51:32,652 INFO tune_controller.py:404 -- Using the newest experiment state file found within the experiment directory: experiment_state-2024-04-29_15-43-40.json View detailed results here: /mnt/cluster_storage/finetune-resnet To visualize your results with TensorBoard, run: `tensorboard --logdir /home/ray/ray_results/finetune-resnet` Result( metrics={'loss': 0.070047477101968, 'acc': 0.23529411764705882}, path='/mnt/cluster_storage/finetune-resnet/TorchTrainer_ecc04_00000_0_2024-04-29_15-43-40', filesystem='local', checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_ecc04_00000_0_2024-04-29_15-43-40/checkpoint_000009) ) ``` If the previous job failed at an earlier checkpoint, the job should resume from the last saved checkpoint and run until `max_epochs=10`. For example, if the last run failed at epoch 7, the training automatically resumes using `checkpoint_000006` and run 3 more iterations until epoch 10: ``` (TorchTrainer pid=611, ip=10.108.2.65) Restored on 10.108.2.65 from checkpoint: Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000006) (RayTrainWorker pid=671, ip=10.108.2.65) Setting up process group for: env:// [rank=0, world_size=4] (TorchTrainer pid=611, ip=10.108.2.65) Started distributed worker processes: (TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.2.65, pid=671) world_rank=0, local_rank=0, node_rank=0 (TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.1.83, pid=589) world_rank=1, local_rank=0, node_rank=1 (TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.0.72, pid=590) world_rank=2, local_rank=0, node_rank=2 (TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.3.76, pid=590) world_rank=3, local_rank=0, node_rank=3 (RayTrainWorker pid=589, ip=10.108.1.83) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth (RayTrainWorker pid=671, ip=10.108.2.65) 0%| | 0.00/97.8M [00:00 **Note:** The Python files for the Ray Serve application and its client are in the repository [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples). ## Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Install KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. Note that the YAML file in this example uses `serveConfigV2`. You need KubeRay version v0.6.0 or later to use this feature. ## Step 3: Install a RayService ```sh # Create a RayService kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-service.mobilenet.yaml ``` * The [mobilenet.py](https://github.com/ray-project/serve_config_examples/blob/master/mobilenet/mobilenet.py) file needs `tensorflow` as a dependency. Hence, the YAML file uses `rayproject/ray-ml` image instead of `rayproject/ray` image. * The request parsing function `starlette.requests.form()` needs `python-multipart`, so the YAML file includes `python-multipart` in the runtime environment. ## Step 4: Forward the port for Ray Serve ```sh # Wait for the RayService to be ready to serve requests kubectl describe rayservice/rayservice-mobilenet # Conditions: # Last Transition Time: 2025-02-13T02:29:26Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True # Type: Ready # Forward the port for Ray Serve service kubectl port-forward svc/rayservice-mobilenet-serve-svc 8000 ``` ## Step 5: Send a request to the ImageClassifier * Step 5.1: Prepare an image file. * Step 5.2: Update `image_path` in [mobilenet_req.py](https://github.com/ray-project/serve_config_examples/blob/master/mobilenet/mobilenet_req.py) * Step 5.3: Send a request to the `ImageClassifier`. ```sh python mobilenet_req.py # sample output: {"prediction":["n02099601","golden_retriever",0.17944198846817017]} ``` --- (kuberay-modin-example)= # Use Modin with Ray on Kubernetes This example runs a modified version of the [Using Modin with the NYC Taxi Dataset](https://github.com/modin-project/modin/blob/4e7afa7ea59c7a160ed504f39652ff23b4d49be3/examples/jupyter/Modin_Taxi.ipynb) example from the Modin official repository using RayJob on Kubernetes. ## Step 1: Install KubeRay operator Follow [KubeRay Operator Installation](kuberay-operator-deploy) to install KubeRay operator. ## Step 2: Run the Modin example with RayJob Create a RayJob that runs the Modin example using the following command: ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-job.modin.yaml ``` ## Step 3: Check the output Run the following command to check the output: ```sh kubectl logs -l=job-name=rayjob-sample # [Example output] # 2024-07-05 10:01:00,945 INFO worker.py:1446 -- Using address 10.244.0.4:6379 set in the environment variable RAY_ADDRESS # 2024-07-05 10:01:00,945 INFO worker.py:1586 -- Connecting to existing Ray cluster at address: 10.244.0.4:6379... # 2024-07-05 10:01:00,948 INFO worker.py:1762 -- Connected to Ray cluster. View the dashboard at 10.244.0.4:8265 # Modin Engine: Ray # FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead. # Time to compute isnull: 0.065887747972738 # Time to compute rounded_trip_distance: 0.34410698304418474 # 2024-07-05 10:01:23,069 SUCC cli.py:60 -- ----------------------------------- # 2024-07-05 10:01:23,069 SUCC cli.py:61 -- Job 'rayjob-sample-zt8wj' succeeded # 2024-07-05 10:01:23,069 SUCC cli.py:62 -- ----------------------------------- ``` --- (kuberay-batch-inference-example)= # RayJob Batch Inference Example This example demonstrates how to use the RayJob custom resource to run a batch inference job for an image classification workload on a Ray cluster. See [Image Classification Batch Inference with HuggingFace Vision Transformer](https://docs.ray.io/en/latest/data/examples/huggingface_vit_batch_prediction.html) for a full explanation of the code. ## Prerequisites You must have a Kubernetes cluster running,`kubectl` configured to use it, and GPUs available. This example provides a brief tutorial for setting up the necessary GPUs on Google Kubernetes Engine (GKE), but you can use any Kubernetes cluster with GPUs. ## Step 0: Create a Kubernetes cluster on GKE (Optional) If you already have a Kubernetes cluster with GPUs, you can skip this step. Otherwise, follow [this tutorial](kuberay-gke-gpu-cluster-setup), but substitute the following GPU node pool creation command to create a Kubernetes cluster on GKE with four NVIDIA T4 GPUs: ```sh gcloud container node-pools create gpu-node-pool \ --accelerator type=nvidia-tesla-t4,count=4,gpu-driver-version=default \ --zone us-west1-b \ --cluster kuberay-gpu-cluster \ --num-nodes 1 \ --min-nodes 0 \ --max-nodes 1 \ --enable-autoscaling \ --machine-type n1-standard-64 ``` This example uses four [NVIDIA T4](https://cloud.google.com/compute/docs/gpus#nvidia_t4_gpus) GPUs. The machine type is `n1-standard-64`, which has [64 vCPUs and 240 GB RAM](https://cloud.google.com/compute/docs/general-purpose-machines#n1_machine_types). ## Step 1: Install the KubeRay Operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. The KubeRay operator Pod must be on the CPU node if you have set up the taint for the GPU node pool correctly. ## Step 2: Submit the RayJob Create the RayJob custom resource with [ray-job.batch-inference.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-job.batch-inference.yaml). Download the file with `curl`: ```bash curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.batch-inference.yaml ``` Note that the `RayJob` spec contains a spec for the `RayCluster`. This tutorial uses a single-node cluster with 4 GPUs. For production use cases, use a multi-node cluster where the head node doesn't have GPUs, so that Ray can automatically schedule GPU workloads on worker nodes which won't interfere with critical Ray processes on the head node. Note the following fields in the `RayJob` spec, which specify the Ray image and the GPU resources for the Ray node: ```yaml spec: containers: - name: ray-head image: rayproject/ray-ml:2.6.3-gpu resources: limits: nvidia.com/gpu: "4" cpu: "54" memory: "54Gi" requests: nvidia.com/gpu: "4" cpu: "54" memory: "54Gi" volumeMounts: - mountPath: /home/ray/samples name: code-sample nodeSelector: cloud.google.com/gke-accelerator: nvidia-tesla-t4 # This is the GPU type we used in the GPU node pool. ``` To submit the job, run the following command: ```bash kubectl apply -f ray-job.batch-inference.yaml ``` Check the status with `kubectl describe rayjob rayjob-sample`. Sample output: ``` [...] Status: Dashboard URL: rayjob-sample-raycluster-j6t8n-head-svc.default.svc.cluster.local:8265 End Time: ... Job Deployment Status: Complete Job Id: rayjob-sample-ft8lh Job Status: SUCCEEDED Message: Job finished successfully. Observed Generation: 2 ... ``` To view the logs, first find the name of the pod running the job with `kubectl get pods`. Sample output: ```bash NAME READY STATUS RESTARTS AGE kuberay-operator-8b86754c-r4rc2 1/1 Running 0 25h rayjob-sample-raycluster-j6t8n-head-kx2gz 1/1 Running 0 35m rayjob-sample-w98c7 0/1 Completed 0 30m ``` The Ray cluster is still running because `shutdownAfterJobFinishes` isn't set in the `RayJob` spec. If you set `shutdownAfterJobFinishes` to `true`, the cluster is shut down after the job finishes. Next, run: ```text kubectl logs rayjob-sample-w98c7 ``` to get the standard output of the `entrypoint` command for the `RayJob`. Sample output: ```text [...] Running: 62.0/64.0 CPU, 4.0/4.0 GPU, 955.57 MiB/12.83 GiB object_store_memory: 0%| | 0/200 [00:05 Label: tench, Tinca tinca Label: tench, Tinca tinca Label: tench, Tinca tinca Label: tench, Tinca tinca Label: tench, Tinca tinca 2023-08-22 15:48:36,522 SUCC cli.py:33 -- ----------------------------------- 2023-08-22 15:48:36,522 SUCC cli.py:34 -- Job 'rayjob-sample-ft8lh' succeeded 2023-08-22 15:48:36,522 SUCC cli.py:35 -- ----------------------------------- ``` --- (kuberay-kueue-gang-scheduling-example)= # Gang Scheduling with RayJob and Kueue This guide demonstrates how to use Kueue for gang scheduling RayJob resources, taking advantage of dynamic resource provisioning and queueing on Kubernetes. To illustrate the concepts, this guide uses the [Fine-tune a PyTorch Lightning Text Classifier with Ray Data](https://docs.ray.io/en/master/train/examples/lightning/lightning_cola_advanced.html) example. ## Gang scheduling Gang scheduling in Kubernetes ensures that a group of related Pods, such as those in a Ray cluster, only start when all required resources are available. Having this requirement is crucial when working with expensive, limited resources like GPUs. ## Kueue [Kueue](https://kueue.sigs.k8s.io/) is a Kubernetes-native system that manages quotas and how jobs consume them. Kueue decides when: * To make a job wait. * To admit a job to start, which triggers Kubernetes to create Pods. * To preempt a job, which triggers Kubernetes to delete active Pods. Kueue has native support for some KubeRay APIs. Specifically, you can use Kueue to manage resources that RayJob, RayCluster, and RayService consume. See the [Kueue documentation](https://kueue.sigs.k8s.io/docs/overview/) to learn more. ## Why use gang scheduling Gang scheduling is essential when working with expensive, limited hardware accelerators like GPUs. It prevents RayJobs from partially provisioning Ray clusters and claiming but not using the GPUs. Kueue suspends a RayJob until the Kubernetes cluster and the underlying cloud provider can guarantee the capacity that the RayJob needs to execute. This approach greatly improves GPU utilization and cost, especially when GPU availability is limited. ## Create a Kubernetes cluster on GKE Create a GKE cluster with the `enable-autoscaling` option: ```bash gcloud container clusters create kuberay-gpu-cluster \ --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \ --zone=us-east4-c --machine-type e2-standard-4 ``` Create a GPU node pool with the `enable-queued-provisioning` option enabled: ```bash gcloud container node-pools create gpu-node-pool \ --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \ --enable-queued-provisioning \ --reservation-affinity=none \ --zone us-east4-c \ --cluster kuberay-gpu-cluster \ --num-nodes 0 \ --min-nodes 0 \ --max-nodes 10 \ --enable-autoscaling \ --machine-type g2-standard-4 ``` This command creates a node pool, which initially has zero nodes. The `--enable-queued-provisioning` flag enables "queued provisioning" in the Kubernetes node autoscaler using the ProvisioningRequest API. More details are below. You need to use the `--reservation-affinity=none` flag because GKE doesn't support Node Reservations with ProvisioningRequest. ## Install the KubeRay operator Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. The KubeRay operator Pod must be on the CPU node if you set up the taint for the GPU node pool correctly. ## Install Kueue Install the latest released version of Kueue. ``` kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/manifests.yaml ``` See [Kueue Installation](https://kueue.sigs.k8s.io/docs/installation/#install-a-released-version) for more details on installing Kueue. ## Configure Kueue for gang scheduling Next, configure Kueue for gang scheduling. Kueue leverages the ProvisioningRequest API for two key tasks: 1. Dynamically adding new nodes to the cluster when a job needs more resources. 2. Blocking the admission of new jobs that are waiting for sufficient resources to become available. See [How ProvisioningRequest works](https://cloud.google.com/kubernetes-engine/docs/how-to/provisioningrequest#how-provisioningrequest-works) for more details. ### Create Kueue resources This manifest creates the following resources: * [ClusterQueue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/): Defines quotas and fair sharing rules * [LocalQueue](https://kueue.sigs.k8s.io/docs/concepts/local_queue/): A namespaced queue, belonging to a tenant, that references a ClusterQueue * [ResourceFlavor](https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/): Defines what resources are available in the cluster, typically from Nodes * [AdmissionCheck](https://kueue.sigs.k8s.io/docs/concepts/admission_check/): A mechanism allowing components to influence the timing of a workload admission ```yaml # kueue-resources.yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: AdmissionCheck metadata: name: rayjob-gpu spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: rayjob-gpu-config --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ProvisioningRequestConfig metadata: name: rayjob-gpu-config spec: provisioningClassName: queued-provisioning.gke.io managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 10000 # infinite quotas - name: "memory" nominalQuota: 10000Gi # infinite quotas - name: "nvidia.com/gpu" nominalQuota: 10000 # infinite quotas admissionChecks: - rayjob-gpu --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "user-queue" spec: clusterQueue: "cluster-queue" ``` Create the Kueue resources: ```bash kubectl apply -f kueue-resources.yaml ``` :::{note} This example configures Kueue to orchestrate the gang scheduling of GPUs. However, you can use other resources such as CPU and memory. ::: ## Deploy a RayJob Download the RayJob that executes all the steps documented in [Fine-tune a PyTorch Lightning Text Classifier](https://docs.ray.io/en/master/train/examples/lightning/lightning_cola_advanced.html). The [source code](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/pytorch-text-classifier) is also in the KubeRay repository. ```bash curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-text-classifier/ray-job.pytorch-distributed-training.yaml ``` Before creating the RayJob, modify the RayJob metadata with a label to assign the RayJob to the LocalQueue that you created earlier: ```yaml metadata: generateName: pytorch-text-classifier- labels: kueue.x-k8s.io/queue-name: user-queue ``` Deploy the RayJob: ```bash $ kubectl create -f ray-job.pytorch-distributed-training.yaml rayjob.ray.io/dev-pytorch-text-classifier-r6d4p created ``` ## Gang scheduling with RayJob Following is the expected behavior when you deploy a GPU-requiring RayJob to a cluster that initially lacks GPUs: * Kueue suspends the RayJob due to insufficient GPU resources in the cluster. * Kueue creates a ProvisioningRequest, specifying the GPU requirements for the RayJob. * The Kubernetes node autoscaler monitors ProvisioningRequests and adds nodes with GPUs as needed. * Once the required GPU nodes are available, the ProvisioningRequest is satisfied. * Kueue admits the RayJob, allowing Kubernetes to schedule the Ray nodes on the newly provisioned nodes, and the RayJob execution begins. If GPUs are unavailable, Kueue keeps suspending the RayJob. In addition, the node autoscaler avoids provisioning new nodes until it can fully satisfy the RayJob's GPU requirements. Upon creating a RayJob, notice that the RayJob status is immediately `suspended` despite the ClusterQueue having GPU quotas available. ```bash $ kubectl get rayjob pytorch-text-classifier-rj4sg -o yaml apiVersion: ray.io/v1 kind: RayJob metadata: name: pytorch-text-classifier-rj4sg labels: kueue.x-k8s.io/queue-name: user-queue ... ... ... status: jobDeploymentStatus: Suspended # RayJob suspended jobId: pytorch-text-classifier-rj4sg-pj9hx jobStatus: PENDING ``` Kueue keeps suspending this RayJob until its corresponding ProvisioningRequest is satisfied. List ProvisioningRequest resources and their status with this command: ```bash $ kubectl get provisioningrequest NAME ACCEPTED PROVISIONED FAILED AGE rayjob-pytorch-text-classifier-nv77q-e95ec-rayjob-gpu-1 True False False 22s ``` Note the two columns in the output: `ACCEPTED` and `PROVISIONED`. `ACCEPTED=True` means that Kueue and the Kubernetes node autoscaler have acknowledged the request. `PROVISIONED=True` means that the Kubernetes node autoscaler has completed provisioning nodes. Once both of these conditions are true, the ProvisioningRequest is satisfied. ```bash $ kubectl get provisioningrequest NAME ACCEPTED PROVISIONED FAILED AGE rayjob-pytorch-text-classifier-nv77q-e95ec-rayjob-gpu-1 True True False 57s ``` Because the example RayJob requires 1 GPU for fine-tuning, the ProvisioningRequest is satisfied by the addition of a single GPU node in the `gpu-node-pool` Node Pool. ```bash $ kubectl get nodes NAME STATUS ROLES AGE VERSION gke-kuberay-gpu-cluster-default-pool-8d883840-fd6d Ready 14m v1.29.0-gke.1381000 gke-kuberay-gpu-cluster-gpu-node-pool-b176212e-g3db Ready 46s v1.29.0-gke.1381000 # new node with GPUs ``` Once the ProvisioningRequest is satisfied, Kueue admits the RayJob. The Kubernetes scheduler then immediately places the head and worker nodes onto the newly provisioned resources. The ProvisioningRequest ensures a seamless Ray cluster start up, with no scheduling delays for any Pods. ```bash $ kubectl get pods NAME READY STATUS RESTARTS AGE pytorch-text-classifier-nv77q-g6z57 1/1 Running 0 13s torch-text-classifier-nv77q-raycluster-gstrk-head-phnfl 1/1 Running 0 6m43s ``` --- (kuberay-kueue-priority-scheduling-example)= # Priority Scheduling with RayJob and Kueue This guide shows how to run [Fine-tune a PyTorch Lightning Text Classifier with Ray Data](https://docs.ray.io/en/master/train/examples/lightning/lightning_cola_advanced.html) example as a RayJob and leverage Kueue to orchestrate priority scheduling and quota management. ## What's Kueue? [Kueue](https://kueue.sigs.k8s.io/) is a Kubernetes-native job queueing system that manages quotas and how jobs consume them. Kueue decides when: * To make a job wait * To admit a job to start, meaning that Kubernetes creates pods. * To preempt a job, meaning that Kubernetes deletes active pods. Kueue has native support for some KubeRay APIs. Specifically, you can use Kueue to manage resources consumed by RayJob and RayCluster. See the [Kueue documentation](https://kueue.sigs.k8s.io/docs/overview/) to learn more. ## Step 0: Create a Kubernetes cluster on GKE (Optional) If you already have a Kubernetes cluster with GPUs, you can skip this step. Otherwise, follow [Start Google Cloud GKE Cluster with GPUs for KubeRay](kuberay-gke-gpu-cluster-setup) to set up a Kubernetes cluster on GKE. ## Step 1: Install the KubeRay operator Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. The KubeRay operator Pod must be on the CPU node if you set up the taint for the GPU node pool correctly. ## Step 2: Install Kueue ```bash VERSION=v0.13.4 kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml ``` See [Kueue Installation](https://kueue.sigs.k8s.io/docs/installation/#install-a-released-version) for more details on installing Kueue. ## Step 3: Configure Kueue with priority scheduling To understand this tutorial, it's important to understand the following Kueue concepts: * [ResourceFlavor](https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/) * [ClusterQueue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/) * [LocalQueue](https://kueue.sigs.k8s.io/docs/concepts/local_queue/) * [WorkloadPriorityClass](https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/) ```yaml # kueue-resources.yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: preemption: withinClusterQueue: LowerPriority namespaceSelector: {} # Match all namespaces. resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 2 - name: "memory" nominalQuota: 8G - name: "nvidia.com/gpu" # ClusterQueue only has quota for a single GPU. nominalQuota: 1 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "user-queue" spec: clusterQueue: "cluster-queue" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: WorkloadPriorityClass metadata: name: prod-priority value: 1000 description: "Priority class for prod jobs" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: WorkloadPriorityClass metadata: name: dev-priority value: 100 description: "Priority class for development jobs" ``` The YAML manifest configures: * **ResourceFlavor** * The ResourceFlavor `default-flavor` is an empty ResourceFlavor because the compute resources in the Kubernetes cluster are homogeneous. In other words, users can request 1 GPU without considering whether it's an NVIDIA A100 or a T4 GPU. * **ClusterQueue** * The ClusterQueue `cluster-queue` only has 1 ResourceFlavor `default-flavor` with quotas for 2 CPUs, 8G memory, and 1 GPU. It exactly matches the resources requested by 1 RayJob custom resource. ***Hence, only 1 RayJob can run at a time.*** * The ClusterQueue `cluster-queue` has a preemption policy `withinClusterQueue: LowerPriority`. This policy allows the pending RayJob that doesn’t fit within the nominal quota for its ClusterQueue to preempt active RayJob custom resources in the ClusterQueue that have lower priority. * **LocalQueue** * The LocalQueue `user-queue` is a namespaced object in the `default` namespace which belongs to a ClusterQueue. A typical practice is to assign a namespace to a tenant, team or user, of an organization. Users submit jobs to a LocalQueue, instead of to a ClusterQueue directly. * **WorkloadPriorityClass** * The WorkloadPriorityClass `prod-priority` has a higher value than the WorkloadPriorityClass `dev-priority`. This means that RayJob custom resources with the `prod-priority` priority class take precedence over RayJob custom resources with the `dev-priority` priority class. Create the Kueue resources: ```bash kubectl apply -f kueue-resources.yaml ``` ## Step 4: Deploy a RayJob Download the RayJob that executes all the steps documented in [Fine-tune a PyTorch Lightning Text Classifier](https://docs.ray.io/en/master/train/examples/lightning/lightning_cola_advanced.html). The [source code](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/pytorch-text-classifier) is also in the KubeRay repository. ```bash curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-text-classifier/ray-job.pytorch-distributed-training.yaml ``` Before creating the RayJob, modify the RayJob metadata with: ```yaml metadata: generateName: dev-pytorch-text-classifier- labels: kueue.x-k8s.io/queue-name: user-queue kueue.x-k8s.io/priority-class: dev-priority ``` * `kueue.x-k8s.io/queue-name: user-queue`: As the previous step mentioned, users submit jobs to a LocalQueue instead of directly to a ClusterQueue. * `kueue.x-k8s.io/priority-class: dev-priority`: Assign the RayJob with the `dev-priority` WorkloadPriorityClass. * A modified name to indicate that this job is for development. Also note the resources required for this RayJob by looking at the resources that the Ray head Pod requests: ```yaml resources: limits: memory: "8G" nvidia.com/gpu: "1" requests: cpu: "2" memory: "8G" nvidia.com/gpu: "1" ``` Now deploy the RayJob: ```bash $ kubectl create -f ray-job.pytorch-distributed-training.yaml rayjob.ray.io/dev-pytorch-text-classifier-r6d4p created ``` Verify that the RayCluster and the submitter Kubernetes Job are running: ```bash $ kubectl get pod NAME READY STATUS RESTARTS AGE dev-pytorch-text-classifier-r6d4p-4nczg 1/1 Running 0 4s # Submitter Kubernetes Job torch-text-classifier-r6d4p-raycluster-br45j-head-8bbwt 1/1 Running 0 34s # Ray head Pod ``` Delete the RayJob after verifying that the job has completed successfully. ```bash $ kubectl get rayjobs.ray.io dev-pytorch-text-classifier-r6d4p -o jsonpath='{.status.jobStatus}' SUCCEEDED $ kubectl get rayjobs.ray.io dev-pytorch-text-classifier-r6d4p -o jsonpath='{.status.jobDeploymentStatus}' Complete $ kubectl delete rayjob dev-pytorch-text-classifier-r6d4p rayjob.ray.io "dev-pytorch-text-classifier-r6d4p" deleted ``` ## Step 5: Queuing multiple RayJob resources Create 3 RayJob custom resources to see how Kueue interacts with KubeRay to implement job queueing. ```bash $ kubectl create -f ray-job.pytorch-distributed-training.yaml rayjob.ray.io/dev-pytorch-text-classifier-8vg2c created $ kubectl create -f ray-job.pytorch-distributed-training.yaml rayjob.ray.io/dev-pytorch-text-classifier-n5k89 created $ kubectl create -f ray-job.pytorch-distributed-training.yaml rayjob.ray.io/dev-pytorch-text-classifier-ftcs9 created ``` Because each RayJob requests 1 GPU and the ClusterQueue has quotas for only 1 GPU, Kueue automatically suspends new RayJob resources until GPU quotas become available. You can also inspect the `ClusterQueue` to see available and used quotas: ```bash $ kubectl get clusterqueue NAME COHORT PENDING WORKLOADS cluster-queue 2 $ kubectl get clusterqueue cluster-queue -o yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue ... ... ... status: admittedWorkloads: 1 # Workloads admitted by queue. flavorsReservation: - name: default-flavor resources: - borrowed: "0" name: cpu total: "8" - borrowed: "0" name: memory total: 19531250Ki - borrowed: "0" name: nvidia.com/gpu total: "2" flavorsUsage: - name: default-flavor resources: - borrowed: "0" name: cpu total: "8" - borrowed: "0" name: memory total: 19531250Ki - borrowed: "0" name: nvidia.com/gpu total: "2" pendingWorkloads: 2 # Queued workloads waiting for quotas. reservingWorkloads: 1 # Running workloads that are using quotas. ``` ## Step 6: Deploy a RayJob with higher priority At this point there are multiple RayJob custom resources queued up but only enough quota to run a single RayJob. Now you can create a new RayJob with higher priority to preempt the already queued RayJob resources. Modify the RayJob with: ```yaml metadata: generateName: prod-pytorch-text-classifier- labels: kueue.x-k8s.io/queue-name: user-queue kueue.x-k8s.io/priority-class: prod-priority ``` * `kueue.x-k8s.io/queue-name: user-queue`: As the previous step mentioned, users submit jobs to a LocalQueue instead of directly to a ClusterQueue. * `kueue.x-k8s.io/priority-class: dev-priority`: Assign the RayJob with the `prod-priority` WorkloadPriorityClass. * A modified name to indicate that this job is for production. Create the new RayJob: ```sh $ kubectl create -f ray-job.pytorch-distributed-training.yaml rayjob.ray.io/prod-pytorch-text-classifier-gkp9b created ``` Note that higher priority jobs preempt lower priority jobs when there aren't enough quotas for both: ```bash $ kubectl get pods NAME READY STATUS RESTARTS AGE prod-pytorch-text-classifier-gkp9b-r9k5r 1/1 Running 0 5s torch-text-classifier-gkp9b-raycluster-s2f65-head-hfvht 1/1 Running 0 35s ``` --- (kuberay-rayservice-deepseek-example)= # Serve Deepseek R1 using Ray Serve LLM This guide provides a step-by-step guide for deploying a Large Language Model (LLM) using Ray Serve LLM on Kubernetes. Leveraging KubeRay, Ray Serve, and vLLM, this guide deploys the `deepseek-ai/DeepSeek-R1` model from Hugging Face, enabling scalable, efficient, and OpenAI-compatible LLM serving within a Kubernetes environment. See [Serving LLMs](serving-llms) for information on Ray Serve LLM. ## Prerequisites A DeepSeek model requires 2 nodes, each equipped with 8 H100 80 GB GPUs. It should be deployable on Kubernetes clusters that meet this requirement. This guide provides instructions for setting up a GKE cluster using [A3 High](https://cloud.google.com/compute/docs/gpus#a3-high) or [A3 Mega](https://cloud.google.com/compute/docs/gpus#a3-mega) machine types. Before creating the cluster, ensure that your project has sufficient [quota](https://console.cloud.google.com/iam-admin/quotas) for the required accelerators. ## Step 1: Create a Kubernetes cluster on GKE Run this command and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, you need to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). The following command creates a Kubernetes cluster named `kuberay-gpu-cluster` with 1 default CPU node in the `us-east5-a` zone. This example uses the `e2-standard-16` machine type, which has 16 vCPUs and 64 GB memory. ```sh gcloud container clusters create kuberay-gpu-cluster \ --location=us-east5-a \ --machine-type=e2-standard-16 \ --num-nodes=1 \ --enable-image-streaming ``` Run the following command to create an on-demand GPU node pool for Ray GPU workers. ```sh gcloud beta container node-pools create gpu-node-pool \ --cluster kuberay-gpu-cluster \ --machine-type a3-highgpu-8g \ --num-nodes 2 \ --accelerator "type=nvidia-h100-80gb,count=8" \ --zone us-east5-a \ --node-locations us-east5-a \ --host-maintenance-interval=PERIODIC ``` The `--accelerator` flag specifies the type and number of GPUs for each node in the node pool. This example uses the [A3 High](https://cloud.google.com/compute/docs/gpus#a3-high) GPU. The machine type `a3-highgpu-8g` has 8 GPU, 640 GB GPU Memory, 208 vCPUs, and 1872 GB RAM. ```{admonition} Note :class: note To create a node pool that uses reservations, you can specify the following parameters: * `--reservation-affinity=specific` * `--reservation=RESERVATION_NAME` * `--placement-policy=PLACEMENT_POLICY_NAME` (Optional) ``` Run the following `gcloud` command to configure `kubectl` to communicate with your cluster: ```sh gcloud container clusters get-credentials kuberay-gpu-cluster --zone us-east5-a ``` ## Step 2: Install the KubeRay operator Install the most recent stable KubeRay operator from the Helm repository by following [Deploy a KubeRay operator](kuberay-operator-deploy). The Kubernetes `NoSchedule` taint in the example config prevents the KubeRay operator Pod from running on a GPU node. ## Step 3: Deploy a RayService Deploy DeepSeek-R1 as a RayService custom resource by running the following command: ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.deepseek.yaml ``` This step sets up a custom Ray Serve application to serve the `deepseek-ai/DeepSeek-R1` model on two worker nodes. You can inspect and modify the `serveConfigV2` section in the YAML file to learn more about the Serve application: ```yaml serveConfigV2: | applications: - args: llm_configs: - model_loading_config: model_id: "deepseek" model_source: "deepseek-ai/DeepSeek-R1" accelerator_type: "H100" deployment_config: autoscaling_config: min_replicas: 1 max_replicas: 1 runtime_env: env_vars: VLLM_USE_V1: "1" engine_kwargs: tensor_parallel_size: 8 pipeline_parallel_size: 2 gpu_memory_utilization: 0.92 dtype: "auto" max_num_seqs: 40 max_model_len: 16384 enable_chunked_prefill: true enable_prefix_caching: true import_path: ray.serve.llm:build_openai_app name: llm_app route_prefix: "/" ``` In particular, this configuration loads the model from `deepseek-ai/DeepSeek-R1` and sets its `model_id` to `deepseek`. The `LLMDeployment` initializes the underlying LLM engine using the `engine_kwargs` field, which includes key performance tuning parameters: - `tensor_parallel_size: 8` This setting enables tensor parallelism, splitting individual large layers of the model across 8 GPUs. Adjust this variable according to the number of GPUs used by cluster nodes. - `pipeline_parallel_size: 2` This setting enables pipeline parallelism, dividing the model's entire set of layers into 2 sequential stages. Adjust this variable according to cluster worker node numbers. The `deployment_config` section sets the desired number of engine replicas. See [Serving LLMs](serving-llms) and the [Ray Serve config documentation](serve-in-production-config-file) for more information. Wait for the RayService resource to become healthy. You can confirm its status by running the following command: ```sh kubectl get rayservice deepseek-r1 -o yaml ``` After a few minutes, the result should be similar to the following: ``` status: activeServiceStatus: applicationStatuses: llm_app: serveDeploymentStatuses: LLMDeployment:deepseek: status: HEALTHY LLMRouter: status: HEALTHY status: RUNNING ``` ```{admonition} Note :class: note The model download and deployment will typically take 20-30 minutes. While this is in progress, use the Ray Dashboard (Step 4) Cluster tab to monitor the download progress as disk fills up. ``` ## Step 4: View the Ray dashboard ```sh # Forward the service port kubectl port-forward svc/deepseek-r1-head-svc 8265:8265 ``` Once forwarded, navigate to the Serve tab on the dashboard to review application status, deployments, routers, logs, and other relevant features. ![LLM Serve Application](../images/ray_dashboard_deepseek.png) ## Step 5: Send a request To send requests to the Ray Serve deployment, port-forward port 8000 from the Serve app service: ```sh kubectl port-forward svc/deepseek-r1-serve-svc 8000 ``` Note that this Kubernetes service comes up only after Ray Serve apps are running and ready. Test the service with the following command: ```sh $ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "deepseek", "messages": [ { "role": "user", "content": "I have four boxes. I put the red box on the bottom and put the blue box on top. Then I put the yellow box on top the blue. Then I take the blue box out and put it on top. And finally I put the green box on the top. Give me the final order of the boxes from bottom to top. Show your reasoning but be brief"} ], "temperature": 0.7 }' ``` The output should be in the following format: ``` { "id": "deepseek-653881a7-18f3-493b-a43f-adc8501f01f8", "object": "chat.completion", "created": 1753345252, "model": "deepseek", "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": null, "content": "Okay, let's break this down step by step. The user has four boxes: red, blue, yellow, and green. The starting point is putting the red box on the bottom. Then blue is placed on top of red. Next, yellow goes on top of blue. At this point, the order is red (bottom), blue, yellow. \n\nThen the instruction says to take the blue box out and put it on top. Wait, when they take the blue box out from where? The current stack is red, blue, yellow. If we remove blue from between red and yellow, that leaves red and yellow. Then placing blue on top would make the stack red, yellow, blue. But the problem is, when you remove a box from the middle, the boxes above it should fall down, right? So after removing blue, yellow would be on top of red. Then putting blue on top of that stack would make it red, yellow, blue.\n\nThen the final step is putting the green box on top. So the final order would be red (bottom), yellow, blue, green. Let me verify again to make sure I didn't miss anything. Start with red at bottom. Blue on top of red: red, blue. Yellow on top of blue: red, blue, yellow. Remove blue from the middle, so yellow moves down to be on red, then put blue on top: red, yellow, blue. Finally, add green on top: red, yellow, blue, green. Yes, that seems right.\n\n\nThe final order from bottom to top is: red, yellow, blue, green.\n\n1. Start with red at the bottom. \n2. Add blue on top: red → blue. \n3. Add yellow on top: red → blue → yellow. \n4. **Remove blue** from between red and yellow; yellow drops to second position. Now: red → yellow. \n5. Place blue back on top: red → yellow → blue. \n6. Add green on top: red → yellow → blue → green.", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 81, "total_tokens": 505, "completion_tokens": 424, "prompt_tokens_details": null }, "prompt_logprobs": null } ``` --- (kuberay-rayservice-llm-example)= # Serve a Large Language Model using Ray Serve LLM on Kubernetes This guide provides a step-by-step guide for deploying a Large Language Model (LLM) using Ray Serve LLM on Kubernetes. Leveraging KubeRay, Ray Serve, and vLLM, this guide deploys the `Qwen/Qwen2.5-7B-Instruct` model from Hugging Face, enabling scalable, efficient, and OpenAI-compatible LLM serving within a Kubernetes environment. See [Serving LLMs](serving-llms) for information on Ray Serve LLM. ## Prerequisites This example downloads model weights from the [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) Hugging Face repository. To completely finish this guide, you must fulfill the following requirements: * A [Hugging Face account](https://huggingface.co/) and a Hugging Face [access token](https://huggingface.co/settings/tokens) with read access to gated repositories. * In your RayService custom resource, set the `HUGGING_FACE_HUB_TOKEN` environment variable to the Hugging Face token to enable model downloads. * A Kubernetes cluster with GPUs. ## Step 1: Create a Kubernetes cluster with GPUs Refer to the Kubernetes cluster setup [instructions](../user-guides/k8s-cluster-setup.md) for guides on creating a Kubernetes cluster. ## Step 2: Install the KubeRay operator Install the most recent stable KubeRay operator from the Helm repository by following [Deploy a KubeRay operator](../getting-started/kuberay-operator-installation.md). The Kubernetes `NoSchedule` taint in the example config prevents the KubeRay operator pod from running on a GPU node. ## Step 3: Create a Kubernetes Secret containing your Hugging Face access token For additional security, instead of passing the HF access token directly as an environment variable, create a Kubernetes secret containing your Hugging Face access token. Download the Ray Serve LLM service config .yaml file using the following command: ```sh curl -o ray-service.llm-serve.yaml https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.llm-serve.yaml ``` After downloading, update the value for `hf_token` to your private access token in the `Secret`. ```yaml apiVersion: v1 kind: Secret metadata: name: hf-token type: Opaque stringData: hf_token: ``` ## Step 4: Deploy a RayService After adding the Hugging Face access token, create a RayService custom resource using the config file: ```sh kubectl apply -f ray-service.llm-serve.yaml ``` This step sets up a custom Ray Serve app to serve the `Qwen/Qwen2.5-7B-Instruct` model, creating an OpenAI-compatible server. You can inspect and modify the `serveConfigV2` section in the YAML file to learn more about the Serve app: ```yaml serveConfigV2: | applications: - name: llms import_path: ray.serve.llm:build_openai_app route_prefix: "/" args: llm_configs: - model_loading_config: model_id: qwen2.5-7b-instruct model_source: Qwen/Qwen2.5-7B-Instruct engine_kwargs: dtype: bfloat16 max_model_len: 1024 device: auto gpu_memory_utilization: 0.75 deployment_config: autoscaling_config: min_replicas: 1 max_replicas: 4 target_ongoing_requests: 64 max_ongoing_requests: 128 ``` In particular, this configuration loads the model from `Qwen/Qwen2.5-7B-Instruct` and sets its `model_id` to `qwen2.5-7b-instruct`. The `LLMDeployment` initializes the underlying LLM engine using the `engine_kwargs` field. The `deployment_config` section sets the desired number of engine replicas. By default, each replica requires one GPU. See [Serving LLMs](serving-llms) and the [Ray Serve config documentation](serve-in-production-config-file) for more information. Wait for the RayService resource to become healthy. You can confirm its status by running the following command: ```sh kubectl get rayservice ray-serve-llm -o yaml ``` After a few minutes, the result should be similar to the following: ``` status: activeServiceStatus: applicationStatuses: llms: serveDeploymentStatuses: LLMDeployment:qwen2_5-7b-instruct: status: HEALTHY LLMRouter: status: HEALTHY status: RUNNING ``` ## Step 5: Send a request To send requests to the Ray Serve deployment, port-forward port 8000 from the Serve app service: ```sh kubectl port-forward ray-serve-llm-serve-svc 8000 ``` Note that this Kubernetes service comes up only after Ray Serve apps are running and ready. Test the service with the following command: ```sh curl --location 'http://localhost:8000/v1/chat/completions' --header 'Content-Type: application/json' --data '{ "model": "qwen2.5-7b-instruct", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Provide steps to serve an LLM using Ray Serve." } ] }' ``` The output should be in the following format: ``` { "id": "qwen2.5-7b-instruct-550d3fd491890a7e7bca74e544d3479e", "object": "chat.completion", "created": 1746595284, "model": "qwen2.5-7b-instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": null, "content": "Sure! Ray Serve is a library built on top of Ray...", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 30, "total_tokens": 818, "completion_tokens": 788, "prompt_tokens_details": null }, "prompt_logprobs": null } ``` ## Step 6: View the Ray dashboard ```sh kubectl port-forward svc/ray-serve-llm-head-svc 8265 ``` Once forwarded, navigate to the Serve tab on the dashboard to review application status, deployments, routers, logs, and other relevant features. ![LLM Serve Application](../images/ray_dashboard_llm_application.png) --- (kuberay-stable-diffusion-rayservice-example)= # Serve a StableDiffusion text-to-image model on Kubernetes > **Note:** The Python files for the Ray Serve application and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples) repository and [the Ray documentation](https://docs.ray.io/en/latest/serve/tutorials/stable-diffusion.html). ## Step 1: Create a Kubernetes cluster with GPUs See [aws-eks-gpu-cluster.md](kuberay-eks-gpu-cluster-setup) or [gcp-gke-gpu-cluster.md](kuberay-gke-gpu-cluster-setup) or [ack-gpu-cluster.md](kuberay-ack-gpu-cluster-setup) to create a Kubernetes cluster with 1 CPU node and 1 GPU node. ## Step 2: Install KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator using the Helm repository. Note that the YAML file in this example uses `serveConfigV2`. This feature requires KubeRay v0.6.0 or later. ## Step 3: Install a RayService ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.stable-diffusion.yaml ``` This RayService configuration contains some important settings: * In the RayService, the head Pod doesn't have any `tolerations`. Meanwhile, the worker Pods use the following `tolerations` so the scheduler won't assign the head Pod to the GPU node. ```yaml # Please add the following taints to the GPU node. tolerations: - key: "ray.io/node-type" operator: "Equal" value: "worker" effect: "NoSchedule" ``` * It includes `diffusers` in `runtime_env` since this package isn't included by default in the `ray-ml` image. ## Step 4: Forward the port of Serve First get the service name from this command. ```sh kubectl get services ``` Then, port forward to the serve. ```sh # Wait until the RayService `Ready` condition is `True`. This means the RayService is ready to serve. kubectl describe rayservices.ray.io stable-diffusion # [Example output] # Conditions: # Last Transition Time: 2025-02-13T07:10:34Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True # Type: Ready # Forward the port of Serve kubectl port-forward svc/stable-diffusion-serve-svc 8000 ``` ## Step 5: Send a request to the text-to-image model ```sh # Step 5.1: Download `stable_diffusion_req.py` curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/stable_diffusion/stable_diffusion_req.py # Step 5.2: Set your `prompt` in `stable_diffusion_req.py`. # Step 5.3: Send a request to the Stable Diffusion model. python stable_diffusion_req.py # Check output.png ``` * You can refer to the document ["Serving a Stable Diffusion Model"](https://docs.ray.io/en/latest/serve/tutorials/stable-diffusion.html) for an example output image. --- (kuberay-text-summarizer-rayservice-example)= # Serve a text summarizer on Kubernetes > **Note:** The Python files for the Ray Serve application and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples) repository. ## Step 1: Create a Kubernetes cluster with GPUs See [aws-eks-gpu-cluster.md](kuberay-eks-gpu-cluster-setup) or [gcp-gke-gpu-cluster.md](kuberay-gke-gpu-cluster-setup) or [ack-gpu-cluster.md](kuberay-ack-gpu-cluster-setup) to create a Kubernetes cluster with 1 CPU node and 1 GPU node. ## Step 2: Install KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator using the Helm repository. ## Step 3: Install a RayService ```sh # Create a RayService kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.text-summarizer.yaml ``` * In the RayService, the head Pod doesn't have any `tolerations`. Meanwhile, the worker Pods use the following `tolerations` so the scheduler won't assign the head Pod to the GPU node. ```yaml # Please add the following taints to the GPU node. tolerations: - key: "ray.io/node-type" operator: "Equal" value: "worker" effect: "NoSchedule" ``` ## Step 4: Forward the port of Serve ```sh # Step 4.1: Wait until the RayService is ready to serve requests. kubectl describe rayservices text-summarizer # Step 4.2: Get the service name. kubectl get services # [Example output] # text-summarizer-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 31s # text-summarizer-raycluster-tb9zf-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 108s # text-summarizer-serve-svc ClusterIP 34.118.226.139 8000/TCP 31s # Step 4.3: Forward the port of Serve. kubectl port-forward svc/text-summarizer-serve-svc 8000 ``` ## Step 5: Send a request to the text summarizer model ```sh # Step 5.1: Download `text_summarizer_req.py` curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/text_summarizer/text_summarizer_req.py # Step 5.2: Send a request to the Summarizer model. python text_summarizer_req.py # Check printed to console ``` ## Step 6: Delete your service ```sh kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.text-summarizer.yaml ``` ## Step 7: Uninstall your KubeRay operator Follow [this document](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-operator) to uninstall the latest stable KubeRay operator using the Helm repository. --- (kuberay-tpu-stable-diffusion-example)= # Serve a Stable Diffusion model on GKE with TPUs > **Note:** The Python files for the Ray Serve app and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples). This guide adapts the [tensorflow/tpu](https://github.com/tensorflow/tpu/tree/master/tools/ray_tpu/src/serve) example. ## Step 1: Create a Kubernetes cluster with TPUs Follow [Creating a GKE Cluster with TPUs for KubeRay](kuberay-gke-tpu-cluster-setup) to create a GKE cluster with 1 CPU node and 1 TPU node. ## Step 2: Install the KubeRay operator Skip this step if the [Ray Operator Addon](https://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/concepts/overview) is enabled in your GKE cluster. Follow [Deploy a KubeRay operator](kuberay-operator-deploy) instructions to install the latest stable KubeRay operator from the Helm repository. Multi-host TPU support is available in KubeRay v1.1.0+. Note that the YAML file in this example uses `serveConfigV2`, which KubeRay supports starting from v0.6.0. ## Step 3: Install the RayService CR ```sh # Creates a RayCluster with a single-host v4 TPU worker group of 2x2x1 topology. kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.tpu-single-host.yaml ``` KubeRay operator v1.1.0 adds a new `NumOfHosts` field to the RayCluster CR, supporting multi-host worker groups. This field specifies the number of workers to create per replica, with each replica representing a multi-host Pod slice. The value for `NumOfHosts` should match the number of TPU VM hosts that the given `cloud.google.com/gke-tpu-topology` node selector expects. For this example, the Stable Diffusion model is small enough to run on a single TPU host, so `numOfHosts` is set to 1 in the RayService manifest. ## Step 4: View the Serve deployment in the Ray Dashboard Verify that you deployed the RayService CR and it's running: ```sh kubectl get rayservice # NAME SERVICE STATUS NUM SERVE ENDPOINTS # stable-diffusion-tpu-serve-svc Running 2 ``` Port-forward the Ray Dashboard from the Ray head service. To view the dashboard, open http://localhost:8265/ on your local machine. ```sh kubectl port-forward svc/stable-diffusion-tpu-head-svc 8265:8265 & ``` Monitor the status of the RayService CR in the Ray Dashboard from the 'Serve' tab. The installed RayService CR should create a running app with the name 'stable_diffusion'. The app should have two deployments, the API ingress, which receives input prompts, and the Stable Diffusion model server. ![serve_dashboard](../images/serve_dashboard.png) ## Step 5: Send text-to-image prompts to the model server Port forward the Ray Serve service: ```sh kubectl port-forward svc/stable-diffusion-tpu-serve-svc 8000 ``` In a separate terminal, download the Python prompt script: ```sh curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/stable_diffusion/stable_diffusion_tpu_req.py ``` Install the required dependencies to run the Python script locally: ```sh # Create a Python virtual environment. python3 -m venv myenv source myenv/bin/activate pip install numpy pillow requests tqdm ``` Submit a text-to-image prompt to the Stable Diffusion model server: ```sh python stable_diffusion_tpu_req.py --save_pictures ``` * The Python prompt script saves the results of the Stable Diffusion inference to a file named diffusion_results.png. ![diffusion_results](../images/diffusion_results.png) --- (kuberay-verl)= # Reinforcement Learning with Human Feedback (RLHF) for LLMs with verl on KubeRay [verl](https://github.com/volcengine/verl) is an open-source framework that provides a flexible, efficient, and production-ready RL training library for large language models (LLMs). This guide demonstrates Proximal Policy Optimization (PPO) training on the GSM8K dataset with verl for `Qwen2.5-0.5B-Instruct` on KubeRay. * To make it easier to follow, this guide launches a single-node RayCluster with 4 GPUs. You can easily use KubeRay to launch a multi-node RayCluster to train larger models. * You can also use the [RayJob CRD](kuberay-rayjob-quickstart) for production use cases. # Step 1: Create a Kubernetes cluster with GPUs Follow the instructions in [Managed Kubernetes services](kuberay-k8s-setup) to create a Kubernetes cluster with GPUs. This guide uses a Kubernetes cluster with 4 L4 GPUs. For GKE, you can follow the instructions in [this tutorial](kuberay-gke-gpu-cluster-setup) and use the following command to create a GPU node pool with 4 L4 GPUs per Kubernetes node: ```bash gcloud container node-pools create gpu-node-pool \ --accelerator type=nvidia-l4-vws,count=4 \ --zone us-west1-b \ --cluster kuberay-gpu-cluster \ --num-nodes 1 \ --min-nodes 0 \ --max-nodes 1 \ --enable-autoscaling \ --machine-type g2-standard-48 ``` # Step 2: Install KubeRay operator Follow the instructions in [KubeRay operator](kuberay-operator-deploy) to install the KubeRay operator. # Step 3: Create a RayCluster ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.verl.yaml ``` # Step 4: Install verl in the head Pod Log in to the head Pod and install verl. The verl community doesn't provide images with verl installed ([verl#2222](https://github.com/volcengine/verl/issues/2222)) at the moment. ```sh # Log in to the head Pod. export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl exec -it $HEAD_POD -- bash # Follow the instructions in https://verl.readthedocs.io/en/latest/start/install.html#install-from-docker-image to install verl. git clone https://github.com/volcengine/verl && cd verl pip3 install -e .[vllm] ``` # Step 5: Prepare the dataset and download `Qwen2.5-0.5B-Instruct` model Run the following commands in the head Pod's verl root directory to prepare the dataset and download the `Qwen2.5-0.5B-Instruct` model. ```sh # Prepare the dataset. python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k # Download the `Qwen2.5-0.5B-Instruct` model. python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')" ``` # Step 6: Run a PPO training job Run the following command to start a PPO training job. This differs slightly from the instructions in [verl's documentation](https://verl.readthedocs.io/en/latest/start/quickstart.html#step-3-perform-ppo-training-with-the-instruct-model). The main differences are the following: * Set `n_gpus_per_node` to `4` because the head Pod has 4 GPUs. * Set `save_freq` to `-1` to avoid disk pressure caused by checkpointing. ```sh PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ data.train_batch_size=256 \ data.max_prompt_length=512 \ data.max_response_length=256 \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ critic.optim.lr=1e-5 \ critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ critic.ppo_micro_batch_size_per_gpu=4 \ algorithm.kl_ctrl.kl_coef=0.001 \ trainer.logger=['console'] \ trainer.val_before_train=False \ trainer.default_hdfs_dir=null \ trainer.n_gpus_per_node=4 \ trainer.nnodes=1 \ trainer.save_freq=-1 \ trainer.test_freq=10 \ trainer.total_epochs=15 2>&1 | tee verl_demo.log ``` This job takes 5 hours to complete. While it's running, you can check the Ray dashboard to see more details about the PPO job and the Ray cluster. Additionally, you can follow the next step to check the PPO job logs to see how the model improves. ```sh # Port forward the Ray dashboard to your local machine's port 8265. kubectl port-forward $HEAD_POD 8265:8265 ``` Open `127.0.0.1:8265` in your browser to view the Ray dashboard and check whether all GPUs are in use. ![Ray dashboard](../images/verl-ray-dashboard.png) # Step 7: Check the PPO job logs Check `verl_demo.log` in the head Pod to see the PPO job's logs. For every 10 steps, verl validates the model with a simple math problem. * Math problem: ``` Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after ``` * Answer: `(16 - 3 - 4) * 2 = 18` You should be able to see the model becomes gradually better at this question after several steps. In this example run, the model first got the correct answer after 130 steps, and the following is the log. Throughout the entire process, the validation ran 44 times and got the correct answer 20 times. It may vary depending on the random seed. ``` (TaskRunner pid=21297) [response] First, we calculate the number of eggs Janet's ducks lay in a day. Since there are 16 eggs per day and Janet lays these eggs every day, the number of eggs laid in a day is 16. (TaskRunner pid=21297) (TaskRunner pid=21297) Next, we calculate the number of eggs Janet eats in a day. She eats 3 eggs for breakfast and bakes 4 muffins, so the total number of eggs she eats in a day is 3 + 4 = 7. (TaskRunner pid=21297) (TaskRunner pid=21297) The number of eggs she sells in a day is the total number of eggs laid minus the number of eggs she eats, which is 16 - 7 = 9 eggs. (TaskRunner pid=21297) (TaskRunner pid=21297) She sells each egg for $2, so the total amount she makes every day is 9 * 2 = 18 dollars. (TaskRunner pid=21297) (TaskRunner pid=21297) #### 18 (TaskRunner pid=21297) #### 18 dollars ``` It's not necessary to wait for all steps to complete. You can stop the job if you observe the process of the model improving. # Step 8: Clean up ```sh kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.verl.yaml ``` --- (kuberay-examples)= # Examples ```{toctree} :hidden: examples/mnist-training-example examples/stable-diffusion-rayservice examples/tpu-serve-stable-diffusion examples/mobilenet-rayservice examples/text-summarizer-rayservice examples/rayjob-batch-inference-example examples/rayjob-kueue-priority-scheduling examples/rayjob-kueue-gang-scheduling examples/distributed-checkpointing-with-gcsfuse examples/modin-example examples/rayserve-llm-example examples/rayserve-deepseek-example examples/verl-post-training examples/argocd ``` This section presents example Ray workloads to try out on your Kubernetes cluster. - {ref}`kuberay-mnist-training-example` (CPU-only) - {ref}`kuberay-mobilenet-rayservice-example` (CPU-only) - {ref}`kuberay-stable-diffusion-rayservice-example` - {ref}`kuberay-tpu-stable-diffusion-example` - {ref}`kuberay-text-summarizer-rayservice-example` - {ref}`kuberay-batch-inference-example` - {ref}`kuberay-kueue-priority-scheduling-example` - {ref}`kuberay-kueue-gang-scheduling-example` - {ref}`kuberay-distributed-checkpointing-gcsfuse` - {ref}`kuberay-modin-example` - {ref}`kuberay-rayservice-llm-example` - {ref}`kuberay-rayservice-deepseek-example` - {ref}`kuberay-verl` --- (kuberay-operator-deploy)= # KubeRay Operator Installation ## Step 1: Create a Kubernetes cluster This step creates a local Kubernetes cluster using [Kind](https://kind.sigs.k8s.io/). If you already have a Kubernetes cluster, you can skip this step. ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Install KubeRay operator ### Method 1: Helm (Recommended) ```sh helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update # Install both CRDs and KubeRay operator v1.5.1. helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 ``` ### Method 2: Kustomize ```sh # Install CRD and KubeRay operator. kubectl create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=v1.5.1" ``` ## Step 3: Validate Installation Confirm that the operator is running in the namespace `default`. ```sh kubectl get pods ``` ```text NAME READY STATUS RESTARTS AGE kuberay-operator-6bc45dd644-gwtqv 1/1 Running 0 24s ``` --- (kuberay-raycluster-quickstart)= # RayCluster Quickstart This guide shows you how to manage and interact with Ray clusters on Kubernetes. ## Preparation * Install [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) (>= 1.23), [Helm](https://helm.sh/docs/intro/install/) (>= v3.4) if needed, [Kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation), and [Docker](https://docs.docker.com/engine/install/). * Make sure your Kubernetes cluster has at least 4 CPU and 4 GB RAM. ## Step 1: Create a Kubernetes cluster This step creates a local Kubernetes cluster using [Kind](https://kind.sigs.k8s.io/). If you already have a Kubernetes cluster, you can skip this step. ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Deploy a KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. (raycluster-deploy)= ## Step 3: Deploy a RayCluster custom resource Once the KubeRay operator is running, you're ready to deploy a RayCluster. Create a RayCluster Custom Resource (CR) in the `default` namespace. ```sh # Deploy a sample RayCluster CR from the KubeRay Helm chart repo: helm install raycluster kuberay/ray-cluster --version 1.5.1 ``` Once the RayCluster CR has been created, you can view it by running: ```sh # Once the RayCluster CR has been created, you can view it by running: kubectl get rayclusters ``` ```sh NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE raycluster-kuberay 1 1 2 3G 0 ready 55s ``` The KubeRay operator detects the RayCluster object and starts your Ray cluster by creating head and worker pods. To view Ray cluster's pods, run the following command: ```sh # View the pods in the RayCluster named "raycluster-kuberay" kubectl get pods --selector=ray.io/cluster=raycluster-kuberay ``` ```sh NAME READY STATUS RESTARTS AGE raycluster-kuberay-head 1/1 Running 0 XXs raycluster-kuberay-worker-workergroup-xvfkr 1/1 Running 0 XXs ``` Wait for the pods to reach `Running` state. This may take a few minutes, downloading the Ray images takes most of this time. If your pods stick in the `Pending` state, you can check for errors using `kubectl describe pod raycluster-kuberay-xxxx-xxxxx` and ensure your Docker resource limits meet the requirements. ## Step 4: Run an application on a RayCluster Now, interact with the RayCluster deployed. ### Method 1: Execute a Ray job in the head Pod The most straightforward way to experiment with your RayCluster is to exec directly into the head pod. First, identify your RayCluster's head pod: ```sh export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) echo $HEAD_POD ``` ```sh raycluster-kuberay-head ``` ```sh # Print the cluster resources. kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.cluster_resources())" ``` ```sh 2023-04-07 10:57:46,472 INFO worker.py:1243 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS 2023-04-07 10:57:46,472 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... 2023-04-07 10:57:46,482 INFO worker.py:1550 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265 {'CPU': 2.0, 'memory': 3000000000.0, 'node:10.244.0.6': 1.0, 'node:10.244.0.7': 1.0, 'node:__internal_head__': 1.0, 'object_store_memory': 749467238.0} ``` ### Method 2: Submit a Ray job to the RayCluster using [ray job submission SDK](jobs-quickstart) Unlike Method 1, this method doesn't require you to execute commands in the Ray head pod. Instead, you can use the [Ray job submission SDK](jobs-quickstart) to submit Ray jobs to the RayCluster through the Ray Dashboard port where Ray listens for Job requests. The KubeRay operator configures a [Kubernetes service](https://kubernetes.io/docs/concepts/services-networking/service/) targeting the Ray head Pod. ```sh kubectl get service raycluster-kuberay-head-svc ``` ```sh NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE raycluster-kuberay-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 57s ``` Now that the service name is available, use port-forwarding to access the Ray Dashboard port which is 8265 by default. ```sh # Execute this in a separate shell. kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265 > /dev/null & ``` Now that the Dashboard port is accessible, submit jobs to the RayCluster: ```sh # The following job's logs will show the Ray cluster's total resource capacity, including 2 CPUs. ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())" ``` ```sh Job submission server address: http://localhost:8265 ------------------------------------------------------- Job 'raysubmit_8vJ7dKqYrWKbd17i' submitted successfully ------------------------------------------------------- Next steps Query the logs of the job: ray job logs raysubmit_8vJ7dKqYrWKbd17i Query the status of the job: ray job status raysubmit_8vJ7dKqYrWKbd17i Request the job to be stopped: ray job stop raysubmit_8vJ7dKqYrWKbd17i Tailing logs until the job exits (disable with --no-wait): 2025-03-18 01:27:51,014 INFO job_manager.py:530 -- Runtime env is setting up. 2025-03-18 01:27:51,744 INFO worker.py:1514 -- Using address 10.244.0.6:6379 set in the environment variable RAY_ADDRESS 2025-03-18 01:27:51,744 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... 2025-03-18 01:27:51,750 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.6:8265 {'CPU': 2.0, 'memory': 3000000000.0, 'node:10.244.0.6': 1.0, 'node:10.244.0.7': 1.0, 'node:__internal_head__': 1.0, 'object_store_memory': 749467238.0} ------------------------------------------ Job 'raysubmit_8vJ7dKqYrWKbd17i' succeeded ------------------------------------------ ``` ## Step 5: Access the Ray Dashboard Visit `${YOUR_IP}:8265` in your browser for the Dashboard. For example, `127.0.0.1:8265`. See the job you submitted in Step 4 in the **Recent jobs** pane as shown below. ![Ray Dashboard](../images/ray-dashboard.png) ## Step 6: Cleanup ```sh # Kill the `kubectl port-forward` background job in the earlier step killall kubectl kind delete cluster ``` --- (kuberay-rayjob-quickstart)= # RayJob Quickstart ## Prerequisites * KubeRay v0.6.0 or higher * KubeRay v0.6.0 or v1.0.0: Ray 1.10 or higher. * KubeRay v1.1.1 or newer is highly recommended: Ray 2.8.0 or higher. ## What's a RayJob? A RayJob manages two aspects: * **RayCluster**: A RayCluster custom resource manages all Pods in a Ray cluster, including a head Pod and multiple worker Pods. * **Job**: A Kubernetes Job runs `ray job submit` to submit a Ray job to the RayCluster. ## What does the RayJob provide? With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes. To understand the following content better, you should understand the difference between: * RayJob: A Kubernetes custom resource definition provided by KubeRay. * Ray job: A Ray job is a packaged Ray application that can run on a remote Ray cluster. See [this document](jobs-overview) for more details. * Submitter: The submitter is a Kubernetes Job that runs `ray job submit` to submit a Ray job to the RayCluster. ## RayJob Configuration * RayCluster configuration * `rayClusterSpec` - Defines the **RayCluster** custom resource to run the Ray job on. * `clusterSelector` - Use existing **RayCluster** custom resources to run the Ray job instead of creating a new one. See [ray-job.use-existing-raycluster.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.use-existing-raycluster.yaml) for example configurations. * Ray job configuration * `entrypoint` - The submitter runs `ray job submit --address ... --submission-id ... -- $entrypoint` to submit a Ray job to the RayCluster. * `runtimeEnvYAML` (Optional): A runtime environment that describes the dependencies the Ray job needs to run, including files, packages, environment variables, and more. Provide the configuration as a multi-line YAML string. Example: ```yaml spec: runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: KEY: "VALUE" ``` See {ref}`Runtime Environments ` for more details. _(New in KubeRay version 1.0.0)_ * `jobId` (Optional): Defines the submission ID for the Ray job. If not provided, KubeRay generates one automatically. See {ref}`Ray Jobs CLI API Reference ` for more details about the submission ID. * `metadata` (Optional): See {ref}`Ray Jobs CLI API Reference ` for more details about the `--metadata-json` option. * `entrypointNumCpus` / `entrypointNumGpus` / `entrypointResources` (Optional): See {ref}`Ray Jobs CLI API Reference ` for more details. * `backoffLimit` (Optional, added in version 1.2.0): Specifies the number of retries before marking this RayJob failed. Each retry creates a new RayCluster. The default value is 0. * Submission configuration * `submissionMode` (Optional): Specifies how RayJob submits the Ray job to the RayCluster. There are three possible values, with the default being `K8sJobMode`. * `K8sJobMode`: The KubeRay operator creates a submitter Kubernetes Job to submit the Ray job. * `HTTPMode`: The KubeRay operator sends a request to the RayCluster to create a Ray job. * `InteractiveMode`: The KubeRay operator waits for the user to submit a job to the RayCluster. This mode is currently in alpha and the [KubeRay kubectl plugin](kubectl-plugin) relies on it. * `SidecarMode`: The KubeRay operator injects a container into the Ray head Pod to submit the Ray job. This mode does not support `clusterSelector`, `submitterPodTemplate`, and `submitterConfig`, and requires the head Pod's restart policy to be `Never`. * `submitterPodTemplate` (Optional): Defines the Pod template for the submitter Kubernetes Job. This field is only effective when `submissionMode` is "K8sJobMode". * `RAY_DASHBOARD_ADDRESS` - The KubeRay operator injects this environment variable to the submitter Pod. The value is `$HEAD_SERVICE:$DASHBOARD_PORT`. * `RAY_JOB_SUBMISSION_ID` - The KubeRay operator injects this environment variable to the submitter Pod. The value is the `RayJob.Status.JobId` of the RayJob. * Example: `ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID ...` * See [ray-job.sample.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.sample.yaml) for more details. * `submitterConfig` (Optional): Additional configurations for the submitter Kubernetes Job. * `backoffLimit` (Optional, added in version 1.2.0): The number of retries before marking the submitter Job as failed. The default value is 2. * Automatic resource cleanup * `shutdownAfterJobFinishes` (Optional): Determines whether to recycle the RayCluster after the Ray job finishes. The default value is false. * `ttlSecondsAfterFinished` (Optional): Only works if `shutdownAfterJobFinishes` is true. The KubeRay operator deletes the RayCluster and the submitter `ttlSecondsAfterFinished` seconds after the Ray job finishes. The default value is 0. * `activeDeadlineSeconds` (Optional): If the RayJob doesn't transition the `JobDeploymentStatus` to `Complete` or `Failed` within `activeDeadlineSeconds`, the KubeRay operator transitions the `JobDeploymentStatus` to `Failed`, citing `DeadlineExceeded` as the reason. * `DELETE_RAYJOB_CR_AFTER_JOB_FINISHES` (Optional, added in version 1.2.0): Set this environment variable for the KubeRay operator, not the RayJob resource. If you set this environment variable to true, the RayJob custom resource itself is deleted if you also set `shutdownAfterJobFinishes` to true. Note that KubeRay deletes all resources created by the RayJob, including the Kubernetes Job. * Others * `suspend` (Optional): If `suspend` is true, KubeRay deletes both the RayCluster and the submitter. Note that Kueue also implements scheduling strategies by mutating this field. Avoid manually updating this field if you use Kueue to schedule RayJob. * `deletionStrategy` (Optional, alpha in v1.5.1): Configures automated cleanup after the RayJob reaches a terminal state. This field requires the `RayJobDeletionPolicy` feature gate to be enabled. Two mutually exclusive styles are supported: * **Rules-based** (Recommended): Define `deletionRules` as a list of deletion actions triggered by specific conditions. Each rule specifies: * `policy`: The deletion action to perform — `DeleteCluster` (delete the entire RayCluster and its Pods), `DeleteWorkers` (delete only worker Pods), `DeleteSelf` (delete the RayJob and all associated resources), or `DeleteNone` (no deletion). * `condition`: When to trigger the deletion, based on `jobStatus` (`SUCCEEDED` or `FAILED`) and an optional `ttlSeconds` delay. * This approach enables flexible, multi-stage cleanup strategies (e.g., delete workers immediately on success, then delete the cluster after 300 seconds). * Rules-based mode is incompatible with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. Use per-rule `condition.ttlSeconds` instead. * See [ray-job.deletion-rules.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.deletion-rules.yaml) for example configurations. * **Legacy** (Deprecated): Define both `onSuccess` and `onFailure` policies. This approach is deprecated and will be removed in v1.6.0. Migration to `deletionRules` is strongly encouraged. * Legacy mode can be combined with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. * For detailed API specifications, see the [KubeRay API Reference](https://ray-project.github.io/kuberay/reference/api/). ## Example: Run a simple Ray job with RayJob ## Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Install the KubeRay operator Follow the [KubeRay Operator Installation](kuberay-operator-deploy) to install the latest stable KubeRay operator by Helm repository. ## Step 3: Install a RayJob ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.sample.yaml ``` ## Step 4: Verify the Kubernetes cluster status ```shell # Step 4.1: List all RayJob custom resources in the `default` namespace. kubectl get rayjob # [Example output] # NAME JOB STATUS DEPLOYMENT STATUS RAY CLUSTER NAME START TIME END TIME AGE # rayjob-sample SUCCEEDED Complete rayjob-sample-qnftt 2025-06-25T16:21:21Z 2025-06-25T16:22:35Z 6m53s # Step 4.2: List all RayCluster custom resources in the `default` namespace. kubectl get raycluster # [Example output] # NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE # rayjob-sample-qnftt 1 1 400m 0 0 ready 7m48s # Step 4.3: List all Pods in the `default` namespace. # The Pod created by the Kubernetes Job will be terminated after the Kubernetes Job finishes. kubectl get pods # [Example output] # kuberay-operator-755f666c4b-wbcm4 1/1 Running 0 8m32s # rayjob-sample-n2vj5 0/1 Completed 0 7m18ss => Pod created by a Kubernetes Job # rayjob-sample-qnftt-head 1/1 Running 0 8m14s # rayjob-sample-qnftt-small-group-worker-4f5wz 1/1 Running 0 8m14s # Step 4.4: Check the status of the RayJob. # The field `jobStatus` in the RayJob custom resource will be updated to `SUCCEEDED` and `jobDeploymentStatus` # should be `Complete` once the job finishes. kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobStatus}' # [Expected output]: "SUCCEEDED" kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}' # [Expected output]: "Complete" ``` The KubeRay operator creates a RayCluster custom resource based on the `rayClusterSpec` and a submitter Kubernetes Job to submit a Ray job to the RayCluster. In this example, the `entrypoint` is `python /home/ray/samples/sample_code.py`, and `sample_code.py` is a Python script stored in a Kubernetes ConfigMap mounted to the head Pod of the RayCluster. Because the default value of `shutdownAfterJobFinishes` is false, the KubeRay operator doesn't delete the RayCluster or the submitter when the Ray job finishes. ## Step 5: Check the output of the Ray job ```sh kubectl logs -l=job-name=rayjob-sample # [Example output] # 2025-06-25 09:22:27,963 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... # 2025-06-25 09:22:27,977 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.6:8265 # test_counter got 1 # test_counter got 2 # test_counter got 3 # test_counter got 4 # test_counter got 5 # 2025-06-25 09:22:31,719 SUCC cli.py:63 -- ----------------------------------- # 2025-06-25 09:22:31,719 SUCC cli.py:64 -- Job 'rayjob-sample-zdxm6' succeeded # 2025-06-25 09:22:31,719 SUCC cli.py:65 -- ----------------------------------- ``` The Python script `sample_code.py` used by `entrypoint` is a simple Ray script that executes a counter's increment function 5 times. ## Step 6: Delete the RayJob ```sh kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.sample.yaml ``` ## Step 7: Create a RayJob with `shutdownAfterJobFinishes` set to true ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.shutdown.yaml ``` The `ray-job.shutdown.yaml` defines a RayJob custom resource with `shutdownAfterJobFinishes: true` and `ttlSecondsAfterFinished: 10`. Hence, the KubeRay operator deletes the RayCluster 10 seconds after the Ray job finishes. Note that the submitter job isn't deleted because it contains the ray job logs and doesn't use any cluster resources once completed. In addition, the RayJob cleans up the submitter job when the RayJob is eventually deleted due to its owner reference back to the RayJob. ## Step 8: Check the RayJob status ```sh # Wait until `jobStatus` is `SUCCEEDED` and `jobDeploymentStatus` is `Complete`. kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobDeploymentStatus}' kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobStatus}' ``` ## Step 9: Check if the KubeRay operator deletes the RayCluster ```sh # List the RayCluster custom resources in the `default` namespace. The RayCluster # associated with the RayJob `rayjob-sample-shutdown` should be deleted. kubectl get raycluster ``` ## Step 10: Clean up ```sh # Step 10.1: Delete the RayJob kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.shutdown.yaml # Step 10.2: Delete the KubeRay operator helm uninstall kuberay-operator # Step 10.3: Delete the Kubernetes cluster kind delete cluster ``` ## Next steps * [RayJob Batch Inference Example](kuberay-batch-inference-example) * [Priority Scheduling with RayJob and Kueue](kuberay-kueue-priority-scheduling-example) * [Gang Scheduling with RayJob and Kueue](kuberay-kueue-gang-scheduling-example) --- (kuberay-rayservice-quickstart)= # RayService Quickstart ## Prerequisites This guide mainly focuses on the behavior of KubeRay v1.5.1 and Ray 2.46.0. ## What's a RayService? A RayService manages these components: * **RayCluster**: Manages resources in a Kubernetes cluster. * **Ray Serve Applications**: Manages users' applications. ## What does the RayService provide? * **Kubernetes-native support for Ray clusters and Ray Serve applications:** After using a Kubernetes configuration to define a Ray cluster and its Ray Serve applications, you can use `kubectl` to create the cluster and its applications. * **In-place updating for Ray Serve applications:** See [RayService](kuberay-rayservice) for more details. * **Zero downtime upgrading for Ray clusters:** See [RayService](kuberay-rayservice) for more details. * **High-availabilable services:** See [RayService high availability](kuberay-rayservice-ha) for more details. ## Example: Serve two simple Ray Serve applications using RayService ## Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. Note that the YAML file in this example uses `serveConfigV2` to specify a multi-application Serve configuration, available starting from KubeRay v0.6.0. ## Step 3: Install a RayService ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-service.sample.yaml ``` ## Step 4: Verify the Kubernetes cluster status ```sh # Step 4.1: List all RayService custom resources in the `default` namespace. kubectl get rayservice # [Example output] # NAME SERVICE STATUS NUM SERVE ENDPOINTS # rayservice-sample Running 2 # Step 4.2: List all RayCluster custom resources in the `default` namespace. kubectl get raycluster # [Example output] # NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE # rayservice-sample-cxm7t 1 1 2500m 4Gi 0 ready 79s # Step 4.3: List all Ray Pods in the `default` namespace. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # rayservice-sample-cxm7t-head 1/1 Running 0 3m5s # rayservice-sample-cxm7t-small-group-worker-8hrgg 1/1 Running 0 3m5s # Step 4.4: Check the `Ready` condition of the RayService. # The RayService is ready to serve requests when the condition is `True`. kubectl describe rayservices.ray.io rayservice-sample # [Example output] # Conditions: # Last Transition Time: 2025-06-26T13:23:06Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True # Type: Ready # Step 4.5: List services in the `default` namespace. kubectl get services # NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE # ... # rayservice-sample-cxm7t-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 71m # rayservice-sample-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 70m # rayservice-sample-serve-svc ClusterIP 10.96.125.107 8000/TCP 70m ``` When the Ray Serve applications are healthy and ready, KubeRay creates a head service and a Ray Serve service for the RayService custom resource. For example, `rayservice-sample-head-svc` and `rayservice-sample-serve-svc` in Step 4.5. > **What do these services do?** - **`rayservice-sample-head-svc`** This service points to the **head pod** of the active RayCluster and is typically used to view the **Ray Dashboard** (port `8265`). - **`rayservice-sample-serve-svc`** This service exposes the **HTTP interface** of Ray Serve, typically on port `8000`. Use this service to send HTTP requests to your deployed Serve applications (e.g., REST API, ML inference, etc.). ## Step 5: Verify the status of the Serve applications ```sh # (1) Forward the dashboard port to localhost. # (2) Check the Serve page in the Ray dashboard at http://localhost:8265/#/serve. kubectl port-forward svc/rayservice-sample-head-svc 8265:8265 ``` * Refer to [rayservice-troubleshooting.md](kuberay-raysvc-troubleshoot) for more details on RayService observability. Below is a screenshot example of the Serve page in the Ray dashboard. ![Ray Serve Dashboard](../images/dashboard_serve.png) ## Step 6: Send requests to the Serve applications by the Kubernetes serve service ```sh # Step 6.1: Run a curl Pod. # If you already have a curl Pod, you can use `kubectl exec -it -- sh` to access the Pod. kubectl run curl --image=radial/busyboxplus:curl -i --tty # Step 6.2: Send a request to the fruit stand app. curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:8000/fruit/ -d '["MANGO", 2]' # [Expected output]: 6 # Step 6.3: Send a request to the calculator app. curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:8000/calc/ -d '["MUL", 3]' # [Expected output]: "15 pizzas please!" ``` ## Step 7: Clean up the Kubernetes cluster ```sh # Delete the RayService. kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-service.sample.yaml # Uninstall the KubeRay operator. helm uninstall kuberay-operator # Delete the curl Pod. kubectl delete pod curl ``` ## Next steps * See [RayService](kuberay-rayservice) document for the full list of RayService features, including in-place update, zero downtime upgrade, and high-availability. * See [RayService troubleshooting guide](kuberay-raysvc-troubleshoot) if you encounter any issues. * See [Examples](kuberay-examples) for more RayService examples. The [MobileNet example](kuberay-mobilenet-rayservice-example) is a good example to start with because it doesn't require GPUs and is easy to run on a local machine. --- (kuberay-quickstart)= # Getting Started with KubeRay ```{toctree} :hidden: getting-started/kuberay-operator-installation getting-started/raycluster-quick-start getting-started/rayjob-quick-start getting-started/rayservice-quick-start ``` ## Custom Resource Definitions (CRDs) [KubeRay](https://github.com/ray-project/kuberay) is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. It offers 3 custom resource definitions (CRDs): * **RayCluster**: KubeRay fully manages the lifecycle of RayCluster, including cluster creation/deletion, autoscaling, and ensuring fault tolerance. * **RayJob**: With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the job finishes. * **RayService**: RayService is made up of two parts: a RayCluster and Ray Serve deployment graphs. RayService offers zero-downtime upgrades for RayCluster and high availability. ## Which CRD should you choose? Using [RayService](kuberay-rayservice-quickstart) to serve models and using [RayCluster](kuberay-raycluster-quickstart) to develop Ray applications are no-brainer recommendations from us. However, if the use case is not model serving or prototyping, how do you choose between [RayCluster](kuberay-raycluster-quickstart) and [RayJob](kuberay-rayjob-quickstart)? ### Q: Is downtime acceptable during a cluster upgrade (e.g. Upgrade Ray version)? If not, use RayJob. RayJob can be configured to automatically delete the RayCluster once the job is completed. You can switch between Ray versions and configurations for each job submission using RayJob. If yes, use RayCluster. Ray doesn't natively support rolling upgrades; thus, you'll need to manually shut down and create a new RayCluster. ### Q: Are you deploying on public cloud providers (e.g. AWS, GCP, Azure)? If yes, use RayJob. It allows automatic deletion of the RayCluster upon job completion, helping you reduce costs. ### Q: Do you care about the latency introduced by spinning up a RayCluster? If yes, use RayCluster. Unlike RayJob, which creates a new RayCluster every time a job is submitted, RayCluster creates the cluster just once and can be used multiple times. ## Run your first Ray application on Kubernetes! * [RayCluster Quick Start](kuberay-raycluster-quickstart) * [RayJob Quick Start](kuberay-rayjob-quickstart) * [RayService Quick Start](kuberay-rayservice-quickstart) --- # Ray on Kubernetes ```{toctree} :hidden: getting-started user-guides examples k8s-ecosystem benchmarks troubleshooting references ``` (kuberay-index)= ## Overview In this section we cover how to execute your distributed Ray programs on a Kubernetes cluster. Using the [KubeRay operator](https://github.com/ray-project/kuberay) is the recommended way to do so. The operator provides a Kubernetes-native way to manage Ray clusters. Each Ray cluster consists of a head node pod and a collection of worker node pods. Optional autoscaling support allows the KubeRay operator to size your Ray clusters according to the requirements of your Ray workload, adding and removing Ray pods as needed. KubeRay supports heterogenous compute nodes (including GPUs) as well as running multiple Ray clusters with different Ray versions in the same Kubernetes cluster. ```{eval-rst} .. image:: images/ray_on_kubernetes.png :align: center .. Find source document here: https://docs.google.com/drawings/d/1E3FQgWWLuj8y2zPdKXjoWKrfwgYXw6RV_FWRwK8dVlg/edit ``` KubeRay introduces three distinct Kubernetes Custom Resource Definitions (CRDs): **RayCluster**, **RayJob**, and **RayService**. These CRDs assist users in efficiently managing Ray clusters tailored to various use cases. See [Getting Started](kuberay-quickstart) to learn the basics of KubeRay and follow the quickstart guides to run your first Ray application on Kubernetes with KubeRay. * [RayCluster Quick Start](kuberay-raycluster-quickstart) * [RayJob Quick Start](kuberay-rayjob-quickstart) * [RayService Quick Start](kuberay-rayservice-quickstart) Additionally, [Anyscale](https://console.anyscale.com/register/ha?render_flow=ray&utm_source=ray_docs&utm_medium=docs&utm_campaign=ray-doc-upsell&utm_content=deploy-ray-on-k8s) is the managed Ray platform developed by the creators of Ray. It offers an easy path to deploy Ray clusters on your existing Kubernetes infrastructure, including EKS, GKE, AKS, or self-hosted Kubernetes. ## Learn More The Ray docs present all the information you need to start running Ray workloads on Kubernetes. ```{eval-rst} .. grid:: 1 2 2 2 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: **Getting Started** ^^^ Learn how to start a Ray cluster and deploy Ray applications on Kubernetes. +++ .. button-ref:: kuberay-quickstart :color: primary :outline: :expand: Get Started with Ray on Kubernetes .. grid-item-card:: **User Guides** ^^^ Learn best practices for configuring Ray clusters on Kubernetes. +++ .. button-ref:: kuberay-guides :color: primary :outline: :expand: Read the User Guides .. grid-item-card:: **Examples** ^^^ Try example Ray workloads on Kubernetes. +++ .. button-ref:: kuberay-examples :color: primary :outline: :expand: Try example workloads .. grid-item-card:: **Ecosystem** ^^^ Integrate KubeRay with third party Kubernetes ecosystem tools. +++ .. button-ref:: kuberay-ecosystem-integration :color: primary :outline: :expand: Ecosystem Guides .. grid-item-card:: **Benchmarks** ^^^ Check the KubeRay benchmark results. +++ .. button-ref:: kuberay-benchmarks :color: primary :outline: :expand: Benchmark results .. grid-item-card:: **Troubleshooting** ^^^ Consult the KubeRay troubleshooting guides. +++ .. button-ref:: kuberay-troubleshooting :color: primary :outline: :expand: Troubleshooting guides ``` ## About KubeRay Ray's Kubernetes support is developed at the [KubeRay GitHub repository](https://github.com/ray-project/kuberay), under the broader [Ray project](https://github.com/ray-project/). KubeRay is used by several companies to run production Ray deployments. - Visit the [KubeRay GitHub repo](https://github.com/ray-project/kuberay) to track progress, report bugs, propose new features, or contribute to the project. --- (kuberay-ingress)= # Ingress Four examples show how to use ingress to access your Ray cluster: * [AWS Application Load Balancer (ALB) Ingress support on AWS EKS](kuberay-aws-alb) * [GKE Ingress support](kuberay-gke-ingress) * [Manually setting up NGINX Ingress on Kind](kuberay-nginx) * [Azure Application Gateway for Containers Gateway API support on AKS](kuberay-aks-agc) ```{admonition} Warning :class: warning **Only expose Ingresses to authorized users.** The Ray Dashboard provides read and write access to the Ray Cluster. Anyone with access to this Ingress can execute arbitrary code on the Ray Cluster. ``` (kuberay-aws-alb)= ## AWS Application Load Balancer (ALB) Ingress support on AWS EKS ### Prerequisites * Create an EKS cluster. See [Getting started with Amazon EKS – AWS Management Console and AWS CLI](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#eks-configure-kubectl). * Set up the [AWS Load Balancer controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller), see [installation instructions](https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/installation/). Note that the repository maintains a webpage for each release. Confirm that you are using the latest installation instructions. * (Optional) Try the [echo server example](https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/examples/echo_server.md) in the [aws-load-balancer-controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller) repository. * (Optional) Read [how-it-works.md](https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/how-it-works.md) to understand the [aws-load-balancer-controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller) mechanism. ### Instructions ```sh # Step 1: Install KubeRay operator and CRD helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 # Step 2: Install a RayCluster helm install raycluster kuberay/ray-cluster --version 1.5.1 # Step 3: Edit the `ray-operator/config/samples/ray-cluster-alb-ingress.yaml` # # (1) Annotation `alb.ingress.kubernetes.io/subnets` # 1. Please include at least two subnets. # 2. One Availability Zone (ex: us-west-2a) can only have at most 1 subnet. # 3. In this example, you need to select public subnets (subnets that "Auto-assign public IPv4 address" is Yes on AWS dashboard) # # (2) Set the name of head pod service to `spec...backend.service.name` eksctl get cluster ${YOUR_EKS_CLUSTER} # Check subnets on the EKS cluster # Step 4: Check ingress created by Step 4. kubectl describe ingress ray-cluster-ingress # [Example] # Name: ray-cluster-ingress # Labels: # Namespace: default # Address: k8s-default-rayclust-....${REGION_CODE}.elb.amazonaws.com # Default backend: default-http-backend:80 () # Rules: # Host Path Backends # ---- ---- -------- # * # / ray-cluster-kuberay-head-svc:8265 (192.168.185.157:8265) # Annotations: alb.ingress.kubernetes.io/scheme: internal # alb.ingress.kubernetes.io/subnets: ${SUBNET_1},${SUBNET_2} # alb.ingress.kubernetes.io/tags: Environment=dev,Team=test # alb.ingress.kubernetes.io/target-type: ip # Events: # Type Reason Age From Message # ---- ------ ---- ---- ------- # Normal SuccessfullyReconciled 39m ingress Successfully reconciled # Step 6: Check ALB on AWS (EC2 -> Load Balancing -> Load Balancers) # The name of the ALB should be like "k8s-default-rayclust-......". # Step 7: Check Ray Dashboard by ALB DNS Name. The name of the DNS Name should be like # "k8s-default-rayclust-.....us-west-2.elb.amazonaws.com" # Step 8: Delete the ingress, and AWS Load Balancer controller will remove ALB. # Check ALB on AWS to make sure it is removed. kubectl delete ingress ray-cluster-ingress ``` (kuberay-gke-ingress)= ## GKE Ingress support ### Prerequisites * Create a GKE cluster and ensure that you have the kubectl tool installed and authenticated to communicate with your GKE cluster. See [this tutorial](kuberay-gke-gpu-cluster-setup) for an example of how to create a GKE cluster with GPUs. (GPUs are not necessary for this section.) * If you are using a `gce-internal` ingress, create a [Proxy-Only subnet](https://cloud.google.com/load-balancing/docs/proxy-only-subnets#proxy_only_subnet_create) in the same region as your GKE cluster. * It may be helpful to understand the concepts at . ### Instructions Save the following file as `ray-cluster-gclb-ingress.yaml`: ```yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: ray-cluster-ingress annotations: kubernetes.io/ingress.class: "gce-internal" spec: rules: - http: paths: - path: / pathType: Prefix backend: service: name: raycluster-kuberay-head-svc # Update this line with your head service in Step 3 below. port: number: 8265 ``` Now run the following commands: ```bash # Step 1: Install KubeRay operator and CRD helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 # Step 2: Install a RayCluster helm install raycluster kuberay/ray-cluster --version 1.5.1 # Step 3: Edit ray-cluster-gclb-ingress.yaml to replace the service name with the name of the head service from the RayCluster. (Output of `kubectl get svc`) # Step 4: Apply the Ingress configuration kubectl apply -f ray-cluster-gclb-ingress.yaml # Step 5: Check ingress created by Step 4. kubectl describe ingress ray-cluster-ingress # Step 6: After a few minutes, GKE allocates an external IP for the ingress. Check it using: kubectl get ingress ray-cluster-ingress # Example output: # NAME CLASS HOSTS ADDRESS PORTS AGE # ray-cluster-ingress * 34.160.82.156 80 54m # Step 7: Check Ray Dashboard by visiting the allocated external IP in your browser. (In this example, it is 34.160.82.156) # Step 8: Delete the ingress. kubectl delete ingress ray-cluster-ingress ``` (kuberay-nginx)= ## Manually setting up NGINX Ingress on Kind ```sh # Step 1: Create a Kind cluster with `extraPortMappings` and `node-labels` # Reference for the setting up of Kind cluster: https://kind.sigs.k8s.io/docs/user/ingress/ cat </raycluster-ingress/` on your browser. You will see the Ray Dashboard. # [Note] The forward slash at the end of the address is necessary. `/raycluster-ingress` # will report "404 Not Found". ``` (kuberay-aks-agc)= ## Azure Application Gateway for Containers Gateway API support on AKS ### Prerequisites * Create an AKS cluster. See [Quickstart: Deploy an Azure Kubernetes Service (AKS) cluster using Azure CLI](https://learn.microsoft.com/azure/aks/learn/quick-kubernetes-deploy-cli). * Deploy Application Gateway for Containers ALB Controller [Quickstart: Deploy Application Gateway for Containers ALB Controller](https://learn.microsoft.com/azure/application-gateway/for-containers/quickstart-deploy-application-gateway-for-containers-alb-controller?tabs=install-helm-windows). * Deploy Application Gateway for Containers [Quickstart: Create Application Gateway for Containers managed by ALB Controller](https://learn.microsoft.com/azure/application-gateway/for-containers/quickstart-create-application-gateway-for-containers-managed-by-alb-controller?tabs=new-subnet-aks-vnet) * (Optional) Read [What is Application Gateway for Containers](https://learn.microsoft.com/azure/application-gateway/for-containers/overview). * (Optional) Read [Secure your web applications with Azure Web Application Firewall on Application Gateway for Containers](https://learn.microsoft.com/azure/application-gateway/for-containers/web-application-firewall) ### Instructions ```sh # Step 1: Install KubeRay operator and CRD helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 # Step 2: Install a RayCluster helm install raycluster kuberay/ray-cluster --version 1.5.1 # Step 3: Edit the `ray-operator/config/samples/ray-cluster-agc-gatewayapi.yaml` # # (1) Annotation `alb.networking.azure.io/alb-namespace` # 1. Please update this to the namespace of your alb custom resource. # # (2) Annotation `alb.networking.azure.io/alb-name` # 1. Please update this to the name of your alb custom resource. # Step 4: Check gateway and http route created by Step 3. kubectl describe gateway ray-cluster-gateway # [Example] # Name: ray-cluster-gateway # Namespace: default # Labels: # Annotations: # alb.networking.azure.io/alb-namespace: alb-test-infra # alb.networking.azure.io/alb-name: alb-test # API Version: gateway.networking.k8s.io/v1 # Kind: Gateway # Metadata: # Creation Timestamp: 2025-09-12T04:44:18Z # Generation: 1 # Resource Version: 247986 # UID: 88c40c06-83fe-4ef3-84e1-7bc36c9b5b43 # Spec: # Gateway Class Name: azure-alb-external # Listeners: # Allowed Routes: # Namespaces: # From: Same # Name: http # Port: 80 # Protocol: HTTP # Status: # Addresses: # Type: Hostname # Value: xxxx.yyyy.alb.azure.com # Conditions: # Last Transition Time: 2025-09-12T04:49:30Z # Message: Valid Gateway # Observed Generation: 1 # Reason: Accepted # Status: True # Type: Accepted # Last Transition Time: 2025-09-12T04:49:30Z # Message: Application Gateway for Containers resource has been successfully updated. # Observed Generation: 1 # Reason: Programmed # Status: True # Type: Programmed # Listeners: # Attached Routes: 1 # Conditions: # Last Transition Time: 2025-09-12T04:49:30Z # Message: # Observed Generation: 1 # Reason: ResolvedRefs # Status: True # Type: ResolvedRefs # Last Transition Time: 2025-09-12T04:49:30Z # Message: Listener is Accepted # Observed Generation: 1 # Reason: Accepted # Status: True # Type: Accepted # Last Transition Time: 2025-09-12T04:49:30Z # Message: Application Gateway for Containers resource has been successfully updated. # Observed Generation: 1 # Reason: Programmed # Status: True # Type: Programmed # Name: http # Supported Kinds: # Group: gateway.networking.k8s.io # Kind: HTTPRoute # Group: gateway.networking.k8s.io # Kind: GRPCRoute # Events: kubectl describe httproutes ray-cluster-http-route # [Example] # Name: ray-cluster-http-route # Namespace: default # Labels: # Annotations: # API Version: gateway.networking.k8s.io/v1 # Kind: HTTPRoute # Metadata: # Creation Timestamp: 2025-09-12T04:44:43Z # Generation: 2 # Resource Version: 247982 # UID: 54bbd1e6-bd28-4cae-a469-e15105f077b8 # Spec: # Parent Refs: # Group: gateway.networking.k8s.io # Kind: Gateway # Name: ray-cluster-gateway # Rules: # Backend Refs: # Group: # Kind: Service # Name: raycluster-kuberay-head-svc # Port: 8265 # Weight: 1 # Matches: # Path: # Type: PathPrefix # Value: / # Status: # Parents: # Conditions: # Last Transition Time: 2025-09-12T04:49:30Z # Message: # Observed Generation: 2 # Reason: ResolvedRefs # Status: True # Type: ResolvedRefs # Last Transition Time: 2025-09-12T04:49:30Z # Message: Route is Accepted # Observed Generation: 2 # Reason: Accepted # Status: True # Type: Accepted # Last Transition Time: 2025-09-12T04:49:30Z # Message: Application Gateway for Containers resource has been successfully updated. # Observed Generation: 2 # Reason: Programmed # Status: True # Type: Programmed # Controller Name: alb.networking.azure.io/alb-controller # Parent Ref: # Group: gateway.networking.k8s.io # Kind: Gateway # Name: ray-cluster-gateway # Events: # Step 5: Check Ray Dashboard by visiting the FQDN assigned to your gateway object in your browser # FQDN can be obtained by the command: # kubectl get gateway ray-cluster-gateway -o jsonpath='{.status.addresses[0].value}' # Step 6: Delete the gateway and http route kubectl delete gateway ray-cluster-gateway kubectl delete httproutes ray-cluster-http-route # Step 7: Delete Application Gateway for containers kubectl delete applicationloadbalancer alb-test -n alb-test-infra kubectl delete ns alb-test-infra ``` --- (kuberay-istio)= # mTLS and L7 observability with Istio This integration guide for KubeRay and Istio enables mTLS and L7 traffic observability in a RayCluster on a local Kind cluster. ## Istio [Istio](https://istio.io/) is an open-source service mesh that provides a uniform and more efficient way to secure, connect, and monitor services. Some features of its powerful control plane include: * Secure network traffic in a Kubernetes cluster with TLS encryption. * Automatic metrics, logs, and traces for all traffic within a cluster. See the [Istio documentation](https://istio.io/latest/docs/) to learn more. ## Step 0: Create a Kind cluster Create a Kind cluster with the following command: ```bash kind create cluster ``` ## Step 1: Install Istio ```bash # Download Istioctl and its manifests. export ISTIO_VERSION=1.21.1 curl -L https://istio.io/downloadIstio | sh - cd istio-1.21.1 export PATH=$PWD/bin:$PATH # Install Istio with: # 1. 100% trace sampling for demo purposes. # 2. "sanitize_te" disabled for proper gRPC interception. This is required by Istio 1.21.0 (https://github.com/istio/istio/issues/49685). # 3. TLS 1.3 enabled. istioctl install -y -f - < Go to the Jaeger dashboard with the `service=raycluster-istio.default` query: http://localhost:16686/jaeger/search?limit=1000&lookback=1h&maxDuration&minDuration&service=raycluster-istio.default ![Istio Jaeger Overview](../images/istio-jaeger-1.png) You can click on any trace of the internal gRPC calls and view their details, such as `grpc.path` and `status code`. ![Istio Jaeger Trace](../images/istio-jaeger-2.png) ## Step 7: Clean up Run the following command to delete your cluster. ```bash kind delete cluster ``` --- (kuberay-kai-scheduler)= # Gang scheduling, queue priority, and GPU sharing for RayClusters using KAI Scheduler This guide demonstrates how to use KAI Scheduler for setting up hierarchical queues with quotas, gang scheduling, and GPU sharing using RayClusters. ## KAI Scheduler [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are: - **Bin packing and spread scheduling**: Optimize node usage either by minimizing fragmentation using bin packing or increasing resiliency and load balancing using spread scheduling. - **GPU sharing**: Allow KAI to consolidate multiple Ray workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time. - **Workload autoscaling**: Scale Ray replicas or workers within min/max while respecting gang constraints - **Cluster autoscaling**: Compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter) - **Workload priorities**: Prioritize Ray workloads effectively within queues - **Hierarchical queues and fairness**: Two-level queues with quotas, over-quota weights, limits, and equitable resource distribution between queues using DRF and many more. For more details and key features, see [the documentation](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#key-features). ### Core components 1. **PodGroups**: PodGroups are atomic units for scheduling and represent one or more interdependent pods that the scheduler execute as a single unit, also known as gang scheduling. They're vital for distributed workloads. KAI Scheduler includes a **PodGrouper** that handles gang scheduling automatically. **How PodGrouper works:** ``` RayCluster "distributed-training": ├── Head Pod: 1 GPU └── Worker Group: 4 × 0.5 GPU = 2 GPUs Total Group Requirement: 3 GPUs PodGrouper schedules all 5 pods (1 head + 4 workers) together or none at all. ``` 2. **Queues**: Queues enforce fairness in resource distribution using: - Quota: The baseline amount of resources guaranteed to the queue. The scheduler allocates quotas first to ensure fairness. - Queue priority: Determines the order in which queues receive resources beyond their quota. The scheduler serves the higher-priority queues first. - Over-quota weight: Controls how the scheduler divides surplus resources among queues within the same priority level. Queues with higher weights receive a larger share of the extra resources. - Limit: Defines the maximum resources that the queue can consume. You can arrange queues hierarchically for organizations with multiple teams, for example, departments with multiple teams. ## [Prerequisites](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#prerequisites) * Kubernetes cluster with GPU nodes * NVIDIA GPU Operator * kubectl configured to access your cluster * Install KAI Scheduler with GPU-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `` in the following command. It's recommended to choose v0.10.0 or higher version. ```bash # Install KAI Scheduler helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version --set "global.gpuSharing=true" ``` ## Step 1: Install the KubeRay operator with KAI Scheduler as the batch scheduler Follow the official KubeRay operator [installation documentation](https://docs.ray.io/en/master/cluster/kubernetes/getting-started/kuberay-operator-installation.html#kuberay-operator-installation) and add the following configuration to enable KAI Scheduler integration: ```bash --set batchScheduler.name=kai-scheduler ``` ## Step 2: Create KAI Scheduler Queues Create a basic queue structure for department-1 and its child team-a. For demo reasons, this example doesn't enforce any quota, overQuotaWeight, or limit. You can configure these parameters depending on your needs: ```yaml apiVersion: scheduling.run.ai/v2 kind: Queue metadata: name: department-1 spec: #priority: 100 (optional) resources: cpu: quota: -1 limit: -1 overQuotaWeight: 1 gpu: quota: -1 limit: -1 overQuotaWeight: 1 memory: quota: -1 limit: -1 overQuotaWeight: 1 --- apiVersion: scheduling.run.ai/v2 kind: Queue metadata: name: team-a spec: #priority: 200 (optional) parentQueue: department-1 resources: cpu: quota: -1 limit: -1 overQuotaWeight: 1 gpu: quota: -1 limit: -1 overQuotaWeight: 1 memory: quota: -1 limit: -1 overQuotaWeight: 1 ``` Note: To make this demo easier to follow, it combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once. ## Step 3: Gang scheduling with KAI Scheduler The key pattern is to add the queue label to your RayCluster. [Here's a basic example](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml) from the KubeRay repository: ```yaml metadata: name: raycluster-sample labels: kai.scheduler/queue: team-a # This is the essential configuration. ``` Apply this RayCluster with queues: ```bash curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml kubectl apply -f ray-cluster.kai-scheduler.yaml #Verify queues are created kubectl get queues # NAME PRIORITY PARENT CHILDREN DISPLAYNAME # department-1 ["team-a"] # team-a department-1 # Watch the pods get scheduled kubectl get pods -w # NAME READY STATUS RESTARTS AGE # kuberay-operator-7d86f4f46b-dq22x 1/1 Running 0 50s # raycluster-sample-head-rvrkz 0/1 ContainerCreating 0 13s # raycluster-sample-worker-worker-mlvtz 0/1 Init:0/1 0 13s # raycluster-sample-worker-worker-rcb54 0/1 Init:0/1 0 13s # raycluster-sample-worker-worker-mlvtz 0/1 Init:0/1 0 40s # raycluster-sample-worker-worker-rcb54 0/1 Init:0/1 0 41s # raycluster-sample-head-rvrkz 0/1 Running 0 42s # raycluster-sample-head-rvrkz 1/1 Running 0 54s # raycluster-sample-worker-worker-rcb54 0/1 PodInitializing 0 59s # raycluster-sample-worker-worker-mlvtz 0/1 PodInitializing 0 59s # raycluster-sample-worker-worker-rcb54 0/1 Running 0 60s # raycluster-sample-worker-worker-mlvtz 0/1 Running 0 60s # raycluster-sample-worker-worker-rcb54 1/1 Running 0 71s # raycluster-sample-worker-worker-mlvtz 1/1 Running 0 71s ``` ## Set priorities for workloads In Kubernetes, assigning different priorities to workloads ensures efficient resource management, minimizes service disruption, and supports better scaling. By prioritizing workloads, KAI Scheduler schedules jobs according to their assigned priority. When sufficient resources aren't available for a workload, the scheduler can preempt lower-priority workloads to free up resources for higher-priority ones. This approach ensures the scheduler always prioritizes that mission-critical services in resource allocation. KAI scheduler deployment comes with several predefined priority classes: - train (50) - use for preemptible training workloads - build-preemptible (75) - use for preemptible build/interactive workloads - build (100) - use for build/interactive workloads (non-preemptible) - inference (125) - use for inference workloads (non-preemptible) You can submit the same workload preceding with a specific priority. Modify the preceding example into a build class workload: ```yaml labels: kai.scheduler/queue: team-a # This is the essential configuration. priorityClassName: build # Here you can specify the priority class in metadata.labels (optional) ``` See the [documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/priority) for more information. ## Step 4: Submitting Ray workers with GPU sharing This example creates two workers that share a single GPU, 0.5 each with time-slicing, within a RayCluster. See the [YAML file](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml)): ```bash curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml kubectl apply -f ray-cluster.kai-gpu-sharing.yaml # Watch the pods get scheduled kubectl get pods -w # NAME READY STATUS RESTARTS AGE # kuberay-operator-7d86f4f46b-dq22x 1/1 Running 0 4m9s # raycluster-half-gpu-head-9rtxf 0/1 Running 0 4s # raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Pending 0 4s # raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 Pending 0 4s # ... (skip for brevity) # raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Init:0/1 0 6s # raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Init:0/1 0 7s # raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 Init:0/1 0 8s # raycluster-half-gpu-head-9rtxf 1/1 Running 0 19s # raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 PodInitializing 0 19s # raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 PodInitializing 0 19s # raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Running 0 20s # raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 Running 0 20s # raycluster-half-gpu-shared-gpu-worker-5l7cn 1/1 Running 0 31s # raycluster-half-gpu-shared-gpu-worker-98tzh 1/1 Running 0 31s ``` Note: GPU sharing with time slicing in this example occurs only at the Kubernetes layer, allowing multiple pods to share a single GPU device. The scheduler doesn't enforce memory isolation, so applications must manage their own usage to prevent interference. For other GPU sharing approaches, for example, MPS, see [the KAI documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/gpu-sharing). ### Verify GPU sharing is working To confirm that GPU sharing is working correctly, use these commands: ```bash # 1. Check GPU fraction annotations and shared GPU groups kubectl get pods -l ray.io/cluster=raycluster-half-gpu -o custom-columns="NAME:.metadata.name,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group" ``` You should see both worker pods on the same node with `GPU-FRACTION: 0.5` and the same `GPU-GROUP` ID: ```bash NAME NODE GPU-FRACTION GPU-GROUP raycluster-half-gpu-head ip-xxx-xx-xx-xxx raycluster-half-gpu-shared-gpu-worker-67tvw ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76 raycluster-half-gpu-shared-gpu-worker-v5tpp ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76 ``` This shows that both workers have the same `NVIDIA_VISIBLE_DEVICES` (same physical GPU) and `GPU-FRACTION: 0.50`. ## Troubleshooting ### Check for missing queue labels If pods remain in `Pending` state, the most common issue is missing queue labels. Check operator logs for KAI Scheduler errors and look for error messages like: ```bash "Queue label missing from RayCluster; pods will remain pending" ``` **Solution**: Ensure your RayCluster has the queue label that exists in the cluster: ```yaml metadata: labels: kai.scheduler/queue: default # Add this label ``` --- (kuberay-kueue)= # Gang scheduling, Priority scheduling, and Autoscaling for KubeRay CRDs with Kueue This guide demonstrates how to integrate KubeRay with [Kueue](https://kueue.sigs.k8s.io/) to enable advanced scheduling capabilities including gang scheduling and priority scheduling for Ray applications on Kubernetes. For real-world use cases with RayJob, see [Priority Scheduling with RayJob and Kueue](kuberay-kueue-priority-scheduling-example) and [Gang Scheduling with RayJob and Kueue](kuberay-kueue-gang-scheduling-example). ## What's Kueue? [Kueue](https://kueue.sigs.k8s.io/) is a Kubernetes-native job queueing system that manages resource quotas and job lifecycle. Kueue decides when: * To make a job wait. * To admit a job to start, which triggers Kubernetes to create pods. * To preempt a job, which triggers Kubernetes to delete active pods. ## Supported KubeRay CRDs Kueue has native support for the following KubeRay APIs: - **RayJob**: Ideal for batch processing and model training workloads (covered in this guide) - **RayCluster**: Perfect for managing long-running Ray clusters - **RayService**: Designed for serving models and applications *Note: This guide focuses on a detailed RayJob example on a kind cluster. For RayCluster and RayService examples, see the ["Working with RayCluster and RayService"](#working-with-raycluster-and-rayservice) section.* ## Prerequisites Before you begin, ensure you have a Kubernetes cluster. This guide uses a local Kind cluster. ## Step 0: Create a Kind cluster ```bash kind create cluster ``` ## Step 1: Install the KubeRay operator Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. ## Step 2: Install Kueue ```bash VERSION=v0.13.4 kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml ``` See [Kueue Installation](https://kueue.sigs.k8s.io/docs/installation/#install-a-released-version) for more details on installing Kueue. **Note**: Some limitations exist between Kueue and RayJob. See the [limitations of Kueue](https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/#c-limitations) for more details. ## Step 3: Create Kueue Resources This manifest creates the necessary Kueue resources to manage scheduling and resource allocation. ```yaml # kueue-resources.yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: preemption: withinClusterQueue: LowerPriority namespaceSelector: {} # Match all namespaces. resourceGroups: - coveredResources: ["cpu", "memory"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 3 - name: "memory" nominalQuota: 6G --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "user-queue" spec: clusterQueue: "cluster-queue" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: WorkloadPriorityClass metadata: name: prod-priority value: 1000 description: "Priority class for prod jobs" --- apiVersion: kueue.x-k8s.io/v1beta1 kind: WorkloadPriorityClass metadata: name: dev-priority value: 100 description: "Priority class for development jobs" ``` The YAML manifest configures: * **ResourceFlavor** * The ResourceFlavor `default-flavor` is an empty ResourceFlavor because the compute resources in the Kubernetes cluster are homogeneous. In other words, users can request 1 CPU without considering whether it's an ARM chip or an x86 chip. * **ClusterQueue** * The ClusterQueue `cluster-queue` only has 1 ResourceFlavor `default-flavor` with quotas for 3 CPUs and 6G memory. * The ClusterQueue `cluster-queue` has a preemption policy `withinClusterQueue: LowerPriority`. This policy allows the pending RayJob that doesn’t fit within the nominal quota for its ClusterQueue to preempt active RayJob custom resources in the ClusterQueue that have lower priority. * **LocalQueue** * The LocalQueue `user-queue` is a namespace-scoped object in the `default` namespace which belongs to a ClusterQueue. A typical practice is to assign a namespace to a tenant, team, or user of an organization. Users submit jobs to a LocalQueue, instead of to a ClusterQueue directly. * **WorkloadPriorityClass** * The WorkloadPriorityClass `prod-priority` has a higher value than the WorkloadPriorityClass `dev-priority`. RayJob custom resources with the `prod-priority` priority class take precedence over RayJob custom resources with the `dev-priority` priority class. Create the Kueue resources: ```bash kubectl apply -f kueue-resources.yaml ``` ## Step 4: Gang scheduling with Kueue Kueue always admits workloads in “gang” mode. Kueue admits workloads on an “all or nothing” basis, ensuring that Kubernetes never partially provisions a RayJob or RayCluster. Use gang scheduling strategy to avoid wasting compute resources caused by inefficient scheduling of workloads. Download the RayJob YAML manifest from the KubeRay repository. ```bash curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-job.kueue-toy-sample.yaml ``` Before creating the RayJob, modify the RayJob metadata with the following: ```yaml metadata: generateName: rayjob-sample- labels: kueue.x-k8s.io/queue-name: user-queue kueue.x-k8s.io/priority-class: dev-priority ``` Create two RayJob custom resources with the same priority `dev-priority`. Note these important points for RayJob custom resources: * The RayJob custom resource includes 1 head Pod and 1 worker Pod, with each Pod requesting 1 CPU and 2G of memory. * The RayJob runs a simple Python script that demonstrates a loop running 600 iterations, printing the iteration number and sleeping for 1 second per iteration. Hence, the RayJob runs for about 600 seconds after the submitted Kubernetes Job starts. * Set `shutdownAfterJobFinishes` to true for RayJob to enable automatic cleanup. This setting triggers KubeRay to delete the RayCluster after the RayJob finishes. * Kueue doesn't handle the RayJob custom resource with the `shutdownAfterJobFinishes` set to false. See the [limitations of Kueue](https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/#c-limitations) for more details. ```yaml kubectl create -f ray-job.kueue-toy-sample.yaml ``` Each RayJob custom resource requests 2 CPUs and 4G of memory in total. However, the ClusterQueue only has 3 CPUs and 6G of memory in total. Therefore, the second RayJob custom resource remains pending, and KubeRay doesn't create Pods from the pending RayJob, even though the remaining resources are sufficient for a Pod. You can also inspect the `ClusterQueue` to see available and used quotas: ```bash $ kubectl get clusterqueues.kueue.x-k8s.io NAME COHORT PENDING WORKLOADS cluster-queue 1 $ kubectl get clusterqueues.kueue.x-k8s.io cluster-queue -o yaml Status: Admitted Workloads: 1 # Workloads admitted by queue. Conditions: Last Transition Time: 2024-02-28T22:41:28Z Message: Can admit new workloads Reason: Ready Status: True Type: Active Flavors Reservation: Name: default-flavor Resources: Borrowed: 0 Name: cpu Total: 2 Borrowed: 0 Name: memory Total: 4Gi Flavors Usage: Name: default-flavor Resources: Borrowed: 0 Name: cpu Total: 2 Borrowed: 0 Name: memory Total: 4Gi Pending Workloads: 1 Reserving Workloads: 1 ``` Kueue admits the pending RayJob custom resource when the first RayJob custom resource finishes. Check the status of the RayJob custom resources and delete them after they finish: ```bash $ kubectl get rayjobs.ray.io NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE rayjob-sample-ckvq4 SUCCEEDED Complete xxxxx xxxxx xxx rayjob-sample-p5msp SUCCEEDED Complete xxxxx xxxxx xxx $ kubectl delete rayjob rayjob-sample-ckvq4 $ kubectl delete rayjob rayjob-sample-p5msp ``` ## Step 5: Priority scheduling with Kueue This step creates a RayJob with a lower priority class `dev-priority` first and a RayJob with a higher priority class `prod-priority` later. The RayJob with higher priority class `prod-priority` takes precedence over the RayJob with lower priority class `dev-priority`. Kueue preempts the RayJob with a lower priority to admit the RayJob with a higher priority. If you followed the previous step, the RayJob YAML manifest `ray-job.kueue-toy-sample.yaml` should already be set to the `dev-priority` priority class. Create a RayJob with the lower priority class `dev-priority`: ```bash kubectl create -f ray-job.kueue-toy-sample.yaml ``` Before creating the RayJob with the higher priority class `prod-priority`, modify the RayJob metadata with the following: ```yaml metadata: generateName: rayjob-sample- labels: kueue.x-k8s.io/queue-name: user-queue kueue.x-k8s.io/priority-class: prod-priority ``` Create a RayJob with the higher priority class `prod-priority`: ```bash kubectl create -f ray-job.kueue-toy-sample.yaml ``` You can see that KubeRay operator deletes the Pods belonging to the RayJob with the lower priority class `dev-priority` and creates the Pods belonging to the RayJob with the higher priority class `prod-priority`. ## Working with RayCluster and RayService ### RayCluster with Kueue For gang scheduling with RayCluster resources, Kueue ensures that all cluster components (head and worker nodes) are provisioned together. This prevents partial cluster creation and resource waste. **For detailed RayCluster integration**: See the [Kueue documentation for RayCluster](https://kueue.sigs.k8s.io/docs/tasks/run/rayclusters/). ### RayService with Kueue RayService integration with Kueue enables gang scheduling for model serving workloads, ensuring consistent resource allocation for serving infrastructure. **For detailed RayService integration**: See the [Kueue documentation for RayService](https://kueue.sigs.k8s.io/docs/tasks/run/rayservices/). ## Ray Autoscaler with Kueue Kueue can treat a **RayCluster** or the underlying cluster of a **RayService** as an **elastic workload**. Kueue manages queueing and quota for the entire cluster, while the in‑tree Ray autoscaler scales worker Pods up and down based on the resource demand. This section shows how to enable autoscaling for Ray workloads managed by Kueue using a step‑by‑step approach similar to the existing Kueue integration guides. > **Supported resources** – At the time of writing, the Kueue > autoscaler integration supports `RayCluster` and `RayService`. Support > for `RayJob` autoscaling is under development; see the Kueue issue > tracker for updates: [issue](https://github.com/kubernetes-sigs/kueue/issues/7605). ### Prerequisites Make sure you have already: - Installed the [KubeRay operator](kuberay-operator-deploy). - Installed **Kueue** (See [Kueue Installation](https://kueue.sigs.k8s.io/docs/installation/#install-a-released-version) for more details, note that it's recommended to install Kueue version >= v0.13) --- ### Step 1: Create Kueue resources Define a **ResourceFlavor**, **ClusterQueue**, and **LocalQueue** so that Kueue knows how many CPUs and how much memory it can allocate. The manifest below creates an 8‑CPU/16‑GiB pool called `default-flavor`, registers it in a `ClusterQueue` named `ray-cq`, and defines a `LocalQueue` named `ray-lq`: ```yaml apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: default-flavor --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: ray-cq spec: cohort: ray-example namespaceSelector: {} resourceGroups: - coveredResources: ["cpu", "memory"] flavors: - name: default-flavor resources: - name: cpu nominalQuota: 8 - name: memory nominalQuota: 16Gi --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: name: ray-lq namespace: default spec: clusterQueue: ray-cq ``` Apply the resources: ```bash kubectl apply -f kueue-resources.yaml ``` ### Step 2: Enable elastic workloads in Kueue Autoscaling only works when Kueue’s `ElasticJobsViaWorkloadSlices` feature gate is enabled. Run the following command to add the feature gate flag to the `kueue-controller-manager` Deployment: ```bash kubectl -n kueue-system patch deploy kueue-controller-manager \ --type='json' \ -p='[ { "op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--feature-gates=ElasticJobsViaWorkloadSlices=true" } ]' ``` ### Autoscaling with RayCluster #### Step 1: Configure an elastic RayCluster An elastic RayCluster is one that can change its worker count at runtime. Kueue requires three changes to recognize a RayCluster as elastic: 1. **Queue label** – set `metadata.labels.kueue.x-k8s.io/queue-name: ` so that Kueue queues this cluster. 2. **Elastic-job annotation** – add `metadata.annotations.kueue.x-k8s.io/elastic-job: "true"` to mark this cluster as elastic. Kueue creates **WorkloadSlices** for scaling up and down. 3. **Enable the Ray autoscaler** – set `spec.enableInTreeAutoscaling: true` in the `RayCluster` spec and optionally configure `autoscalerOptions` such as `idleTimeoutSeconds`. Here is a minimal manifest for an elastic RayCluster: ```yaml apiVersion: ray.io/v1 kind: RayCluster metadata: name: raycluster-kueue-autoscaler namespace: default labels: kueue.x-k8s.io/queue-name: ray-lq annotations: kueue.x-k8s.io/elastic-job: "true" # Mark as elastic spec: rayVersion: "2.46.0" enableInTreeAutoscaling: true # Turn on the Ray autoscaler autoscalerOptions: idleTimeoutSeconds: 60 # Delete idle workers after 60 s headGroupSpec: serviceType: ClusterIP rayStartParams: dashboard-host: "0.0.0.0" template: spec: containers: - name: ray-head image: rayproject/ray:2.46.0 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "5Gi" workerGroupSpecs: - groupName: workers replicas: 0 # start with no workers; autoscaler will add them minReplicas: 0 # lower bound maxReplicas: 4 # upper bound rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray:2.46.0 resources: requests: cpu: "1" memory: "1Gi" limits: cpu: "2" memory: "5Gi" ``` Apply this manifest and verify that Kueue admits the associated Workload: ```bash kubectl apply -f raycluster-kueue-autoscaler.yaml kubectl get workloads.kueue.x-k8s.io -A ``` The `ADMITTED` column should show `True` once the `RayCluster` has been scheduled by Kueue. ```bash NAMESPACE NAME QUEUE RESERVED IN ADMITTED FINISHED AGE default raycluster-raycluster-kueue-autoscaler-21c46 ray-lq ray-cq True 26s ``` (step-2-verify-autoscaling-for-a-raycluster)= #### Step 2: Verify autoscaling for a RayCluster To observe autoscaling, create load on the cluster and watch worker Pods appear. The following procedure runs a CPU‑bound workload from inside the head Pod and monitors scaling: 1. **Enter the head Pod:** ```bash HEAD_POD=$(kubectl get pod -l ray.io/node-type=head,ray.io/cluster=raycluster-kueue-autoscaler \ -o jsonpath='{.items[0].metadata.name}') kubectl exec -it "$HEAD_POD" -- bash ``` 2. **Run a workload:** execute the following Python script inside the head container. It submits 20 tasks that each consume a full CPU for about one minute. ```bash python << 'EOF' import ray, time ray.init(address="auto") @ray.remote(num_cpus=1) def busy(): end = time.time() + 60 while time.time() < end: x = 0 for i in range(100_000): x += i * i return 1 tasks = [busy.remote() for _ in range(20)] print(sum(ray.get(tasks))) EOF ``` Because the head Pod has a single CPU, the tasks queue up and the autoscaler raises the worker replicas toward the `maxReplicas`. 3. **Monitor worker Pods:** in another terminal, watch the worker Pods scale up and down: ```bash kubectl get pods -w \ -l ray.io/cluster=raycluster-kueue-autoscaler,ray.io/node-type=worker ``` New worker Pods should appear as the tasks run and vanish once the workload finishes and the idle timeout elapses. ### Autoscaling with RayService #### Step 1: Configure an elastic RayService A `RayService` deploys a Ray Serve application by materializing the `spec.rayClusterConfig` into a managed `RayCluster`. Kueue doesn't interact with the RayService object directly. Instead, the KubeRay operator propagates relevant metadata from the RayService to the managed RayCluster, and **Kueue queues and admits that RayCluster**. To make a RayService work with Kueue and the Ray autoscaler: 1. **Queue label** `metadata.labels.kueue.x-k8s.io/queue-name` on the `RayService`. KubeRay passes service labels to the underlying `RayCluster`, allowing Kueue to queue it. 2. **Elastic-job annotation** `metadata.annotations.kueue.x-k8s.io/elastic-job: "true"`. This annotation propagates to the `RayCluster` and instructs Kueue to treat it as an elastic workload. 3. **Enable the Ray autoscaler** `spec.rayClusterConfig`, set `enableInTreeAutoscaling: true` and specify worker `minReplicas`/`maxReplicas`. The following manifest deploys a simple Ray Serve app with autoscaling. The Serve application (`demo_app`) and deployment names (`ServiceA` and `ServiceB`) are placeholders to avoid implying a specific KubeRay example. Adjust the deployments and resources for your own application. ```yaml apiVersion: ray.io/v1 kind: RayService metadata: name: rayservice-kueue-autoscaler namespace: default labels: kueue.x-k8s.io/queue-name: ray-lq # copy to RayCluster annotations: kueue.x-k8s.io/elastic-job: "true" # mark as elastic spec: # A simple Serve config with two deployments serveConfigV2: | applications: - name: fruit_app import_path: fruit.deployment_graph route_prefix: /fruit runtime_env: working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip" deployments: - name: MangoStand num_replicas: 2 max_replicas_per_node: 1 user_config: price: 3 ray_actor_options: num_cpus: 0.1 - name: OrangeStand num_replicas: 1 user_config: price: 2 ray_actor_options: num_cpus: 0.1 - name: PearStand num_replicas: 1 user_config: price: 1 ray_actor_options: num_cpus: 0.1 - name: FruitMarket num_replicas: 1 ray_actor_options: num_cpus: 0.1 - name: math_app import_path: conditional_dag.serve_dag route_prefix: /calc runtime_env: working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip" deployments: - name: Adder num_replicas: 1 user_config: increment: 3 ray_actor_options: num_cpus: 0.1 - name: Multiplier num_replicas: 1 user_config: factor: 5 ray_actor_options: num_cpus: 0.1 - name: Router num_replicas: 1 rayClusterConfig: rayVersion: "2.46.0" enableInTreeAutoscaling: true autoscalerOptions: idleTimeoutSeconds: 60 headGroupSpec: serviceType: ClusterIP rayStartParams: dashboard-host: "0.0.0.0" template: spec: containers: - name: ray-head image: rayproject/ray:2.46.0 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "5Gi" workerGroupSpecs: - groupName: workers replicas: 1 # initial workers minReplicas: 1 # lower bound maxReplicas: 5 # upper bound rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray:2.46.0 resources: requests: cpu: "1" memory: "1Gi" limits: cpu: "2" memory: "5Gi" ``` Apply the manifest and verify that the service's RayCluster is admitted by Kueue: ```bash kubectl apply -f rayservice-kueue-autoscaler.yaml kubectl get workloads.kueue.x-k8s.io -A ``` The `ADMITTED` column should show `True` once the `RayService` has been scheduled by Kueue. ```bash NAMESPACE NAME QUEUE RESERVED IN ADMITTED FINISHED AGE default raycluster-rayservice-kueue-autoscaler-9xvcr-d7add ray-lq ray-cq True 21s ``` #### Step 2: Verify autoscaling for a RayService Autoscaling for a `RayService` is ultimately driven by load on the managed RayCluster. The verification procedure is the same as for a plain `RayCluster`. To verify autoscaling: 1. Follow the steps in [Step 2: Verify autoscaling for a RayCluster](#step-2-verify-autoscaling-for-a-raycluster), but use the RayService name in the label selector. Concretely: - when selecting the head Pod, use (remember to replace your cluster name): ```bash HEAD_POD=$(kubectl get pod \ -l ray.io/node-type=head,ray.io/cluster=rayservice-kueue-autoscaler-9xvcr \ -o jsonpath='{.items[0].metadata.name}') kubectl exec -it "$HEAD_POD" -- bash ``` - inside the head container, run the same CPU-bound Python script used in the RayCluster example. ```bash python << 'EOF' import ray, time ray.init(address="auto") @ray.remote(num_cpus=1) def busy(): end = time.time() + 60 while time.time() < end: x = 0 for i in range(100_000): x += i * i return 1 tasks = [busy.remote() for _ in range(20)] print(sum(ray.get(tasks))) EOF ``` - in another terminal, watch the worker Pods with: ```bash kubectl get pods -w \ -l ray.io/cluster=rayservice-kueue-autoscaler,ray.io/node-type=worker ``` As in the RayCluster case, the worker Pods scale up toward `maxReplicas` while the CPU-bound tasks are running and scale back down toward `minReplicas` after the tasks finish and the idle timeout elapses. The only difference is that the `ray.io/cluster` label now matches the RayService name (`rayservice-kueue-autoscaler-9xvcr`) instead of the stand-alone `RayCluster` name (`raycluster-kueue-autoscaler`). ### Limitations * **Feature status** – The `ElasticJobsViaWorkloadSlices` feature gate is currently **alpha**. Elastic autoscaling only applies to RayClusters that are annotated with `kueue.x-k8s.io/elastic-job: "true"` and configured with `enableInTreeAutoscaling: true` when ray image < 2.47.0. * **RayJob support** – Autoscaling for `RayJob` isn't yet supported. The Kueue maintainers are actively tracking this work and will update their documentation when it becomes available. * **Kueue versions prior to v0.13** – If you are using a Kueue version earlier than v0.13, restart the Kueue controller once after installation to ensure RayCluster management works correctly. --- (kuberay-metrics-references)= # KubeRay metrics references ## `controller-runtime` metrics KubeRay exposes metrics provided by [kubernetes-sigs/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime), including information about reconciliation, work queues, and more, to help users operate the KubeRay operator in production environments. For more details about the default metrics provided by [kubernetes-sigs/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime), see [Default Exported Metrics References](https://book.kubebuilder.io/reference/metrics-reference). ## KubeRay custom metrics Starting with KubeRay 1.4.0, KubeRay provides metrics for its custom resources to help users better understand Ray clusters and Ray applications. You can view these metrics by following the instructions below: ```sh # Forward a local port to the KubeRay operator service. kubectl port-forward service/kuberay-operator 8080 # View the metrics. curl localhost:8080/metrics # You should see metrics like the following if a RayCluster already exists: # kuberay_cluster_info{name="raycluster-kuberay",namespace="default",owner_kind="None"} 1 ``` ### RayCluster metrics | Metric name | Type | Description | Labels | |--------------------------------------------------|-------|----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| | `kuberay_cluster_info` | Gauge | Metadata information about RayCluster custom resources. | `namespace`: <RayCluster-namespace>
`name`: <RayCluster-name>
`owner_kind`: <RayJob\|RayService\|None>
`uid`: <RayCluster-uid> | | `kuberay_cluster_condition_provisioned` | Gauge | Indicates whether the RayCluster is provisioned. See [RayClusterProvisioned](https://github.com/ray-project/kuberay/blob/7c6aedff5b4106281f50e87a7e9e177bf1237ec7/ray-operator/apis/ray/v1/raycluster_types.go#L214) for more information. | `namespace`: <RayCluster-namespace>
`name`: <RayCluster-name>
`condition`: <true\|false>
`uid`: <RayCluster-uid> | | `kuberay_cluster_provisioned_duration_seconds` | Gauge | The time, in seconds, when a RayCluster's `RayClusterProvisioned` status transitions from false (or unset) to true. | `namespace`: <RayCluster-namespace>
`name`: <RayCluster-name>
`uid`: <RayCluster-uid> | ### RayService metrics | Metric name | Type | Description | Labels | |--------------------------------------------------|-------|------------------------------------------------------------|--------------------------------------------------------------------| | `kuberay_service_info` | Gauge | Metadata information about RayService custom resources. | `namespace`: <RayService-namespace>
`name`: <RayService-name>
`uid`: <RayService-uid> | | `kuberay_service_condition_ready` | Gauge | Describes whether the RayService is ready. Ready means users can send requests to the underlying cluster and the number of serve endpoints is greater than 0. See [RayServiceReady](https://github.com/ray-project/kuberay/blob/33ee6724ca2a429c77cb7ff5821ba9a3d63f7c34/ray-operator/apis/ray/v1/rayservice_types.go#L135) for more information. | `namespace`: <RayService-namespace>
`name`: <RayService-name>
`uid`: <RayService-uid> | | `kuberay_service_condition_upgrade_in_progress` | Gauge | Describes whether the RayService is performing a zero-downtime upgrade. See [UpgradeInProgress](https://github.com/ray-project/kuberay/blob/33ee6724ca2a429c77cb7ff5821ba9a3d63f7c34/ray-operator/apis/ray/v1/rayservice_types.go#L137) for more information. | `namespace`: <RayService-namespace>
`name`: <RayService-name>
`uid`: <RayService-uid> | ### RayJob metrics | Metric name | Type | Description | Labels | |--------------------------------------------------|-------|------------------------------------------------------------|---------------------------------------------------------------------------| | `kuberay_job_info` | Gauge | Metadata information about RayJob custom resources. | `namespace`: <RayJob-namespace>
`name`: <RayJob-name>
`uid`: <RayJob-uid> | | `kuberay_job_deployment_status` | Gauge | The RayJob's current deployment status. | `namespace`: <RayJob-namespace>
`name`: <RayJob-name>
`deployment_status`: <New\|Initializing\|Running\|Complete\|Failed\|Suspending\|Suspended\|Retrying\|Waiting>
`uid`: <RayJob-uid> | | `kuberay_job_execution_duration_seconds` | Gauge | Duration of the RayJob CR’s JobDeploymentStatus transition from `Initializing` to either the `Retrying` state or a terminal state, such as `Complete` or `Failed`. The `Retrying` state indicates that the CR previously failed and that spec.backoffLimit is enabled. | `namespace`: <RayJob-namespace>
`name`: <RayJob-name>
`job_deployment_status`: <Complete\|Failed>
`retry_count`: <count>
`uid`: <RayJob-uid> | --- (kuberay-prometheus-grafana)= # Using Prometheus and Grafana This section will describe how to monitor Ray Clusters in Kubernetes using Prometheus & Grafana. If you do not have any experience with Prometheus and Grafana on Kubernetes, watch this [YouTube playlist](https://youtube.com/playlist?list=PLy7NrYWoggjxCF3av5JKwyG7FFF9eLeL4). ## Preparation Clone the [KubeRay repository](https://github.com/ray-project/kuberay) and checkout the `master` branch. This tutorial requires several files in the repository. ## Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster ``` ## Step 2: Install Kubernetes Prometheus Stack via Helm chart ```sh # Path: kuberay/ ./install/prometheus/install.sh --auto-load-dashboard true # Check the installation kubectl get all -n prometheus-system # (part of the output) # NAME READY UP-TO-DATE AVAILABLE AGE # deployment.apps/prometheus-grafana 1/1 1 1 46s # deployment.apps/prometheus-kube-prometheus-operator 1/1 1 1 46s # deployment.apps/prometheus-kube-state-metrics 1/1 1 1 46s ``` * KubeRay provides an [install.sh script](https://github.com/ray-project/kuberay/blob/master/install/prometheus/install.sh) to: * Install the [kube-prometheus-stack v48.2.1](https://github.com/prometheus-community/helm-charts/tree/kube-prometheus-stack-48.2.1/charts/kube-prometheus-stack) chart and related custom resources, including **PodMonitor** for Ray Pods and **PrometheusRule**, in the namespace `prometheus-system` automatically. * Import Ray Dashboard's [Grafana JSON files](https://github.com/ray-project/kuberay/tree/master/config/grafana) into Grafana using the `--auto-load-dashboard true` flag. If the flag isn't set, the following step also provides instructions for manual import. See [Step 12: Import Grafana dashboards manually (optional)](#step-12-import-grafana-dashboards-manually-optional) for more details. * We made some modifications to the original `values.yaml` in kube-prometheus-stack chart to allow embedding Grafana panels in Ray Dashboard. See [overrides.yaml](https://github.com/ray-project/kuberay/tree/master/install/prometheus/overrides.yaml) for more details. ```yaml grafana: grafana.ini: security: allow_embedding: true auth.anonymous: enabled: true org_role: Viewer ``` ## Step 3: Install a KubeRay operator * Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator via Helm repository. * Set `metrics.serviceMonitor.enabled=true` when installing the KubeRay operator with Helm to create a ServiceMonitor that scrapes metrics exposed by the KubeRay operator's service. ```sh # Enable the ServiceMonitor and set the label `release: prometheus` to the ServiceMonitor so that Prometheus can discover it helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 \ --set metrics.serviceMonitor.enabled=true \ --set metrics.serviceMonitor.selector.release=prometheus ``` You can verify the ServiceMonitor creation with: ```sh kubectl get servicemonitor # NAME AGE # kuberay-operator 11s ``` ## Step 4: Install a RayCluster ```sh # path: ray-operator/config/samples/ kubectl apply -f ray-cluster.embed-grafana.yaml # Check there's a Service that specifies port 8080 for the metrics endpoint. # There may be a slight delay between RayCluster and Service creation. kubectl get service -l ray.io/cluster=raycluster-embed-grafana # NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE # raycluster-embed-grafana-head-svc ClusterIP None 44217/TCP,10001/TCP,44227/TCP,8265/TCP,6379/TCP,8080/TCP 13m # Wait until all Ray Pods are ready. kubectl wait pods -l ray.io/cluster=raycluster-embed-grafana --timeout 2m --for condition=Ready # pod/raycluster-embed-grafana-head-2jk7c condition met # pod/raycluster-embed-grafana-small-group-worker-8g2vv condition met # Forward the port of the Prometheus metrics endpoint. kubectl port-forward service/raycluster-embed-grafana-head-svc metrics # Check metrics in a new terminal. curl localhost:8080 # Example output (Prometheus metrics format): # # HELP ray_spill_manager_request_total Number of {spill, restore} requests. # # TYPE ray_spill_manager_request_total gauge # ray_spill_manager_request_total{Component="raylet", NodeAddress="10.244.0.13", SessionName="session_2025-01-02_07-58-21_419367_11", Type="FailedDeletion", Version="2.9.0", container="ray-head", endpoint="metrics", instance="10.244.0.13:8080", job="prometheus-system/ray-head-monitor", namespace="default", pod="raycluster-embed-grafana-head-98fqt", ray_io_cluster="raycluster-embed-grafana"} 0 ``` * KubeRay exposes a Prometheus metrics endpoint in port **8080** via a built-in exporter by default. Hence, we do not need to install any external exporter. * If you want to configure the metrics endpoint to a different port, see [kuberay/#954](https://github.com/ray-project/kuberay/pull/954) for more details. * Prometheus metrics format: * `# HELP`: Describe the meaning of this metric. * `# TYPE`: See [this document](https://prometheus.io/docs/concepts/metric_types/) for more details. * Three required environment variables are defined in [ray-cluster.embed-grafana.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-cluster.embed-grafana.yaml). See [Configuring and Managing Ray Dashboard](https://docs.ray.io/en/latest/cluster/configure-manage-dashboard.html) for more details about these environment variables. ```yaml env: - name: RAY_GRAFANA_IFRAME_HOST value: http://127.0.0.1:3000 - name: RAY_GRAFANA_HOST value: http://prometheus-grafana.prometheus-system.svc:80 - name: RAY_PROMETHEUS_HOST value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090 ``` * Note that we do not deploy Grafana in the head Pod, so we need to set both `RAY_GRAFANA_IFRAME_HOST` and `RAY_GRAFANA_HOST`. `RAY_GRAFANA_HOST` is used by the head Pod to send health-check requests to Grafana in the backend. `RAY_GRAFANA_IFRAME_HOST` is used by your browser to fetch the Grafana panels from the Grafana server rather than from the head Pod. Because we forward the port of Grafana to `127.0.0.1:3000` in this example, we set `RAY_GRAFANA_IFRAME_HOST` to `http://127.0.0.1:3000`. * `http://` is required. ## Step 5: Collect Head Node metrics with a PodMonitor RayService creates two Kubernetes services for the head Pod; one managed by the RayService and the other by the underlying RayCluster. Therefore, it's recommended to use a PodMonitor to monitor the metrics for head Pods to avoid misconfigurations that could result in double counting the same metrics when using a ServiceMonitor. ```yaml apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: labels: # `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label. release: prometheus name: ray-head-monitor namespace: prometheus-system spec: jobLabel: ray-head # Only select Kubernetes Pods in the "default" namespace. namespaceSelector: matchNames: - default # Only select Kubernetes Pods with "matchLabels". selector: matchLabels: ray.io/node-type: head # A list of endpoints allowed as part of this PodMonitor. podMetricsEndpoints: - port: metrics relabelings: - action: replace sourceLabels: - __meta_kubernetes_pod_label_ray_io_cluster targetLabel: ray_io_cluster - port: as-metrics # autoscaler metrics relabelings: - action: replace sourceLabels: - __meta_kubernetes_pod_label_ray_io_cluster targetLabel: ray_io_cluster - port: dash-metrics # dashboard metrics relabelings: - action: replace sourceLabels: - __meta_kubernetes_pod_label_ray_io_cluster targetLabel: ray_io_cluster ``` * The **install.sh** script creates the above YAML example, [podMonitor.yaml](https://github.com/ray-project/kuberay/blob/master/config/prometheus/podMonitor.yaml#L26-L63) so you don't need to create anything. * See the official [PodMonitor doc](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api-reference/api.md#monitoring.coreos.com/v1.PodMonitor) for more details about configurations. * `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label. See [here](#prometheus-can-only-detect-this-label) for more details. (prometheus-can-only-detect-this-label)= ```sh helm ls -n prometheus-system # ($HELM_RELEASE is "prometheus".) # NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION # prometheus prometheus-system 1 2023-02-06 06:27:05.530950815 +0000 UTC deployed kube-prometheus-stack-44.3.1 v0.62.0 kubectl get prometheuses.monitoring.coreos.com -n prometheus-system -oyaml # podMonitorSelector: # matchLabels: # release: prometheus # ruleSelector: # matchLabels: # release: prometheus ``` * Prometheus uses `namespaceSelector` and `selector` to select Kubernetes Pods. ```sh kubectl get pod -n default -l ray.io/node-type=head # NAME READY STATUS RESTARTS AGE # raycluster-embed-grafana-head-khfs4 1/1 Running 0 4m38s ``` * `relabelings`: This configuration renames the label `__meta_kubernetes_pod_label_ray_io_cluster` to `ray_io_cluster` in the scraped metrics. It ensures that each metric includes the name of the RayCluster to which the Pod belongs. This configuration is especially useful for distinguishing metrics when deploying multiple RayClusters. For example, a metric with the `ray_io_cluster` label might look like this: ``` ray_node_cpu_count{SessionName="session_2025-01-02_07-58-21_419367_11", container="ray-head", endpoint="metrics", instance="10.244.0.13:8080", ip="10.244.0.13", job="raycluster-embed-grafana-head-svc", namespace="default", pod="raycluster-embed-grafana-head-98fqt", ray_io_cluster="raycluster-embed-grafana", service="raycluster-embed-grafana-head-svc"} ``` In this example, `raycluster-embed-grafana` is the name of the RayCluster. ## Step 6: Collect Worker Node metrics with PodMonitors Similar to the head Pod, this tutorial also uses a PodMonitor to collect metrics from worker Pods. The reason for using separate PodMonitors for head Pods and worker Pods is that the head Pod exposes multiple metric endpoints, whereas a worker Pod exposes only one. **Note**: You could create a Kubernetes service with selectors a common label subset from our worker pods, however, this configuration is not ideal because the workers are independent from each other, that is, they aren't a collection of replicas spawned by replicaset controller. Due to this behavior, avoid using a Kubernetes service for grouping them together. ```yaml apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: ray-workers-monitor namespace: prometheus-system labels: # `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label. release: prometheus spec: jobLabel: ray-workers # Only select Kubernetes Pods in the "default" namespace. namespaceSelector: matchNames: - default # Only select Kubernetes Pods with "matchLabels". selector: matchLabels: ray.io/node-type: worker # A list of endpoints allowed as part of this PodMonitor. podMetricsEndpoints: - port: metrics relabelings: - sourceLabels: [__meta_kubernetes_pod_label_ray_io_cluster] targetLabel: ray_io_cluster ``` * **PodMonitor** in `namespaceSelector` and `selector` are used to select Kubernetes Pods. ```sh kubectl get pod -n default -l ray.io/node-type=worker # NAME READY STATUS RESTARTS AGE # raycluster-kuberay-worker-workergroup-5stpm 1/1 Running 0 3h16m ``` ## Step 7: Scrape KubeRay metrics with ServiceMonitor * See the official [ServiceMonitor doc](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api-reference/api.md#servicemonitor) for more details about configurations. * KubeRay operator provides metrics for RayCluster, RayService, and RayJob. See {ref}`kuberay-metrics-references` for more details. * Prometheus uses `namespaceSelector` and `selector` to select Kubernetes Service. ```sh kubectl get service -n default -l app.kubernetes.io/name=kuberay-operator NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kuberay-operator ClusterIP 10.96.205.229 8080/TCP 53m ``` ## Step 8: Collect custom metrics with recording rules [Recording Rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) allow KubeRay to precompute frequently needed or computationally expensive [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) expressions and save their result as custom metrics. Note that this behavior is different from [Custom application-level metrics](application-level-metrics), which are for the visibility of Ray applications. ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ray-cluster-gcs-rules namespace: prometheus-system labels: # `release: $HELM_RELEASE`: Prometheus can only detect Recording Rules with this label. release: prometheus spec: groups: - # Rules within a group are run periodically with the same evaluation interval(30s in this example). name: ray-cluster-main-staging-gcs.rules # How often rules in the group are evaluated. interval: 30s rules: - # The name of the custom metric. # Also see best practices for naming metrics created by recording rules: # https://prometheus.io/docs/practices/rules/#recording-rules record: ray_gcs_availability_30d # PromQL expression. expr: | ( 100 * ( sum(rate(ray_gcs_update_resource_usage_time_bucket{container="ray-head", le="20.0"}[30d])) / sum(rate(ray_gcs_update_resource_usage_time_count{container="ray-head"}[30d])) ) ) ``` * The PromQL expression above is: $$\frac{ number\ of\ update\ resource\ usage\ RPCs\ that\ have\ RTT\ smaller\ then\ 20ms\ in\ last\ 30\ days\ }{total\ number\ of\ update\ resource\ usage\ RPCs\ in\ last\ 30\ days\ } \times 100 $$ * The recording rule above is one of rules defined in [prometheusRules.yaml](https://github.com/ray-project/kuberay/blob/master/config/prometheus/rules/prometheusRules.yaml), and it is created by **install.sh**. Hence, no need to create anything here. * See the official [PrometheusRule document](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api-reference/api.md#monitoring.coreos.com/v1.PrometheusRule) for more details about configurations. * `release: $HELM_RELEASE`: Prometheus can only detect PrometheusRule with this label. See [here](#prometheus-can-only-detect-this-label) for more details. * PrometheusRule can be reloaded at runtime. Use `kubectl apply {modified prometheusRules.yaml}` to reconfigure the rules if needed. ## Step 9: Define Alert Conditions with alerting rules (optional) [Alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) allow us to define alert conditions based on [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) expressions and to send notifications about firing alerts to [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager) which adds summarization, notification rate limiting, silencing and alert dependencies on top of the simple alert definitions. ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ray-cluster-gcs-rules namespace: prometheus-system labels: # `release: $HELM_RELEASE`: Prometheus can only detect Alerting Rules with this label. release: prometheus spec: groups: - name: ray-cluster-main-staging-gcs.rules # How often rules in the group are evaluated. interval: 30s rules: - alert: MissingMetricRayGlobalControlStore # A set of informational labels. Annotations can be used to store longer additional information compared to rules.0.labels. annotations: description: Ray GCS is not emitting any metrics for Resource Update requests summary: Ray GCS is not emitting metrics anymore # PromQL expression. expr: | ( absent(ray_gcs_update_resource_usage_time_bucket) == 1 ) # Time that Prometheus will wait and check if the alert continues to be active during each evaluation before firing the alert. # firing alerts may be due to false positives or noise if the setting value is too small. # On the other hand, if the value is too big, the alerts may not be handled in time. for: 5m # A set of additional labels to be attached to the alert. # It is possible to overwrite the labels in metadata.labels, so make sure one of the labels match the label in ruleSelector.matchLabels. labels: severity: critical ``` * The PromQL expression above checks if there is no time series exist for `ray_gcs_update_resource_usage_time_bucket` metric. See [absent()](https://prometheus.io/docs/prometheus/latest/querying/functions/#absent) for more detail. * The alerting rule above is one of rules defined in [prometheusRules.yaml](https://github.com/ray-project/kuberay/blob/master/config/prometheus/rules/prometheusRules.yaml), and it is created by **install.sh**. Hence, no need to create anything here. * Alerting rules are configured in the same way as recording rules. ## Step 10: Access Prometheus Web UI ```sh # Forward the port of Prometheus Web UI in the Prometheus server Pod. kubectl port-forward -n prometheus-system service/prometheus-kube-prometheus-prometheus http-web ``` - Go to `${YOUR_IP}:9090/targets` (e.g. `127.0.0.1:9090/targets`). You should be able to see: - `podMonitor/prometheus-system/ray-workers-monitor/0 (1/1 up)` - `serviceMonitor/prometheus-system/ray-head-monitor/0 (1/1 up)` ![Prometheus Web UI](../images/prometheus_web_ui.png) - Go to `${YOUR_IP}:9090/graph`. You should be able to query: - [System Metrics](https://docs.ray.io/en/latest/ray-observability/ray-metrics.html#system-metrics) - [Application Level Metrics](https://docs.ray.io/en/latest/ray-observability/ray-metrics.html#application-level-metrics) - Custom Metrics defined in Recording Rules (e.g. `ray_gcs_availability_30d`) - Go to `${YOUR_IP}:9090/alerts`. You should be able to see: - Alerting Rules (e.g. `MissingMetricRayGlobalControlStore`). ## Step 11: Access Grafana ```sh # Forward the Grafana port kubectl port-forward -n prometheus-system service/prometheus-grafana 3000:http-web # Note: You need to update `RAY_GRAFANA_IFRAME_HOST` if you expose Grafana to a different port. # Check ${YOUR_IP}:3000/login for the Grafana login page (e.g. 127.0.0.1:3000/login). # The default username is "admin" and the password is "prom-operator". ``` > Note: `kubectl port-forward` is not recommended for production use. Refer to [this Grafana document](https://grafana.com/tutorials/run-grafana-behind-a-proxy/) for exposing Grafana behind a reverse proxy. * The default password is defined by `grafana.adminPassword` in the [values.yaml](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml) of the kube-prometheus-stack chart. ## Step 12: Import Grafana dashboards manually (optional) If `--auto-load-dashboard true` is set when running `install.sh`, you can skip this step. * Import Grafana dashboards manually * Click "Dashboards" icon in the left panel. * Click "New". * Click "Import". * Click "Upload JSON file". * Choose a JSON file. * Case 1: If you are using Ray 2.41.0, you can use [the sample config files in GitHub repository](https://github.com/ray-project/kuberay/tree/master/config/grafana). The file names have a pattern of `xxx_grafana_dashboard.json`. * Case 2: Otherwise, import the JSON files from the head Pod's `/tmp/ray/session_latest/metrics/grafana/dashboards/` directory. You can use `kubectl cp` to copy the files from the head Pod to your local machine. `kubectl cp $(kubectl get pods --selector ray.io/node-type=head,ray.io/cluster=raycluster-embed-grafana -o jsonpath={..metadata.name}):/tmp/ray/session_latest/metrics/grafana/dashboards/ /tmp/` * Click "Import". ## Step 13: View metrics from different RayCluster CRs Once the Ray Dashboard is imported into Grafana, you can filter metrics by using the `Cluster` variable. Ray Dashboard automatically applies this variable by default when you use the provided `PodMonitor` configuration. You don't need any additional setup for this labeling. If you have multiple RayCluster custom resources, the `Cluster` variable allows you to filter metrics specific to a particular cluster. This feature ensures that you can easily monitor or debug individual RayCluster instances without being overwhelmed by the data from all clusters. For example, in the following figures, one selects the metrics from the RayCluster `raycluster-embed-grafana`, and the other selects metrics from the RayCluster `raycluster-embed-grafana-2`. ![Grafana Ray Dashboard](../images/grafana_ray_dashboard.png) ![Grafana Ray Dashboard2](../images/grafana_ray_dashboard2.png) ## Step 14: View the KubeRay operator dashboard After importing the KubeRay operator dashboard into Grafana, you can monitor metrics from the KubeRay operator. The dashboard includes a dropdown menu that lets you filter and view controller runtime metrics for specific Ray custom resources CRs: `RayCluster`, `RayJob`, and `RayService`. The KubeRay operator dashboard should look like this: ![Grafana KubeRay operator Controller Runtime dashboard](../images/kuberay-dashboard-controller-runtime.png) ## Step 15: Embed Grafana panels in the Ray dashboard (optional) ```sh kubectl port-forward service/raycluster-embed-grafana-head-svc dashboard # Visit http://127.0.0.1:8265/#/metrics in your browser. ``` ![Ray Dashboard with Grafana panels](../images/ray_dashboard_embed_grafana.png) --- (kuberay-pyspy-integration)= # Profiling with py-spy ## Stack trace and CPU profiling [py-spy](https://github.com/benfred/py-spy/tree/master) is a sampling profiler for Python programs. It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. This section describes how to configure RayCluster YAML file to enable py-spy and see Stack Trace and CPU Flame Graph on Ray Dashboard. ## Prerequisite py-spy requires the `SYS_PTRACE` capability to read process memory. However, Kubernetes omits this capability by default. To enable profiling, add the following to the `template.spec.containers` for both the head and worker Pods. ```bash securityContext: capabilities: add: - SYS_PTRACE ``` **Notes:** - The `baseline` and `restricted` Pod Security Standards forbid adding `SYS_PTRACE`. See [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) for more details. ## Check CPU flame graph and stack trace on Ray Dashboard ### Step 1: Create a Kind cluster ```bash kind create cluster ``` ### Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator using Helm repository. ### Step 3: Create a RayCluster with `SYS_PTRACE` capability ```bash kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.py-spy.yaml ``` ### Step 4: Forward the dashboard port ```bash kubectl port-forward svc/raycluster-py-spy-head-svc 8265:8265 ``` ### Step 5: Run a sample job within the head Pod ```bash # Log in to the head Pod kubectl exec -it ${YOUR_HEAD_POD} -- bash # (Head Pod) Run a sample job in the Pod # `long_running_task` includes a `while True` loop to ensure the task remains actively running indefinitely. # This allows you ample time to view the Stack Trace and CPU Flame Graph via Ray Dashboard. python3 samples/long_running_task.py ``` **Notes:** - If you're running your own examples and encounter the error `Failed to write flamegraph: I/O error: No stack counts found` when viewing CPU Flame Graph, it might be due to the process being idle. Notably, using the `sleep` function can lead to this state. In such situations, py-spy filters out the idle stack traces. Refer to this [issue](https://github.com/benfred/py-spy/issues/321#issuecomment-731848950) for more information. ### Step 6: Profile using Ray Dashboard - Visit http://localhost:8265/#/cluster. - Click `Stack Trace` for `ray::long_running_task`. ![StackTrace](../images/stack_trace.png) - Click `CPU Flame Graph` for `ray::long_running_task`. ![FlameGraph](../images/cpu_flame_graph.png) - For additional details on using the profiler, See [Python CPU profiling in the Dashboard](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/optimize-performance.html#python-cpu-profiling-in-the-dashboard). ### Step 7: Clean up ```bash kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.py-spy.yaml helm uninstall kuberay-operator ``` --- (kuberay-scheduler-plugins)= # KubeRay integration with scheduler plugins The [kubernetes-sigs/scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins) repository provides out-of-tree scheduler plugins based on the scheduler framework. Starting with KubeRay v1.4.0, KubeRay integrates with the [PodGroup API](https://github.com/kubernetes-sigs/scheduler-plugins/blob/93126eabdf526010bf697d5963d849eab7e8e898/site/content/en/docs/plugins/coscheduling.md) provided by scheduler plugins to support gang scheduling for RayCluster custom resources. ## Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Install scheduler plugins Follow the [installation guide](https://scheduler-plugins.sigs.k8s.io/docs/user-guide/installation/) in the scheduler-plugins repository to install the scheduler plugins. :::{note} There are two modes for installing the scheduler plugins: *single scheduler mode* and *second scheduler mode*. KubeRay v1.4.0 only supports the *single scheduler mode*. You need to have the access to configure Kubernetes control plane to replace the default scheduler with the scheduler plugins. ::: ## Step 3: Install KubeRay operator with scheduler plugins enabled KubeRay v1.4.0 and later versions support scheduler plugins. ```sh helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 --set batchScheduler.name=scheduler-plugins ``` ## Step 4: Deploy a RayCluster with gang scheduling ```sh # Configure the RayCluster with label `ray.io/gang-scheduling-enabled: "true"` # to enable gang scheduling. kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/release-1.4/ray-operator/config/samples/ray-cluster.scheduler-plugins.yaml ``` ## Step 5: Verify Ray Pods and PodGroup Note that if you use "second scheduler mode," which KubeRay currently doesn't support, the following commands still show similar results. However, the Ray Pods don't get scheduled in a gang scheduling manner. Make sure to use "single scheduler mode" to enable gang scheduling. ```sh kubectl get podgroups.scheduling.x-k8s.io # NAME PHASE MINMEMBER RUNNING SUCCEEDED FAILED AGE # test-podgroup-0 Running 3 3 2m25s # All Ray Pods (1 head and 2 workers) belong to the same PodGroup. kubectl get pods -L scheduling.x-k8s.io/pod-group # NAME READY STATUS RESTARTS AGE POD-GROUP # test-podgroup-0-head 1/1 Running 0 3m30s test-podgroup-0 # test-podgroup-0-worker-worker-4vc6j 1/1 Running 0 3m30s test-podgroup-0 # test-podgroup-0-worker-worker-ntm9f 1/1 Running 0 3m30s test-podgroup-0 ``` --- (kuberay-volcano)= # KubeRay integration with Volcano [Volcano](https://github.com/volcano-sh/volcano) is a batch scheduling system built on Kubernetes, providing gang scheduling, job queues, fair scheduling policies, and network topology-aware scheduling. KubeRay integrates natively with Volcano for RayCluster, RayJob, and RayService, enabling more efficient scheduling of Ray pods in multi-tenant Kubernetes environments. This guide covers [setup instructions](#setup), [configuration options](#step-4-install-a-raycluster-with-the-volcano-scheduler), and [examples](#example) demonstrating gang scheduling for both RayCluster and RayJob. ## Setup ### Step 1: Create a Kubernetes cluster with KinD Run the following command in a terminal: ```shell kind create cluster ``` ### Step 2: Install Volcano You need to successfully install Volcano on your Kubernetes cluster before enabling Volcano integration with KubeRay. See [Quick Start Guide](https://github.com/volcano-sh/volcano#quick-start-guide) for Volcano installation instructions. ### Step 3: Install the KubeRay Operator with batch scheduling Deploy the KubeRay Operator with the `--batch-scheduler=volcano` flag to enable Volcano batch scheduling support. When installing KubeRay Operator using Helm, you should use one of these two options: * Set `batchScheduler.name` to `volcano` in your [`values.yaml`](https://github.com/ray-project/kuberay/blob/753dc05dbed5f6fe61db3a43b34a1b350f26324c/helm-chart/kuberay-operator/values.yaml#L48) file: ```shell # values.yaml file batchScheduler: name: volcano ``` * Pass the `--set batchScheduler.name=volcano` flag when running on the command line: ```shell # Install the Helm chart with the --batch-scheduler=volcano flag helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 --set batchScheduler.name=volcano ``` ### Step 4: Install a RayCluster with the Volcano scheduler ```shell # Path: kuberay/ray-operator/config/samples curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.volcano-scheduler.yaml kubectl apply -f ray-cluster.volcano-scheduler.yaml # Check the RayCluster kubectl get pod -l ray.io/cluster=test-cluster-0 # NAME READY STATUS RESTARTS AGE # test-cluster-0-head-jj9bg 1/1 Running 0 36s ``` You can also provide the following labels in the RayCluster, RayJob and RayService metadata: - `ray.io/priority-class-name`: The cluster priority class as defined by [Kubernetes](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass) - This label only works after you create a `PriorityClass` resource - ```shell labels: ray.io/priority-class-name: ``` - `volcano.sh/queue-name`: The Volcano [queue](https://volcano.sh/en/docs/queue/) name the cluster submits to. - This label only works after you create a `Queue` resource - ```shell labels: volcano.sh/queue-name: ``` - `volcano.sh/network-topology-mode`: Enables [network topology-aware scheduling](https://volcano.sh/en/docs/network_topology_aware_scheduling/) to optimize pod placement based on network proximity, reducing inter-node communication latency for distributed workloads. Valid values are `soft` (best-effort) or `hard` (strict enforcement). - This label only works after you create a `HyperNode` resource - ```shell labels: volcano.sh/network-topology-mode: "soft" # or "hard" ``` - `volcano.sh/network-topology-highest-tier-allowed`: Specifies the highest network topology tier for pod placement, restricting pods to be scheduled within the specified tier boundary. The value must match a tier defined in your `HyperNode` resource. Must be used together with `volcano.sh/network-topology-mode`. - This label only works after you create a `HyperNode` resource - ```shell labels: volcano.sh/network-topology-highest-tier-allowed: ``` **Note**: - Starting from KubeRay v1.3.0, you **no** longer need to add the `ray.io/scheduler-name: volcano` label to your RayCluster/RayJob. The batch scheduler is now configured at the operator level using the `--batch-scheduler=volcano` flag. - When autoscaling is enabled, KubeRay uses `minReplicas` to calculate the minimum resources required for gang scheduling. Otherwise, it uses the `desired` replicas value. ### Step 5: Use Volcano for batch scheduling For guidance, see [examples](https://github.com/volcano-sh/volcano/tree/master/example). ## Example Before going through the example, remove any running Ray Clusters to ensure a successful run through of the example below. ```shell kubectl delete raycluster --all ``` ### Gang scheduling This example walks through how gang scheduling works with Volcano and KubeRay. First, create a queue with a capacity of 4 CPUs and 6Gi of RAM: ```shell kubectl create -f - < API Version: ray.io/v1 Kind: RayCluster Metadata: Creation Timestamp: 2024-09-29T09:52:30Z Generation: 1 Resource Version: 951 UID: cae1dbc9-5a67-4b43-b0d9-be595f21ab85 # Other fields are skipped for brevity ```` Note the labels on the RayCluster: `ray.io/gang-scheduling-enabled=true`, `yunikorn.apache.org/app-id=test-yunikorn-0`, and `yunikorn.apache.org/queue=root.test`. :::{note} You only need the `ray.io/gang-scheduling-enabled` label when you require gang scheduling. If you don't set this label, YuniKorn schedules the Ray cluster without enforcing gang scheduling. ::: Because the queue has a capacity of 4 CPU and 6GiB of RAM, this resource should schedule successfully without any issues. ```shell $ kubectl get pods NAME READY STATUS RESTARTS AGE test-yunikorn-0-head-98fmp 1/1 Running 0 67s test-yunikorn-0-worker-worker-42tgg 1/1 Running 0 67s test-yunikorn-0-worker-worker-467mn 1/1 Running 0 67s ``` Verify the scheduling by checking the [Apache YuniKorn dashboard](https://yunikorn.apache.org/docs/#access-the-web-ui). ```shell kubectl port-forward svc/yunikorn-service 9889:9889 -n yunikorn ``` Go to `http://localhost:9889/#/applications` to see the running apps. ![Apache YuniKorn dashboard](../images/yunikorn-dashboard-apps-running.png) Next, add an additional RayCluster with the same configuration of head and worker nodes, but with a different name: ```shell # Replace the name with `test-yunikorn-1`. sed 's/test-yunikorn-0/test-yunikorn-1/' ray-cluster.yunikorn-scheduler.yaml | kubectl apply -f- ``` Now all the Pods for `test-yunikorn-1` are in the `Pending` state: ```shell $ kubectl get pods NAME READY STATUS RESTARTS AGE test-yunikorn-0-head-98fmp 1/1 Running 0 4m22s test-yunikorn-0-worker-worker-42tgg 1/1 Running 0 4m22s test-yunikorn-0-worker-worker-467mn 1/1 Running 0 4m22s test-yunikorn-1-head-xl2r5 0/1 Pending 0 71s test-yunikorn-1-worker-worker-l6ttz 0/1 Pending 0 71s test-yunikorn-1-worker-worker-vjsts 0/1 Pending 0 71s tg-test-yunikorn-1-headgroup-vgzvoot0dh 0/1 Pending 0 69s tg-test-yunikorn-1-worker-eyti2bn2jv 1/1 Running 0 69s tg-test-yunikorn-1-worker-k8it0x6s73 0/1 Pending 0 69s ``` Apache YuniKorn creates the Pods with the `tg-` prefix for gang scheduling purpose. Go to `http://localhost:9889/#/applications` and to see `test-yunikorn-1` in the `Accepted` state but not running yet: ![Apache YuniKorn dashboard](../images/yunikorn-dashboard-apps-pending.png) Because the new cluster requires more CPU and RAM than the queue allows, even though one of the Pods would fit in the remaining 1 CPU and 2GiB of RAM, Apache YuniKorn doesn't place the cluster's Pods until there's enough room for all of the Pods. Without using Apache YuniKorn for gang scheduling in this way, KubeRay would place one of the Pods, and only partially allocating the cluster. Delete the first RayCluster to free up resources in the queue: ```shell kubectl delete raycluster test-yunikorn-0 ``` Now all the Pods for the second cluster change to the `Running` state, because enough resources are now available to schedule the entire set of Pods: Check the Pods again to see that the second cluster is now up and running: ```shell $ kubectl get pods NAME READY STATUS RESTARTS AGE test-yunikorn-1-head-xl2r5 1/1 Running 0 3m34s test-yunikorn-1-worker-worker-l6ttz 1/1 Running 0 3m34s test-yunikorn-1-worker-worker-vjsts 1/1 Running 0 3m34s ``` Clean up the resources: ```shell kubectl delete raycluster test-yunikorn-1 ``` --- (kuberay-ecosystem-integration)= # KubeRay Ecosystem ```{toctree} :hidden: k8s-ecosystem/ingress k8s-ecosystem/metrics-references k8s-ecosystem/prometheus-grafana k8s-ecosystem/pyspy k8s-ecosystem/kai-scheduler k8s-ecosystem/volcano k8s-ecosystem/yunikorn k8s-ecosystem/kueue k8s-ecosystem/istio k8s-ecosystem/scheduler-plugins ``` * {ref}`kuberay-ingress` * {ref}`kuberay-metrics-references` * {ref}`kuberay-prometheus-grafana` * {ref}`kuberay-pyspy-integration` * {ref}`kuberay-kai-scheduler` * {ref}`kuberay-volcano` * {ref}`kuberay-yunikorn` * {ref}`kuberay-kueue` * {ref}`kuberay-istio` * {ref}`kuberay-scheduler-plugins` --- (kuberay-api-reference)= # API Reference To learn about RayCluster configuration, we recommend taking a look at the {ref}`configuration guide `. For comprehensive coverage of all supported RayCluster fields, refer to the [API reference][APIReference]. ## KubeRay API compatibility and guarantees v1 APIs in the KubeRay project are stable and suitable for production environments. Fields in the v1 APIs will never be removed to maintain compatibility. Future major versions of the API (i.e. v2) may have breaking changes and fields removed from v1. However, KubeRay maintainers preserve the right to mark fields as deprecated and remove functionality associated with deprecated fields after a minimum of two minor releases. In addition, some definitions of the API may see small changes in behavior. For example, the definition of a "ready" or "unhealthy" RayCluster could change to better handle new failure scenarios. [APIReference]: https://ray-project.github.io/kuberay/reference/api/ --- (kuberay-raysvc-troubleshoot)= # RayService troubleshooting RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. This section provides some tips to help you debug these issues. ## Observability ### Method 1: Check KubeRay operator's logs for errors ```bash kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log ``` The above command will redirect the operator's logs to a file called `operator-log`. You can then search for errors in the file. ### Method 2: Check RayService CR status ```bash kubectl describe rayservice $RAYSERVICE_NAME -n $YOUR_NAMESPACE ``` You can check the status and events of the RayService CR to see if there are any errors. ### Method 3: Check logs of Ray Pods You can also check the Ray Serve logs directly by accessing the log files on the pods. These log files contain system level logs from the Serve controller and HTTP proxy as well as access logs and user-level logs. See [Ray Serve Logging](serve-logging) and [Ray Logging](configure-logging) for more details. ```bash kubectl exec -it $RAY_POD -n $YOUR_NAMESPACE -- bash # Check the logs under /tmp/ray/session_latest/logs/serve/ ``` ### Method 4: Check Dashboard ```bash kubectl port-forward $RAY_POD -n $YOUR_NAMESPACE 8265:8265 # Check $YOUR_IP:8265 in your browser ``` For more details about Ray Serve observability on the dashboard, you can refer to [the documentation](dash-serve-view) and [the YouTube video](https://youtu.be/eqXfwM641a4). ### Method 5: Ray State CLI You can use the [Ray State CLI](state-api-cli-ref) on the head Pod to check the status of Ray Serve applications. ```bash # Log into the head Pod export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl exec -it $HEAD_POD -- ray summary actors # [Example output]: # ======== Actors Summary: 2023-07-11 17:58:24.625032 ======== # Stats: # ------------------------------------ # total_actors: 14 # Table (group by class): # ------------------------------------ # CLASS_NAME STATE_COUNTS # 0 ServeController ALIVE: 1 # 1 ServeReplica:fruit_app_OrangeStand ALIVE: 1 # 2 ProxyActor ALIVE: 3 # 4 ServeReplica:math_app_Multiplier ALIVE: 1 # 5 ServeReplica:math_app_create_order ALIVE: 1 # 7 ServeReplica:fruit_app_FruitMarket ALIVE: 1 # 8 ServeReplica:math_app_Adder ALIVE: 1 # 9 ServeReplica:math_app_Router ALIVE: 1 # 10 ServeReplica:fruit_app_MangoStand ALIVE: 1 # 11 ServeReplica:fruit_app_PearStand ALIVE: 1 ``` ## Common issues * {ref}`kuberay-raysvc-issue1` * {ref}`kuberay-raysvc-issue2` * {ref}`kuberay-raysvc-issue3` * {ref}`kuberay-raysvc-issue4` * {ref}`kuberay-raysvc-issue5` * {ref}`kuberay-raysvc-issue6` * {ref}`kuberay-raysvc-issue7` * {ref}`kuberay-raysvc-issue8` * {ref}`kuberay-raysvc-issue9` * {ref}`kuberay-raysvc-issue10` * {ref}`kuberay-raysvc-issue11` (kuberay-raysvc-issue1)= ### Issue 1: Ray Serve script is incorrect It's better to test Ray Serve script locally or in a RayCluster before deploying it to a RayService. See [Development Workflow](serve-dev-workflow) for more details. (kuberay-raysvc-issue2)= ### Issue 2: `serveConfigV2` is incorrect The RayService CR sets `serveConfigV2` as a YAML multi-line string for flexibility. This implies that there is no strict type checking for the Ray Serve configurations in `serveConfigV2` field. Some tips to help you debug the `serveConfigV2` field: * Check [the documentation](serve-api) for the schema about the Ray Serve Multi-application API `PUT "/api/serve/applications/"`. * Unlike `serveConfig`, `serveConfigV2` adheres to the snake case naming convention. For example, `numReplicas` is used in `serveConfig`, while `num_replicas` is used in `serveConfigV2`. (kuberay-raysvc-issue3)= ### Issue 3: The Ray image doesn't include the required dependencies You have two options to resolve this issue: * Build your own Ray image with the required dependencies. * Specify the required dependencies using `runtime_env` in the `serveConfigV2` field. * For example, the MobileNet example requires `python-multipart`, which isn't included in the Ray image `rayproject/ray:x.y.z`. Therefore, the YAML file includes `python-multipart` in the runtime environment. For more details, refer to [the MobileNet example](kuberay-mobilenet-rayservice-example). (kuberay-raysvc-issue4)= ### Issue 4: Incorrect `import_path`. You can refer to [the documentation](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ServeApplicationSchema.html#ray.serve.schema.ServeApplicationSchema.import_path) for more details about the format of `import_path`. Taking [the MobileNet YAML file](https://github.com/ray-project/kuberay/blob/v1.0.0/ray-operator/config/samples/ray-service.mobilenet.yaml) as an example, the `import_path` is `mobilenet.mobilenet:app`. The first `mobilenet` is the name of the directory in the `working_dir`, the second `mobilenet` is the name of the Python file in the directory `mobilenet/`, and `app` is the name of the variable representing Ray Serve application within the Python file. ```yaml serveConfigV2: | applications: - name: mobilenet import_path: mobilenet.mobilenet:app runtime_env: working_dir: "https://github.com/ray-project/serve_config_examples/archive/b393e77bbd6aba0881e3d94c05f968f05a387b96.zip" pip: ["python-multipart==0.0.6"] ``` (kuberay-raysvc-issue5)= ### Issue 5: Fail to create / update Serve applications. You may encounter the following error messages when KubeRay tries to create / update Serve applications: #### Error message 1: `connect: connection refused` ``` Put "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused ``` For RayService, the KubeRay operator submits a request to the RayCluster for creating Serve applications once the head Pod is ready. It's important to note that the Dashboard, Dashboard Agent and GCS may take a few seconds to start up after the head Pod is ready. As a result, the request may fail a few times initially before the necessary components are fully operational. If you continue to encounter this issue after waiting for 1 minute, it's possible that the dashboard or dashboard agent may have failed to start. For more information, you can check the `dashboard.log` and `dashboard_agent.log` files located at `/tmp/ray/session_latest/logs/` on the head Pod. #### Error message 2: `i/o timeout` ``` Put "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: i/o timeout" ``` One possible cause of this issue could be a Kubernetes NetworkPolicy blocking the traffic between the Ray Pods and the dashboard agent's port (i.e., 52365). (kuberay-raysvc-issue6)= ### Issue 6: `runtime_env` In `serveConfigV2`, you can specify the runtime environment for the Ray Serve applications using `runtime_env`. Some common issues related to `runtime_env`: * The `working_dir` points to a private AWS S3 bucket, but the Ray Pods do not have the necessary permissions to access the bucket. * The NetworkPolicy blocks the traffic between the Ray Pods and the external URLs specified in `runtime_env`. (kuberay-raysvc-issue7)= ### Issue 7: Failed to get Serve application statuses. You may encounter the following error message when KubeRay tries to get Serve application statuses: ``` Get "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused" ``` As mentioned in [Issue 5](#kuberay-raysvc-issue5), the KubeRay operator submits a `Put` request to the RayCluster for creating Serve applications once the head Pod is ready. After the successful submission of the `Put` request to the dashboard agent, a `Get` request is sent to the dashboard agent port (i.e., 52365). The successful submission indicates that all the necessary components, including the dashboard agent, are fully operational. Therefore, unlike Issue 5, the failure of the `Get` request is not expected. If you consistently encounter this issue, there are several possible causes: * The dashboard agent process on the head Pod is not running. You can check the `dashboard_agent.log` file located at `/tmp/ray/session_latest/logs/` on the head Pod for more information. In addition, you can also perform an experiment to reproduce this cause by manually killing the dashboard agent process on the head Pod. ```bash # Step 1: Log in to the head Pod kubectl exec -it $HEAD_POD -n $YOUR_NAMESPACE -- bash # Step 2: Check the PID of the dashboard agent process ps aux # [Example output] # ray 156 ... 0:03 /.../python -u /.../ray/dashboard/agent.py -- # Step 3: Kill the dashboard agent process kill 156 # Step 4: Check the logs cat /tmp/ray/session_latest/logs/dashboard_agent.log # [Example output] # 2023-07-10 11:24:31,962 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:31 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1" # 2023-07-10 11:24:34,001 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:33 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1" # 2023-07-10 11:24:36,043 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:36 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1" # 2023-07-10 11:24:38,082 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:38 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1" # 2023-07-10 11:24:38,590 WARNING agent.py:531 -- Exiting with SIGTERM immediately... # Step 5: Open a new terminal and check the logs of the KubeRay operator kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log # [Example output] # Get \"http://rayservice-sample-raycluster-rqlsl-head-svc.default.svc.cluster.local:52365/api/serve/applications/\": dial tcp 10.96.7.154:52365: connect: connection refused ``` (kuberay-raysvc-issue8)= ### Issue 8: A loop of restarting the RayCluster occurs when the Kubernetes cluster runs out of resources. (KubeRay v0.6.1 or earlier) > Note: Currently, the KubeRay operator does not have a clear plan to handle situations where the Kubernetes cluster runs out of resources. Therefore, we recommend ensuring that the Kubernetes cluster has sufficient resources to accommodate the serve application. If the status of a serve application remains non-`RUNNING` for more than `serviceUnhealthySecondThreshold` seconds, the KubeRay operator will consider the RayCluster as unhealthy and initiate the preparation of a new RayCluster. A common cause of this issue is that the Kubernetes cluster does not have enough resources to accommodate the serve application. In such cases, the KubeRay operator may continue to restart the RayCluster, leading to a loop of restarts. We can also perform an experiment to reproduce this situation: * A Kubernetes cluster with an 8-CPUs node * [ray-service.insufficient-resources.yaml](https://gist.github.com/kevin85421/6a7779308aa45b197db8015aca0c1faf) * RayCluster: * The cluster has 1 head Pod with 4 physical CPUs, but `num-cpus` is set to 0 in `rayStartParams` to prevent any serve replicas from being scheduled on the head Pod. * The cluster also has 1 worker Pod with 1 CPU by default. * `serveConfigV2` specifies 5 serve deployments, each with 1 replica and a requirement of 1 CPU. ```bash # Step 1: Get the number of CPUs available on the node kubectl get nodes -o custom-columns=NODE:.metadata.name,ALLOCATABLE_CPU:.status.allocatable.cpu # [Example output] # NODE ALLOCATABLE_CPU # kind-control-plane 8 # Step 2: Install a KubeRay operator. # Step 3: Create a RayService with autoscaling enabled. kubectl apply -f ray-service.insufficient-resources.yaml # Step 4: The Kubernetes cluster will not have enough resources to accommodate the serve application. kubectl describe rayservices.ray.io rayservice-sample -n $YOUR_NAMESPACE # [Example output] # fruit_app_FruitMarket: # Health Last Update Time: 2023-07-11T02:10:02Z # Last Update Time: 2023-07-11T02:10:35Z # Message: Deployment "fruit_app_FruitMarket" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 1.0}, resources available: {}. # Status: UPDATING # Step 5: A new RayCluster will be created after `serviceUnhealthySecondThreshold` (300s here) seconds. # Check the logs of the KubeRay operator to find the reason for restarting the RayCluster. kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log # [Example output] # 2023-07-11T02:14:58.109Z INFO controllers.RayService Restart RayCluster {"appName": "fruit_app", "restart reason": "The status of the serve application fruit_app has not been RUNNING for more than 300.000000 seconds. Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster."} # 2023-07-11T02:14:58.109Z INFO controllers.RayService Restart RayCluster {"deploymentName": "fruit_app_FruitMarket", "appName": "fruit_app", "restart reason": "The status of the serve deployment fruit_app_FruitMarket or the serve application fruit_app has not been HEALTHY/RUNNING for more than 300.000000 seconds. Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster. The message of the serve deployment is: Deployment \"fruit_app_FruitMarket\" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {\"CPU\": 1.0}, resources available: {}."} # . # . # . # 2023-07-11T02:14:58.122Z INFO controllers.RayService Restart RayCluster {"ServiceName": "default/rayservice-sample", "AvailableWorkerReplicas": 1, "DesiredWorkerReplicas": 5, "restart reason": "The serve application is unhealthy, restarting the cluster. If the AvailableWorkerReplicas is not equal to DesiredWorkerReplicas, this may imply that the Autoscaler does not have enough resources to scale up the cluster. Hence, the serve application does not have enough resources to run. Please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details.", "RayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "default", "name": "rayservice-sample-raycluster-hvd9f"}} ``` (kuberay-raysvc-issue9)= ### Issue 9: Upgrade from Ray Serve's single-application API to its multi-application API without downtime KubeRay v0.6.0 has begun supporting Ray Serve API V2 (multi-application) by exposing `serveConfigV2` in the RayService CRD. However, Ray Serve does not support deploying both API V1 and API V2 in the cluster simultaneously. Hence, if users want to perform in-place upgrades by replacing `serveConfig` with `serveConfigV2`, they may encounter the following error message: ``` ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the the multi-app API endpoint `/api/serve/applications/`. ``` To resolve this issue, you can replace `serveConfig` with `serveConfigV2` and modify `rayVersion` which has no effect when the Ray version is 2.0.0 or later to 2.100.0. This will trigger a new RayCluster preparation instead of an in-place update. If, after following the steps above, you still see the error message and GCS fault tolerance is enabled, it may be due to the `ray.io/external-storage-namespace` annotation being the same for both old and new RayClusters. You can remove the annotation and KubeRay will automatically generate a unique key for each RayCluster custom resource. See [kuberay#1297](https://github.com/ray-project/kuberay/issues/1297) for more details. (kuberay-raysvc-issue10)= ### Issue 10: Upgrade RayService with GCS fault tolerance enabled without downtime KubeRay uses the value of the annotation [ray.io/external-storage-namespace](kuberay-external-storage-namespace) to assign the environment variable `RAY_external_storage_namespace` to all Ray Pods managed by the RayCluster. This value represents the storage namespace in Redis where the Ray cluster metadata resides. In the process of a head Pod recovery, the head Pod attempts to reconnect to the Redis server using the `RAY_external_storage_namespace` value to recover the cluster data. However, specifying the `RAY_external_storage_namespace` value in RayService can potentially lead to downtime during zero-downtime upgrades. Specifically, the new RayCluster accesses the same Redis storage namespace as the old one for cluster metadata. This configuration can lead the KubeRay operator to assume that the Ray Serve applications are operational, as indicated by the existing metadata in Redis. Consequently, the operator might deem it safe to retire the old RayCluster and redirect traffic to the new one, even though the latter may still require time to initialize the Ray Serve applications. The recommended solution is to remove the `ray.io/external-storage-namespace` annotation from the RayService CRD. If the annotation isn't set, KubeRay automatically uses each RayCluster custom resource's UID as the `RAY_external_storage_namespace` value. Hence, both the old and new RayClusters have different `RAY_external_storage_namespace` values, and the new RayCluster is unable to access the old cluster metadata. Another solution is to set the `RAY_external_storage_namespace` value manually to a unique value for each RayCluster custom resource. See [kuberay#1296](https://github.com/ray-project/kuberay/issues/1296) for more details. (kuberay-raysvc-issue11)= ### Issue 11: RayService stuck in Initializing — use the initializing timeout to fail fast If one or more underlying Pods are scheduled but fail to start (for example, ImagePullBackOff, CrashLoopBackOff, or other container startup errors), a `RayService` can remain in the Initializing state indefinitely. This state consumes cluster resources and makes the root cause harder to diagnose. #### What to do KubeRay exposes a configurable initializing timeout via the annotation `ray.io/initializing-timeout`. When the timeout expires, the operator marks the `RayService` as failed and starts cleanup of associated `RayCluster` resources. Enabling the timeout requires only adding the annotation to the `RayService` metadata — no other CRD changes are necessary. #### Operator behavior after timeout - The `RayServiceReady` condition is set to `False` with reason `InitializingTimeout`. - The `RayService` is placed into a **terminal (failed)** state; updating the spec will not trigger a retry. Recovery requires deleting and recreating the `RayService`. - Cluster names on the `RayService` CR are cleared, which triggers cleanup of the underlying `RayCluster` resources. Deletions still respect `RayClusterDeletionDelaySeconds`. - A `Warning` event is emitted that documents the timeout and the failure reason. #### Enable the timeout Add the annotation to your `RayService` metadata. The annotation accepts either Go duration strings (for example, `"30m"` or `"1h"`) or integer seconds (for example, `"1800"`): ```yaml metadata: annotations: ray.io/initializing-timeout: "30m" ``` #### Guidance - Pick a timeout that balances expected startup work with failing fast to conserve cluster resources. - See the upstream discussion [kuberay#4138](https://github.com/ray-project/kuberay/issues/4138) for more implementation details. --- (kuberay-troubleshooting-guides)= # Troubleshooting guide This document addresses common inquiries. If you don't find an answer to your question here, please don't hesitate to connect with us via our [community channels](https://github.com/ray-project/kuberay#getting-involved). # Contents - [Use the right version of Ray](#use-the-right-version-of-ray) - [Use ARM-based docker images for Apple M1 or M2 MacBooks](#docker-image-for-apple-macbooks) - [Upgrade KubeRay](#upgrade-kuberay) - [Worker init container](#worker-init-container) - [Cluster domain](#cluster-domain) - [RayService](#rayservice) - [Autoscaler](#autoscaler) - [Multi-node GPU clusters](#multi-node-gpu) - [Other questions](#other-questions) (use-the-right-version-of-ray)= ## Use the right version of Ray See the [upgrade guide](#kuberay-upgrade-guide) for the compatibility matrix between KubeRay versions and Ray versions. ```{admonition} Don't use Ray versions between 2.11.0 and 2.37.0. The [commit](https://github.com/ray-project/ray/pull/44658) introduces a bug in Ray 2.11.0. When a Ray job is created, the Ray dashboard agent process on the head node gets stuck, causing the readiness and liveness probes, which send health check requests for the Raylet to the dashboard agent, to fail. ``` (docker-image-for-apple-macbooks)= ## Use ARM-based docker images for Apple M1 or M2 MacBooks Ray builds different images for different platforms. Until Ray moves to building multi-architecture images, [tracked by this GitHub issue](https://github.com/ray-project/ray/issues/39364), use platform-specific docker images in the head and worker group specs of the [RayCluster config](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#image). Use an image with the tag `aarch64`, for example, `image: rayproject/ray:2.41.0-aarch64`), if you are running KubeRay on a MacBook M1 or M2. [Link to issue details and discussion](https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1712267296145549). (upgrade-kuberay)= ## Upgrade KubeRay If you have issues upgrading KubeRay, see the [upgrade guide](#kuberay-upgrade-guide). Most issues are about the CRD version. (worker-init-container)= ## Worker init container The KubeRay operator injects a default [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into every worker Pod. This init container is responsible for waiting until the Global Control Service (GCS) on the head Pod is ready before establishing a connection to the head. The init container will use `ray health-check` to check the GCS server status continuously. The default worker init container may not work for all use cases, or users may want to customize the init container. ### 1. Init container troubleshooting Some common causes for the worker init container to stuck in `Init:0/1` status are: * The GCS server process has failed in the head Pod. Please inspect the log directory `/tmp/ray/session_latest/logs/` in the head Pod for errors related to the GCS server. * The `ray` executable is not included in the `$PATH` for the image, so the init container will fail to run `ray health-check`. * The `CLUSTER_DOMAIN` environment variable is not set correctly. See the section [cluster domain](#cluster-domain) for more details. * The worker init container shares the same ***ImagePullPolicy***, ***SecurityContext***, ***Env***, ***VolumeMounts***, and ***Resources*** as the worker Pod template. Sharing these settings is possible to cause a deadlock. See [#1130](https://github.com/ray-project/kuberay/issues/1130) for more details. If the init container remains stuck in `Init:0/1` status for 2 minutes, Ray stops redirecting the output messages to `/dev/null` and instead prints them to the worker Pod logs. To troubleshoot further, you can inspect the logs using `kubectl logs`. ### 2. Disable the init container injection If you want to customize the worker init container, you can disable the init container injection and add your own. To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment variable in the KubeRay operator to `false` (applicable from KubeRay v0.5.2). Please refer to [#1069](https://github.com/ray-project/kuberay/pull/1069) and the [KubeRay Helm chart](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L83-L87) for instructions on how to set the environment variable. Once disabled, you can add your custom init container to the worker Pod template. (cluster-domain)= ## Cluster domain In KubeRay, we use Fully Qualified Domain Names (FQDNs) to establish connections between workers and the head. The FQDN of the head service is `${HEAD_SVC}.${NAMESPACE}.svc.${CLUSTER_DOMAIN}`. The default [cluster domain](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction) is `cluster.local`, which works for most Kubernetes clusters. However, it's important to note that some clusters may have a different cluster domain. You can check the cluster domain of your Kubernetes cluster by checking `/etc/resolv.conf` in a Pod. To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable in the KubeRay operator. Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L88-L91). For more information, please refer to [#951](https://github.com/ray-project/kuberay/pull/951) and [#938](https://github.com/ray-project/kuberay/pull/938) for more details. (rayservice)= ## RayService RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. See [rayservice-troubleshooting](kuberay-raysvc-troubleshoot) for more details. (autoscaler)= ## Ray Autoscaler ### Ray Autoscaler doesn't scale up, causing new Ray tasks or actors to remain pending One common cause is that the Ray tasks or actors require an amount of resources that exceeds what any single Ray node can provide. Note that Ray tasks and actors represent the smallest scheduling units in Ray, and a task or actor should be on a single Ray node. Take [kuberay#846](https://github.com/ray-project/kuberay/issues/846) as an example. The user attempts to schedule a Ray task that requires 2 CPUs, but the Ray Pods available for these tasks have only 1 CPU each. Consequently, the Ray Autoscaler decides not to scale up the RayCluster. (multi-node-gpu)= ## Multi-node GPU Deployments For comprehensive troubleshooting of multi-node GPU serving issues, refer to {ref}`Troubleshooting multi-node GPU serving on KubeRay `. (other-questions)= ## Other questions ### Why are changes to the RayCluster or RayJob CR not taking effect? Currently, only modifications to the `replicas` field in `RayCluster/RayJob` CR are supported. Changes to other fields may not take effect or could lead to unexpected results. ### How to configure reconcile concurrency when there are large mount of CRs? In this example, [kuberay#3909](https://github.com/ray-project/kuberay/issues/3909), the user encountered high latency when processing RayCluster CRs and found that the ReconcileConcurrency value was set to 1. The KubeRay operator supports configuring the `ReconcileConcurrency` setting, which controls the number of concurrent workers processing Ray custom resources (CRs). To configure the `ReconcileConcurrency` number, you can edit the deployment's container args: ```bash kubectl edit deployment kuberay-operator ``` Specify the `ReconcileConcurrency` number in the container args: ```yaml spec: containers: - args: - --reconcile-concurrency - "10" ``` You can also use the following command for kuberay version >= 1.5.1: ```bash helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 --set reconcileConcurrency=10 ``` --- (kuberay-troubleshooting)= # KubeRay Troubleshooting ```{toctree} :hidden: troubleshooting/troubleshooting troubleshooting/rayservice-troubleshooting ``` - {ref}`kuberay-troubleshooting-guides` - {ref}`kuberay-raysvc-troubleshoot` --- (kuberay-ack-gpu-cluster-setup)= # Start an Aliyun ACK cluster with GPUs for KubeRay This guide provides step-by-step instructions for creating an ACK cluster with GPU nodes specifically configured for KubeRay. The configuration outlined here can be applied to most KubeRay examples found in the documentation. ## Step 1: Create a Kubernetes cluster on Aliyun ACK See [Create a cluster](https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/create-an-ack-managed-cluster-2) to create a Aliyun ACK cluster and see [Connect to clusters](https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/access-clusters) to configure your computer to communicate with the cluster. ## Step 2: Create node pools for the Aliyun ACK cluster See [Create a node pool](https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/create-a-node-pool) to create node pools. ### Manage node labels and taints If you need to set taints for nodes, see [Create and manage node labels](https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/manage-taints-and-tolerations) and [Create and manage node taints](https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/manage-taints-and-tolerations). For example, you can add a taint to GPU node pools so that Ray won't schedule head pods on these nodes. ### Upgrade drivers on the nodes If you need to upgrade the drivers on the nodes, see [Step 2: Create a node pool and specify an NVIDIA driver version](https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/customize-the-gpu-driver-version-of-the-node-by-specifying-the-version-number) to upgrade drivers. ## Step 3: Install KubeRay addon in the cluster See [Step 2: Install KubeRay-Operator](https://www.alibabacloud.com/help/en/ack/cloud-native-ai-suite/use-cases/efficient-deployment-and-optimization-practice-of-ray-in-ack-cluster?) to deploy KubeRay in ACK. --- (kuberay-eks-gpu-cluster-setup)= # Start Amazon EKS Cluster with GPUs for KubeRay This guide walks you through the steps to create an Amazon EKS cluster with GPU nodes specifically for KubeRay. The configuration outlined here can be applied to most KubeRay examples found in the documentation. ## Step 1: Create a Kubernetes cluster on Amazon EKS Follow the first two steps in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) to: (1) create your Amazon EKS cluster and (2) configure your computer to communicate with your cluster. ## Step 2: Create node groups for the Amazon EKS cluster Follow "Step 3: Create nodes" in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) to create node groups. The following section provides more detailed information. ### Create a CPU node group Typically, avoid running GPU workloads on the Ray head. Create a CPU node group for all Pods except Ray GPU workers, such as the KubeRay operator, Ray head, and CoreDNS Pods. Here's a common configuration that works for most KubeRay examples in the docs: * Instance type: [**m5.xlarge**](https://aws.amazon.com/ec2/instance-types/m5/) (4 vCPU; 16 GB RAM) * Disk size: 256 GB * Desired size: 1, Min size: 0, Max size: 1 ### Create a GPU node group Create a GPU node group for Ray GPU workers. 1. Here's a common configuration that works for most KubeRay examples in the docs: * AMI type: Bottlerocket NVIDIA (BOTTLEROCKET_x86_64_NVIDIA) * Instance type: [**g5.xlarge**](https://aws.amazon.com/ec2/instance-types/g5/) (1 GPU; 24 GB GPU Memory; 4 vCPUs; 16 GB RAM) * Disk size: 1024 GB * Desired size: 1, Min size: 0, Max size: 1 2. Please install the NVIDIA device plugin. (Note: You can skip this step if you used the `BOTTLEROCKET_x86_64_NVIDIA` AMI in the step above.) * Install the DaemonSet for NVIDIA device plugin to run GPU enabled containers in your Amazon EKS cluster. You can refer to the [Amazon EKS optimized accelerated Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami) or [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) repository for more details. * If the GPU nodes have taints, add `tolerations` to `nvidia-device-plugin.yml` to enable the DaemonSet to schedule Pods on the GPU nodes. > **Note:** If you encounter permission issues with `kubectl`, follow "Step 2: Configure your computer to communicate with your cluster" in the [AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#). ```sh # Install the DaemonSet kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml # Verify that your nodes have allocatable GPUs. If the GPU node fails to detect GPUs, # please verify whether the DaemonSet schedules the Pod on the GPU node. kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" # Example output: # NAME GPU # ip-....us-west-2.compute.internal 4 # ip-....us-west-2.compute.internal ``` 3. Add a Kubernetes taint to prevent scheduling CPU Pods on this GPU node group. For KubeRay examples, add the following taint to the GPU nodes: `Key: ray.io/node-type, Value: worker, Effect: NoSchedule`, and include the corresponding `tolerations` for GPU Ray worker Pods. > Warning: GPU nodes are extremely expensive. Please remember to delete the cluster if you no longer need it. ## Step 3: Verify the node groups > **Note:** If you encounter permission issues with `eksctl`, navigate to your AWS account's webpage and copy the credential environment variables, including `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`, from the "Command line or programmatic access" page. ```sh eksctl get nodegroup --cluster ${YOUR_EKS_NAME} # CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID ASG NAME TYPE # ${YOUR_EKS_NAME} cpu-node-group ACTIVE 2023-06-05T21:31:49Z 0 1 1 m5.xlarge AL2_x86_64 eks-cpu-node-group-... managed # ${YOUR_EKS_NAME} gpu-node-group ACTIVE 2023-06-05T22:01:44Z 0 1 1 g5.12xlarge BOTTLEROCKET_x86_64_NVIDIA eks-gpu-node-group-... managed ``` --- (kuberay-aks-gpu-cluster-setup)= # Start Azure AKS Cluster with GPUs for KubeRay This guide walks you through the steps to create an Azure AKS cluster with GPU nodes specifically for KubeRay. The configuration outlined here can be applied to most KubeRay examples found in the documentation. You can find the landing page for AKS [here](https://azure.microsoft.com/en-us/services/kubernetes-service/). If you have an account set up, you can immediately start experimenting with Kubernetes clusters in the provider's console. Alternatively, check out the [documentation](https://docs.microsoft.com/en-us/azure/aks/) and [quickstart guides](https://docs.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-portal?tabs=azure-cli). To successfully deploy Ray on Kubernetes, you will need to use node pools following the guidance [here](https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools). ## Step 1: Create a Resource Group To create a resource group in a particular region: ``` az group create -l eastus -n kuberay-rg ``` ## Step 2: Create AKS Cluster To create an AKS cluster with system nodepool: ``` az aks create \ -g kuberay-rg \ -n kuberay-gpu-cluster \ --nodepool-name system \ --node-vm-size Standard_D8s_v3 \ --node-count 3 ``` ## Step 3: Add a GPU node group To add a GPU nodepool with autoscaling: ``` az aks nodepool add \ -g kuberay-rg \ --cluster-name kuberay-gpu-cluster \ --nodepool-name gpupool \ --node-vm-size Standard_NC6s_v3 \ --node-taints nvidia.com/gpu=present:NoSchedule \ --min-count 0 \ --max-count 3 \ --enable-cluster-autoscaler ``` To use NVIDIA GPU operator alternatively, follow instructions [here](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#skip-gpu-driver-installation-preview) ## Step 4: Get kubeconfig To get kubeconfig: ``` az aks get-credentials --resource-group kuberay-rg \ --name kuberay-gpu-cluster \ --overwrite-existing ``` --- (kuberay-config)= # RayCluster Configuration This guide covers the key aspects of Ray cluster configuration on Kubernetes. ## Introduction Deployments of Ray on Kubernetes follow the [operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/). The key players are - A [custom resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) called a `RayCluster` describing the desired state of a Ray cluster. - A [custom controller](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#custom-controllers), the KubeRay operator, which manages Ray pods in order to match the `RayCluster`'s spec. To deploy a Ray cluster, one creates a `RayCluster` custom resource (CR): ```shell kubectl apply -f raycluster.yaml ``` This guide covers the salient features of `RayCluster` CR configuration. For reference, here is a condensed example of a `RayCluster` CR in yaml format. ```yaml apiVersion: ray.io/v1alpha1 kind: RayCluster metadata: name: raycluster-complete spec: rayVersion: "2.3.0" enableInTreeAutoscaling: true autoscalerOptions: ... headGroupSpec: serviceType: ClusterIP # Options are ClusterIP, NodePort, and LoadBalancer rayStartParams: dashboard-host: "0.0.0.0" ... template: # Pod template metadata: # Pod metadata spec: # Pod spec containers: - name: ray-head image: rayproject/ray-ml:2.3.0 resources: limits: cpu: 14 memory: 54Gi requests: cpu: 14 memory: 54Gi ports: # Optional service port overrides - containerPort: 6379 name: gcs - containerPort: 8265 name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve ... workerGroupSpecs: - groupName: small-group replicas: 1 minReplicas: 1 maxReplicas: 5 rayStartParams: ... template: # Pod template spec: ... # Another workerGroup - groupName: medium-group ... # Yet another workerGroup, with access to special hardware perhaps. - groupName: gpu-group ... ``` The rest of this guide will discuss the `RayCluster` CR's config fields. See also the [guide](kuberay-autoscaling-config) on configuring Ray autoscaling with KubeRay. (kuberay-config-ray-version)= ## The Ray Version The field `rayVersion` specifies the version of Ray used in the Ray cluster. The `rayVersion` is used to fill default values for certain config fields. The Ray container images specified in the RayCluster CR should carry the same Ray version as the CR's `rayVersion`. If you are using a nightly or development Ray image, it is fine to set `rayVersion` to the latest release version of Ray. ## Pod configuration: headGroupSpec and workerGroupSpecs At a high level, a RayCluster is a collection of Kubernetes pods, similar to a Kubernetes Deployment or StatefulSet. Just as with the Kubernetes built-ins, the key pieces of configuration are * Pod specification * Scale information (how many pods are desired) The key difference between a Deployment and a `RayCluster` is that a `RayCluster` is specialized for running Ray applications. A Ray cluster consists of * One **head pod** which hosts global control processes for the Ray cluster. The head pod can also run Ray tasks and actors. * Any number of **worker pods**, which run Ray tasks and actors. Workers come in **worker groups** of identically configured pods. For each worker group, we must specify **replicas**, the number of pods we want of that group. The head pod’s configuration is specified under `headGroupSpec`, while configuration for worker pods is specified under `workerGroupSpecs`. There may be multiple worker groups, each group with its own configuration. The `replicas` field of a `workerGroupSpec` specifies the number of worker pods of that group to keep in the cluster. Each `workerGroupSpec` also has optional `minReplicas` and `maxReplicas` fields; these fields are important if you wish to enable {ref}`autoscaling `. ### Pod templates The bulk of the configuration for a `headGroupSpec` or `workerGroupSpec` goes in the `template` field. The `template` is a Kubernetes Pod template which determines the configuration for the pods in the group. Here are some of the subfields of the pod `template` to pay attention to: #### containers A Ray pod template specifies at minimum one container, namely the container that runs the Ray processes. A Ray pod template may also specify additional sidecar containers, for purposes such as {ref}`log processing `. However, the KubeRay operator assumes that the first container in the containers list is the main Ray container. Therefore, make sure to specify any sidecar containers **after** the main Ray container. In other words, the Ray container should be the **first** in the `containers` list. #### resources It's important to specify container CPU and memory resources for each group spec. Since CPU is a [compressible resource], you may want to set only CPU requests and not limits to guarantee your workloads a minimum amount of CPU but [allow them to take advantage of unused CPU and not get throttled][1] if they use more than their requested CPU. For GPU workloads, you may also wish to specify GPU limits. For example, set `nvidia.com/gpu: 2` if using an NVIDIA GPU device plugin and you wish to specify a pod with access to 2 GPUs. See {ref}`this guide ` for more details on GPU support. KubeRay automatically configures Ray to use the CPU, memory, and GPU **limits** in the Ray container config. These values are the logical resource capacities of Ray pods in the head or worker group. As of KubeRay 1.3.0, KubeRay uses the CPU request if the limit is absent. KubeRay rounds up CPU quantities to the nearest integer. You can override these resource capacities with {ref}`rayStartParams`. KubeRay ignores memory and GPU **requests**. So **set memory and GPU resource requests equal to their limits** when possible It's ideal to size each Ray pod to take up the entire Kubernetes node. In other words, it's best to run one large Ray pod per Kubernetes node. In general, it's more efficient to use a few large Ray pods than many small ones. The pattern of fewer large Ray pods has the following advantages: - more efficient use of each Ray pod's shared memory object store - reduced communication overhead between Ray pods - reduced redundancy of per-pod Ray control structures such as Raylets #### nodeSelector and tolerations You can control the scheduling of worker groups' Ray pods by setting the `nodeSelector` and `tolerations` fields of the pod spec. Specifically, these fields determine on which Kubernetes nodes the pods may be scheduled. See the [Kubernetes docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) for more about Pod-to-Node assignment. #### image The Ray container images specified in the `RayCluster` CR should carry the same Ray version as the CR's `spec.rayVersion`. If you are using a nightly or development Ray image, you can specify Ray's latest release version under `spec.rayVersion`. For Apple M1 or M2 MacBooks, see [Use ARM-based docker images for Apple M1 or M2 MacBooks](docker-image-for-apple-macbooks) to specify the correct image. You must install code dependencies for a given Ray task or actor on each Ray node that might run the task or actor. The simplest way to achieve this configuration is to use the same Ray image for the Ray head and all worker groups. In any case, do make sure that all Ray images in your CR carry the same Ray version and Python version. To distribute custom code dependencies across your cluster, you can build a custom container image, using one of the [official Ray images](https://hub.docker.com/r/rayproject/ray) as the base. See {ref}`this guide ` to learn more about the official Ray images. For dynamic dependency management geared towards iteration and development, you can also use {ref}`Runtime Environments `. For `kuberay-operator` versions 1.1.0 and later, the Ray container image must have `wget` installed in it. #### metadata.name and metadata.generateName The KubeRay operator will ignore the values of `metadata.name` and `metadata.generateName` set by users. The KubeRay operator will generate a `generateName` automatically to avoid name conflicts. See [KubeRay issue #587](https://github.com/ray-project/kuberay/pull/587) for more details. (rayStartParams)= ## Ray Start Parameters The ``rayStartParams`` field of each group spec is a string-string map of arguments to the Ray container’s `ray start` entrypoint. For the full list of arguments, refer to the documentation for {ref}`ray start `. The RayCluster Kubernetes custom resource Custom Resource Definition (CRD) in KubeRay versions before 1.4.0 required this field to exist, but the value could be an empty map. As of KubeRay 1.4.0, ``rayStartParams`` is optional. Note the following arguments: ### dashboard-host For most use-cases, this field should be set to "0.0.0.0" for the Ray head pod. This is required to expose the Ray dashboard outside the Ray cluster. (Future versions might set this parameter by default.) (kuberay-num-cpus)= ### num-cpus This optional field tells the Ray scheduler and autoscaler how many CPUs are available to the Ray pod. The CPU count can be autodetected from the Kubernetes resource limits specified in the group spec’s pod `template`. However, it is sometimes useful to override this autodetected value. For example, setting `num-cpus:"0"` for the Ray head pod will prevent Ray workloads with non-zero CPU requirements from being scheduled on the head. Note that the values of all Ray start parameters, including `num-cpus`, must be supplied as **strings**. ### num-gpus This field specifies the number of GPUs available to the Ray container. In future KubeRay versions, the number of GPUs will be auto-detected from Ray container resource limits. Note that the values of all Ray start parameters, including `num-gpus`, must be supplied as **strings**. ### memory The memory available to the Ray is detected automatically from the Kubernetes resource limits. If you wish, you may override this autodetected value by setting the desired memory value, in bytes, under `rayStartParams.memory`. Note that the values of all Ray start parameters, including `memory`, must be supplied as **strings**. ### resources This field can be used to specify custom resource capacities for the Ray pod. These resource capacities will be advertised to the Ray scheduler and Ray autoscaler. For example, the following annotation will mark a Ray pod as having 1 unit of `Custom1` capacity and 5 units of `Custom2` capacity. ```yaml rayStartParams: resources: '"{\"Custom1\": 1, \"Custom2\": 5}"' ``` You can then annotate tasks and actors with annotations like `@ray.remote(resources={"Custom2": 1})`. The Ray scheduler and autoscaler will take appropriate action to schedule such tasks. Note the format used to express the resources string. In particular, note that the backslashes are present as actual characters in the string. If you are specifying a `RayCluster` programmatically, you may have to [escape the backslashes](https://github.com/ray-project/ray/blob/cd9cabcadf1607bcda1512d647d382728055e688/python/ray/tests/kuberay/test_autoscaling_e2e.py#L92) to make sure they are processed as part of the string. The field `rayStartParams.resources` should only be used for custom resources. The keys `CPU`, `GPU`, and `memory` are forbidden. If you need to specify overrides for those resource fields, use the Ray start parameters `num-cpus`, `num-gpus`, or `memory`. (kuberay-networking)= ## Services and Networking ### The Ray head service. The KubeRay operator automatically configures a Kubernetes Service exposing the default ports for several services of the Ray head pod, including - Ray Client (default port 10001) - Ray Dashboard (default port 8265) - Ray GCS server (default port 6379) - Ray Serve (default port 8000) - Ray Prometheus metrics (default port 8080) The name of the configured Kubernetes Service is the name, `metadata.name`, of the RayCluster followed by the suffix `head-svc`. For the example CR given on this page, the name of the head service will be `raycluster-example-head-svc`. Kubernetes networking (`kube-dns`) then allows us to address the Ray head's services using the name `raycluster-example-head-svc`. For example, the Ray Client server can be accessed from a pod in the same Kubernetes namespace using ```python ray.init("ray://raycluster-example-head-svc:10001") ``` The Ray Client server can be accessed from a pod in another namespace using ```python ray.init("ray://raycluster-example-head-svc.default.svc.cluster.local:10001") ``` (This assumes the Ray cluster was deployed into the default Kubernetes namespace. If the Ray cluster is deployed in a non-default namespace, use that namespace in place of `default`.) ### Specifying non-default ports. If you wish to override the ports exposed by the Ray head service, you may do so by specifying the Ray head container's `ports` list, under `headGroupSpec`. Here is an example of a list of non-default ports for the Ray head service. ```yaml ports: - containerPort: 6380 name: gcs - containerPort: 8266 name: dashboard - containerPort: 10002 name: client ``` If the head container's `ports` list is specified, the Ray head service will expose precisely the ports in the list. In the above example, the head service will expose just three ports; in particular there will be no port exposed for Ray Serve. For the Ray head to actually use the non-default ports specified in the ports list, you must also specify the relevant `rayStartParams`. For the above example, ```yaml rayStartParams: port: "6380" dashboard-port: "8266" ray-client-server-port: "10002" ... ``` [compressible resource]: https://kubernetes.io/blog/2021/11/26/qos-memory-resources/#:~:text=CPU%20is%20considered%20a%20%22compressible%22%20resource.%20If%20your%20app%20starts%20hitting%20your%20CPU%20limits%2C%20Kubernetes%20starts%20throttling%20your%20container%2C%20giving%20your%20app%20potentially%20worse%20performance.%20However%2C%20it%20won%E2%80%99t%20be%20terminated.%20That%20is%20what%20%22compressible%22%20means [1]: https://home.robusta.dev/blog/stop-using-cpu-limits --- (kuberay-autoscaling)= # KubeRay Autoscaling This guide explains how to configure the Ray Autoscaler on Kubernetes. The Ray Autoscaler is a Ray cluster process that automatically scales a cluster up and down based on resource demand. The Autoscaler does this by adjusting the number of nodes (Ray Pods) in the cluster based on the resources required by tasks, actors, or placement groups. The Autoscaler utilizes logical resource requests, indicated in `@ray.remote` and shown in `ray status`, not the physical machine utilization, to scale. If you launch an actor, task, or placement group and resources are insufficient, the Autoscaler queues the request. It adjusts the number of nodes to meet queue demands and removes idle nodes that have no tasks, actors, or objects over time. ```{admonition} When to use Autoscaling? Autoscaling can reduce workload costs, but adds node launch overheads and can be tricky to configure. We recommend starting with non-autoscaling clusters if you're new to Ray. ``` ```{admonition} Ray Autoscaling V2 alpha with KubeRay (@ray 2.10.0) With Ray 2.10, Ray Autoscaler V2 alpha is available with KubeRay. It has improvements on observability and stability. Please see the [section](kuberay-autoscaler-v2) for more details. ``` ## Overview The following diagram illustrates the integration of the Ray Autoscaler with the KubeRay operator. Although depicted as a separate entity for clarity, the Ray Autoscaler is actually a sidecar container within the Ray head Pod in the actual implementation. ```{eval-rst} .. image:: ../images/AutoscalerOperator.svg :align: center .. Find the source document here (https://docs.google.com/drawings/d/1LdOg9JQuN5AOII-vDpSaFBsTeg0JGWcsbyNNLP1yovg/edit) ``` ```{admonition} 3 levels of autoscaling in KubeRay * **Ray actor/task**: Some Ray libraries, like Ray Serve, can automatically adjust the number of Serve replicas (i.e., Ray actors) based on the incoming request volume. * **Ray node**: Ray Autoscaler automatically adjusts the number of Ray nodes (i.e., Ray Pods) based on the resource demand of Ray actors/tasks. * **Kubernetes node**: If the Kubernetes cluster lacks sufficient resources for the new Ray Pods that the Ray Autoscaler creates, the Kubernetes Autoscaler can provision a new Kubernetes node. ***You must configure the Kubernetes Autoscaler yourself.*** ``` * The Autoscaler scales up the cluster through the following sequence of events: 1. A user submits a Ray workload. 2. The Ray head container aggregates the workload resource requirements and communicates them to the Ray Autoscaler sidecar. 3. The Autoscaler decides to add a Ray worker Pod to satisfy the workload's resource requirement. 4. The Autoscaler requests an additional worker Pod by incrementing the RayCluster CR's `replicas` field. 5. The KubeRay operator creates a Ray worker Pod to match the new `replicas` specification. 6. The Ray scheduler places the user's workload on the new worker Pod. * The Autoscaler also scales down the cluster by removing idle worker Pods. If it finds an idle worker Pod, it reduces the count in the RayCluster CR's `replicas` field and adds the identified Pods to the CR's `workersToDelete` field. Then, the KubeRay operator deletes the Pods in the `workersToDelete` field. ## Quickstart ### Step 1: Create a Kubernetes cluster with Kind ```bash kind create cluster --image=kindest/node:v1.26.0 ``` ### Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator via Helm repository. ### Step 3: Create a RayCluster custom resource with autoscaling enabled ```bash kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.autoscaler.yaml ``` ### Step 4: Verify the Kubernetes cluster status ```bash # Step 4.1: List all Ray Pods in the `default` namespace. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # raycluster-autoscaler-head 2/2 Running 0 107s # Step 4.2: Check the ConfigMap in the `default` namespace. kubectl get configmaps # [Example output] # NAME DATA AGE # ray-example 2 21s # ... ``` The RayCluster has one head Pod and zero worker Pods. The head Pod has two containers: a Ray head container and a Ray Autoscaler sidecar container. Additionally, the [ray-cluster.autoscaler.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-cluster.autoscaler.yaml) includes a ConfigMap named `ray-example` that contains two Python scripts: `detached_actor.py` and `terminate_detached_actor.py`. * `detached_actor.py` is a Python script that creates a detached actor which requires 1 CPU. ```py import ray import sys @ray.remote(num_cpus=1) class Actor: pass ray.init(namespace="default_namespace") Actor.options(name=sys.argv[1], lifetime="detached").remote() ``` * `terminate_detached_actor.py` is a Python script that terminates a detached actor. ```py import ray import sys ray.init(namespace="default_namespace") detached_actor = ray.get_actor(sys.argv[1]) ray.kill(detached_actor) ``` ### Step 5: Trigger RayCluster scale-up by creating detached actors ```bash # Step 5.1: Create a detached actor "actor1" which requires 1 CPU. export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor1 # Step 5.2: The Ray Autoscaler creates a new worker Pod. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # raycluster-autoscaler-head 2/2 Running 0 xxm # raycluster-autoscaler-small-group-worker-yyyyy 1/1 Running 0 xxm # Step 5.3: Create a detached actor which requires 1 CPU. kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor2 kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # raycluster-autoscaler-head 2/2 Running 0 xxm # raycluster-autoscaler-small-group-worker-yyyyy 1/1 Running 0 xxm # raycluster-autoscaler-small-group-worker-zzzzz 1/1 Running 0 xxm # Step 5.4: List all actors in the Ray cluster. kubectl exec -it $HEAD_POD -- ray list actors # ======= List: 2023-09-06 13:26:49.228594 ======== # Stats: # ------------------------------ # Total: 2 # Table: # ------------------------------ # ACTOR_ID CLASS_NAME STATE JOB_ID NAME ... # 0 xxxxxxxx Actor ALIVE 02000000 actor1 ... # 1 xxxxxxxx Actor ALIVE 03000000 actor2 ... ``` The Ray Autoscaler generates a new worker Pod for each new detached actor. This is because the `rayStartParams` field in the Ray head specifies `num-cpus: "0"`, preventing the Ray scheduler from scheduling any Ray actors or tasks on the Ray head Pod. In addition, each Ray worker Pod has a capacity of 1 CPU, so the Autoscaler creates a new worker Pod to satisfy the resource requirement of the detached actor which requires 1 CPU. * Using detached actors isn't necessary to trigger cluster scale-up. Normal actors and tasks can also initiate it. [Detached actors](actor-lifetimes) remain persistent even after the job's driver process exits, which is why the Autoscaler doesn't scale down the cluster automatically when the `detached_actor.py` process exits, making it more convenient for this tutorial. * In this RayCluster custom resource, each Ray worker Pod possesses only 1 logical CPU from the perspective of the Ray Autoscaler. Therefore, if you create a detached actor with `@ray.remote(num_cpus=2)`, the Autoscaler doesn't initiate the creation of a new worker Pod because the capacity of the existing Pod is limited to 1 CPU. * (Advanced) The Ray Autoscaler also offers a [Python SDK](ref-autoscaler-sdk), enabling advanced users, like Ray maintainers, to request resources directly from the Autoscaler. Generally, most users don't need to use the SDK. ### Step 6: Trigger RayCluster scale-down by terminating detached actors ```bash # Step 6.1: Terminate the detached actor "actor1". kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/terminate_detached_actor.py actor1 # Step 6.2: A worker Pod will be deleted after `idleTimeoutSeconds` (default 60s) seconds. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # raycluster-autoscaler-head 2/2 Running 0 xxm # raycluster-autoscaler-small-group-worker-zzzzz 1/1 Running 0 xxm # Step 6.3: Terminate the detached actor "actor2". kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/terminate_detached_actor.py actor2 # Step 6.4: A worker Pod will be deleted after `idleTimeoutSeconds` (default 60s) seconds. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # raycluster-autoscaler-head 2/2 Running 0 xxm ``` ### Step 7: Ray Autoscaler observability ```bash # Method 1: "ray status" kubectl exec $HEAD_POD -it -c ray-head -- ray status # [Example output]: # ======== Autoscaler status: 2023-09-06 13:42:46.372683 ======== # Node status # --------------------------------------------------------------- # Healthy: # 1 head-group # Pending: # (no pending nodes) # Recent failures: # (no failures) # Resources # --------------------------------------------------------------- # Usage: # 0B/1.86GiB memory # 0B/514.69MiB object_store_memory # Demands: # (no resource demands) # Method 2: "kubectl logs" kubectl logs $HEAD_POD -c autoscaler | tail -n 20 # [Example output]: # 2023-09-06 13:43:22,029 INFO autoscaler.py:421 -- # ======== Autoscaler status: 2023-09-06 13:43:22.028870 ======== # Node status # --------------------------------------------------------------- # Healthy: # 1 head-group # Pending: # (no pending nodes) # Recent failures: # (no failures) # Resources # --------------------------------------------------------------- # Usage: # 0B/1.86GiB memory # 0B/514.69MiB object_store_memory # Demands: # (no resource demands) # 2023-09-06 13:43:22,029 INFO autoscaler.py:464 -- The autoscaler took 0.036 seconds to complete the update iteration. ``` ### Step 8: Clean up the Kubernetes cluster ```bash # Delete RayCluster and ConfigMap kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.autoscaler.yaml # Uninstall the KubeRay operator helm uninstall kuberay-operator # Delete the kind cluster kind delete cluster ``` (kuberay-autoscaling-config)= ## KubeRay Autoscaling Configurations The [ray-cluster.autoscaler.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-cluster.autoscaler.yaml) used in the quickstart example contains detailed comments about the configuration options. ***It's recommended to read this section in conjunction with the YAML file.*** ### 1. Enabling autoscaling * **`enableInTreeAutoscaling`**: By setting `enableInTreeAutoscaling: true`, the KubeRay operator automatically configures an autoscaling sidecar container for the Ray head Pod. * **`minReplicas` / `maxReplicas` / `replicas`**: Set the `minReplicas` and `maxReplicas` fields to define the range for `replicas` in an autoscaling `workerGroup`. Typically, you would initialize both `replicas` and `minReplicas` with the same value during the deployment of an autoscaling cluster. Subsequently, the Ray Autoscaler adjusts the `replicas` field as it adds or removes Pods from the cluster. ### 2. Scale-up and scale-down speed If necessary, you can regulate the pace of adding or removing nodes from the cluster. For applications with numerous short-lived tasks, considering a more conservative approach to adjusting the upscaling and downscaling speeds might be beneficial. Utilize the `RayCluster` CR's `autoscalerOptions` field to accomplish this. This field encompasses the following sub-fields: * **`upscalingMode`**: This controls the rate of scale-up process. The valid values are: - `Conservative`: Upscaling is rate-limited; the number of pending worker Pods is at most the number of worker pods connected to the Ray cluster. - `Default`: Upscaling isn't rate-limited. - `Aggressive`: An alias for Default; upscaling isn't rate-limited. * **`idleTimeoutSeconds`** (default 60s): This denotes the waiting time in seconds before scaling down an idle worker pod. A worker node is idle when it has no active tasks, actors, or referenced objects, either stored in-memory or spilled to disk. ### 3. Autoscaler sidecar container The `autoscalerOptions` field also provides options for configuring the Autoscaler container. Usually, it's not necessary to specify these options. * **`resources`**: The `resources` sub-field of `autoscalerOptions` sets optional resource overrides for the Autoscaler sidecar container. These overrides should be specified in the standard [container resource spec format](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/Pod-v1/#resources). The default values are indicated below: ```yaml resources: limits: cpu: "500m" memory: "512Mi" requests: cpu: "500m" memory: "512Mi" ``` * **`image`**: This field overrides the Autoscaler container image. The container uses the same **image** as the Ray container by default. * **`imagePullPolicy`**: This field overrides the Autoscaler container's image pull policy. The default is `IfNotPresent`. * **`env`** and **`envFrom`**: These fields specify Autoscaler container environment variables. These fields should be formatted following the [Kubernetes API](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/Pod-v1/#environment-variables) for container environment variables. ### 4. Set the `rayStartParams` and the resource limits for the Ray container ```{admonition} Resource limits are optional starting from Ray 2.41.0 Starting from Ray 2.41.0, the Ray Autoscaler can read resource specifications from `rayStartParams`, resource limits, or resource requests of the Ray container. You must specify at least one of these fields. Earlier versions only support `rayStartParams` or resource limits, and don't recognize resource requests. ``` ```{admonition} rayStartParams is optional if you're using an autoscaler image with Ray 2.45.0 or later. `rayStartParams` is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions. If you omit `rayStartParams` and want to use autoscaling, the autoscaling image must have Ray 2.45.0 or later. ``` The Ray Autoscaler reads the `rayStartParams` field or the Ray container's resource limits in the RayCluster custom resource specification to determine the Ray Pod's resource requirements. The information regarding the number of CPUs is essential for the Ray Autoscaler to scale the cluster. Therefore, without this information, the Ray Autoscaler reports an error and fails to start. Take [ray-cluster.autoscaler.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-cluster.autoscaler.yaml) as an example below: * If users set `num-cpus` in `rayStartParams`, Ray Autoscaler would work regardless of the resource limits on the container. * If users don't set `rayStartParams`, the Ray container must have a specified CPU resource limit. ```yaml headGroupSpec: rayStartParams: num-cpus: "0" template: spec: containers: - name: ray-head resources: # The Ray Autoscaler still functions if you comment out the `limits` field for the # head container, as users have already specified `num-cpus` in `rayStartParams`. limits: cpu: "1" memory: "2G" requests: cpu: "1" memory: "2G" ... workerGroupSpecs: - groupName: small-group template: spec: containers: - name: ray-worker resources: limits: # The Ray Autoscaler versions older than 2.41.0 will fail to start if the CPU resource limit for the worker # container is commented out because `rayStartParams` is empty. # The Ray Autoscaler starting from 2.41.0 will not fail but use the resource requests if the resource # limits are commented out and `rayStartParams` is empty. cpu: "1" memory: "1G" requests: cpu: "1" memory: "1G" ``` ### 5. Autoscaler environment configuration You can configure the Ray autoscaler using environment variables specified in the `env` or `envFrom` fields under the `autoscalerOptions` section of your RayCluster custom resource. These variables provide fine-grained control over how the autoscaler behaves internally. For example, `AUTOSCALER_UPDATE_INTERVAL_S` determines how frequently the autoscaler checks the cluster status and decides whether to scale up or down. For complete examples, see [ray-cluster.autoscaler.yaml](https://github.com/ray-project/kuberay/blob/099bf616c012975031ea9e5bbf7843af03e5f05b/ray-operator/config/samples/ray-cluster.autoscaler.yaml#L28-L33) and [ray-cluster.autoscaler-v2.yaml](https://github.com/ray-project/kuberay/blob/099bf616c012975031ea9e5bbf7843af03e5f05b/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml#L16_L21). ```yaml autoscalerOptions: env: - name: AUTOSCALER_UPDATE_INTERVAL_S value: "5" ``` ## Next steps See [(Advanced) Understanding the Ray Autoscaler in the Context of Kubernetes](ray-k8s-autoscaler-comparison) for more details about the relationship between the Ray Autoscaler and Kubernetes autoscalers. (kuberay-autoscaler-v2)= ### Autoscaler V2 with KubeRay #### Prerequisites * KubeRay v1.4.0 and the latest Ray version are the preferred setup for Autoscaler V2. The release of Ray 2.10.0 introduces the alpha version of Ray Autoscaler V2 integrated with KubeRay, bringing enhancements in terms of observability and stability: 1. **Observability**: The Autoscaler V2 provides instance level tracing for each Ray worker's lifecycle, making it easier to debug and understand the Autoscaler behavior. It also reports the idle information about each node, including details on why nodes are idle or active: ```bash > ray status -v ======== Autoscaler status: 2024-03-08 21:06:21.023751 ======== GCS request time: 0.003238s Node status --------------------------------------------------------------- Active: 1 node_40f427230584b2d9c9f113d8db51d10eaf914aa9bf61f81dc7fabc64 Idle: 1 node_2d5fd3d4337ba5b5a8c3106c572492abb9a8de2dee9da7f6c24c1346 Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Total Usage: 1.0/64.0 CPU 0B/72.63GiB memory 0B/33.53GiB object_store_memory Pending Demands: (no resource demands) Node: 40f427230584b2d9c9f113d8db51d10eaf914aa9bf61f81dc7fabc64 Usage: 1.0/32.0 CPU 0B/33.58GiB memory 0B/16.79GiB object_store_memory # New in autoscaler V2: activity information Activity: Busy workers on node. Resource: CPU currently in use. Node: 2d5fd3d4337ba5b5a8c3106c572492abb9a8de2dee9da7f6c24c1346 # New in autoscaler V2: idle information Idle: 107356 ms Usage: 0.0/32.0 CPU 0B/39.05GiB memory 0B/16.74GiB object_store_memory Activity: (no activity) ``` 2. **Stability** Autoscaler V2 makes significant improvements to idle node handling. The V1 autoscaler could stop nodes that became active during termination processing, potentially failing tasks or actors. V2 uses Ray's graceful draining mechanism, which safely stops idle nodes without disrupting ongoing work. [ray-cluster.autoscaler-v2.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml) is an example YAML file of a RayCluster with Autoscaler V2 enabled that works with the latest KubeRay version. If you're using KubeRay >= 1.4.0, enable V2 by setting `RayCluster.spec.autoscalerOptions.version: v2`. ```yaml spec: enableInTreeAutoscaling: true # Set .spec.autoscalerOptions.version: v2 autoscalerOptions: version: v2 ``` If you're using KubeRay < 1.4.0, enable V2 by setting the `RAY_enable_autoscaler_v2` environment variable in the head and using `restartPolicy: Never` on head and all worker groups. ```yaml spec: enableInTreeAutoscaling: true headGroupSpec: template: spec: containers: - name: ray-head image: rayproject/ray:2.X.Y env: # Set this environment variable - name: RAY_enable_autoscaler_v2 value: "1" restartPolicy: Never # Prevent container restart to maintain Ray health. # Prevent Kubernetes from restarting Ray worker pod containers, enabling correct instance management by Ray. workerGroupSpecs: - replicas: 1 template: spec: restartPolicy: Never ... ``` --- (kuberay-gke-gpu-cluster-setup)= # Start Google Cloud GKE Cluster with GPUs for KubeRay See for full details, or continue reading for a quick start. ## Step 1: Create a Kubernetes cluster on GKE Run this command and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, you need to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). The following command creates a Kubernetes cluster named `kuberay-gpu-cluster` with 1 CPU node in the `us-west1-b` zone. This example uses the `e2-standard-4` machine type, which has 4 vCPUs and 16 GB RAM. ```sh gcloud container clusters create kuberay-gpu-cluster \ --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \ --zone=us-west1-b --machine-type e2-standard-4 ``` ```{admonition} Note You can also create a cluster from the [Google Cloud Console](https://console.cloud.google.com/kubernetes/list). ``` ## Step 2: Create a GPU node pool Run the following command to create a GPU node pool for Ray GPU workers. You can also create it from the Google Cloud Console: ```sh gcloud container node-pools create gpu-node-pool \ --accelerator type=nvidia-l4-vws,count=1 \ --zone us-west1-b \ --cluster kuberay-gpu-cluster \ --num-nodes 1 \ --min-nodes 0 \ --max-nodes 1 \ --enable-autoscaling \ --machine-type g2-standard-4 ``` The `--accelerator` flag specifies the type and number of GPUs for each node in the node pool. This example uses the [NVIDIA L4](https://cloud.google.com/compute/docs/gpus#l4-gpus) GPU. The machine type `g2-standard-4` has 1 GPU, 24 GB GPU Memory, 4 vCPUs and 16 GB RAM. ```{admonition} Note GKE automatically configures taints and tolerations so that only GPU pods are scheduled on GPU nodes. For more details, see [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create) ``` ## Step 3: Configure `kubectl` to connect to the cluster Run the following command to download Google Cloud credentials and configure the Kubernetes CLI to use them. ```sh gcloud container clusters get-credentials kuberay-gpu-cluster --zone us-west1-b ``` For more details, see [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl). ## Step 4: Install GPU drivers (optional) If you encounter any issues with the GPU drivers installed by GKE, you can manually install the GPU drivers by following the instructions below. ```sh # Install NVIDIA GPU device driver kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml # Verify that your nodes have allocatable GPUs kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" # Verify that your nodes have allocatable GPUs # NAME GPU # ...... # ...... 1 ``` --- (kuberay-gke-tpu-cluster-setup)= # Start Google Cloud GKE Cluster with TPUs for KubeRay See the [GKE documentation]() for full details, or continue reading for a quick start. ## Step 1: Create a Kubernetes cluster on GKE First, set the following environment variables to be used for GKE cluster creation: ```sh export CLUSTER_NAME=CLUSTER_NAME export COMPUTE_ZONE=ZONE export CLUSTER_VERSION=CLUSTER_VERSION ``` Replace the following: - CLUSTER_NAME: The name of the GKE cluster to be created. - ZONE: The zone with available TPU quota, for a list of TPU availability by zones, see the [GKE documentation](https://cloud.google.com/tpu/docs/regions-zones). - CLUSTER_VERSION: The GKE version to use. TPU v6e is supported in GKE versions 1.31.2-gke.1115000 or later. See the [GKE documentation](https://cloud.google.com/tpu/docs/tpus-in-gke#tpu-machine-types) for TPU generations and their minimum supported version. Run the following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). Create a Standard GKE cluster and enable the Ray Operator: ```sh gcloud container clusters create $CLUSTER_NAME \ --addons=RayOperator \ --machine-type=n1-standard-16 \ --cluster-version=$CLUSTER_VERSION \ --location=$ZONE ``` Run the following command to add a TPU node pool to the cluster. You can also create it from the [Google Cloud Console](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#console): Create a node pool with a single-host v4 TPU topology as follows: ```sh gcloud container node-pools create v4-4 \ --zone $ZONE \ --cluster $CLUSTER_NAME \ --num-nodes 1 \ --min-nodes 0 \ --max-nodes 10 \ --enable-autoscaling \ --machine-type ct4p-hightpu-4t \ --tpu-topology 2x2x1 ``` - For v4 TPUs, ZONE must be `us-central2-b`. Alternatively, create a multi-host node pool as follows: ```sh gcloud container node-pools create v4-8 \ --zone $ZONE \ --cluster $CLUSTER_NAME \ --num-nodes 2 \ --min-nodes 0 \ --max-nodes 10 \ --enable-autoscaling \ --machine-type ct4p-hightpu-4t \ --tpu-topology 2x2x2 ``` - For v4 TPUs, ZONE must be `us-central2-b`. The `--tpu-topology` flag specifies the physical topology of the TPU Pod slice. This example uses a v4 TPU slice with either a 2x2x1 or 2x2x2 topology. v4 TPUs have 4 chips per VM host, so a 2x2x2 v4 slice has 8 chips total and 2 TPU hosts, each scheduled on their own node. GKE treats multi-host TPU slices as atomic units, and scales them using node pools rather than singular nodes. Therefore, the number of TPU hosts should always equal the number of nodes in the TPU node pool. For more information about selecting a TPU topology and accelerator, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/concepts/tpus). GKE uses Kubernetes node selectors to ensure TPU workloads run on the desired machine type and topology. For more details, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#workload_preparation). ## Step 2: Connect to the GKE cluster Run the following command to download Google Cloud credentials and configure the Kubernetes CLI to use them. ```sh gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE ``` The remote GKE cluster is now reachable through `kubectl`. For more details, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl). ### [Optional] Manually install KubeRay and the TPU webhook in a GKE cluster without the Ray Operator Addon: In a cluster without the Ray Operator Addon enabled, KubeRay can be manually installed using [helm](https://ray-project.github.io/kuberay/deploy/helm/) with the following commands: ```sh helm repo add kuberay https://ray-project.github.io/kuberay-helm/ # Install both CRDs and KubeRay operator v1.5.1. helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 ``` GKE provides a [validating and mutating webhook](https://github.com/ai-on-gke/kuberay-tpu-webhook) to handle TPU Pod scheduling and bootstrap certain environment variables used for [JAX](https://github.com/google/jax) initialization. The Ray TPU webhook requires a KubeRay operator version of at least v1.1.0. GKE automatically installs the Ray TPU webhook through the [Ray Operator Addon](https://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/how-to/enable-ray-on-gke) with GKE versions 1.30.0-gke.1747000 or later. When manually installing the webhook, [cert-manager](https://github.com/cert-manager/cert-manager) is required to handle TLS certificate injection. You can install cert-manager in both GKE Standard and Autopilot clusters using the following helm commands: Install cert-manager: ``` helm repo add jetstack https://charts.jetstack.io helm repo update helm install --create-namespace --namespace cert-manager --set installCRDs=true --set global.leaderElection.namespace=cert-manager cert-manager jetstack/cert-manager ``` Next, deploy the Ray TPU initialization webhook: 1. `git clone https://github.com/GoogleCloudPlatform/ai-on-gke` 2. `cd ray-on-gke/tpu/kuberay-tpu-webhook` 3. `make deploy deploy-cert` --- (kuberay-gke-bucket)= # Configuring KubeRay to use Google Cloud Storage Buckets in GKE If you are already familiar with Workload Identity in GKE, you can skip this document. The gist is that you need to specify a service account in each of the Ray pods after linking your Kubernetes service account to your Google Cloud service account. Otherwise, read on. This example is an abridged version of the documentation at . The full documentation is worth reading if you are interested in the details. ## Create a Kubernetes cluster on GKE This example creates a minimal KubeRay cluster using GKE. Run this and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). ```bash PROJECT_ID=my-project-id # Replace my-project-id with your GCP project ID CLUSTER_NAME=cloud-bucket-cluster ZONE=us-west1-b gcloud container clusters create $CLUSTER_NAME \ --addons=RayOperator \ --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \ --zone=$ZONE --machine-type e2-standard-8 \ --workload-pool=${PROJECT_ID}.svc.id.goog ``` This command creates a Kubernetes cluster named `cloud-bucket-cluster` with one node in the `us-west1-b` zone. This example uses the `e2-standard-8` machine type, which has 8 vCPUs and 32 GB RAM. For more information on how to find your project ID, see or . Now get credentials for the cluster to use with `kubectl`: ```bash gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_ID ``` ## Create a Kubernetes Service Account ```bash NAMESPACE=default KSA=my-ksa kubectl create serviceaccount $KSA -n $NAMESPACE ``` ## Configure the GCS Bucket Create a GCS bucket that Ray uses as the remote filesystem. ```bash BUCKET=my-bucket gcloud storage buckets create gs://$BUCKET --uniform-bucket-level-access ``` Bind the `roles/storage.objectUser` role to the Kubernetes service account and bucket IAM policy. See [Identifying projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects) to find your project ID and project number: ```bash PROJECT_ID= PROJECT_NUMBER= gcloud storage buckets add-iam-policy-binding gs://${BUCKET} --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA}" --role "roles/storage.objectUser" ``` See [Authenticate to Google Cloud APIs from GKE workloads](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) for more details. ## Create a minimal RayCluster YAML manifest You can download the RayCluster YAML manifest for this tutorial with `curl` as follows: ```bash curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.gke-bucket.yaml ``` The key parts are the following lines: ```yaml spec: serviceAccountName: my-ksa nodeSelector: iam.gke.io/gke-metadata-server-enabled: "true" ``` Include these lines in every pod spec of your Ray cluster. This example uses a single-node cluster (1 head node and 0 worker nodes) for simplicity. ## Create the RayCluster ```bash kubectl apply -f ray-cluster.gke-bucket.yaml ``` ## Test GCS bucket access from the RayCluster Use `kubectl get pod` to get the name of the Ray head pod. Then run the following command to get a shell in the Ray head pod: ```bash kubectl exec -it raycluster-mini-head-xxxx -- /bin/bash ``` In the shell, run `pip install google-cloud-storage` to install the Google Cloud Storage Python client library. (For production use cases, you will need to make sure `google-cloud-storage` is installed on every node of your cluster, or use `ray.init(runtime_env={"pip": ["google-cloud-storage"]})` to have the package installed as needed at runtime -- see for more details.) Then run the following Python code to test access to the bucket: ```python import ray import os from google.cloud import storage GCP_GCS_BUCKET = "my-bucket" GCP_GCS_FILE = "test_file.txt" ray.init(address="auto") @ray.remote def check_gcs_read_write(): client = storage.Client() bucket = client.bucket(GCP_GCS_BUCKET) blob = bucket.blob(GCP_GCS_FILE) # Write to the bucket blob.upload_from_string("Hello, Ray on GKE!") # Read from the bucket content = blob.download_as_text() return content result = ray.get(check_gcs_read_write.remote()) print(result) ``` You should see the following output: ```text Hello, Ray on GKE! ``` --- (kuberay-helm-chart-rbac)= # Helm Chart RBAC KubeRay utilizes [Kubernetes Role-Based Access Control (RBAC) resources](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) to grant permissions for monitoring and managing resources. This document describes how to configure the KubeRay Helm chart to create RBAC resources for 3 different use cases. * [Case 1: Watch all namespaces in the Kubernetes cluster](case1-watch-all-namespaces) * [Case 2: Watch the namespace where the operator is deployed](case2-watch-1-namespace) * [Case 3: Watch multiple namespaces in the Kubernetes cluster](case3-watch-multiple-namespaces) ## Parameters You can configure the KubeRay Helm chart to create RBAC resources for different use cases by modifying the following parameters in the [values.yaml](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml). Then, you can install the KubeRay Helm chart with the modified **values.yaml**. ```shell # Step 1: Clone the KubeRay repository # Step 2: Modify the helm-chart/kuberay-operator/values.yaml # Step 3: Install the KubeRay Helm chart (path: helm-chart/kuberay-operator) helm install kuberay-operator . ``` * **`rbacEnable`** * If true, the Helm chart creates RBAC resources. If false, it doesn't create any RBAC resources. Default: true. * **`singleNamespaceInstall`** * If true, the Helm chart creates namespace-scoped RBAC resources, that is, Role and RoleBinding. If false, it creates cluster-scoped RBAC resources, that is, ClusterRole and ClusterRoleBinding instead. Default: false. * **`watchNamespace`** * A list of namespaces in which the KubeRay operator's informer watches the custom resources. * **`crNamespacedRbacEnable`** * Set to `true` in most cases. Set to `false` in the uncommon case of using a Kubernetes cluster managed by GitOps tools such as ArgoCD. For additional details, refer to [ray-project/kuberay#1162](https://github.com/ray-project/kuberay/pull/1162). Default: true. The [values.yaml](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml) file contains detailed descriptions of the parameters. See these pull requests for more context on parameters: * [ray-project/kuberay#1106](https://github.com/ray-project/kuberay/pull/1106) * [ray-project/kuberay#1162](https://github.com/ray-project/kuberay/pull/1162) * [ray-project/kuberay#1190](https://github.com/ray-project/kuberay/pull/1190) (case1-watch-all-namespaces)= ## Case 1: Watch all namespaces in the Kubernetes cluster ![Watch all namespaces in the Kubernetes cluster](../images/rbac-clusterrole.svg) By default, the informer of the KubeRay operator watches all namespaces in the Kubernetes cluster. The operator has cluster-scoped access to create and manage resources, using ClusterRole and ClusterRoleBinding. ```shell # Create a Kubernetes cluster using Kind. kind create cluster --image=kindest/node:v1.26.0 # Create namespaces. kubectl create ns n1 kubectl create ns n2 # Install a KubeRay operator. Use the default `values.yaml` file. # (path: helm-chart/kuberay-operator) helm install kuberay-operator . # Check ClusterRole. kubectl get clusterrole | grep kuberay # kuberay-operator 2023-10-15T04:54:28Z # Check Role. kubectl get role #NAME CREATED AT #kuberay-operator-leader-election 2023-10-15T04:54:28Z # Install RayCluster in the `default`, `n1`, and `n2` namespaces. helm install raycluster kuberay/ray-cluster --version 1.5.1 helm install raycluster kuberay/ray-cluster --version 1.5.1 -n n1 helm install raycluster kuberay/ray-cluster --version 1.5.1 -n n2 # You should create a RayCluster in these 3 namespaces. kubectl get raycluster -A # NAMESPACE NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE # default raycluster-kuberay 1 1 ready 73s # n1 raycluster-kuberay 1 1 ready 56s # n2 raycluster-kuberay 1 1 ready 52s ``` (case2-watch-1-namespace)= ## Case 2: Watch the namespace where you deployed the operator ![Watch the namespace where you deployed the operator](../images/rbac-role-one-namespace.svg) The informer of the KubeRay operator watches the namespace where you deployed the operator. The operator has Role and RoleBinding in the same namespace. * Modify the `singleNamespaceInstall` parameter in the `values.yaml` file to `true`. ```shell singleNamespaceInstall: true ``` ```shell # Create a Kubernetes cluster using Kind. kind create cluster --image=kindest/node:v1.26.0 # Create namespaces. kubectl create ns n1 kubectl create ns n2 # Install a KubeRay operator. # Set `singleNamespaceInstall` to true in the `values.yaml` file. # (path: helm-chart/kuberay-operator) helm install kuberay-operator . # Check ClusterRole. kubectl get clusterrole | grep kuberay # (nothing found) # Check Role. kubectl get role --all-namespaces | grep kuberay #default kuberay-operator 2023-10-15T05:18:03Z #default kuberay-operator-leader-election 2023-10-15T05:18:03Z # Install RayCluster in the `default`, `n1`, and `n2` namespaces. helm install raycluster kuberay/ray-cluster --version 1.5.1 helm install raycluster kuberay/ray-cluster --version 1.5.1 -n n1 helm install raycluster kuberay/ray-cluster --version 1.5.1 -n n2 # KubeRay only creates a RayCluster in the `default` namespace. kubectl get raycluster -A # NAMESPACE NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE # default raycluster-kuberay 1 1 ready 54s # n1 raycluster-kuberay 50s # n2 raycluster-kuberay 44s ``` (case3-watch-multiple-namespaces)= ## Case 3: Watch multiple namespaces in the Kubernetes cluster ![Watch multiple namespaces in the Kubernetes cluster](../images/rbac-role-multi-namespaces.svg) In Case 2, users with only namespaced access deploy a separate KubeRay operator for each namespace. This approach can increase maintenance overhead, especially when upgrading versions for each deployed instance. Case 3 creates Role and RoleBinding for multiple namespaces, allowing a single KubeRay operator to monitor several namespaces. * Modify the `singleNamespaceInstall` and `watchNamespace` parameters in the `values.yaml` file. ```shell # Set in the `value.yaml` file. singleNamespaceInstall: true # Set the namespaces list. watchNamespace: - n1 - n2 ``` ```shell # Create a Kubernetes cluster using Kind. kind create cluster --image=kindest/node:v1.26.0 # Create namespaces. kubectl create ns n1 kubectl create ns n2 # Install a KubeRay operator. # Set `singleNamespaceInstall` and `watchNamespace` in the `values.yaml` file. # (path: helm-chart/kuberay-operator) helm install kuberay-operator . # Check ClusterRole kubectl get clusterrole | grep kuberay # (nothing found) # Check Role. kubectl get role --all-namespaces | grep kuberay #default kuberay-operator-leader-election 2023-10-15T05:34:27Z #n1 kuberay-operator 2023-10-15T05:34:27Z #n2 kuberay-operator 2023-10-15T05:34:27Z # Install RayCluster in the `default`, `n1`, and `n2` namespaces. helm install raycluster kuberay/ray-cluster --version 1.5.1 helm install raycluster kuberay/ray-cluster --version 1.5.1 -n n1 helm install raycluster kuberay/ray-cluster --version 1.5.1 -n n2 # KubeRay creates a RayCluster only in the `n1` and `n2` namespaces. kubectl get raycluster -A # NAMESPACE NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE # default raycluster-kuberay 74s # n1 raycluster-kuberay 1 1 ready 70s # n2 raycluster-kuberay 1 1 ready 67s ``` --- (ray-k8s-autoscaler-comparison)= # (Advanced) Understanding the Ray Autoscaler in the Context of Kubernetes We describe the relationship between the Ray autoscaler and other autoscalers in the Kubernetes ecosystem. ## Ray Autoscaler vs. Horizontal Pod Autoscaler The Ray autoscaler adjusts the number of Ray nodes in a Ray cluster. On Kubernetes, each Ray node is run as a Kubernetes Pod. Thus in the context of Kubernetes, the Ray autoscaler scales Ray **Pod quantities**. In this sense, the Ray autoscaler plays a role similar to that of the Kubernetes [Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-Pod-autoscale/) (HPA). However, the following features distinguish the Ray Autoscaler from the HPA. ### Load metrics are based on application semantics The Horizontal Pod Autoscaler determines scale based on physical usage metrics like CPU and memory. By contrast, the Ray autoscaler uses the logical resources expressed in task and actor annotations. For instance, if each Ray container spec in your RayCluster CR indicates a limit of 10 CPUs, and you submit twenty tasks annotated with `@ray.remote(num_cpus=5)`, 10 Ray Pods are created to satisfy the 100-CPU resource demand. In this respect, the Ray autoscaler is similar to the [Kubernetes Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler), which makes scaling decisions based on the logical resources expressed in container resource requests. ### Fine-grained control of scale-down To accommodate the statefulness of Ray applications, the Ray autoscaler has more fine-grained control over scale-down than the Horizontal Pod Autoscaler. In addition to determining desired scale, the Ray Autoscaler is able to select precisely which Pods to scale down. The KubeRay operator then deletes that Pod. By contrast, the Horizontal Pod Autoscaler can only decrease a replica count, without much control over which Pods are deleted. For a Ray application, downscaling a random Pod could be dangerous. ### Architecture: One Ray Autoscaler per Ray Cluster Horizontal Pod Autoscaling is centrally controlled by a manager in the Kubernetes control plane; the manager controls the scale of many Kubernetes objects. By contrast, each Ray cluster is managed by its own Ray autoscaler process, running as a sidecar container in the Ray head Pod. This design choice is motivated by the following considerations: - **Scalability.** Autoscaling each Ray cluster requires processing a significant volume of resource data from that Ray cluster. - **Simplified versioning and compatibility.** The autoscaler and Ray are both developed as part of the Ray repository. The interface between the autoscaler and the Ray core is complex. To support multiple Ray clusters running at different Ray versions, it is thus best to match Ray and Autoscaler code versions. Running one autoscaler per Ray cluster and matching the code versions ensures compatibility. (kuberay-autoscaler-with-ray-autoscaler)= ## Ray Autoscaler with Kubernetes Cluster Autoscaler The Ray Autoscaler and the [Kubernetes Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) complement each other. After the Ray autoscaler decides to create a Ray Pod, the Kubernetes Cluster Autoscaler can provision a Kubernetes node so that the Pod can be placed. Similarly, after the Ray autoscaler decides to delete an idle Pod, the Kubernetes Cluster Autoscaler can clean up the idle Kubernetes node that remains. It is recommended to configure your RayCluster so that only one Ray Pod fits per Kubernetes node. If you follow this pattern, Ray Autoscaler Pod scaling events correspond roughly one-to-one with cluster autoscaler node scaling events. (We say "roughly" because it is possible for a Ray Pod be deleted and replaced with a new Ray Pod before the underlying Kubernetes node is scaled down.) ## Vertical Pod Autoscaler There is no relationship between the Ray Autoscaler and the Kubernetes [Vertical Pod Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) (VPA), which is meant to size individual Pods to the appropriate size based on current and past usage. If you find that the load on your individual Ray Pods is too high, there are a number of manual techniques to decrease the load. One method is to schedule fewer tasks/actors per node by increasing the resource requirements specified in the `ray.remote` annotation. For example, changing `@ray.remote(num_cpus=2)` to `@ray.remote(num_cpus=4)` will halve the quantity of that task or actor that can fit in a given Ray Pod. --- (kuberay-k8s-setup)= # Managed Kubernetes services ```{toctree} :hidden: aws-eks-gpu-cluster gcp-gke-gpu-cluster gcp-gke-tpu-cluster azure-aks-gpu-cluster ack-gpu-cluster ``` Most KubeRay documentation examples only require a local Kubernetes cluster such as [Kind](https://kind.sigs.k8s.io/). Some KubeRay examples require GPU nodes, which can be provided by a managed Kubernetes service. We collect a few helpful links for users who are getting started with a managed Kubernetes service to launch a Kubernetes cluster equipped with GPUs. (gke-setup)= # Set up a GKE cluster (Google Cloud) - {ref}`kuberay-gke-gpu-cluster-setup` - {ref}`kuberay-gke-tpu-cluster-setup` (eks-setup)= # Set up an EKS cluster (AWS) - {ref}`kuberay-eks-gpu-cluster-setup` (aks-setup)= # Set up an AKS cluster (Microsoft Azure) - {ref}`kuberay-aks-gpu-cluster-setup` # Set up an ACK cluster (Alibaba Cloud) - {ref}`kuberay-ack-gpu-cluster-setup` --- (kubectl-plugin)= # Use kubectl plugin (beta) Starting from KubeRay v1.3.0, you can use the `kubectl ray` plugin to simplify common workflows when deploying Ray on Kubernetes. If you aren't familiar with Kubernetes, this plugin simplifies running Ray on Kubernetes. ## Installation See [KubeRay kubectl Plugin](https://github.com/ray-project/kuberay/tree/master/kubectl-plugin) to install the plugin. Install the KubeRay kubectl plugin using one of the following methods: - Install using Krew kubectl plugin manager (recommended) - Download from GitHub releases ```{admonition} Plugin since 1.4.0 may be incompatible with KubeRay before 1.4.0 :class: warning Plugin versions since 1.4.0 may be incompatible with KubeRay versions before 1.4.0. Try to use the same plugin and KubeRay versions. ``` ### Install using the Krew kubectl plugin manager (recommended) 1. Install [Krew](https://krew.sigs.k8s.io/docs/user-guide/setup/install/). 2. Download the plugin list by running `kubectl krew update`. 3. Install the plugin by running `kubectl krew install ray`. ### Download from GitHub releases Go to the [releases page](https://github.com/ray-project/kuberay/releases) and download the binary for your platform. For example, to install kubectl plugin version 1.5.1 on Linux amd64: ```bash curl -LO https://github.com/ray-project/kuberay/releases/download/v1.5.1/kubectl-ray_v1.5.1_linux_amd64.tar.gz tar -xvf kubectl-ray_v1.5.1_linux_amd64.tar.gz cp kubectl-ray ~/.local/bin ``` Replace `~/.local/bin` with the directory in your `PATH`. ## Shell Completion Follow the instructions for installing and enabling [kubectl plugin-completion] ## Usage After installing the plugin, you can use `kubectl ray --help` to see the available commands and options. ## Examples Assume that you have installed the KubeRay operator. If not, follow the [KubeRay Operator Installation](kuberay-operator-deploy) to install the latest stable KubeRay operator by Helm repository. ### Example 1: RayCluster Management The `kubectl ray create cluster` command allows you to create a valid RayCluster without an existing YAML file. The default values are follows (empty values mean unset): | Parameter | Default | |-----------------------------------------------------|--------------------------------| | K8s labels | | | K8s annotations | | | ray version | 2.46.0 | | ray image | rayproject/ray:\ | | head CPU | 2 | | head memory | 4Gi | | head GPU | 0 | | head ephemeral storage | | | head `ray start` parameters | | | head node selectors | | | worker replicas | 1 | | worker CPU | 2 | | worker memory | 4Gi | | worker GPU | 0 | | worker TPU | 0 | | worker ephemeral storage | | | worker `ray start` parameters | | | worker node selectors | | | Number of hosts in default worker group per replica | 1 | | Autoscaler version (v1 or v2) | | ```text $ kubectl ray create cluster raycluster-sample Created Ray Cluster: raycluster-sample ``` You can override the default values by specifying the flags. For example, to create a RayCluster with 2 workers: ```text $ kubectl ray create cluster raycluster-sample-2 --worker-replicas 2 Created Ray Cluster: raycluster-sample-2 ``` You can also override the default values with a config file. For example, the following config file sets the worker CPU to 3. ```text $ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/refs/tags/v1.5.1/kubectl-plugin/config/samples/create-cluster.sample.yaml $ kubectl ray create cluster raycluster-sample-3 --file create-cluster.sample.yaml Created Ray Cluster: raycluster-sample-3 ``` See https://github.com/ray-project/kuberay/blob/v1.5.1/kubectl-plugin/config/samples/create-cluster.complete.yaml for the complete list of parameters that you can set in the config file. By default it only creates one worker group. You can use `kubectl ray create workergroup` to add additional worker groups to existing RayClusters. ```text $ kubectl ray create workergroup example-group --ray-cluster raycluster-sample --worker-memory 5Gi ``` You can use `kubectl ray get cluster`, `kubectl ray get workergroup`, and `kubectl ray get node` to get the status of RayClusters, worker groups, and Ray nodes, respectively. ```text $ kubectl ray get cluster NAME NAMESPACE DESIRED WORKERS AVAILABLE WORKERS CPUS GPUS TPUS MEMORY AGE raycluster-sample default 2 2 6 0 0 13Gi 3m56s raycluster-sample-2 default 2 2 6 0 0 12Gi 3m51s $ kubectl ray get workergroup NAME REPLICAS CPUS GPUS TPUS MEMORY CLUSTER default-group 1/1 2 0 0 4Gi raycluster-sample example-group 1/1 2 0 0 5Gi raycluster-sample default-group 2/2 4 0 0 8Gi raycluster-sample-2 $ kubectl ray get nodes NAME CPUS GPUS TPUS MEMORY CLUSTER TYPE WORKER GROUP AGE raycluster-sample-default-group-4lb5w 2 0 0 4Gi raycluster-sample worker default-group 3m56s raycluster-sample-example-group-vnkkc 2 0 0 5Gi raycluster-sample worker example-group 3m56s raycluster-sample-head-vplcq 2 0 0 4Gi raycluster-sample head headgroup 3m56s raycluster-sample-2-default-group-74nd4 2 0 0 4Gi raycluster-sample-2 worker default-group 3m51s raycluster-sample-2-default-group-vnkkc 2 0 0 4Gi raycluster-sample-2 worker default-group 3m51s raycluster-sample-2-head-pwsrm 2 0 0 4Gi raycluster-sample-2 head headgroup 3m51s ``` You can scale a cluster's worker group like so. ```shell $ kubectl ray scale cluster raycluster-sample \ --worker-group default-group \ --replicas 2 Scaled worker group default-group in Ray cluster raycluster-sample in namespace default from 1 to 2 replicas # verify the worker group scaled up $ kubectl ray get workergroup default-group --ray-cluster raycluster-sample NAME REPLICAS CPUS GPUS TPUS MEMORY CLUSTER default-group 2/2 4 0 0 8Gi raycluster-sample ``` The `kubectl ray session` command can forward local ports to Ray resources, allowing users to avoid remembering which ports Ray resources exposes. ```text $ kubectl ray session raycluster-sample Forwarding ports to service raycluster-sample-head-svc Ray Dashboard: http://localhost:8265 Ray Interactive Client: http://localhost:10001 ``` And then you can open [http://localhost:8265](http://localhost:8265) in your browser to access the dashboard. The `kubectl ray log` command can download logs from RayClusters to local directories. ```text $ kubectl ray log raycluster-sample No output directory specified, creating dir under current directory using resource name. Command set to retrieve both head and worker node logs. Downloading log for Ray Node raycluster-sample-default-group-worker-b2k7h Downloading log for Ray Node raycluster-sample-example-group-worker-sfdp7 Downloading log for Ray Node raycluster-sample-head-k5pj8 ``` It creates a folder named `raycluster-sample` in the current directory containing the logs of the RayCluster. Use `kubectl ray delete` command to clean up the resources. ```text $ kubectl ray delete raycluster-sample $ kubectl ray delete raycluster-sample-2 ``` ### Example 2: RayJob Submission `kubectl ray job submit` is a wrapper around the `ray job submit` command. It can automatically forward the ports to the Ray cluster and submit the job. This command can also provision an ephemeral cluster if the user doesn't provide a RayJob. Assume that under the current directory, you have a file named `sample_code.py`. ```python import ray ray.init(address="auto") @ray.remote def f(x): return x * x futures = [f.remote(i) for i in range(4)] print(ray.get(futures)) # [0, 1, 4, 9] ``` #### Submit a Ray job without a YAML file You can submit a RayJob without specifying a YAML file. The command generates a RayJob based on the following: | Parameter | Default | |-----------------------------------------------|--------------------------------| | ray version | 2.46.0 | | ray image | rayproject/ray:\ | | head CPU | 2 | | head memory | 4Gi | | head GPU | 0 | | worker replicas | 1 | | worker CPU | 2 | | worker memory | 4Gi | | worker GPU | 0 | | TTL to clean up RayClsuter after job finished | 0 | ```text $ kubectl ray job submit --name rayjob-sample --working-dir . -- python sample_code.py Submitted RayJob rayjob-sample. Waiting for RayCluster ... 2025-01-06 11:53:34,806 INFO worker.py:1634 -- Connecting to existing Ray cluster at address: 10.12.0.9:6379... 2025-01-06 11:53:34,814 INFO worker.py:1810 -- Connected to Ray cluster. View the dashboard at 10.12.0.9:8265 [0, 1, 4, 9] 2025-01-06 11:53:38,368 SUCC cli.py:63 -- ------------------------------------------ 2025-01-06 11:53:38,368 SUCC cli.py:64 -- Job 'raysubmit_9NfCvwcmcyMNFCvX' succeeded 2025-01-06 11:53:38,368 SUCC cli.py:65 -- ------------------------------------------ ``` You can also designate a specific RayJob YAML to submit a Ray job. ```text $ wget https://raw.githubusercontent.com/ray-project/kuberay/refs/heads/master/ray-operator/config/samples/ray-job.interactive-mode.yaml ``` Note that in the RayJob spec, `submissionMode` is `InteractiveMode`. ```text $ kubectl ray job submit -f ray-job.interactive-mode.yaml --working-dir . -- python sample_code.py Submitted RayJob rayjob-interactive-mode. Waiting for RayCluster ... 2025-01-06 12:44:43,542 INFO worker.py:1634 -- Connecting to existing Ray cluster at address: 10.12.0.10:6379... 2025-01-06 12:44:43,551 INFO worker.py:1810 -- Connected to Ray cluster. View the dashboard at 10.12.0.10:8265 [0, 1, 4, 9] 2025-01-06 12:44:47,830 SUCC cli.py:63 -- ------------------------------------------ 2025-01-06 12:44:47,830 SUCC cli.py:64 -- Job 'raysubmit_fuBdjGnecFggejhR' succeeded 2025-01-06 12:44:47,830 SUCC cli.py:65 -- ------------------------------------------ ``` Use `kubectl ray delete` command to clean up the resources. ```text $ kubectl ray delete rayjob/rayjob-sample $ kubectl ray delete rayjob/rayjob-interactive-mode ``` [kubectl plugin-completion]: https://github.com/marckhouzam/kubectl-plugin_completion?tab=readme-ov-file#tldr --- (kuberay-auth)= # Configure Ray clusters to use token authentication This guide demonstrates how to enable Ray token authentication with KubeRay. ## Prerequisites * A Kubernetes cluster. This guide uses GKE, but the concepts apply to other Kubernetes distributions. * `kubectl` installed and configured to interact with your cluster. * `gcloud` CLI installed and configured, if using GKE. * [Helm](https://helm.sh/) installed. * Ray 2.52.0 or newer. ## Create or use an existing GKE Cluster If you don't have a Kubernetes cluster, create one using the following command, or adapt it for your cloud provider: ```bash gcloud container clusters create kuberay-cluster \ --num-nodes=2 --zone=us-west1-b --machine-type e2-standard-4 ``` ## Install the KubeRay Operator Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. ## Deploy a Ray cluster with token authentication If you are using KubeRay v1.5.1 or newer, you can use the `authOptions` API in RayCluster to enable token authentication: ```bash kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/refs/heads/master/ray-operator/config/samples/ray-cluster.auth.yaml ``` When enabled, the KubeRay operator will: * Create a Kubernetes Secret containing a randomly generated token. * Automatically set the `RAY_AUTH_TOKEN` and `RAY_AUTH_MODE` environment variables on all Ray containers. If you are using a KubeRay version older than v1.5.1, you can enable token authentication by creating a Kubernetes Secret containing your token and configuring the `RAY_AUTH_MODE` and `RAY_AUTH_TOKEN` environment variables. ```bash kubectl create secret generic ray-cluster-with-auth --from-literal=auth_token=$(openssl rand -base64 32) kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/refs/heads/master/ray-operator/config/samples/ray-cluster.auth-manual.yaml ``` ## Verify initial unauthenticated access Attempt to submit a Ray job to the cluster to verify that authentication is required. You should receive a `401 Unauthorized` error: ```bash kubectl port-forward svc/ray-cluster-with-auth-head-svc 8265:8265 & ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())" ``` You should see an error similar to this: ```bash RuntimeError: Authentication required: Unauthorized: Missing authentication token The Ray cluster requires authentication, but no token was provided. Please provide an authentication token using one of these methods: 1. Set the `RAY_AUTH_TOKEN` environment variable. 2. Set the `RAY_AUTH_TOKEN_PATH` environment variable (pointing to a file containing the token). 3. Create a token file at the default location: `~/.ray/auth_token`. ``` This error confirms that the Ray cluster requires authentication. ## Accessing your Ray cluster with the Ray CLI To access your Ray cluster using the Ray CLI, you need to configure the following environment variables: * `RAY_AUTH_MODE`: this configures the Ray CLI to set the necessary authorization headers for token authentication * `RAY_AUTH_TOKEN`: this contains the token that will be used for authentication. * `RAY_AUTH_TOKEN_PATH`: if `RAY_AUTH_TOKEN` is not set, the Ray CLI will instead read the token from this path (defaults to `~/.ray/auth_token`). Submit a job with an authenticated Ray CLI: ```bash export RAY_AUTH_MODE=token export RAY_AUTH_TOKEN=$(kubectl get secrets ray-cluster-with-auth --template={{.data.auth_token}} | base64 -d) ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())" ``` The job should now succeed and you should see output similar to this: ```bash Job submission server address: http://localhost:8265 ------------------------------------------------------- Job 'raysubmit_...' submitted successfully ------------------------------------------------------- Next steps Query the logs of the job: ray job logs raysubmit_n2fq2Ui7cbh3p2Js Query the status of the job: ray job status raysubmit_n2fq2Ui7cbh3p2Js Request the job to be stopped: ray job stop raysubmit_n2fq2Ui7cbh3p2Js Tailing logs until the job exits (disable with --no-wait): ... {'node:10.112.0.52': 1.0, 'memory': ..., 'node:__internal_head__': 1.0, 'object_store_memory': ..., 'CPU': 4.0, 'node:10.112.1.49': 1.0, 'node:10.112.2.36': 1.0} ------------------------------------------ Job 'raysubmit_...' succeeded ------------------------------------------ ``` ## Viewing the Ray dashboard (optional) To view the Ray dashboard from your browser, first port forward to from your local machine to the cluster: ```bash kubectl port-forward svc/ray-cluster-with-auth-head-svc 8265:8265 & ``` Then open `localhost:8265` in your browser. You will be prompted to provide the auth token for the cluster, which can be retrieved with: ```bash kubectl get secrets ray-cluster-with-auth --template={{.data.auth_token}} | base64 -d ``` --- (kuberay-dashboard)= # Use KubeRay dashboard (experimental) Starting from KubeRay v1.4.0, you can use the open source dashboard UI for KubeRay. This component is still experimental and not considered ready for production, but feedback is welcome. The KubeRay dashboard is a web-based UI that allows you to view and manage KubeRay resources running on your Kubernetes cluster. It's different from the Ray dashboard, which is a part of the Ray cluster itself. The KubeRay dashboard provides a centralized view of all KubeRay resources. ## Installation The KubeRay dashboard depends on the optional `kuberay-apiserver` that you need to install. For simplicity, this guide disables the security proxy and allows all origins for Cross-Origin Resource Sharing. ```bash helm install kuberay-apiserver kuberay/kuberay-apiserver --version v1.5.1 --set security= --set cors.allowOrigin='*' ``` And you need to port-forward the `kuberay-apiserver` service because the dashboard currently sends requests to `http://localhost:31888`: ```bash kubectl port-forward svc/kuberay-apiserver-service 31888:8888 ``` Install the KubeRay dashboard: ```bash kubectl run kuberay-dashboard --image=quay.io/kuberay/dashboard:v1.5.1 ``` Port-forward the KubeRay dashboard: ```bash kubectl port-forward kuberay-dashboard 3000:3000 ``` Go to `http://localhost:3000/ray/jobs` to see the list of Ray jobs. It's empty for now. You can create a RayJob custom resource to see how it works. ```bash kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.sample.yaml ``` The KubeRay dashboard only shows RayJob custom resources that the KubeRay API server creates. This guide simulates the API server by labeling the RayJob. ```bash kubectl label rayjob rayjob-sample app.kubernetes.io/managed-by=kuberay-apiserver ``` Go to `http://localhost:3000/ray/jobs` again. You can see `rayjob-sample` in the list of RayJob custom resources. ![KubeRay Dashboard List of RayJobs](./images/kuberay-dashboard-rayjobs.png) --- (kuberay-gcs-ft)= # GCS fault tolerance in KubeRay Global Control Service (GCS) manages cluster-level metadata. By default, the GCS lacks fault tolerance as it stores all data in-memory, and a failure can cause the entire Ray cluster to fail. To make the GCS fault tolerant, you must have a high-availability Redis. This way, in the event of a GCS restart, it retrieves all the data from the Redis instance and resumes its regular functions. ```{admonition} Fate-sharing without GCS fault tolerance Without GCS fault tolerance, the Ray cluster, the GCS process, and the Ray head Pod are fate-sharing. If the GCS process dies, the Ray head Pod dies as well after `RAY_gcs_rpc_server_reconnect_timeout_s` seconds. If the Ray head Pod is restarted according to the Pod's `restartPolicy`, worker Pods attempt to reconnect to the new head Pod. The worker Pods are terminated by the new head Pod; without GCS fault tolerance enabled, the cluster state is lost, and the worker Pods are perceived as "unknown workers" by the new head Pod. This is adequate for most Ray applications; however, it is not ideal for Ray Serve, especially if high availability is crucial for your use cases. Hence, we recommend enabling GCS fault tolerance on the RayService custom resource to ensure high availability. See {ref}`Ray Serve end-to-end fault tolerance documentation ` for more information. ``` ```{seealso} If you need fault tolerance for Redis as well, see {ref}`Tuning Redis for a Persistent Fault Tolerant GCS `. ``` ## Use cases * **Ray Serve**: The recommended configuration is enabling GCS fault tolerance on the RayService custom resource to ensure high availability. See {ref}`Ray Serve end-to-end fault tolerance documentation ` for more information. * **Other workloads**: GCS fault tolerance isn't recommended and the compatibility isn't guaranteed. ## Prerequisites * Ray 2.0.0+ * KubeRay 1.3.0+ * Redis: single shard Redis Cluster or Redis Sentinel, one or multiple replicas ## Quickstart ### Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ### Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator via Helm repository. ### Step 3: Install a RayCluster with GCS FT enabled ```sh curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.external-redis.yaml kubectl apply -f ray-cluster.external-redis.yaml ``` ### Step 4: Verify the Kubernetes cluster status ```sh # Step 4.1: List all Pods in the `default` namespace. # The expected output should be 4 Pods: 1 head, 1 worker, 1 KubeRay, and 1 Redis. kubectl get pods # [Example output] # NAME READY STATUS RESTARTS AGE # kuberay-operator-6bc45dd644-ktbnh 1/1 Running 0 3m4s # raycluster-external-redis-head 1/1 Running 0 2m41s # raycluster-external-redis-small-group-worker-dwt98 1/1 Running 0 2m41s # redis-6cf756c755-qljcv 1/1 Running 0 2m41s # Step 4.2: List all ConfigMaps in the `default` namespace. kubectl get configmaps # [Example output] # NAME DATA AGE # kube-root-ca.crt 1 3m4s # ray-example 2 5m45s # redis-config 1 5m45s ``` The [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml) file defines Kubernetes resources for RayCluster, Redis, and ConfigMaps. There are two ConfigMaps in this example: `ray-example` and `redis-config`. The `ray-example` ConfigMap houses two Python scripts: `detached_actor.py` and `increment_counter.py`. * `detached_actor.py` is a Python script that creates a detached actor with the name, `counter_actor`. ```python import ray @ray.remote(num_cpus=1) class Counter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value ray.init(namespace="default_namespace") Counter.options(name="counter_actor", lifetime="detached").remote() ``` * `increment_counter.py` is a Python script that increments the counter. ```python import ray ray.init(namespace="default_namespace") counter = ray.get_actor("counter_actor") print(ray.get(counter.increment.remote())) ``` ### Step 5: Create a detached actor ```sh # Step 5.1: Create a detached actor with the name, `counter_actor`. export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py # 2025-04-18 02:51:25,359 INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS # 2025-04-18 02:51:25,361 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379... # 2025-04-18 02:51:25,557 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265 # Step 5.2: Increment the counter. kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py # 2025-04-18 02:51:29,069 INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS # 2025-04-18 02:51:29,072 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379... # 2025-04-18 02:51:29,198 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265 # 1 ``` (kuberay-external-storage-namespace-example)= ### Step 6: Check the data in Redis ```sh # Step 6.1: Check the RayCluster's UID. kubectl get rayclusters.ray.io raycluster-external-redis -o=jsonpath='{.metadata.uid}' # [Example output]: 864b004c-6305-42e3-ac46-adfa8eb6f752 # Step 6.2: Check the head Pod's environment variable `RAY_external_storage_namespace`. kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq # [Example output]: # [ # { # "name": "RAY_external_storage_namespace", # "value": "864b004c-6305-42e3-ac46-adfa8eb6f752" # }, # ... # ] # Step 6.3: Log into the Redis Pod. # The password `5241590000000000` is defined in the `redis-config` ConfigMap. # Step 6.4: Check the keys in Redis. # Note: the schema changed in Ray 2.38.0. Previously we use a single HASH table, # now we use multiple HASH tables with a common prefix. export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers) kubectl exec -it $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli KEYS "*" # [Example output]: # 1) "RAY864b004c-6305-42e3-ac46-adfa8eb6f752@INTERNAL_CONFIG" # 2) "RAY864b004c-6305-42e3-ac46-adfa8eb6f752@KV" # 3) "RAY864b004c-6305-42e3-ac46-adfa8eb6f752@NODE" # [Example output Before Ray 2.38.0]: # 2) "864b004c-6305-42e3-ac46-adfa8eb6f752" # # Step 6.5: Check the value of the key. kubectl exec -it $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli HGETALL RAY864b004c-6305-42e3-ac46-adfa8eb6f752@NODE # Before Ray 2.38.0: # HGETALL 864b004c-6305-42e3-ac46-adfa8eb6f752 ``` In [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml), the `gcsFaultToleranceOptions.externalStorageNamespace` option isn't set for the RayCluster. Therefore, KubeRay automatically injects the environment variable `RAY_external_storage_namespace` to all Ray Pods managed by the RayCluster with the RayCluster's UID as the external storage namespace by default. See [this section](kuberay-external-storage-namespace) to learn more about the option. ### Step 7: Kill the GCS process in the head Pod ```sh # Step 7.1: Check the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable in both the head Pod and worker Pod. kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq # [Expected result]: # No `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable is set. Hence, the Ray head uses its default value of `60`. export YOUR_WORKER_POD=$(kubectl get pods -l ray.io/group=small-group -o jsonpath='{.items[0].metadata.name}') kubectl get pods $YOUR_WORKER_POD -o=jsonpath='{.spec.containers[0].env}' | jq # [Expected result]: # KubeRay injects the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable with the value `600` to the worker Pod. # [ # { # "name": "RAY_gcs_rpc_server_reconnect_timeout_s", # "value": "600" # }, # ... # ] # Step 7.2: Kill the GCS process in the head Pod. kubectl exec -it $HEAD_POD -- pkill gcs_server # Step 7.3: The head Pod fails and restarts after `RAY_gcs_rpc_server_reconnect_timeout_s` (60) seconds. # In addition, the worker Pod isn't terminated by the new head after reconnecting because GCS fault # tolerance is enabled. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output]: # NAME READY STATUS RESTARTS AGE # raycluster-external-redis-head 1/1 Running 1 (64s ago) xxm # raycluster-external-redis-worker-small-group-yyyyy 1/1 Running 0 xxm ``` In [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml), the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable isn't set in the specifications for either the head Pod or the worker Pod within the RayCluster. Therefore, KubeRay automatically injects the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable with the value **600** to the worker Pod and uses the default value **60** for the head Pod. The timeout value for worker Pods must be longer than the timeout value for the head Pod so that the worker Pods don't terminate before the head Pod restarts from a failure. ### Step 8: Access the detached actor again ```sh kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py # 2023-09-07 17:31:17,793 INFO worker.py:1313 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS # 2023-09-07 17:31:17,793 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.244.0.21:6379... # 2023-09-07 17:31:17,875 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at http://10.244.0.21:8265 # 2 ``` ```{admonition} The detached actor is always on the worker Pod in this example. The head Pod's `rayStartParams` is set to `num-cpus: "0"`. Hence, no tasks or actors will be scheduled on the head Pod. ``` With GCS fault tolerance enabled, you can still access the detached actor after the GCS process is dead and restarted. Note that the fault tolerance doesn't persist the actor's state. The reason why the result is 2 instead of 1 is that the detached actor is on the worker Pod which is always running. On the other hand, if the head Pod hosts the detached actor, the `increment_counter.py` script yields a result of 1 in this step. ### Step 9: Remove the key stored in Redis when deleting RayCluster ```shell # Step 9.1: Delete the RayCluster custom resource. kubectl delete raycluster raycluster-external-redis # Step 9.2: KubeRay operator deletes all Pods in the RayCluster. # Step 9.3: KubeRay operator creates a Kubernetes Job to delete the Redis key after the head Pod is terminated. # Step 9.4: Check whether the RayCluster has been deleted. kubectl get raycluster # [Expected output]: No resources found in default namespace. # Step 9.5: Check Redis keys after the Kubernetes Job finishes. export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers) kubectl exec -i $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli KEYS "*" # [Expected output]: (empty list or set) ``` In KubeRay v1.0.0, the KubeRay operator adds a Kubernetes finalizer to the RayCluster with GCS fault tolerance enabled to ensure Redis cleanup. KubeRay only removes this finalizer after the Kubernetes Job successfully cleans up Redis. * In other words, if the Kubernetes Job fails, the RayCluster won't be deleted. In that case, you should remove the finalizer and cleanup Redis manually. ```shell kubectl patch rayclusters.ray.io raycluster-external-redis --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]' ``` Starting with KubeRay v1.1.0, KubeRay changes the Redis cleanup behavior from a mandatory to a best-effort basis. KubeRay still removes the Kubernetes finalizer from the RayCluster if the Kubernetes Job fails, thereby unblocking the deletion of the RayCluster. Users can turn off this by setting the feature gate value `ENABLE_GCS_FT_REDIS_CLEANUP`. Refer to the [KubeRay GCS fault tolerance configurations](kuberay-redis-cleanup-gate) section for more details. ### Step 10: Delete the Kubernetes cluster ```sh kind delete cluster ``` ## KubeRay GCS fault tolerance configurations The [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml) used in the quickstart example contains detailed comments about the configuration options. ***Read this section in conjunction with the YAML file.*** ```{admonition} These configurations require KubeRay 1.3.0+ The following section uses the new `gcsFaultToleranceOptions` field introduced in KubeRay 1.3.0. For the old GCS fault tolerance configurations, including the `ray.io/ft-enabled` annotation, please refer to [the old document](https://docs.ray.io/en/releases-2.42.1/cluster/kubernetes/user-guides/kuberay-gcs-ft.html). ``` ### 1. Enable GCS fault tolerance * **`gcsFaultToleranceOptions`**: Add `gcsFaultToleranceOptions` field to the RayCluster custom resource to enable GCS fault tolerance. ```yaml kind: RayCluster metadata: spec: gcsFaultToleranceOptions: # <- Add this field to enable GCS fault tolerance. ``` ### 2. Connect to an external Redis * **`redisAddress`**: Add `redisAddress` to the `gcsFaultToleranceOptions` field. Use this option to specify the address for the Redis service, thus allowing the Ray head to connect to it. In the [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml), the RayCluster custom resource uses the `redis` Kubernetes ClusterIP service name as the connection point to the Redis server. The ClusterIP service is also created by the YAML file. ```yaml kind: RayCluster metadata: spec: gcsFaultToleranceOptions: redisAddress: "redis:6379" # <- Add redis address here. ``` * **`redisPassword`**: Add `redisPassword` to the `gcsFaultToleranceOptions` field. Use this option to specify the password for the Redis service, thus allowing the Ray head to connect to it. In the [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml), the RayCluster custom resource loads the password from a Kubernetes secret. ```yaml kind: RayCluster metadata: spec: gcsFaultToleranceOptions: redisAddress: "redis:6379" redisPassword: # <- Add redis password from a Kubernetes secret. valueFrom: secretKeyRef: name: redis-password-secret key: password ``` (kuberay-external-storage-namespace)= ### 3. Use an external storage namespace * **`externalStorageNamespace`** (**optional**): Add `externalStorageNamespace` to the `gcsFaultToleranceOptions` field. KubeRay uses the value of this option to set the environment variable `RAY_external_storage_namespace` to all Ray Pods managed by the RayCluster. In most cases, ***you don't need to set `externalStorageNamespace`*** because KubeRay automatically sets it to the UID of RayCluster. Only modify this option if you fully understand the behaviors of the GCS fault tolerance and RayService to avoid [this issue](kuberay-raysvc-issue10). Refer to [this section](kuberay-external-storage-namespace-example) in the earlier quickstart example for more details. ```yaml kind: RayCluster metadata: spec: gcsFaultToleranceOptions: externalStorageNamespace: "my-raycluster-storage" # <- Add this option to specify a storage namespace ``` (kuberay-redis-cleanup-gate)= ### 4. Turn off Redis cleanup * `ENABLE_GCS_FT_REDIS_CLEANUP`: True by default. You can turn this feature off by setting the environment variable in the [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml). ```{admonition} Key eviction setup on Redis If you disable `ENABLE_GCS_FT_REDIS_CLEANUP` but want Redis to remove GCS metadata automatically, set these two options in your `redis.conf` or in the command line options of your redis-server command [(example)](https://github.com/ray-project/ray/pull/40949#issuecomment-1799057691): * `maxmemory=` * `maxmemory-policy=allkeys-lru` These two options instruct Redis to delete the least recently used keys when it reaches the `maxmemory` limit. See [Key eviction](https://redis.io/docs/reference/eviction/) from Redis for more information. Note that Redis does this eviction and it doesn't guarantee that Ray won't use the deleted keys. ``` ## Next steps * See {ref}`Ray Serve end-to-end fault tolerance documentation ` for more information. * See {ref}`Ray Core GCS fault tolerance documentation ` for more information. --- (kuberay-gcs-persistent-ft)= # Tuning Redis for a Persistent Fault Tolerant GCS Using Redis to back up the Global Control Store (GCS) with KubeRay provides fault tolerance in the event that Ray loses the Ray Head. It allows the new Ray Head to rebuild its state by reading Redis. However, if Redis loses data, the Ray Head state is also lost. Therefore, you may want further protection in the event that your Redis cluster experiences partial or total failure. This guide documents how to configure and tune Redis for a highly available Ray Cluster with KubeRay. Tuning your Ray cluster to be highly available safeguards long-running jobs against unexpected failures and allows you to run Ray on commodity hardware/pre-emptible machines. ## Solution overview KubeRay supports using Redis to persist the GCS, which allows you to move the point of failure (for data loss) outside Ray. However, you still have to configure Redis itself to be resilient to failures. This solution provisions a [Persistent Volume](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) backed by hardware storage, which Redis will use to write regular snapshots. If you lose Redis or its host node, the Redis deployment can be restored from the snapshot. While Redis supports clustering, KubeRay only supports standalone (single replica) Redis, so it omits clustering. ## Persistent storage Specialty storage volumes (like Google Cloud Storage FUSE or S3) don't support append operations, which Redis uses to efficiently write its Append Only File (AOF) log. When using these options, it's recommended to disable AOF. With GCP GKE and Azure AKS, the default storage classes are [persistent disks](https://cloud.google.com/kubernetes-engine/docs/concepts/persistent-volumes) and [SSD Azure disks](https://learn.microsoft.com/en-us/azure/aks/azure-csi-disk-storage-provision) respectively, and the only configuration needed to provision a disk is as follows: ``` apiVersion: v1 kind: PersistentVolumeClaim metadata: name: redis-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: standard-rwo ``` On AWS, you must [Create a storage class](https://docs.aws.amazon.com/eks/latest/userguide/create-storage-class.html) yourself as well. ## Tuning backups Redis supports database dumps at set intervals, which is good for fast recovery and high performance during normal operation. Redis also supports journaling at frequent intervals (or continuously), which can provide stronger durability at the cost of more disk writes (i.e., slower performance). A good starting point for backups is to enable both as shown in the following: ``` # Dump a backup every 60s, if there are 1000 writes since the prev. backup. save 60 1000 dbfilename dump.rdb # Enable the append-only log file. appendonly yes appendfilename "appendonly.aof" ``` In this recommended configuration, Redis creates full backups every 60 s and updates the append-only every second, which is a reasonable balance for disk space, latency, and data safety. There are more options to configure the AOF, defaults shown here: ``` # Sync the log to disk every second. # Alternatives are "no" and "always" (every write). appendfsync everysec auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb ``` You can view the full reference [here](https://raw.githubusercontent.com/redis/redis/refs/tags/7.4.0/redis.conf). If your job is generally idempotent and can resume from several minutes of state loss, you may prefer to disable the append-only log. If you prefer your job to lose as little state as possible, then you may prefer to set `appendfsync` to `always` so Redis stores all writes immediately. ## Putting it together Edit [the full YAML](https://github.com/ray-project/kuberay/blob/release-1.3/ray-operator/config/samples/ray-cluster.persistent-redis.yaml) to your satisfaction and apply it: ``` kubectl apply -f config/samples/ray-cluster.persistent-redis.yaml ``` Verify that Kubernetes provisioned a disk and Redis is running: ``` kubectl get persistentvolumes kubectl get pods # Should see redis-0 running. ``` After running a job with some state in GCS, you can delete the ray head pod as well as the redis pod without data loss. ## Verifying Forward connections to the ray cluster you just created with the {ref}`Ray kubectl plugin `: ``` $ kubectl ray session raycluster-external-redis ``` Then submit any Ray job of your choosing and let it run. When finished, delete all your pods: ``` $ kubectl delete pods --all ``` Wait for Kubernetes to provision the Ray head and enter a ready state. Then restart your port forwarding and view the Ray dashboard. You should find that Ray and Redis has persisted your job's metadata, despite the loss of the ray head as well as the Redis replica. --- (kuberay-label-scheduling)= # KubeRay label-based scheduling This guide explains how to use label-based scheduling for Ray clusters on Kubernetes. This feature allows you to direct Ray workloads (tasks, actors, or placement groups) to specific Ray nodes running on Pods using labels. Label selectors enable fine-grained control of where your workloads run in a heterogeneous cluster, helping to optimize both performance and cost. Label-based scheduling is an essential tool for heterogeneous clusters, where your RayCluster might contain different types of nodes for different purposes, such as: * Nodes with different accelerator types like A100 GPUs or Trillium TPU. * Nodes with different CPU families like Intel or AMD. * Nodes with different instance types related to cost and availability, such as spot or on-demand instances. * Nodes in different failure domains or with region or zone requirements. The Ray scheduler uses a `label_selector` specified in the `@ray.remote` decorator to filter on labels defined on the Ray nodes. In KubeRay, set Ray node labels using labels defined in the RayCluster custom resource. ```{admonition} Label selectors are an experimental feature in Ray 2.49.1. Full autoscaling support for tasks, actors, and placement groups with label selectors is available in Ray 2.51.0 and KubeRay v1.5.1. ``` ## Overview There are three scheduling steps to understand when using KubeRay with label-based scheduling: 1. **The Ray workload**: A Ray application requests resources with a `label_selector`, specifying that you want to schedule on a node with those labels. Example: ```py @ray.remote(num_gpus=1, label_selector={"ray.io/accelerator-type": "A100"}) def gpu_task(): pass ``` 2. **The RayCluster CR**: The RayCluster CRD defines the types of nodes available for scheduling (or scaling with autoscaling) through `HeadGroupSpec` and `WorkerGroupSpecs`. To set Ray node labels for a given group, you can specify them under a top-level `Labels` field. When KubeRay creates a Pod for this group, it sets these labels in the Ray runtime environment. For RayClusters with autoscaling enabled, KubeRay also adds these labels to the autoscaling configuration use for scheduling Ray workloads. Example: ```yaml headGroupSpec: labels: ray.io/region: us-central2 ... workerGroupSpecs: - replicas: 1 minReplicas: 1 maxReplicas: 10 groupName: intel-cpu-group labels: cpu-family: intel ray.io/market-type: on-demand ``` 3. **The Kubernetes scheduler**: To ensure the Ray Pods land on the correct physical hardware, add standard Kubernetes scheduling features like `nodeSelector` or `podAffinity` in the Pod template. Similar to how Ray treats label selectors, the Kubernetes scheduler filters the underlying nodes in the Kubernetes cluster based on these labels when scheduling the Pod. For example, you might add the following `nodeSelector` to the above `intel-cpu-group` to ensure both Ray and Kubernetes constrain scheduling: ```yaml nodeSelector: cloud.google.com/machine-family: "N4" cloud.google.com/gke-spot: "false" ``` This quickstart demonstrates all three steps working together. ## Quickstart ### Step 1: [Optional] Create a Kubernetes cluster with Kind If you don't already have a Kubernetes cluster, create a new cluster with Kind for testing. If you're already using a cloud provider's Kubernetes service such as GKE, skip this step. ```bash kind create cluster --image=kindest/node:v1.26.0 # Mock underlying nodes with GKE-related labels. This is necessary for the `nodeSelector` to be able to schedule Pods. kubectl label node kind-control-plane \ cloud.google.com/machine-family="N4" \ cloud.google.com/gke-spot="true" \ cloud.google.com/gke-accelerator="nvidia-tesla-a100" ``` ```{admonition} This quickstart uses Kind for simplicity. In a real-world scenario, you would use a cloud provider's Kubernetes service (like GKE or EKS) that has different machine types, like GPU nodes and spot instances, available. ``` ### Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator via Helm repository. The minimum KubeRay version for this guide is v1.5.1. ### Step 3: Create a RayCluster CR with autoscaling enabled and labels specified ```bash kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster-label-selector.yaml ``` ### Step 4: Verify the Kubernetes cluster status ```bash # Step 4.1: List all Ray Pods in the `default` namespace. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] NAME READY STATUS RESTARTS AGE ray-label-cluster-head-5tkn2 2/2 Running 0 3s ray-label-cluster-large-cpu-group-worker-dhqmt 1/1 Running 0 3s # Step 4.2: Check the ConfigMap in the `default` namespace. kubectl get configmaps # [Example output] # NAME DATA AGE # ray-example 3 21s # ... ``` The RayCluster has 1 head Pod and 1 worker Pod already scaled. The head Pod has two containers: a Ray head container and a Ray autoscaler sidecar container. Additionally, the [ray-cluster-label-selector.yaml](https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster-label-selector.yaml) includes a ConfigMap named `ray-example` that contains three Python scripts: `example_task.py`, `example_actor.py`, and `example_placement_group.py`, which all showcase label-based scheduling. * `example_task.py` is a Python script that creates a simple task requiring a node with the `ray.io/market-type: on-demand` and `cpu-family: in(intel,amd)` labels. The `in` operator expresses that the cpu-family can be either Intel or AMD. ```py import ray @ray.remote(num_cpus=1, label_selector={"ray.io/market-type": "on-demand", "cpu-family": "in(intel,amd)"}) def test_task(): pass ray.init() ray.get(test_task.remote()) ``` * `example_actor.py` is a Python script that creates a simple actor requiring a node with the`ray.io/accelerator-type: A100` label. Ray sets the `ray.io/accelerator-type` label by default when it can detect the underlying compute. ```py import ray @ray.remote(num_gpus=1, label_selector={"ray.io/accelerator-type": "A100"}) class Actor: def ready(self): return True ray.init() my_actor = Actor.remote() ray.get(my_actor.ready.remote()) ``` * `example_placement_group.py` is a Python script that creates a placement group requiring two bundles of 1 CPU with the `ray.io/market-type: spot` label but NOT `ray.io/region: us-central2`. Since the strategy is "SPREAD", we expect two separate Ray nodes with the desired labels to scale up, one node for each placement group bundle. ```py import ray from ray.util.placement_group import placement_group ray.init() pg = placement_group( [{"CPU": 1}] * 2, bundle_label_selector=[{"ray.io/market-type": "spot", "ray.io/region": "!us-central2"},] * 2, strategy="SPREAD" ) ray.get(pg.ready()) ``` ### Step 5: Trigger RayCluster label-based scheduling ```bash # Step 5.1: Get the head pod name export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) # Step 5.2: Run the task. The task should target the existing large-cpu-group and not require autoscaling. kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/example_task.py # Step 5.3: Run the actor. This should cause the Ray autoscaler to scale a GPU node in accelerator-group. The Pod may not # schedule unless you have GPU resources in your cluster. kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/example_actor.py # Step 5.4: Create the placement group. This should cause the Ray autoscaler to scale two nodes in spot-group. kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/example_placement_group.py # Step 5.5: List all nodes in the Ray cluster. The nodes scaled for the task, actor, and placement group should be annotated with # the expected Ray node labels. kubectl exec -it $HEAD_POD -- ray list nodes ``` ### Step 6: Clean up the Kubernetes cluster ```bash # Delete RayCluster and ConfigMap kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster-label-selector.yaml # Uninstall the KubeRay operator helm uninstall kuberay-operator # Delete the kind cluster kind delete cluster ``` ## Next steps * See [Use labels to control scheduling](https://docs.ray.io/en/master/ray-core/scheduling/labels.html) for more details on label selectors in Ray. * See [KubeRay Autoscaling](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html) for instructions on how to configure the Ray autoscaler with KubeRay. --- (kuberay-observability)= # KubeRay Observability ## KubeRay / Kubernetes Observability ### Check KubeRay operator's logs for errors ```bash # Typically, the operator's Pod name is kuberay-operator-xxxxxxxxxx-yyyyy. kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log ``` Use this command to redirect the operator's logs to a file called `operator-log`. Then search for errors in the file. ### Check the status and events of custom resources ```bash kubectl describe [raycluster|rayjob|rayservice] $CUSTOM_RESOURCE_NAME -n $YOUR_NAMESPACE ``` After running this command, check events and the `state`, and `conditions` in the status of the custom resource for any errors and progress. #### RayCluster `.Status.State` The `.Status.State` field represents the cluster's situation, but its limited representation restricts its utility. Replace it with the new `Status.Conditions` field. | State | Description | |-----------|----------------------------------------------------------------------------------------------------------------------------------------| | Ready | KubeRay sets the state to `Ready` once all the Pods in the cluster are ready. The `State` remains `Ready` until KubeRay suspends the cluster. | | Suspended | KubeRay sets the state to `Suspended` when it sets `Spec.Suspend` to true and deletes all Pods in the cluster. | #### RayCluster `.Status.Conditions` Although `Status.State` can represent the cluster situation, it's still only a single field. By enabling the feature gate `RayClusterStatusConditions` on the KubeRay v1.2.1, you can access to new `Status.Conditions` for more detailed cluster history and states. :::{warning} `RayClusterStatusConditions` is still an alpha feature and may change in the future. ::: If you deployed KubeRay with Helm, then enable the `RayClusterStatusConditions` gate in the `featureGates` of your Helm values. ```bash helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.2 \ --set featureGates\[0\].name=RayClusterStatusConditions \ --set featureGates\[0\].enabled=true ``` Or, just make your KubeRay Operator executable run with `--feature-gates=RayClusterStatusConditions=true` argument. | Type | Status | Reason | Description | |--------------------------|--------|--------------------------------|----------------------------------------------------------------------------------------------------------------------| | RayClusterProvisioned | True | AllPodRunningAndReadyFirstTime | When all Pods in the cluster become ready, the system marks the condition as `True`. Even if some Pods fail later, the system maintains this `True` state. | | | False | RayClusterPodsProvisioning | | | RayClusterReplicaFailure | True | FailedDeleteAllPods | KubeRay sets this condition to `True` when there's a reconciliation error, otherwise KubeRay clears the condition. | | | True | FailedDeleteHeadPod | See the `Reason` and the `Message` of the condition for more detailed debugging information. | | | True | FailedCreateHeadPod | | | | True | FailedDeleteWorkerPod | | | | True | FailedCreateWorkerPod | | | HeadPodReady | True | HeadPodRunningAndReady | This condition is `True` only if the HeadPod is currently ready; otherwise, it's `False`. | | | False | HeadPodNotFound | | #### RayService `.Status.Conditions` From KubeRay v1.3.0, RayService also supports the `Status.Conditions` field. * `Ready`: If `Ready` is true, the RayService is ready to serve requests. * `UpgradeInProgress`: If `UpgradeInProgress` is true, the RayService is currently in the upgrade process and both active and pending RayCluster exist. ```sh kubectl describe rayservices.ray.io rayservice-sample # [Example output] # Conditions: # Last Transition Time: 2025-02-08T06:45:20Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True # Type: Ready # Last Transition Time: 2025-02-08T06:44:28Z # Message: Active Ray cluster exists and no pending Ray cluster # Observed Generation: 1 # Reason: NoPendingCluster # Status: False # Type: UpgradeInProgress ``` #### Kubernetes Events KubeRay creates Kubernetes events for every interaction between the KubeRay operator and the Kubernetes API server, such as creating a Kubernetes service, updating a RayCluster, and deleting a RayCluster. In addition, if the validation of the custom resource fails, KubeRay also creates a Kubernetes event. ```sh # Example: kubectl describe rayclusters.ray.io raycluster-kuberay # Events: # Type Reason Age From Message # ---- ------ ---- ---- ------- # Normal CreatedService 37m raycluster-controller Created service default/raycluster-kuberay-head-svc # Normal CreatedHeadPod 37m raycluster-controller Created head Pod default/raycluster-kuberay-head-l7v7q # Normal CreatedWorkerPod ... ``` ## Ray Observability ### Ray Dashboard * To view the [Ray dashboard](observability-getting-started) running on the head Pod, follow [these instructions](kuberay-port-forward-dashboard). * To integrate the Ray dashboard with Prometheus and Grafana, see [Using Prometheus and Grafana](kuberay-prometheus-grafana) for more details. * To enable the "CPU Flame Graph" and "Stack Trace" features, see [Profiling with py-spy](kuberay-pyspy-integration). ### Check logs of Ray Pods Check the Ray logs directly by accessing the log files on the Pods. See [Ray Logging](configure-logging) for more details. ```bash kubectl exec -it $RAY_POD -n $YOUR_NAMESPACE -- bash # Check the logs under /tmp/ray/session_latest/logs/ ``` (kuberay-port-forward-dashboard)= ### Check Dashboard ```bash export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl port-forward $HEAD_POD -n $YOUR_NAMESPACE 8265:8265 # Check $YOUR_IP:8265 in your browser to access the dashboard. # For most cases, 127.0.0.1:8265 or localhost:8265 should work. ``` ### Ray State CLI You can use the [Ray State CLI](state-api-cli-ref) on the head Pod to check the status of Ray Serve applications. ```bash # Log into the head Pod export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl exec -it $HEAD_POD -- ray summary actors # [Example output]: # ======== Actors Summary: 2023-07-11 17:58:24.625032 ======== # Stats: # ------------------------------------ # total_actors: 14 # Table (group by class): # ------------------------------------ # CLASS_NAME STATE_COUNTS # 0 ... ALIVE: 1 # 1 ... ALIVE: 1 # 2 ... ALIVE: 3 # 3 ... ALIVE: 1 # 4 ... ALIVE: 1 # 5 ... ALIVE: 1 # 6 ... ALIVE: 1 # 7 ... ALIVE: 1 # 8 ... ALIVE: 1 # 9 ... ALIVE: 1 # 10 ... ALIVE: 1 # 11 ... ALIVE: 1 ``` --- (persist-kuberay-custom-resource-logs)= # Persist KubeRay custom resource logs Logs (both system and application logs) are useful for troubleshooting Ray applications and Clusters. For example, you may want to access system logs if a node terminates unexpectedly. Similar to Kubernetes, Ray does not provide a native storage solution for log data. Users need to manage the lifecycle of the logs by themselves. This page provides instructions on how to collect logs from Ray Clusters that are running on Kubernetes. ## Ray log directory By default, Ray writes logs to files in the directory `/tmp/ray/session_*/logs` on each Ray pod's file system, including application and system logs. Learn more about the {ref}`log directory and log files ` and the {ref}`log rotation configuration ` before you start to collect the logs. ## Log processing tools There are a number of open source log processing tools available within the Kubernetes ecosystem. This page shows how to extract Ray logs using [Fluent Bit][FluentBit]. Other popular tools include [Vector][Vector], [Fluentd][Fluentd], [Filebeat][Filebeat], and [Promtail][Promtail]. ## Log collection strategies To write collected logs to a pod's filesystem ,use one of two logging strategies: **sidecar containers** or **daemonsets**. Read more about these logging patterns in the [Kubernetes documentation][KubDoc]. ### Sidecar containers We provide an {ref}`example ` of the sidecar strategy in this guide. You can process logs by configuring a log-processing sidecar for each Ray pod. Ray containers should be configured to share the `/tmp/ray` directory with the logging sidecar via a volume mount. You can configure the sidecar to do either of the following: * Stream Ray logs to the sidecar's stdout. * Export logs to an external service. ### Daemonset Alternatively, it is possible to collect logs at the Kubernetes node level. To do this, one deploys a log-processing daemonset onto the Kubernetes cluster's nodes. With this strategy, it is key to mount the Ray container's `/tmp/ray` directory to the relevant `hostPath`. (kuberay-fluentbit)= ## Setting up logging sidecars with Fluent Bit In this section, we give an example of how to set up log-emitting [Fluent Bit][FluentBit] sidecars for Ray pods to send logs to [Grafana Loki][GrafanaLoki], enabling centralized log management and querying. See the full config for a single-pod RayCluster with a logging sidecar [here][ConfigLink]. We now discuss this configuration and show how to deploy it. ### Deploy Loki monolithic mode Follow [Deploy Loki monolithic mode](deploy-loki-monolithic-mode) to deploy Grafana Loki in monolithic mode. ### Deploy Grafana Follow [Deploy Grafana](deploy-grafana) to set up Grafana Loki datasource and deploy Grafana. ### Configuring log processing The first step is to create a ConfigMap with configuration for Fluent Bit. The following ConfigMap example configures a Fluent Bit sidecar to: * Tail Ray logs. * Send logs to a Grafana Loki endpoint. * Add metadata to the logs for filtering by labels, for example, `RayCluster`. ```yaml apiVersion: v1 kind: ConfigMap metadata: name: fluentbit-config data: fluent-bit.conf: | [INPUT] Name tail Path /tmp/ray/session_latest/logs/* Tag ray Path_Key true Refresh_Interval 5 [FILTER] Name modify Match ray Add POD_LABELS ${POD_LABELS} [OUTPUT] Name loki Match * Host loki-gateway Port 80 Labels RayCluster=${POD_LABELS} tenant_id test ``` A few notes on the above config: - The `Path_Key true` line above ensures that file names are included in the log records emitted by Fluent Bit. - The `Refresh_Interval 5` line asks Fluent Bit to refresh the list of files in the log directory once per 5 seconds, rather than the default 60. The reason is that the directory `/tmp/ray/session_latest/logs/` does not exist initially (Ray must create it first). Setting the `Refresh_Interval` low allows us to see logs in the Fluent Bit container's stdout sooner. - The [Kubernetes downward API][KubernetesDownwardAPI] populates the `POD_LABELS` variable used in the `FILTER` section. It pulls the label from the pod's metadata label `ray.io/cluster`, which is defined in the Fluent Bit sidecar container's environment. - The `tenant_id` field allows you to assign logs to different tenants. In this example, Fluent Bit sidecar sends the logs to the `test` tenant. You can adjust this configuration to match the tenant ID set up in your Grafana Loki instance, enabling multi-tenancy support in Grafana. - The `Host` field specifies the endpoint of the Loki gateway. If Loki and the RayCluster are in different namespaces, you need to append `.namespace` to the hostname, for example, `loki-gateway.monitoring` (replacing `monitoring` with the namespace where Loki resides). ### Adding logging sidecars to RayCluster Custom Resource (CR) #### Adding log and config volumes For each pod template in our RayCluster CR, we need to add two volumes: One volume for Ray's logs and another volume to store Fluent Bit configuration from the ConfigMap applied above. ```yaml volumes: - name: ray-logs emptyDir: {} - name: fluentbit-config configMap: name: fluentbit-config ``` #### Mounting the Ray log directory Add the following volume mount to the Ray container's configuration. ```yaml volumeMounts: - mountPath: /tmp/ray name: ray-logs ``` #### Adding the Fluent Bit sidecar Finally, add the Fluent Bit sidecar container to each Ray pod config in your RayCluster CR. ```yaml - name: fluentbit image: fluent/fluent-bit:3.2.2 # Get Kubernetes metadata via downward API env: - name: POD_LABELS valueFrom: fieldRef: fieldPath: metadata.labels['ray.io/cluster'] # These resource requests for Fluent Bit should be sufficient in production. resources: requests: cpu: 100m memory: 128Mi limits: cpu: 100m memory: 128Mi volumeMounts: - mountPath: /tmp/ray name: ray-logs - mountPath: /fluent-bit/etc/fluent-bit.conf subPath: fluent-bit.conf name: fluentbit-config ``` Mounting the `ray-logs` volume gives the sidecar container access to Ray's logs. The `fluentbit-config` volume gives the sidecar access to logging configuration. #### Putting everything together Putting all of the above elements together, we have the following yaml configuration for a single-pod RayCluster will a log-processing sidecar. ```yaml # Fluent Bit ConfigMap apiVersion: v1 kind: ConfigMap metadata: name: fluentbit-config data: fluent-bit.conf: | [INPUT] Name tail Path /tmp/ray/session_latest/logs/* Tag ray Path_Key true Refresh_Interval 5 [FILTER] Name modify Match ray Add POD_LABELS ${POD_LABELS} [OUTPUT] Name loki Match * Host loki-gateway Port 80 Labels RayCluster=${POD_LABELS} tenant_id test --- # RayCluster CR with a FluentBit sidecar apiVersion: ray.io/v1 kind: RayCluster metadata: name: raycluster-fluentbit-sidecar-logs spec: rayVersion: '2.46.0' headGroupSpec: template: spec: containers: - name: ray-head image: rayproject/ray:2.46.0 # This config is meant for demonstration purposes only. # Use larger Ray containers in production! resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi # Share logs with Fluent Bit volumeMounts: - mountPath: /tmp/ray name: ray-logs # Fluent Bit sidecar - name: fluentbit image: fluent/fluent-bit:3.2.2 # Get Kubernetes metadata via downward API env: - name: POD_LABELS valueFrom: fieldRef: fieldPath: metadata.labels['ray.io/cluster'] # These resource requests for Fluent Bit should be sufficient in production. resources: requests: cpu: 100m memory: 128Mi limits: cpu: 100m memory: 128Mi volumeMounts: - mountPath: /tmp/ray name: ray-logs - mountPath: /fluent-bit/etc/fluent-bit.conf subPath: fluent-bit.conf name: fluentbit-config # Log and config volumes volumes: - name: ray-logs emptyDir: {} - name: fluentbit-config configMap: name: fluentbit-config ``` ### Deploying a RayCluster with logging sidecar To deploy the configuration described above, deploy the KubeRay Operator if you haven't yet: Refer to the {ref}`Getting Started guide ` for instructions on this step. Now, run the following commands to deploy the Fluent Bit ConfigMap and a single-pod RayCluster with a Fluent Bit sidecar. ```shell kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/refs/heads/master/ray-operator/config/samples/ray-cluster.fluentbit.yaml ``` To access Grafana from your local machine, set up port forwarding by running: ```shell export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $POD_NAME 3000 ``` This command makes Grafana available locally at `http://localhost:3000`. * Username: "admin" * Password: Get the password using the following command: ```shell kubectl get secret --namespace default grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo ``` Finally, use a LogQL query to view logs for a specific RayCluster or RayJob, and filter by `RayCluster`, as set in the FluentBit ConfigMap OUTPUT configuration in this example. ```shell {RayCluster="raycluster-fluentbit-sidecar-logs"} ``` ![Loki Logs](images/loki-sidecar-logs.png) [Vector]: https://vector.dev/ [FluentBit]: https://docs.fluentbit.io/manual [FluentBitStorage]: https://docs.fluentbit.io/manual [Filebeat]: https://www.elastic.co/guide/en/beats/filebeat/7.17/index.html [Fluentd]: https://docs.fluentd.org/ [Promtail]: https://grafana.com/docs/loki/latest/clients/promtail/ [GrafanaLoki]: https://grafana.com/oss/loki/ [KubDoc]: https://kubernetes.io/docs/concepts/cluster-administration/logging/ [ConfigLink]: https://raw.githubusercontent.com/ray-project/ray/releases/2.4.0/doc/source/cluster/kubernetes/configs/ray-cluster.log.yaml [KubernetesDownwardAPI]: https://kubernetes.io/docs/concepts/workloads/pods/downward-api/ (redirect-to-stderr)= ## Redirecting Ray logs to stderr By default, Ray writes logs to files in the `/tmp/ray/session_*/logs` directory. If your log processing tool is capable of capturing log records written to stderr, you can redirect Ray logs to the stderr stream of Ray containers by setting the environment variable `RAY_LOG_TO_STDERR=1` on all Ray nodes. ```{admonition} Alert: this practice isn't recommended. :class: caution If `RAY_LOG_TO_STDERR=1` is set, Ray doesn't write logs to files. Consequently, this behavior can cause some Ray features that rely on log files to malfunction. For instance, {ref}`worker log redirection to driver ` doesn't work if you redirect Ray logs to stderr. If you need these features, consider using the {ref}`Fluent Bit solution ` mentioned above. For clusters on VMs, don't redirect logs to stderr. Instead, follow {ref}`this guide ` to persist logs. ``` Redirecting logging to stderr also prepends a `({component})` prefix, for example, `(raylet)`, to each log record message. ```bash [2022-01-24 19:42:02,978 I 1829336 1829336] (gcs_server) grpc_server.cc:103: GcsServer server started, listening on port 50009. [2022-01-24 19:42:06,696 I 1829415 1829415] (raylet) grpc_server.cc:103: ObjectManager server started, listening on port 40545. 2022-01-24 19:42:05,087 INFO (dashboard) dashboard.py:95 -- Setup static dir for dashboard: /mnt/data/workspace/ray/python/ray/dashboard/client/build 2022-01-24 19:42:07,500 INFO (dashboard_agent) agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:49228 ``` These prefixes allow you to filter the stderr stream of logs by the component of interest. Note, however, that multi-line log records **don't** have this component marker at the beginning of each line. Follow the steps below to set the environment variable ``RAY_LOG_TO_STDERR=1`` on all Ray nodes ::::{tab-set} :::{tab-item} Single-node local cluster **Start the cluster explicitly with CLI**
```bash env RAY_LOG_TO_STDERR=1 ray start ``` **Start the cluster implicitly with `ray.init`**
```python os.environ["RAY_LOG_TO_STDERR"] = "1" ray.init() ``` ::: :::{tab-item} KubeRay Set the `RAY_LOG_TO_STDERR` environment variable to `1` in the Ray container of each Ray Pod. Use this [example YAML file](https://gist.github.com/kevin85421/3d676abae29ebd5677428ddbbd4c8d74) as a reference. ::: :::: --- (persist-kuberay-operator-logs)= # Persist KubeRay Operator Logs The KubeRay Operator plays a vital role in managing Ray clusters on Kubernetes. Persisting its logs is essential for effective troubleshooting and monitoring. This guide describes methods to set up centralized logging for KubeRay Operator logs. ## Grafana Loki [Grafana Loki][GrafanaLoki] is a log aggregation system optimized for Kubernetes, providing efficient log storage and querying. The following steps set up [Fluent Bit][FluentBit] as a DaemonSet to collect logs from Kubernetes containers and send them to Loki for centralized storage and analysis. (deploy-loki-monolithic-mode)= ### Deploy Loki monolithic mode Loki’s Helm chart supports three deployment methods to fit different scalability and performance needs: Monolithic, Simple Scalable, and Microservices. This guide demonstrates the monolithic method. For details on each deployment mode, see the [Loki deployment](https://grafana.com/docs/loki/latest/get-started/deployment-modes/) modes documentation. Deploy the Loki deployment with the [Helm chart repository](https://github.com/grafana/loki/tree/main/production/helm/loki). ```shell helm repo add grafana https://grafana.github.io/helm-charts helm repo update # Install Loki with single replica mode helm install loki grafana/loki --version 6.21.0 -f https://raw.githubusercontent.com/grafana/loki/refs/heads/main/production/helm/loki/single-binary-values.yaml ``` ### Configure log processing Create a `fluent-bit-config.yaml` file, which configures Fluent Bit to: * Tail log files from Kubernetes containers. * Parse multi-line logs for Docker and Container Runtime Interface (CRI) formats. * Enrich logs with Kubernetes metadata such as namespace, pod, and container names. * Send the logs to Loki for centralized storage and querying. ```{literalinclude} ../configs/loki.log.yaml :language: yaml :start-after: Fluent Bit Config :end-before: --- ``` A few notes on the above config: * Inputs: The `tail` input reads log files from `/var/log/containers/*.log`, with `multiline.parser` to handle complex log messages across multiple lines. * Filters: The `kubernetes` filter adds metadata like namespace, pod, and container names to each log, enabling more efficient log management and querying in Loki. * Outputs: The `loki` output block specifies Loki as the target. The `Host` and `Port` define the Loki service endpoint, and `Labels` adds metadata for easier querying in Grafana. Additionally, `tenant_id` allows for multi-tenancy if required by the Loki setup. Deploy the Fluent Bit deployment with the [Helm chart repository](https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit). ```shell helm repo add fluent https://fluent.github.io/helm-charts helm repo update helm install fluent-bit fluent/fluent-bit --version 0.48.2 -f fluent-bit-config.yaml ``` ### Install the KubeRay Operator Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the KubeRay operator. ### Deploy a RayCluster Follow [Deploy a RayCluster custom resource](raycluster-deploy) to deploy a RayCluster. (deploy-grafana)= ### Deploy Grafana Create a `datasource-config.yaml` file with the following configuration to set up Grafana's Loki datasource: ```{literalinclude} ../configs/loki.log.yaml :language: yaml :start-after: Grafana Datasource Config ``` Deploy the Grafana deployment with the [Helm chart repository](https://github.com/grafana/helm-charts/tree/main/charts/grafana). ```shell helm repo add grafana https://grafana.github.io/helm-charts helm repo update helm install grafana grafana/grafana --version 8.6.2 -f datasource-config.yaml ``` ### Check the Grafana Dashboard ```shell # Verify that the Grafana pod is running in the `default` namespace. kubectl get pods --namespace default -l "app.kubernetes.io/name=grafana" # NAME READY STATUS RESTARTS AGE # grafana-54d5d747fd-5fldc 1/1 Running 0 8m21s ``` To access Grafana from your local machine, set up port forwarding by running: ```shell export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $POD_NAME 3000 ``` This command makes Grafana available locally at `http://localhost:3000`. * Username: "admin" * Password: Get the password using the following command: ```shell kubectl get secret --namespace default grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo ``` Finally, use a LogQL query to view logs for a specific pod, such as the KubeRay Operator, and filter logs by the `RayCluster_name`: ``` {pod="kuberay-operator-xxxxxxxx-xxxxx"} | json | RayCluster_name = `raycluster-kuberay` ``` ![Loki Logs](images/loki-logs.png) You can use LogQL's JSON syntax to filter logs based on specific fields, such as `RayCluster_name`. See [Log query language doc](https://grafana.com/docs/loki/latest/query/) for more information about LogQL filtering. [GrafanaLoki]: https://grafana.com/oss/loki/ [FluentBit]: https://docs.fluentbit.io/manual --- (kuberay-pod-command)= # Specify container commands for Ray head/worker Pods KubeRay generates a `ray start` command for each Ray Pod. Sometimes, you may want to execute certain commands either before or after the ray start command, or you may wish to define the container's command yourself. This document shows you how to do that. ## Part 1: Specify a custom container command, optionally including the generated `ray start` command Starting with KubeRay v1.1.0, if users add the annotation `ray.io/overwrite-container-cmd: "true"` to a RayCluster, KubeRay respects the container `command` and `args` as provided by the users, without including any generated command, including the `ulimit` and the `ray start` commands, with the latter stored in the environment variable `KUBERAY_GEN_RAY_START_CMD`. ```yaml apiVersion: ray.io/v1 kind: RayCluster metadata: annotations: # If this annotation is set to "true", KubeRay will respect the container `command` and `args`. ray.io/overwrite-container-cmd: "true" ... spec: headGroupSpec: # Pod template template: spec: containers: - name: ray-head image: rayproject/ray:2.46.0 # Because the annotation "ray.io/overwrite-container-cmd" is set to "true", # KubeRay will overwrite the generated container command with `command` and # `args` in the following. Hence, you need to specify the `ulimit` command # by yourself to avoid Ray scalability issues. command: ["/bin/bash", "-lc", "--"] # Starting from v1.1.0, KubeRay injects the environment variable `KUBERAY_GEN_RAY_START_CMD` # into the Ray container. This variable can be used to retrieve the generated Ray start command. # Note that this environment variable does not include the `ulimit` command. args: ["ulimit -n 65536; echo head; $KUBERAY_GEN_RAY_START_CMD"] ... ``` The preceding example YAML is a part of [ray-cluster.overwrite-command.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.overwrite-command.yaml). * `metadata.annotations.ray.io/overwrite-container-cmd: "true"`: This annotation tells KubeRay to respect the container `command` and `args` as provided by the users, without including any generated command. Refer to Part 2 for the default behavior if you set the annotation to "false" or don't set it at all. * `ulimit -n 65536`: This command is necessary to avoid Ray scalability issues caused by running out of file descriptors. If you don't set the annotation, KubeRay automatically injects the `ulimit` command into the container. * `$KUBERAY_GEN_RAY_START_CMD`: Starting from KubeRay v1.1.0, KubeRay injects the environment variable `KUBERAY_GEN_RAY_START_CMD` into the Ray container for both head and worker Pods to store the `ray start` command generated by KubeRay. Note that this environment variable doesn't include the `ulimit` command. ```sh # Example of the environment variable `KUBERAY_GEN_RAY_START_CMD` in the head Pod. ray start --head --dashboard-host=0.0.0.0 --num-cpus=1 --block --metrics-export-port=8080 --memory=2147483648 ``` The head Pod's `command`/`args` looks like the following: ```yaml Command: /bin/bash -lc -- Args: ulimit -n 65536; echo head; $KUBERAY_GEN_RAY_START_CMD ``` ## Part 2: Execute commands before the generated `ray start` command If you only want to execute commands before the generated command, you don't need to set the annotation `ray.io/overwrite-container-cmd: "true"`. Some users employ this method to set up environment variables used by `ray start`. ```yaml # https://github.com/ray-project/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml #pod template template: spec: containers: - name: ray-head image: rayproject/ray:2.46.0 resources: ... ports: ... # `command` and `args` will become a part of `spec.containers.0.args` in the head Pod. command: ["echo 123"] args: ["456"] ``` * `spec.containers.0.command`: KubeRay hard codes `["/bin/bash", "-lc", "--"]` as the container's command. * `spec.containers.0.args` contains two parts: * **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` and `headGroupSpec.template.spec.containers.0.args` together. * **ray start command**: KubeRay creates the command based on `rayStartParams` specified in RayCluster. The command looks like `ulimit -n 65536; ray start ...`. * To summarize, `spec.containers.0.args` is `$(user-specified command) && $(ray start command)`. * Example ```sh # Prerequisite: There is a KubeRay operator in the Kubernetes cluster. # Create a RayCluster kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.head-command.yaml # Check ${RAYCLUSTER_HEAD_POD} kubectl get pod -l ray.io/node-type=head # Check `spec.containers.0.command` and `spec.containers.0.args`. kubectl describe pod ${RAYCLUSTER_HEAD_POD} # Command: # /bin/bash # -lc # -- # Args: # echo 123 456 && ulimit -n 65536; ray start --head --dashboard-host=0.0.0.0 --num-cpus=1 --block --metrics-export-port=8080 --memory=2147483648 ``` --- (kuberay-rayservice-ha)= # RayService high availability [RayService](kuberay-rayservice) provides high availability to ensure services continue serving requests when the Ray head Pod fails. ## Prerequisites * Use RayService with KubeRay 1.3.0 or later. * Enable GCS fault tolerance in the RayService. ## Quickstart ### Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ### Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository. ### Step 3: Install a RayService with GCS fault tolerance ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.high-availability.yaml ``` The [ray-service.high-availability.yaml](https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.high-availability.yaml) file has several Kubernetes objects: * Redis: Redis is necessary to make GCS fault tolerant. See {ref}`GCS fault tolerance ` for more details. * RayService: This RayService custom resource includes a 3-node RayCluster and a simple [Ray Serve application](https://github.com/ray-project/test_dag). * Ray Pod: This Pod sends requests to the RayService. ### Step 4: Verify the Kubernetes Serve service Check the output of the following command to verify that you successfully started the Kubernetes Serve service: ```sh # Step 4.1: Wait until the RayService is ready to serve requests. kubectl describe rayservices.ray.io rayservice-ha # [Example output] # Conditions: # Last Transition Time: 2025-02-13T21:36:18Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True # Type: Ready # Step 4.2: `rayservice-ha-serve-svc` should have 3 endpoints, including the Ray head and two Ray workers. kubectl describe svc rayservice-ha-serve-svc # [Example output] # Endpoints: 10.244.0.29:8000,10.244.0.30:8000,10.244.0.32:8000 ``` ### Step 5: Verify the Serve applications In the [ray-service.high-availability.yaml](https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.high-availability.yaml) file, the `serveConfigV2` parameter specifies `num_replicas: 2` and `max_replicas_per_node: 1` for each Ray Serve deployment. In addition, the YAML sets the `rayStartParams` parameter to `num-cpus: "0"` to ensure that the system doesn't schedule any Ray Serve replicas on the Ray head Pod. In total, each Ray Serve deployment has two replicas, and each Ray node can have at most one of those two Ray Serve replicas. Additionally, Ray Serve replicas can't schedule on the Ray head Pod. As a result, each worker node should have exactly one Ray Serve replica for each Ray Serve deployment. For Ray Serve, the Ray head always has a HTTPProxyActor whether it has a Ray Serve replica or not. The Ray worker nodes only have HTTPProxyActors when they have Ray Serve replicas. Thus, the `rayservice-ha-serve-svc` service in the previous step has 3 endpoints. ```sh # Port forward the Ray Dashboard. kubectl port-forward svc/rayservice-ha-head-svc 8265:8265 # Visit ${YOUR_IP}:8265 in your browser for the Dashboard (e.g. 127.0.0.1:8265) # Check: # (1) Both head and worker nodes have HTTPProxyActors. # (2) Only worker nodes have Ray Serve replicas. # (3) Each worker node has one Ray Serve replica for each Ray Serve deployment. ``` ### Step 6: Send requests to the RayService ```sh # Log into the separate Ray Pod. kubectl exec -it ray-pod -- bash # Send requests to the RayService. python3 samples/query.py # This script sends the same request to the RayService consecutively, ensuring at most one in-flight request at a time. # The request is equivalent to `curl -X POST -H 'Content-Type: application/json' localhost:8000/fruit/ -d '["PEAR", 12]'`. # [Example output] # req_index : 2197, num_fail: 0 # response: 12 # req_index : 2198, num_fail: 0 # response: 12 # req_index : 2199, num_fail: 0 ``` ### Step 7: Delete the Ray head Pod ```sh # Step 7.1: Delete the Ray head Pod. export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl delete pod $HEAD_POD ``` In this example, `query.py` ensures that at most one request is in-flight at any given time. Furthermore, the Ray head Pod has doesn't have any Ray Serve replicas. Requests may fail only when a request is in the HTTPProxyActor on the Ray head Pod. Therefore, failures are highly unlikely to occur during the deletion and recovery of the Ray head Pod. You can implement retry logic in Ray scripts to handle the failures. ```sh # [Expected output]: The `num_fail` is highly likely to be 0. req_index : 32503, num_fail: 0 response: 12 req_index : 32504, num_fail: 0 response: 12 ``` ### Step 8: Cleanup ```sh kind delete cluster ``` --- (kuberay-rayservice-incremental-upgrade)= # RayService Zero-Downtime Incremental Upgrades This guide details how to configure and use the `NewClusterWithIncrementalUpgrade` strategy for a `RayService` with KubeRay. This feature was proposed in a [Ray Enhancement Proposal (REP)](https://github.com/ray-project/enhancements/blob/main/reps/2024-12-4-ray-service-incr-upgrade.md) and implemented with alpha support in KubeRay v1.5.1. If unfamiliar with RayServices and KubeRay, see the [RayService Quickstart](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html). In previous versions of KubeRay, zero-downtime upgrades were supported only through the `NewCluster` strategy. This upgrade strategy involved scaling up a pending RayCluster with equal capacity as the active cluster, waiting until the updated Serve applications were healthy, and then switching traffic to the new RayCluster. While this upgrade strategy is reliable, it required users to scale 200% of their original cluster's compute resources which can be prohibitive when dealing with expensive accelerator resources. The `NewClusterWithIncrementalUpgrade` strategy is designed for large-scale deployments, such as LLM serving, where duplicating resources for a standard blue/green deployment is not feasible due to resource constraints. This feature minimizes resource usage during RayService CR upgrades while maintaining service availability. Below we explain the design and usage. Rather than creating a new `RayCluster` at 100% capacity, this strategy creates a new cluster and gradually scales its capacity up while simultaneously shifting user traffic from the old cluster to the new one. This gradual traffic migration enables users to safely scale their updated RayService while the old cluster auto-scales down, enabling users to save expensive compute resources and exert greater control over the pace of their upgrade. This process relies on the Kubernetes Gateway API for fine-grained traffic splitting. ## Quickstart: Performing an Incremental Upgrade ### 1. Prerequisites Before you can use this feature, you **must** have the following set up in your Kubernetes cluster: 1. **Gateway API CRDs:** The K8s Gateway API resources must be installed. You can typically install them with: ```bash kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml ``` The RayService controller utilizes GA Gateway API resources such as a [Gateway](https://kubernetes.io/docs/concepts/services-networking/gateway/#api-kind-gateway) and [HTTPRoute](https://kubernetes.io/docs/concepts/services-networking/gateway/#api-kind-httproute) to safely split traffic during the upgrade. 2. **A Gateway Controller:** Users must install a Gateway controller that implements the Gateway API, such as Istio, Contour, or a cloud-native implementation like GKE's Gateway controller. This feature should support any controller that implements Gateway API with support for `Gateway` and `HTTPRoute` CRDs, but is an alpha feature that's primarily been tested utilizing [Istio](https://istio.io/latest/docs/tasks/traffic-management/ingress/gateway-api/). 3. **A `GatewayClass` Resource:** Your cluster admin must create a `GatewayClass` resource that defines which controller to use. KubeRay will use this to create `Gateway` and `HTTPRoute` objects. **Example: Istio `GatewayClass`** ```yaml apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: istio spec: controllerName: istio.io/gateway-controller ``` You will need to use the `metadata.name` (e.g. `istio`) in the `gatewayClassName` field of the `RayService` spec. 4. **Ray Autoscaler:** Incremental upgrades require the Ray Autoscaler to be enabled in your `RayCluster` spec, as KubeRay manages the upgrade by adjusting the `target_capacity` for Ray Serve which adjusts the number of Serve replicas for each deployment. These Serve replicas are translated into a resource load which the Ray autoscaler considers when determining the number of Pods to provision with KubeRay. For information on enabling and configuring Ray autoscaling on Kubernetes, see [KubeRay Autoscaling](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html). #### Example: Setting up a RayService on kind: The following instructions detail the minimal steps to configure a cluster with KubeRay and trigger a zero-downtime incremental upgrade for a RayService. 1. Create a kind cluster ```bash kind create cluster --image=kindest/node:v1.29.0 ``` We use `v1.29.0` which is known to be compatible with recent Istio versions. 2. Install istio ``` istioctl install --set profile=demo -y ``` 3. Install Gateway API CRDs ```bash kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml ``` 4. Create a Gateway class with the following spec ```yaml echo "apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: istio spec: controllerName: istio.io/gateway-controller" | kubectl apply -f - ``` ```yaml kubectl get gatewayclass NAME CONTROLLER ACCEPTED AGE istio istio.io/gateway-controller True 4s istio-remote istio.io/unmanaged-gateway True 3s ``` 5. Install and Configure MetalLB for LoadBalancer on kind ```bash kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.7/config/manifests/metallb-native.yaml ``` Create an `IPAddressPool` with the following spec for MetalLB ```yaml echo "apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: kind-pool namespace: metallb-system spec: addresses: - 192.168.8.200-192.168.8.250 # adjust based on your subnets range --- apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: default namespace: metallb-system spec: ipAddressPools: - kind-pool" | kubectl apply -f - ``` 6. Install the KubeRay operator, following [these instructions](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/kuberay-operator-installation.html). The minimum version for this guide is v1.5.1. To use this feature, the `RayServiceIncrementalUpgrade` feature gate must be enabled. To enable the feature gate when installing the kuberay operator, run the following command: ```bash helm install kuberay-operator kuberay/kuberay-operator --version v1.5.1 \ --set featureGates\[0\].name=RayServiceIncrementalUpgrade \ --set featureGates\[0\].enabled=true ``` 7. Create a RayService with incremental upgrade enabled. ```bash kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.incremental-upgrade.yaml ``` 8. Update one of the fields under `rayClusterConfig` and re-apply the RayService to trigger a zero-downtime upgrade. ### 2. How it Works: The Upgrade Process Understanding the lifecycle of an incremental upgrade helps in monitoring and configuration. 1. **Trigger:** You trigger an upgrade by updating the `RayService` spec, such as changing the container `image` or updating the `resources` used by a worker group in the `rayClusterSpec`. 2. **Pending Cluster Creation:** KubeRay detects the change and creates a new, *pending* `RayCluster`. It sets this cluster's initial `target_capacity` (the percentage of serve replicas it should run) to `0%`. 3. **Gateway and Route Creation:** KubeRay creates a `Gateway` resource for your `RayService` and an `HTTPRoute` resource that initially routes 100% of traffic to the old, *active* cluster and 0% to the new, *pending* cluster. 4. **The Upgrade Loop Begins:** The KubeRay controller now enters a loop that repeats three phases until the upgrade is complete. This loop ensures that the total cluster capacity only exceeds 100% by at most `maxSurgePercent`, preventing resource starvation. Let's use an example: `maxSurgePercent: 20` and `stepSizePercent: 5`. * **Initial State:** * Active Cluster `target_capacity`: 100% * Pending Cluster `target_capacity`: 0% * **Total Capacity: 100%** --- **The Upgrade Cycle** * **Phase 1: Scale Up Pending Cluster (Capacity)** * KubeRay checks the total capacity (100%) and sees it's $\le$ 100%. It increases the **pending** cluster's `target_capacity` by `maxSurgePercent`. * Active `target_capacity`: 100% * Pending `target_capacity`: 0% $\rightarrow$ **20%** * **Total Capacity: 120%** * If the Ray Serve autoscaler is enabled, the Serve application will scale its `num_replicas` from `min_replicas` based on the new `target_capacity`. Without the Ray Serve autoscaler enabled, the new `target_capacity` value will directly adjust `num_replicas` for each Serve deployment. Depending on the updated value of`num_replicas`, the Ray Autoscaler will begin provisioning pods for the pending cluster to handle the updated resource load. * **Phase 2: Shift Traffic (HTTPRoute)** * KubeRay waits for the pending cluster's new pods to be ready. There may be a temporary drop in requests-per-second while worker Pods are being created for the updated Ray serve replicas. * Once ready, it begins to *gradually* shift traffic. Every `intervalSeconds`, it updates the `HTTPRoute` weights, moving `stepSizePercent` (5%) of traffic from the active to the pending cluster. * This continues until the *actual* traffic (`trafficRoutedPercent`) "catches up" to the *pending* cluster's `target_capacity` (20% in this example). * **Phase 3: Scale Down Active Cluster (Capacity)** * Once Phase 2 is complete (`trafficRoutedPercent` == 20%), the loop runs again. * KubeRay checks the total capacity (120%) and sees it's > 100%. It decreases the **active** cluster's `target_capacity` by `maxSurgePercent`. * Active `target_capacity`: 100% $\rightarrow$ **80%** * Pending `target_capacity`: 20% * **Total Capacity: 100%** * The Ray Autoscaler terminates pods on the active cluster as they become idle. --- 5. **Completion & Cleanup:** This cycle of **(Scale Up Pending $\rightarrow$ Shift Traffic $\rightarrow$ Scale Down Active)** continues until the pending cluster is at 100% `target_capacity` and 100% `trafficRoutedPercent`, and the active cluster is at 0%. KubeRay then promotes the pending cluster to active, updates the `HTTPRoute` to send 100% of traffic to it, and safely terminates the old `RayCluster`. ### 3. Example `RayService` Configuration To use the feature, set the `upgradeStrategy.type` to `NewClusterWithIncrementalUpgrade` and provide the required options. ```yaml apiVersion: ray.io/v1 kind: RayService metadata: name: rayservice-incremental-upgrade spec: # This is the main configuration block for the upgrade upgradeStrategy: # 1. Set the type to NewClusterWithIncrementalUpgrade type: "NewClusterWithIncrementalUpgrade" clusterUpgradeOptions: # 2. The name of your K8s GatewayClass gatewayClassName: "istio" # 3. Capacity scaling: Increase new cluster's target_capacity # by 20% in each scaling step. maxSurgePercent: 20 # 4. Traffic shifting: Move 5% of traffic from old to new # cluster every intervalSeconds. stepSizePercent: 5 # 5. Interval seconds controls the pace of traffic migration during the upgrade. intervalSeconds: 10 # This is your Serve config serveConfigV2: | applications: - name: my_app import_path: my_model:app route_prefix: / deployments: - name: MyModel num_replicas: 10 ray_actor_options: resources: { "GPU": 1 } autoscaling_config: min_replicas: 0 max_replicas: 20 # This is your RayCluster config (autoscaling must be enabled) rayClusterSpec: enableInTreeAutoscaling: true headGroupSpec: # ... head spec ... workerGroupSpecs: - groupName: gpu-worker replicas: 0 minReplicas: 0 maxReplicas: 20 template: # ... pod spec with GPU requests ... ``` ### 4. Trigger the Upgrade Incremental upgrades are triggered exactly like standard zero-downtime upgrades in KubeRay: by modifying the `spec.rayClusterConfig` in your RayService Custom Resource. When KubeRay detects a change in the cluster specification (such as a new container image, modified resource limits, or updated environment variables), it calculates a new hash. If the hash differs from the active cluster and incremental upgrades are enabled, the `NewClusterWithIncrementalUpgrade` strategy is automatically initiated. Updates to the cluster specifications can occur by running `kubectl apply -f` on the updated YAML configuration file, or by directly editing the CR using `kubectl edit rayservice `. ### 5. Monitoring the Upgrade You can monitor the progress of the upgrade by inspecting the `RayService` status and the `HTTPRoute` object. 1. **Check `RayService` Status:** ```bash kubectl describe rayservice rayservice-incremental-upgrade ``` Look at the `Status` section. You will see both `Active Service Status` and `Pending Service Status`, which show the state of both clusters. Pay close attention to these two new fields: * **`Target Capacity`:** The percentage of replicas KubeRay is *telling* this cluster to scale to. * **`Traffic Routed Percent`:** The percentage of traffic KubeRay is *currently* sending to this cluster via the Gateway. During an upgrade, you will see `Target Capacity` on the pending cluster increase in steps (e.g., 20%, 40%) and `Traffic Routed Percent` gradually climb to meet it. 2. **Check `HTTPRoute` Weights:** You can also see the traffic weights directly on the `HTTPRoute` resource KubeRay manages. ```bash kubectl get httproute rayservice-incremental-upgrade-httproute -o yaml ``` Look at the `spec.rules.backendRefs`. You will see the `weight` for the old and new services change in real-time as the traffic shift (Phase 2) progresses. For example: ```yaml apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: creationTimestamp: "2025-12-07T07:42:24Z" generation: 10 name: stress-test-serve-httproute namespace: default ownerReferences: - apiVersion: ray.io/v1 blockOwnerDeletion: true controller: true kind: RayService name: stress-test-serve uid: 83a785cc-8745-4ccd-9973-2fc9f27000cc resourceVersion: "3714" uid: 660b14b5-78df-4507-b818-05989b1ef806 spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: stress-test-serve-gateway namespace: default rules: - backendRefs: - group: "" kind: Service name: stress-test-serve-f6z4w-serve-svc namespace: default port: 8000 weight: 90 - group: "" kind: Service name: stress-test-serve-xclvf-serve-svc namespace: default port: 8000 weight: 10 matches: - path: type: PathPrefix value: / status: parents: - conditions: - lastTransitionTime: "2025-12-07T07:42:24Z" message: Route was valid observedGeneration: 10 reason: Accepted status: "True" type: Accepted - lastTransitionTime: "2025-12-07T07:42:24Z" message: All references resolved observedGeneration: 10 reason: ResolvedRefs status: "True" type: ResolvedRefs controllerName: istio.io/gateway-controller parentRef: group: gateway.networking.k8s.io kind: Gateway name: stress-test-serve-gateway namespace: default ``` ## How to upgrade safely? Since this feature is alpha and rollback is not yet supported, we recommend conservative parameter settings to minimize risk during upgrades. ### Recommended Parameters To upgrade safely, you should: 1. Scale up 1 worker pod in the new cluster and scale down 1 worker pod in the old cluster at a time 2. Make the upgrade process gradual to allow the Ray Serve autoscaler and Ray autoscaler to adapt Based on these principles, we recommend: - **maxSurgePercent**: Calculate based on the formula below - **stepSizePercent**: Set to a value less than `maxSurgePercent` - **intervalSeconds**: 60 ### Calculating maxSurgePercent The `maxSurgePercent` determines the maximum percentage of additional resources that can be provisioned during the upgrade. To calculate the minimum safe value: \begin{equation} \text{maxSurgePercent} = \frac{\text{resources per pod}}{\text{total cluster resources}} \times 100 \end{equation} #### Example Consider a RayCluster with the following configuration: - `excludeHeadService`: true - Head pod: No GPU - 5 worker pods, each with 1 GPU (total: 5 GPUs) For this cluster: \begin{equation} \text{maxSurgePercent} = \frac{1 \text{ GPU}}{5 \text{ GPUs}} \times 100 = 20\% \end{equation} With `maxSurgePercent: 20`, the upgrade process ensures: - The new cluster scales up **1 worker pod at a time** (20% of 5 = 1 pod) - The old cluster scales down **1 worker pod at a time** - Your cluster temporarily uses 6 GPUs during the transition (5 original + 1 new) This configuration guarantees you have sufficient resources to run at least one additional worker pod during the upgrade without resource contention. ### Understanding intervalSeconds Set `intervalSeconds` to 60 seconds to give the Ray Serve autoscaler and Ray autoscaler sufficient time to: - Detect load changes - Immediately scale replicas up or down to enforce new min_replicas and max_replicas limits (via target_capacity) - Scale down replicas immediately if they exceed the new max_replicas - Scale up replicas immediately if they fall below the new min_replicas - Provision resources A larger interval prevents the upgrade controller from making changes faster than the autoscaler can react, reducing the risk of service disruption. ### Example Configuration ```yaml upgradeStrategy: maxSurgePercent: 20 # Calculated: (1 GPU / 5 GPUs) × 100 stepSizePercent: 10 # Less than maxSurgePercent intervalSeconds: 60 # Wait 1 minute between steps ``` ## API Overview (Reference) This section details the new and updated fields in the `RayService` CRD. ### `RayService.spec.upgradeStrategy` | Field | Type | Description | Required | Default | | :--- | :--- | :--- | :--- | :--- | | `type` | `string` | The strategy to use for upgrades. Can be `NewCluster`, `None`, or `NewClusterWithIncrementalUpgrade`. | No | `NewCluster` | | `clusterUpgradeOptions` | `object` | Container for incremental upgrade settings. **Required if `type` is `NewClusterWithIncrementalUpgrade`.** The `RayServiceIncrementalUpgrade` feature gate must be enabled. | No | `nil` | ### `RayService.spec.upgradeStrategy.clusterUpgradeOptions` This block is required *only* if `type` is set to `NewClusterWithIncrementalUpgrade`. | Field | Type | Description | Required | Default | | :--- | :--- | :--- | :--- | :--- | | `maxSurgePercent` | `int32` | The percentage of *capacity* (Serve replicas) to add to the new cluster in each scaling step. For example, a value of `20` means the new cluster's `target_capacity` will increase in 20% increments (0% -> 20% -> 40%...). Must be between 0 and 100. | No | `100` | | `stepSizePercent` | `int32` | The percentage of *traffic* to shift from the old to the new cluster during each interval. Must be between 0 and 100. | **Yes** | N/A | | `intervalSeconds` | `int32` | The time in seconds to wait between shifting traffic by `stepSizePercent`. | **Yes** | N/A | | `gatewayClassName` | `string` | The `metadata.name` of the `GatewayClass` resource KubeRay should use to create `Gateway` and `HTTPRoute` objects. | **Yes** | N/A | ### `RayService.status.activeServiceStatus` & `RayService.status.pendingServiceStatus` Three new fields are added to both the `activeServiceStatus` and `pendingServiceStatus` blocks to provide visibility into the upgrade process. | Field | Type | Description | | :--- | :--- | :--- | | `targetCapacity` | `int32` | The target percentage of Serve replicas this cluster is *configured* to handle (from 0 to 100). This is controlled by KubeRay based on `maxSurgePercent`. | | `trafficRoutedPercent` | `int32` | The *actual* percentage of traffic (from 0 to 100) currently being routed to this cluster's endpoint. This is controlled by KubeRay during an upgrade based on `stepSizePercent` and `intervalSeconds`. | | `lastTrafficMigratedTime` | `metav1.Time` | A timestamp indicating the last time `trafficRoutedPercent` was updated. | #### Next steps: * See [Deploy on Kubernetes](https://docs.ray.io/en/latest/serve/production-guide/kubernetes.html) for more information about deploying Ray Serve with KubeRay. * See [Ray Serve Autoscaling](https://docs.ray.io/en/latest/serve/autoscaling-guide.html) to configure your Serve deployments to scale based on traffic load. --- (kuberay-rayservice-no-ray-serve-replica)= # RayService worker Pods aren't ready This guide explores a specific scenario in KubeRay's RayService API where a Ray worker Pod remains in an unready state due to the absence of a Ray Serve replica. To better understand this section, you should be familiar with the following Ray Serve components: the [Ray Serve replica and ProxyActor](https://docs.ray.io/en/latest/serve/architecture.html#high-level-view). ProxyActor is responsible for forwarding incoming requests to the corresponding Ray Serve replicas. Hence, if a Ray Pod without a running ProxyActor receives requests, those requests will fail. KubeRay's readiness probe fails, rendering the Pods unready and preventing ProxyActor from sending requests to them. The default behavior of Ray Serve only creates ProxyActor on Ray Pods with running Ray Serve replicas. To illustrate, the following example serves one simple Ray Serve app using RayService. ## Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator using Helm repository. ## Step 3: Install a RayService ```sh curl -O https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.no-ray-serve-replica.yaml kubectl apply -f ray-service.no-ray-serve-replica.yaml ``` Look at the Ray Serve configuration `serveConfigV2` embedded in the RayService YAML. Notice the only deployment in `deployments` of the application named `simple_app`: * `num_replicas`: Controls the number of replicas, that handle requests to this deployment, to run. Initialize to 1 to ensure the overall number of the Ray Serve replicas is 1. * `max_replicas_per_node`: Controls the maximum number of replicas on a single pod. See [Ray Serve Documentation](https://docs.ray.io/en/master/serve/configure-serve-deployment.html) for more details. ```yaml serveConfigV2: | applications: - name: simple_app import_path: ray-operator.config.samples.ray-serve.single_deployment_dag:DagNode route_prefix: /basic runtime_env: working_dir: "https://github.com/ray-project/kuberay/archive/master.zip" deployments: - name: BaseService num_replicas: 1 max_replicas_per_node: 1 ray_actor_options: num_cpus: 0.1 ``` Look at the head Pod configuration `rayClusterConfig:headGroupSpec` embedded in the RayService YAML. The configuration sets the CPU resources for the head Pod to 0 by passing the option `num-cpus: "0"` to `rayStartParams`. This setup avoids Ray Serve replicas running on the head Pod. See [rayStartParams](https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md) for more details. ```sh headGroupSpec: rayStartParams: num-cpus: "0" template: ... ``` ## Step 4: Why 1 worker Pod isn't ready? ```sh # Step 4.1: Wait until the RayService is ready to serve requests. kubectl describe rayservices.ray.io rayservice-no-ray-serve-replica # [Example output] # Conditions: # Last Transition Time: 2025-03-18T14:14:43Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True # Type: Ready # Last Transition Time: 2025-03-18T14:12:03Z # Message: Active Ray cluster exists and no pending Ray cluster # Observed Generation: 1 # Reason: NoPendingCluster # Status: False # Type: UpgradeInProgress # Step 4.2: List all Ray Pods in the `default` namespace. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # rayservice-no-ray-serve-replica-raycluster-dnm28-head-9h2qt 1/1 Running 0 2m21s # rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-46t7l 1/1 Running 0 2m21s # rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk 0/1 Running 0 2m20s # Step 4.3: Check unready worker pod events kubectl describe pods {YOUR_UNREADY_WORKER_POD_NAME} # [Example output] # Events: # Type Reason Age From Message # ---- ------ ---- ---- ------- # Normal Scheduled 3m4s default-scheduler Successfully assigned default/rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk to kind-control-plane # Normal Pulled 3m3s kubelet Container image "rayproject/ray:2.46.0" already present on machine # Normal Created 3m3s kubelet Created container wait-gcs-ready # Normal Started 3m3s kubelet Started container wait-gcs-ready # Normal Pulled 2m57s kubelet Container image "rayproject/ray:2.46.0" already present on machine # Normal Created 2m57s kubelet Created container ray-worker # Normal Started 2m57s kubelet Started container ray-worker # Warning Unhealthy 78s (x19 over 2m43s) kubelet Readiness probe failed: success ``` Look at the output of Step 4.2. One worker Pod is running and ready, while the other is running but not ready. Starting from Ray 2.8, a Ray worker Pod that doesn't have any Ray Serve replica won't have a Proxy actor. Starting from KubeRay v1.1.0, KubeRay adds a readiness probe to every worker Pod's Ray container to check if the worker Pod has a Proxy actor or not. If the worker Pod lacks a Proxy actor, the readiness probe fails, rendering the worker Pod unready, and thus, it doesn't receive any traffic. With `spec.serveConfigV2`, KubeRay only creates one Ray Serve replica and schedules it to one of the worker Pods. KubeRay sets up the worker Pod with a Ray Serve replica with a Proxy actor and marks it as ready. KubeRay marks the other worker Pod, which doesn't have any Ray Serve replica and a Proxy actor, as unready. ## Step 5: Verify the status of the Serve apps ```sh kubectl port-forward svc/rayservice-no-ray-serve-replica-head-svc 8265:8265 ``` See [rayservice-troubleshooting.md](kuberay-raysvc-troubleshoot) for more details on RayService observability. Below is a screenshot example of the Serve page in the Ray dashboard. Note that a `ray::ServeReplica::simple_app::BaseService` and a `ray::ProxyActor` are running on one of the worker pod, while no Ray Serve replica and Proxy actor is running on the another. KubeRay marks the former as ready and the later as unready. ![Ray Serve Dashboard](../images/rayservice-no-ray-serve-replica-dashboard.png) ## Step 6: Send requests to the Serve apps by the Kubernetes serve service `rayservice-no-ray-serve-serve-svc` does traffic routing among all the workers that have Ray Serve replicas. Although one worker Pod is unready, Ray Serve can still route the traffic to the ready worker Pod with a Ray Serve replica running. Therefore, users can still send requests to the app and receive responses from it. ```sh # Step 6.1: Run a curl Pod. # If you already have a curl Pod, you can use `kubectl exec -it -- sh` to access the Pod. kubectl run curl --image=radial/busyboxplus:curl -i --tty # Step 6.2: Send a request to the simple_app. curl -X POST -H 'Content-Type: application/json' rayservice-no-ray-serve-replica-serve-svc:8000/basic # [Expected output]: hello world ``` ## Step 7: In-place update for Ray Serve apps Update the `num_replicas` for the app from `1` to `2` in `ray-service.no-ray-serve-replica.yaml`. This change reconfigures the existing RayCluster. ```sh # Step 7.1: Update the num_replicas of the app from 1 to 2. # [ray-service.no-ray-serve-replica.yaml] # deployments: # - name: BaseService # num_replicas: 2 # max_replicas_per_node: 1 # ray_actor_options: # num_cpus: 0.1 # Step 7.2: Apply the updated RayService config. kubectl apply -f ray-service.no-ray-serve-replica.yaml # Step 7.3: List all Ray Pods in the `default` namespace. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # rayservice-no-ray-serve-replica-raycluster-dnm28-head-9h2qt 1/1 Running 0 46m # rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-46t7l 1/1 Running 0 46m # rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk 1/1 Running 0 46m ``` After reconfiguration, KubeRay requests the head Pod to create an additional Ray Serve replica to match the `num_replicas` configuration. Because the `max_replicas_per_node` is `1`, the new Ray Serve replica runs on the worker Pod without any replicas. After that, KubeRay marks the worker Pod as ready. ## Step 8: Clean up the Kubernetes cluster ```sh # Delete the RayService. kubectl delete -f ray-service.no-ray-serve-replica.yaml # Uninstall the KubeRay operator. helm uninstall kuberay-operator # Delete the curl Pod. kubectl delete pod curl ``` ## Next steps * See [RayService troubleshooting guide](kuberay-raysvc-troubleshoot) if you encounter any issues. * See [Examples](kuberay-examples) for more RayService examples. --- (kuberay-rayservice)= # Deploy Ray Serve Applications ## Prerequisites This guide mainly focuses on the behavior of KubeRay v1.4.0 and Ray 2.46.0. ## What's a RayService? A RayService manages two components: * **RayCluster**: Manages resources in a Kubernetes cluster. * **Ray Serve Applications**: Manages users' applications. ## What does the RayService provide? * **Kubernetes-native support for Ray clusters and Ray Serve applications:** After using a Kubernetes configuration to define a Ray cluster and its Ray Serve applications, you can use `kubectl` to create the cluster and its applications. * **In-place updates for Ray Serve applications:** Users can update the Ray Serve configuration in the RayService CR configuration and use `kubectl apply` to update the applications. See [Step 7](#step-7-in-place-update-for-ray-serve-applications) for more details. * **Zero downtime upgrades for Ray clusters:** Users can update the Ray cluster configuration in the RayService CR configuration and use `kubectl apply` to update the cluster. RayService temporarily creates a pending cluster and waits for it to be ready, then switches traffic to the new cluster and terminates the old one. See [Step 8](#step-8-zero-downtime-upgrade-for-ray-clusters) for more details. * **High-availabilable services:** See [RayService high availability](kuberay-rayservice-ha) for more details. ## Example: Serve two simple Ray Serve applications using RayService ## Step 1: Create a Kubernetes cluster with Kind ```sh kind create cluster --image=kindest/node:v1.26.0 ``` ## Step 2: Install the KubeRay operator Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator using Helm repository. ## Step 3: Install a RayService ```sh curl -O https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-service.sample.yaml kubectl apply -f ray-service.sample.yaml ``` Look at the Ray Serve configuration `serveConfigV2` embedded in the RayService YAML. Notice two high-level applications: a fruit stand application and a calculator application. Take note of some details about the fruit stand application: * `import_path`: The path to import the Serve application. For `fruit_app`, [fruit.py](https://github.com/ray-project/test_dag/blob/master/fruit.py) defines the application in the `deployment_graph` variable. * `route_prefix`: See [Ray Serve API](serve-api) for more details. * `working_dir`: The working directory points to the [test_dag](https://github.com/ray-project/test_dag/) repository, which RayService downloads at runtime and uses to start your application. See {ref}`Runtime Environments `. for more details. * `deployments`: See [Ray Serve Documentation](https://docs.ray.io/en/master/serve/configure-serve-deployment.html). ```yaml serveConfigV2: | applications: - name: fruit_app import_path: fruit.deployment_graph route_prefix: /fruit runtime_env: working_dir: "https://github.com/ray-project/test_dag/archive/....zip" deployments: ... - name: math_app import_path: conditional_dag.serve_dag route_prefix: /calc runtime_env: working_dir: "https://github.com/ray-project/test_dag/archive/....zip" deployments: ... ``` ## Step 4: Verify the Kubernetes cluster status ```sh # Step 4.1: List all RayService custom resources in the `default` namespace. kubectl get rayservice # [Example output] # NAME SERVICE STATUS NUM SERVE ENDPOINTS # rayservice-sample Running 1 # Step 4.2: List all RayCluster custom resources in the `default` namespace. kubectl get raycluster # [Example output] # NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE # rayservice-sample-raycluster-fj2gp 1 1 2500m 4Gi 0 ready 75s # Step 4.3: List all Ray Pods in the `default` namespace. kubectl get pods -l=ray.io/is-ray-node=yes # [Example output] # NAME READY STATUS RESTARTS AGE # rayservice-sample-raycluster-fj2gp-head-6wwqp 1/1 Running 0 93s # rayservice-sample-raycluster-fj2gp-small-group-worker-hxrxc 1/1 Running 0 93s # Step 4.4: Check whether the RayService is ready to serve requests. kubectl describe rayservices.ray.io rayservice-sample # [Example output] # Conditions: # Last Transition Time: 2025-02-13T18:28:51Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True <--- RayService is ready to serve requests # Type: Ready # Step 4.5: List services in the `default` namespace. kubectl get services # NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE # ... # rayservice-sample-head-svc ClusterIP 10.96.34.90 10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP 4m58s # rayservice-sample-raycluster-6mj28-head-svc ClusterIP 10.96.171.184 10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP 6m21s # rayservice-sample-serve-svc ClusterIP 10.96.161.84 8000/TCP 4m58s ``` KubeRay creates a RayCluster based on `spec.rayClusterConfig` defined in the RayService YAML for a RayService custom resource. Next, once the head Pod is running and ready, KubeRay submits a request to the head's dashboard port to create the Ray Serve applications defined in `spec.serveConfigV2`. Users can access the head Pod through both RayService’s head service `rayservice-sample-head-svc` and RayCluster’s head service `rayservice-sample-raycluster-xxxxx-head-svc`. However, during a zero downtime upgrade, KubeRay creates a new RayCluster and a new head service `rayservice-sample-raycluster-yyyyy-head-svc` for the new RayCluster. If you don't use `rayservice-sample-head-svc`, you need to update the ingress configuration to point to the new head service. However, if you use `rayservice-sample-head-svc`, KubeRay automatically updates the selector to point to the new head Pod, eliminating the need to update the ingress configuration. > Note: Default ports and their definitions. | Port | Definition | |-------|---------------------| | 6379 | Ray GCS | | 8265 | Ray Dashboard | | 10001 | Ray Client | | 8000 | Ray Serve | ## Step 5: Verify the status of the Serve applications ```sh # Step 5.1: Check the status of the RayService. kubectl describe rayservices rayservice-sample # [Example output: Ray Serve application statuses] # Status: # Active Service Status: # Application Statuses: # fruit_app: # Serve Deployment Statuses: # Fruit Market: # Status: HEALTHY # ... # Status: RUNNING # math_app: # Serve Deployment Statuses: # Adder: # Status: HEALTHY # ... # Status: RUNNING # [Example output: RayService conditions] # Conditions: # Last Transition Time: 2025-02-13T18:28:51Z # Message: Number of serve endpoints is greater than 0 # Observed Generation: 1 # Reason: NonZeroServeEndpoints # Status: True # Type: Ready # Last Transition Time: 2025-02-13T18:28:00Z # Message: Active Ray cluster exists and no pending Ray cluster # Observed Generation: 1 # Reason: NoPendingCluster # Status: False # Type: UpgradeInProgress # Step 5.2: Check the Serve applications in the Ray dashboard. # (1) Forward the dashboard port to localhost. # (2) Check the Serve page in the Ray dashboard at http://localhost:8265/#/serve. kubectl port-forward svc/rayservice-sample-head-svc 8265:8265 ``` * See [rayservice-troubleshooting.md](kuberay-raysvc-troubleshoot) for more details on RayService observability. Below is a screenshot example of the Serve page in the Ray dashboard. ![Ray Serve Dashboard](../images/dashboard_serve.png) ## Step 6: Send requests to the Serve applications by the Kubernetes serve service ```sh # Step 6.1: Run a curl Pod. # If you already have a curl Pod, you can use `kubectl exec -it -- sh` to access the Pod. kubectl run curl --image=radial/busyboxplus:curl -i --tty # Step 6.2: Send a request to the fruit stand app. curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:8000/fruit/ -d '["MANGO", 2]' # [Expected output]: 6 # Step 6.3: Send a request to the calculator app. curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:8000/calc/ -d '["MUL", 3]' # [Expected output]: "15 pizzas please!" ``` * `rayservice-sample-serve-svc` does traffic routing among all the workers which have Ray Serve replicas. (step-7-in-place-update-for-ray-serve-applications)= ## Step 7: In-place update for Ray Serve applications You can update the configurations for the applications by modifying `serveConfigV2` in the RayService configuration file. Reapplying the modified configuration with `kubectl apply` reapplies the new configurations to the existing RayCluster instead of creating a new RayCluster. Update the price of Mango from `3` to `4` for the fruit stand app in [ray-service.sample.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-service.sample.yaml). This change reconfigures the existing MangoStand deployment, and future requests are going to use the updated mango price. ```sh # Step 7.1: Update the price of mangos from 3 to 4. # [ray-service.sample.yaml] # - name: MangoStand # num_replicas: 1 # max_replicas_per_node: 1 # user_config: # price: 4 # Step 7.2: Apply the updated RayService config. kubectl apply -f ray-service.sample.yaml # Step 7.3: Check the status of the RayService. kubectl describe rayservices rayservice-sample # [Example output] # Serve Deployment Statuses: # Mango Stand: # Status: UPDATING # Step 7.4: Send a request to the fruit stand app again after the Serve deployment status changes from UPDATING to HEALTHY. # (Execute the command in the curl Pod from Step 6) curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:8000/fruit/ -d '["MANGO", 2]' # [Expected output]: 8 ``` (step-8-zero-downtime-upgrade-for-ray-clusters)= ## Step 8: Zero downtime upgrade for Ray clusters This section describes the default `NewCluster` upgrade strategy. For large-scale deployments where duplicating resources isn't feasible, see [RayService incremental upgrade](kuberay-rayservice-incremental-upgrade) for the `NewClusterWithIncrementalUpgrade` strategy, which uses fewer resources during upgrades. In Step 7, modifying `serveConfigV2` doesn't trigger a zero downtime upgrade for Ray clusters. Instead, it reapplies the new configurations to the existing RayCluster. However, if you modify `spec.rayClusterConfig` in the RayService YAML file, it triggers a zero downtime upgrade for Ray clusters. RayService temporarily creates a new RayCluster and waits for it to be ready, then switches traffic to the new RayCluster by updating the selector of the head service managed by RayService `rayservice-sample-head-svc` and terminates the old one. During the zero downtime upgrade process, RayService creates a new RayCluster temporarily and waits for it to become ready. Once the new RayCluster is ready, RayService updates the selector of the head service managed by RayService `rayservice-sample-head-svc` to point to the new RayCluster to switch the traffic to the new RayCluster. Finally, KubeRay deletes the old RayCluster. Certain exceptions don't trigger a zero downtime upgrade. Only the fields managed by Ray autoscaler—`replicas`, `minReplicas`, `maxReplicas`, and `scaleStrategy.workersToDelete`—don't trigger a zero downtime upgrade. When you update these fields, KubeRay doesn't propagate the update from RayService to RayCluster custom resources, so nothing happens. ```sh # Step 8.1: Update `spec.rayClusterConfig.workerGroupSpecs[0].replicas` in the RayService YAML file from 1 to 2. # This field is an exception that doesn't trigger a zero-downtime upgrade, and KubeRay doesn't update the # RayCluster as a result. Therefore, no changes occur. kubectl apply -f ray-service.sample.yaml # Step 8.2: Check RayService CR kubectl describe rayservices rayservice-sample # Worker Group Specs: # ... # Replicas: 2 # Step 8.3: Check RayCluster CR. The update doesn't propagate to the RayCluster CR. kubectl describe rayclusters $YOUR_RAY_CLUSTER # Worker Group Specs: # ... # Replicas: 1 # Step 8.4: Update `spec.rayClusterConfig.rayVersion` to `2.100.0`. # This field determines the Autoscaler sidecar image, and triggers a zero downtime upgrade. kubectl apply -f ray-service.sample.yaml # Step 8.5: List all RayCluster custom resources in the `default` namespace. # Note that the new RayCluster is created based on the updated RayService config to have 2 workers. kubectl get raycluster # NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE # rayservice-sample-raycluster-fj2gp 1 1 2500m 4Gi 0 ready 40m # rayservice-sample-raycluster-pddrb 2 2 3 6Gi 0 13s # Step 8.6: Wait for the old RayCluster terminate. # Step 8.7: Submit a request to the fruit stand app via the same serve service. curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:8000/fruit/ -d '["MANGO", 2]' # [Expected output]: 8 ``` ## Step 9: Clean up the Kubernetes cluster ```sh # Delete the RayService. kubectl delete -f ray-service.sample.yaml # Uninstall the KubeRay operator. helm uninstall kuberay-operator # Delete the curl Pod. kubectl delete pod curl ``` ## Next steps * See [RayService high availability](kuberay-rayservice-ha) for more details on RayService HA. * See [RayService troubleshooting guide](kuberay-raysvc-troubleshoot) if you encounter any issues. * See [Examples](kuberay-examples) for more RayService examples. The [MobileNet example](kuberay-mobilenet-rayservice-example) is a good example to start with because it doesn't require GPUs and is easy to run on a local machine. --- (reduce-image-pull-latency)= # Reducing image pull latency on Kubernetes This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic so you can use them on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers. ## Image pull latency Ray container images can often be several gigabytes, primarily due to the Python dependencies included. Other factors can also contribute to image size. Pulling large images from remote repositories can slow down Ray cluster startup times. The time required to download an image depends on several factors, including: * Whether image layers are already cached on the node. * The overall size of the image. * The reliability and throughput of the remote repository. ## Strategies for reducing image pulling latency The following sections discuss strategies for reducing image pull latency. ### Preload images on every node using a DaemonSet You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that Ray downloads the image to each node, reducing the time to pull the image when a Ray needs to schedule a pod. The following is an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`: ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: ray-image-preloader labels: k8s-app: ray-image-preloader spec: selector: matchLabels: k8s-app: ray-image-preloader template: metadata: labels: name: ray-image-preloader k8s-app: ray-image-preloader spec: containers: - image: rayproject/ray:2.40.0 name: ray-image-preloader command: [ "sleep", "inf" ] ``` > **Note:** Ensure that the image tag that you use in the Daemonset is consistent with the Ray images your Ray cluster uses. ### Preload images into machine images Some cloud providers allow you to build custom machine images for your Kubernetes nodes. Including your Ray images in these custom machine images ensures that Ray caches images locally when your nodes start up, avoiding the need to pull them from a remote registry. While this approach can be effective, it's generally not recommended, as changing machine images often requires multiple steps and is tightly coupled to the lifecycle of your nodes. ### Use private image registries For production environments, it's generally recommended to avoid pulling images from the public internet. Instead, host your images closer to your cluster to reduce pull times. Cloud providers like Google Cloud and AWS offer services such as Artifact Registry (AR) and Elastic Container Registry (ECR), respectively. Using these services ensures that traffic for image pulls remains within the provider's internal network, avoiding network hops on the public internet and resulting in faster pull times. ### Enable Image streaming (GKE only) If you're using Google Kubernetes Engine (GKE), you can leverage [Image streaming](https://cloud.google.com/kubernetes-engine/docs/how-to/image-streaming). With Image streaming, GKE uses a remote filesystem as the root filesystem for any containers that use eligible container images. GKE streams image data from the remote filesystem as needed by your workloads. While streaming the image data, GKE downloads the entire container image onto the local disk in the background and caches it. GKE then serves future data read requests from the cached image. When you deploy workloads that need to read specific files in the container image, the Image streaming backend serves only those requested files. Only container images hosted on [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview) are eligible for Image streaming. > **Note:** You might not notice the benefits of Image streaming during the first pull of an eligible image. However, after Image streaming caches the image, future image pulls on any cluster benefit from Image streaming. You can enable Image streaming when creating a GKE cluster by setting the `--enable-image-streaming` flag: ``` gcloud container clusters create CLUSTER_NAME \ --zone=COMPUTE_ZONE \ --image-type="COS_CONTAINERD" \ --enable-image-streaming ``` See [Enable Image streaming on clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/image-streaming#enable_on_clusters) for more details. ### Enable secondary boot disks (GKE only) If you're using Google Kubernetes Engine (GKE), you can enable the [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading). GKE enables secondary boot disks per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool. The images within the Persistent Disk are immediately accessible to containers once Ray schedules workloads on those nodes. Including Ray images in the secondary boot disk can significantly reduce image pull latency. See [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk. See [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for how to enable secondary boot disks for your node pools. --- (kuberay-storage)= # Best Practices for Storage and Dependencies This document contains recommendations for setting up storage and handling application dependencies for your Ray deployment on Kubernetes. When you set up Ray on Kubernetes, the [KubeRay documentation](kuberay-quickstart) provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle. However, as administrators you may still have questions with respect to actual user workflows. For example: * How do you ship or run code on the Ray cluster? * What type of storage system should you set up for artifacts? * How do you handle package dependencies for your application? The answers to these questions vary between development and production. This table summarizes the recommended setup for each situation: | | Interactive Development | Production | |---|---|---| | Cluster Configuration | KubeRay YAML | KubeRay YAML | | Code | Run driver or Jupyter notebook on head node | Bake code into Docker image | | Artifact Storage | Set up an EFS
or
Cloud Storage (S3, GS) | Set up an EFS
or
Cloud Storage (S3, GS) | | Package Dependencies | Install onto NFS
or
Use runtime environments | Bake into Docker image | Table 1: Table comparing recommended setup for development and production. ## Interactive development To provide an interactive development environment for data scientists and ML practitioners, we recommend setting up the code, storage, and dependencies in a way that reduces context switches for developers and shortens iteration times. ```{eval-rst} .. image:: ../images/interactive-dev.png :align: center .. Find the source document here (https://whimsical.com/clusters-P5Y6R23riCuNb6xwXVXN72) ``` ### Storage Use one of these two standard solutions for artifact and log storage during the development process, depending on your use case: * POSIX-compliant network file storage, like Network File System (NFS) and Elastic File Service (EFS): This approach is useful when you want to have artifacts or dependencies accessible across different nodes with low latency. For example, experiment logs of different models trained on different Ray tasks. * Cloud storage, like AWS Simple Storage Service (S3) or GCP Google Storage (GS): This approach is useful for large artifacts or datasets that you need to access with high throughput. Ray's AI libraries such as Ray Data, Ray Train, and Ray Tune come with out-of-the-box capabilities to read and write from cloud storage and local or networked storage. ### Driver script Run the main, or driver, script on the head node of the cluster. Ray Core and library programs often assume that the driver is on the head node and take advantage of the local storage. For example, Ray Tune generates log files on the head node by default. A typical workflow can look like this: * Start a Jupyter server on the head node * SSH onto the head node and run the driver script or application there * Use the Ray Job Submission client to submit code from a local machine onto a cluster ### Dependencies For local dependencies, for example, if you’re working in a mono-repo, or external dependencies, like a pip package, use one of the following options: * Put the code and install the packages onto your NFS. The benefit is that you can quickly interact with the rest of the codebase and dependencies without shipping it across a cluster every time. * Use the [runtime env](runtime-environments) with the [Ray Job Submission Client](ray.job_submission.JobSubmissionClient), which can pull down code from S3 or ship code from your local working directory onto the remote cluster. * Bake remote and local dependencies into a published Docker image for all nodes to use. See [Custom Docker Images](serve-custom-docker-images). This approach is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes), but it's also the highest friction option. ## Production The recommendations for production align with standard Kubernetes best practices. See the configuration in the following image: ```{eval-rst} .. image:: ../images/production.png :align: center .. Find the source document here (https://whimsical.com/clusters-P5Y6R23riCuNb6xwXVXN72) ``` ### Storage The choice of storage system remains the same across development and production. ### Code and dependencies Bake your code, remote, and local dependencies into a published Docker image for all nodes in the cluster. This approach is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes). See [Custom Docker Images](serve-custom-docker-images). Using cloud storage and the [runtime env](runtime-environments) is a less preferred method as it may not be as reproducible as the container path, but it's still viable. In this case, use the runtime environment option to download zip files containing code and other private modules from cloud storage, in addition to specifying the pip packages needed to run your application. --- (kuberay-tls)= # TLS Authentication Ray can be configured to use TLS on its gRPC channels. This means that connecting to the Ray head will require an appropriate set of credentials and also that data exchanged between various processes (client, head, workers) will be encrypted ([Ray's document](https://docs.ray.io/en/latest/ray-core/configure.html?highlight=tls#tls-authentication)). This document provides detailed instructions for generating a public-private key pair and CA certificate for configuring KubeRay. > Warning: Enabling TLS will cause a performance hit due to the extra overhead of mutual authentication and encryption. Testing has shown that this overhead is large for small workloads and becomes relatively smaller for large workloads. The exact overhead will depend on the nature of your workload. # Prerequisites To fully understand this document, it's highly recommended that you have a solid understanding of the following concepts: * private/public key * CA (certificate authority) * CSR (certificate signing request) * self-signed certificate This [YouTube video](https://youtu.be/T4Df5_cojAs) is a good start. # TL;DR > Please note that this document is designed to support KubeRay version 0.5.0 or later. If you are using an older version of KubeRay, some of the instructions or configurations may not apply or may require additional modifications. > Warning: Please note that the `ray-cluster.tls.yaml` file is intended for demo purposes only. It is crucial that you **do not** store your CA private key in a Kubernetes Secret in your production environment. ```sh # Install KubeRay operator # `ray-cluster.tls.yaml` will cover from Step 1 to Step 3 # Download `ray-cluster.tls.yaml` curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.tls.yaml # Create a RayCluster kubectl apply -f ray-cluster.tls.yaml # Jump to Step 4 "Verify TLS authentication" to verify the connection. ``` `ray-cluster.tls.yaml` will create: * A Kubernetes Secret containing the CA's private key (`ca.key`) and self-signed certificate (`ca.crt`) (**Step 1**) * A Kubernetes ConfigMap containing the scripts `gencert_head.sh` and `gencert_worker.sh`, which allow Ray Pods to generate private keys (`tls.key`) and self-signed certificates (`tls.crt`) (**Step 2**) * A RayCluster with proper TLS environment variables configurations (**Step 3**) The certificate (`tls.crt`) for a Ray Pod is encrypted using the CA's private key (`ca.key`). Additionally, all Ray Pods have the CA's public key included in `ca.crt`, which allows them to decrypt certificates from other Ray Pods. # Step 1: Generate a private key and self-signed certificate for CA In this document, a self-signed certificate is used, but users also have the option to choose a publicly trusted certificate authority (CA) for their TLS authentication. ```sh # Step 1-1: Generate a self-signed certificate and a new private key file for CA. openssl req -x509 \ -sha256 -days 3650 \ -nodes \ -newkey rsa:2048 \ -subj "/CN=*.kuberay.com/C=US/L=San Francisco" \ -keyout ca.key -out ca.crt # Step 1-2: Check the CA's public key from the self-signed certificate. openssl x509 -in ca.crt -noout -text # Step 1-3 # Method 1: Use `cat $FILENAME | base64` to encode `ca.key` and `ca.crt`. # Then, paste the encoding strings to the Kubernetes Secret in `ray-cluster.tls.yaml`. # Method 2: Use kubectl to encode the certificate as Kubernetes Secret automatically. # (Note: You should comment out the Kubernetes Secret in `ray-cluster.tls.yaml`.) kubectl create secret generic ca-tls --from-file=ca.key --from-file=ca.crt ``` * `ca.key`: CA's private key * `ca.crt`: CA's self-signed certificate This step is optional because the `ca.key` and `ca.crt` files have already been included in the Kubernetes Secret specified in [ray-cluster.tls.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-cluster.tls.yaml). # Step 2: Create separate private key and self-signed certificate for Ray Pods In [ray-cluster.tls.yaml](https://github.com/ray-project/kuberay/blob/v1.5.1/ray-operator/config/samples/ray-cluster.tls.yaml), each Ray Pod (both head and workers) generates its own private key file (`tls.key`) and self-signed certificate file (`tls.crt`) in its init container. We generate separate files for each Pod because worker Pods do not have deterministic DNS names, and we cannot use the same certificate across different Pods. In the YAML file, you'll find a ConfigMap named `tls` that contains two shell scripts: `gencert_head.sh` and `gencert_worker.sh`. These scripts are used to generate the private key and self-signed certificate files (`tls.key` and `tls.crt`) for the Ray head and worker Pods. An alternative approach for users is to prebake the shell scripts directly into the docker image that's utilized by the init containers, rather than relying on a ConfigMap. Please find below a brief explanation of what happens in each of these scripts: 1. A 2048-bit RSA private key is generated and saved as `/etc/ray/tls/tls.key`. 2. A Certificate Signing Request (CSR) is generated using the private key file (`tls.key`) and the `csr.conf` configuration file. 3. A self-signed certificate (`tls.crt`) is generated using the private key of the Certificate Authority (`ca.key`) and the previously generated CSR. The only difference between `gencert_head.sh` and `gencert_worker.sh` is the `[ alt_names ]` section in `csr.conf` and `cert.conf`. The worker Pods use the fully qualified domain name (FQDN) of the head Kubernetes Service to establish a connection with the head Pod. Therefore, the `[alt_names]` section for the head Pod needs to include the FQDN of the head Kubernetes Service. By the way, the head Pod uses `$POD_IP` to communicate with worker Pods. ```sh # gencert_head.sh [alt_names] DNS.1 = localhost DNS.2 = $FQ_RAY_IP IP.1 = 127.0.0.1 IP.2 = $POD_IP # gencert_worker.sh [alt_names] DNS.1 = localhost IP.1 = 127.0.0.1 IP.2 = $POD_IP ``` In [Kubernetes networking model](https://github.com/kubernetes/design-proposals-archive/blob/main/network/networking.md#pod-to-pod), the IP that a Pod sees itself as is the same IP that others see it as. That's why Ray Pods can self-register for the certificates. # Step 3: Configure environment variables for Ray TLS authentication To enable TLS authentication in your Ray cluster, set the following environment variables: - `RAY_USE_TLS`: Either 1 or 0 to use/not-use TLS. If this is set to 1 then all of the environment variables below must be set. Default: 0. - `RAY_TLS_SERVER_CERT`: Location of a certificate file which is presented to other endpoints so as to achieve mutual authentication (i.e. `tls.crt`). - `RAY_TLS_SERVER_KEY`: Location of a private key file which is the cryptographic means to prove to other endpoints that you are the authorized user of a given certificate (i.e. `tls.key`). - `RAY_TLS_CA_CERT`: Location of a CA certificate file which allows TLS to decide whether an endpoint’s certificate has been signed by the correct authority (i.e. `ca.crt`). For more information on how to configure Ray with TLS authentication, please refer to [Ray's document](https://docs.ray.io/en/latest/ray-core/configure.html#tls-authentication). # Step 4: Verify TLS authentication ```sh # Log in to the worker Pod kubectl exec -it ${WORKER_POD} -- bash # Since the head Pod has the certificate of $FQ_RAY_IP, the connection to the worker Pods # will be established successfully, and the exit code of the ray health-check command # should be 0. ray health-check --address $FQ_RAY_IP:6379 echo $? # 0 # Since the head Pod has the certificate of $RAY_IP, the connection will fail and an error # message similar to the following will be displayed: "Peer name raycluster-tls-head-svc is # not in peer certificate". ray health-check --address $RAY_IP:6379 # If you add `DNS.3 = $RAY_IP` to the [alt_names] section in `gencert_head.sh`, # the head Pod will generate the certificate of $RAY_IP. # # For KubeRay versions prior to 0.5.0, this step is necessary because Ray workers in earlier # versions use $RAY_IP to connect with Ray head. ``` --- (kuberay-tpu)= # Use TPUs with KubeRay This document provides tips on TPU usage with KubeRay. TPUs are available on Google Kubernetes Engine (GKE). To use TPUs with Kubernetes, configure both the Kubernetes setup and add additional values to the RayCluster CR configuration. Configure TPUs on GKE by referencing the {ref}`kuberay-gke-tpu-cluster-setup`. ## About TPUs TPUs are custom-designed AI accelerators, which are optimized for training and inference of large AI models. A TPU host is a VM that runs on a physical computer connected to TPU hardware. TPU workloads can run on one or multiple hosts. A TPU Pod slice is a collection of chips all physically colocated and connected by high-speed inter chip interconnects (ICI). Single-host TPU Pod slices contain independent TPU VM hosts and communicate over the Data Center Network (DCN) rather than ICI interconnects. Multi-host TPU Pod slices contain two or more interconnected TPU VM hosts. In GKE, multi-host TPU Pod slices run on their own node pools and GKE scales them atomically by node pools, rather than individual nodes. Ray enables single-host and multi-host TPU Pod slices to be scaled seamlessly to multiple slices, enabling greater parallelism to support larger workloads. ## Quickstart: Serve a Stable Diffusion model on GKE with TPUs After setting up a GKE cluster with TPUs and the Ray TPU initialization webhook, run a workload on Ray with TPUs. {ref}`Serve a Stable Diffusion model on GKE with TPUs ` shows how to serve a model with KubeRay on single-host TPUs. ## Configuring Ray Pods for TPU usage Using any TPU accelerator requires specifying `google.com/tpu` resource `limits` and `requests` in the container fields of your `RayCluster`'s `workerGroupSpecs`. This resource specifies the number of TPU chips for GKE to allocate each Pod. KubeRay v1.1.0 adds a `numOfHosts` field to the RayCluster custom resource, specifying the number of TPU hosts to create per worker group replica. For multi-host worker groups, Ray treats replicas as Pod slices rather than individual workers, and creates `numOfHosts` worker nodes per replica. Additionally, GKE uses `gke-tpu` node selectors to schedule TPU Pods on the node matching the desired TPU accelerator and topology. Below is a config snippet for a RayCluster worker group with 2 Ray TPU worker Pods. Ray schedules each worker on its own GKE v4 TPU node belonging to the same TPU Pod slice. ``` groupName: tpu-group replicas: 1 minReplicas: 0 maxReplicas: 1 numOfHosts: 2 ... template: spec: ... containers: - name: ray-worker image: rayproject/ray:2.9.0-py310 ... resources: google.com/tpu: "4" # Required to use TPUs. ... limits: google.com/tpu: "4" # The resources and limits value is expected to be equal. ... nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice cloud.google.com/gke-tpu-topology: 2x2x2 ... ``` ## TPU workload scheduling After Ray deploys a Ray Pod with TPU resources, the Ray Pod can execute tasks and actors annotated with TPU requests. Ray supports TPUs as a [custom resource](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources). Tasks or actors request TPUs using the decorator `@ray.remote(resources={"TPU": NUM_TPUS})`. ### TPU default labels When running on Google Cloud TPUs with KubeRay, Ray automatically detects and adds the following labels to describe the underlying compute. These are critical for scheduling distributed workloads that must span an entire TPU "slice" (a set of interconnected hosts). * `ray.io/accelerator-type`: The type of TPU accelerator, such as TPU-V6E. * `ray.io/tpu-slice-name`: The name of the TPU Pod or slice. Ray uses this to ensure all workers of a job land on the *same* slice. * `ray.io/tpu-worker-id`: The integer worker ID within the slice. * `ray.io/tpu-topology`: The physical topology of the slice. * `ray.io/tpu-pod-type`: The TPU pod type, which defines the size and TPU generation such as `v4-8` or `v5p-16`. You can use these labels to schedule a `placement_group` that requests an entire TPU slice. For example, to request all TPU devices on a `v6e-16` slice: ```py # Request 4 bundles, one for each TPU VM in the v6e-16 slice. pg = placement_group( [{"TPU": 4}] * 4, strategy="SPREAD", bundle_label_selector=[{ "ray.io/tpu-pod-type": "v6e-16" }] * 4 ) ray.get(pg.ready()) ``` ### TPU scheduling utility library The `ray.util.tpu` package introduces a number of TPU utilities related to scheduling that streamline the process of utilizing TPUs in multi-host and/or multi-slice configurations. These utilities utilize default Ray node labels that are set when running on Google Kubernetes Engine (GKE) with KubeRay. Previously, when simply requesting TPU using something like `resources={"TPU": 4}` over multiple tasks or actors, it was never guaranteed that the Ray nodes scheduled were part of the same slice or even the same TPU generation. To address the latter, Ray introduced the {ref}`label selector API ` and default labels (like `ray.io/accelerator-type`) to describe the underlying compute. Going even further, the new TPU utility library leverages the default node labels and the label selector API to abstract away the complexities of TPU scheduling, particularly for multi-host slices. The core abstraction is the `SlicePlacementGroup`. #### `SlicePlacementGroup` The `SlicePlacementGroup` class provides a high-level interface to reserve one or more complete, available TPU slices and create a Ray [Placement Group](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) constrained to those slices. This guarantees that all bundles within the placement group (and thus the tasks/actors scheduled on them) run on workers belonging to the reserved physical TPU slices. **How it works:** 1. **Reservation:** When you create a `SlicePlacementGroup`, it first interacts with the Ray scheduler to find an available TPU slice matching your specified `topology` and `accelerator_version`. It does this by temporarily reserving the "head" TPU node (worker ID 0) of a slice using a small, internal placement group. 2. **Slice Identification:** From the reserved head node, it retrieves the unique slice name (using the `ray.io/tpu-slice-name` default label). This label is set using an environment variable injected by a GKE webhook. The GKE webhook also ensures that KubeRay pods with `numOfHosts > 1` are scheduled with affinity on the same GKE nodepool, which is 1:1 with a TPU multi-host slice. 3. **Main Placement Group Creation:** It then creates the main placement group you requested. This group contains bundles representing each host (VM) in the slice(s). For each slice, it uses `bundle_label_selector` to target the specific `ray.io/tpu-slice-name` identified in the previous step. This ensures all bundles for a given slice land on workers within that exact slice. 4. **Handle:** It returns a `SlicePlacementGroup` handle which exposes the underlying Ray `PlacementGroup` object (`.placement_group`) along with useful properties like the number of workers (`.num_workers`) and chips per host (`.chips_per_host`). **Usage:** You typically create a `SlicePlacementGroup` using the `slice_placement_group` function: ```python import ray from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy from ray.util.tpu import slice_placement_group # Reserve two v6e TPU slices, each with a 4x4 topology (16 chips each). # This topology typically has 4 VM workers, each with 4 chips. slice_handle = slice_placement_group(topology="4x4", accelerator_version="v6e", num_slices=2) slice_pg = slice_handle.placement_group print("Waiting for placement group to be ready...") ray.get(slice_pg.ready(), timeout=600) # Increased timeout for potential scaling print("Placement group ready.") @ray.remote(num_cpus=0, resources={"TPU": 4}) def spmd_task(world_size, rank): pod_name = ray.util.tpu.get_current_pod_name() chips_on_node = ray.util.tpu.get_num_tpu_chips_on_node() print(f"Worker Rank {rank}/{world_size}: Running on slice '{pod_name}' with {chips_on_node} chips.") return rank # Launch one task per VM in the reserved slices. The num_workers field describes the total # number of VMs across all slices in the SlicePlacementGroup. tasks = [ spmd_task.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=slice_pg, ) ).remote(world_size=slice_handle.num_workers, rank=i) for i in range(slice_handle.num_workers) ] results = ray.get(tasks) print(f"Task results: {results}") ``` #### TPU Pod Information Utilities These functions provide information about the TPU pod that a given worker is a part of. They return None if the worker is not running on a TPU. * `ray.util.tpu.get_current_pod_name() -> Optional[str]` Returns the name of the TPU pod that the worker is a part of. * `ray.util.tpu.get_current_pod_worker_count() -> Optional[int]` Counts the number of workers associated with the TPU pod that the worker belongs to. * `ray.util.tpu.get_num_tpu_chips_on_node() -> int` Returns the total number of TPU chips on the current node. Returns 0 if none are found. ## Multi-Host TPU autoscaling Multi-host TPU autoscaling is supported in Kuberay versions 1.1.0 or later and Ray versions 2.32.0 or later. Ray multi-host TPU worker groups are worker groups, which specify "google.com/tpu" Kubernetes container limits or requests and have `numOfHosts` greater than 1. Ray treats each replica of a Ray multi-host TPU worker group as a TPU Pod slice and scales them atomically. When scaling up, multi-host worker groups create `numOfHosts` Ray workers per replica. Likewise, Ray scales down multi-host worker group replicas by `numOfHosts` workers. When Ray schedules a deletion of a single Ray worker in a multi-host TPU worker group, it terminates the entire replica to which the worker belongs. When scheduling TPU workloads on multi-host worker groups, ensure that Ray tasks or actors run on every TPU VM host in a worker group replica to avoid Ray from scaling down idle TPU workers. Further reference and discussion -------------------------------- * See [TPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus) for more details on using TPUs. * [TPU availability](https://cloud.google.com/tpu/docs/regions-zones) * [TPU System Architecture](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm) --- (kuberay-upgrade-guide)= # KubeRay upgrade guide ## KubeRay / Ray compatibility KubeRay CI tests the nightly KubeRay against the three most recent major or minor releases of Ray, as well as against the nightly Ray build. For example, if the latest Ray release is 2.7.0, KubeRay CI tests the nightly KubeRay against Ray 2.7.0, 2.6.0, 2.5.0, and the nightly Ray build. ```{admonition} Don't use Ray versions between 2.11.0 and 2.37.0. The [commit](https://github.com/ray-project/ray/pull/44658) introduces a bug in Ray 2.11.0. When a Ray job is created, the Ray dashboard agent process on the head node gets stuck, causing the readiness and liveness probes, which send health check requests for the Raylet to the dashboard agent, to fail. ``` * KubeRay v0.6.0: Supports all Ray versions > Ray 2.0.0 * KubeRay v1.0.0: Supports all Ray versions > Ray 2.0.0 * KubeRay v1.1.0: Supports Ray 2.8.0 and later. * KubeRay v1.2.X: Supports Ray 2.8.0 and later. * KubeRay v1.3.X: Supports Ray 2.38.0 and later. * KubeRay v1.4.X: Supports Ray 2.38.0 and later. The preceding compatibility plan is closely tied to the KubeRay CRD versioning plan. ## CRD versioning Typically, while new fields are added to the KubeRay CRD in each release, KubeRay doesn't bump the CRD version for every release. * KubeRay v0.6.0 and older: CRD v1alpha1 * KubeRay v1.0.0: CRD v1alpha1 and v1 * KubeRay v1.1.0 and later: CRD v1 If you want to understand the reasoning behind the CRD versioning plan, see [ray-project/ray#40357](https://github.com/ray-project/ray/pull/40357) for more details. ## Upgrade KubeRay Upgrading the KubeRay version is the best strategy if you have any issues with KubeRay. Due to reliability and security implications of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs. To upgrade the KubeRay version, follow these steps in order: 1. Upgrade the CRD manifest, containing new fields added to the v1 CRDs. 2. Upgrade the kuberay-operator image to the new version. 3. Verify the success of the upgrade. The following is an example of upgrading KubeRay from v1.3.X to v1.4.0: ``` # Upgrade the CRD to v1.5.1. # Note: This example uses kubectl because Helm doesn't support lifecycle management of CRDs. # See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations $ kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.5.1" # Upgrade kuberay-operator to v1.5.1. This step doesn't upgrade the CRDs. $ helm upgrade kuberay-operator kuberay/kuberay-operator --version v1.5.1 # Install a RayCluster using the v1.5.1 helm chart to verify the success of the upgrade. $ helm install raycluster kuberay/ray-cluster --version 1.5.1 ``` --- (kuberay-uv)= # Using `uv` for Python package management in KubeRay [uv](https://github.com/astral-sh/uv) is a modern Python package manager written in Rust. Starting with Ray 2.45, the `rayproject/ray:2.45.0` image includes `uv` as one of its dependencies. This guide provides a simple example of using `uv` to manage Python dependencies on KubeRay. To learn more about the `uv` integration in Ray, refer to: * [Environment Dependencies](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#using-uv-for-package-management) * [uv + Ray: Pain-Free Python Dependencies in Clusters](https://www.anyscale.com/blog/uv-ray-pain-free-python-dependencies-in-clusters) # Example ## Step 1: Create a Kind cluster ```sh kind create cluster ``` ## Step 2: Install KubeRay operator Follow the [KubeRay Operator Installation](kuberay-operator-deploy) to install the latest stable KubeRay operator by Helm repository. ## Step 3: Create a RayCluster with `uv` enabled `ray-cluster.uv.yaml` YAML file contains a RayCluster custom resource and a ConfigMap that includes a sample Ray Python script. * The `RAY_RUNTIME_ENV_HOOK` feature flag enables the `uv` integration in Ray. Future versions may enable this by default. ```yaml env: - name: RAY_RUNTIME_ENV_HOOK value: ray._private.runtime_env.uv_runtime_env_hook.hook ``` * `sample_code.py` is a simple Ray Python script that uses the `emoji` package. ```python import emoji import ray @ray.remote def f(): return emoji.emojize('Python is :thumbs_up:') # Execute 10 copies of f across a cluster. print(ray.get([f.remote() for _ in range(10)])) ``` ```sh kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.uv.yaml ``` ## Step 4: Execute a Ray Python script with `uv` ```sh export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) kubectl exec -it $HEAD_POD -- /bin/bash -c "cd samples && uv run --with emoji /home/ray/samples/sample_code.py" # [Example output]: # # Installed 1 package in 1ms # 2025-06-01 14:49:15,021 INFO worker.py:1554 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS # 2025-06-01 14:49:15,024 INFO worker.py:1694 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... # 2025-06-01 14:49:15,035 INFO worker.py:1879 -- Connected to Ray cluster. View the dashboard at 10.244.0.6:8265 # 2025-06-01 14:49:15,040 INFO packaging.py:576 -- Creating a file package for local module '/home/ray/samples'. # 2025-06-01 14:49:15,041 INFO packaging.py:368 -- Pushing file package 'gcs://_ray_pkg_d4da2ce33cf6d176.zip' (0.00MiB) to Ray cluster... # 2025-06-01 14:49:15,042 INFO packaging.py:381 -- Successfully pushed file package 'gcs://_ray_pkg_d4da2ce33cf6d176.zip'. # ['Python is 👍', 'Python is 👍', 'Python is 👍', 'Python is 👍', 'Python is 👍', 'Python is 👍', 'Python is 👍', 'Python is 👍', 'Python is 👍', 'Python is 👍'] ``` > NOTE: Use `/bin/bash -c` to execute the command while changing the current directory to `/home/ray/samples`. By default, `working_dir` is set to the current directory. This prevents uploading all files under `/home/ray`, which can take a long time when executing `uv run`. Alternatively, you can use `ray job submit --runtime-env-json ...` to specify the `working_dir` manually. --- (kuberay-guides)= # User Guides ```{toctree} :hidden: Deploy Ray Serve Apps user-guides/rayservice-no-ray-serve-replica user-guides/rayservice-high-availability user-guides/rayservice-incremental-upgrade user-guides/observability user-guides/upgrade-guide user-guides/k8s-cluster-setup user-guides/storage user-guides/config user-guides/configuring-autoscaling user-guides/label-based-scheduling user-guides/kuberay-gcs-ft user-guides/kuberay-gcs-persistent-ft user-guides/gke-gcs-bucket user-guides/persist-kuberay-custom-resource-logs user-guides/persist-kuberay-operator-logs user-guides/gpu user-guides/tpu user-guides/pod-command user-guides/helm-chart-rbac user-guides/tls user-guides/k8s-autoscaler user-guides/kubectl-plugin user-guides/kuberay-auth user-guides/reduce-image-pull-latency user-guides/uv user-guides/kuberay-dashboard ``` :::{note} To learn the basics of Ray on Kubernetes, we recommend taking a look at the {ref}`introductory guide ` first. ::: * {ref}`kuberay-rayservice` * {ref}`kuberay-rayservice-no-ray-serve-replica` * {ref}`kuberay-rayservice-ha` * {ref}`kuberay-rayservice-incremental-upgrade` * {ref}`kuberay-observability` * {ref}`kuberay-upgrade-guide` * {ref}`kuberay-k8s-setup` * {ref}`kuberay-storage` * {ref}`kuberay-config` * {ref}`kuberay-autoscaling` * {ref}`kuberay-gpu` * {ref}`kuberay-tpu` * {ref}`kuberay-gcs-ft` * {ref}`kuberay-gcs-persistent-ft` * {ref}`persist-kuberay-custom-resource-logs` * {ref}`persist-kuberay-operator-logs` * {ref}`kuberay-pod-command` * {ref}`kuberay-helm-chart-rbac` * {ref}`kuberay-tls` * {ref}`kuberay-gke-bucket` * {ref}`ray-k8s-autoscaler-comparison` * {ref}`kubectl-plugin` * {ref}`kuberay-auth` * {ref}`reduce-image-pull-latency` * {ref}`kuberay-uv` * {ref}`kuberay-dashboard` --- (collect-metrics)= # Collecting and monitoring metrics Metrics are useful for monitoring and troubleshooting Ray applications and Clusters. For example, you may want to access a node's metrics if it terminates unexpectedly. Ray records and emits time-series metrics using the [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/). Ray doesn't provide a native storage solution for metrics. Users need to manage the lifecycle of the metrics by themselves. This page provides instructions on how to collect and monitor metrics from Ray Clusters. For Kubernetes users, see {ref}`Using Prometheus and Grafana ` with KubeRay. ## System and application metrics Ray exports metrics if you use `ray[default]` or {ref}`other installation commands ` that include Dashboard component. Dashboard agent process is responsible for aggregating and reporting metrics to the endpoints for Prometheus to scrape. **System metrics**: Ray exports a number of system metrics. View {ref}`system metrics ` for more details about the emitted metrics. **Application metrics**: Application-specific metrics are useful for monitoring your application states. View {ref}`adding application metrics ` for how to record metrics. (prometheus-setup)= ## Setting up Prometheus You can use Prometheus to scrape metrics from Ray Clusters. Ray doesn't start Prometheus servers for you. You need to decide where to host and configure it to scrape the metrics from Clusters. For a quick demo, you can run Prometheus locally on your machine. Follow the quickstart instructions below to set up Prometheus and scrape metrics from a local single-node Ray Cluster. ### Quickstart: Running Prometheus locally ```{admonition} Note :class: note If you need to change the root temporary directory by using "--temp-dir" in your Ray cluster setup, follow these [manual steps](#optional-manual-running-prometheus-locally) to set up Prometheus locally. ``` Run the following command to download and start Prometheus locally with a configuration that scrapes metrics from a local Ray Cluster. ```bash ray metrics launch-prometheus ``` You should see the following output: ```text 2024-01-11 16:08:45,805 - INFO - Prometheus installed successfully. 2024-01-11 16:08:45,810 - INFO - Prometheus has started. Prometheus is running with PID 1234. To stop Prometheus, use the command: 'kill 1234', or if you need to force stop, use 'kill -9 1234'. ``` You should also see some logs from Prometheus: ```shell [...] ts=2024-01-12T00:47:29.761Z caller=main.go:1009 level=info msg="Server is ready to receive web requests." ts=2024-01-12T00:47:29.761Z caller=manager.go:1012 level=info component="rule manager" msg="Starting rule manager..." ``` Now you can access Ray metrics from the default Prometheus URL, http://localhost:9090. To demonstrate that Prometheus is scraping metrics from Ray, run the following command: ```shell ray start --head --metrics-export-port=8080 ``` Then go to the Prometheus UI and run the following query: ```shell ray_dashboard_api_requests_count_requests_total ``` You can then see the number of requests to the Ray Dashboard API over time. To stop Prometheus, run the following commands: ```sh # case 1: Ray > 2.40 ray metrics shutdown-prometheus # case 2: Otherwise # Run `ps aux | grep prometheus` to find the PID of the Prometheus process. Then, kill the process. kill ``` ### [Optional] Manual: Running Prometheus locally If the preceding automatic script doesn't work or you would prefer to install and start Prometheus manually, follow these instructions. First, [download Prometheus](https://prometheus.io/download/). Make sure to download the correct binary for your operating system. For example, Darwin for macOS X. Then, unzip the archive into a local directory using the following command: ```bash tar xvfz prometheus-*.tar.gz cd prometheus-* ``` Ray provides a Prometheus config that works out of the box. After running Ray, you can find the config at `/tmp/ray/session_latest/metrics/prometheus/prometheus.yml`. If you specify the `--temp-dir={your_temp_path}` when starting the Ray cluster, the config file is at `{your_temp_path}/session_latest/metrics/prometheus/prometheus.yml` ```yaml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: # Scrape from each Ray node as defined in the service_discovery.json provided by Ray. - job_name: 'ray' file_sd_configs: - files: - '/tmp/ray/prom_metrics_service_discovery.json' # or '${your_temp_path}/prom_metrics_service_discovery.json' if --temp-dir is specified ``` Next, start Prometheus: ```shell # With default settings ./prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml # With specified --temp-dir ./prometheus --config.file={your_temp_path}/session_latest/metrics/prometheus/prometheus.yml ``` ```{admonition} Note :class: note If you are using macOS, you may receive an error at this point about trying to launch an application where the developer has not been verified. See the "Troubleshooting" guide below to fix the issue. ``` Now, you can access Ray metrics from the default Prometheus URL, `http://localhost:9090`. ### Running Prometheus in production For a production environment, view [Prometheus documentation](https://prometheus.io/docs/introduction/overview/) for the best strategy to set up your Prometheus server. The Prometheus server should live outside of the Ray Cluster, so that metrics are still accessible if the Cluster is down. For KubeRay users, follow [these instructions](kuberay-prometheus-grafana) to set up Prometheus. ### Troubleshooting #### Using Ray configurations in Prometheus with Homebrew on macOS X Homebrew installs Prometheus as a service that is automatically launched for you. To configure these services, you cannot simply pass in the config files as command line arguments. Instead, change the --config-file line in `/usr/local/etc/prometheus.args` to read `--config.file /tmp/ray/session_latest/metrics/prometheus/prometheus.yml`. You can then start or restart the services with `brew services start prometheus`. #### macOS does not trust the developer to install Prometheus You may receive the following error: ![trust error](https://raw.githubusercontent.com/ray-project/Images/master/docs/troubleshooting/prometheus-trusted-developer.png) When downloading binaries from the internet, macOS requires that the binary be signed by a trusted developer ID. Many developers are not on macOS's trusted list. Users can manually override this requirement. See [these instructions](https://support.apple.com/guide/mac-help/open-a-mac-app-from-an-unidentified-developer-mh40616/mac) for how to override the restriction and install or run the application. #### Loading Ray Prometheus configurations with Docker Compose In the Ray container, the symbolic link "/tmp/ray/session_latest/metrics" points to the latest active Ray session. However, Docker does not support the mounting of symbolic links on shared volumes and you may fail to load the Prometheus configuration files. To fix this issue, employ an automated shell script for seamlessly transferring the Prometheus configurations from the Ray container to a shared volume. To ensure a proper setup, mount the shared volume on the respective path for the container, which contains the recommended configurations to initiate the Prometheus servers. (scrape-metrics)= ## Scraping metrics Ray runs a metrics agent per node to export system and application metrics. Each metrics agent collects metrics from the local node and exposes them in a Prometheus format. You can then scrape each endpoint to access the metrics. To scrape the endpoints, we need to ensure service discovery, which allows Prometheus to find the metrics agents' endpoints on each node. ### Auto-discovering metrics endpoints You can allow Prometheus to dynamically find the endpoints to scrape by using Prometheus' [file based service discovery](https://prometheus.io/docs/guides/file-sd/#installing-configuring-and-running-prometheus). Use auto-discovery to export Prometheus metrics when using the Ray {ref}`cluster launcher `, as node IP addresses can often change as the cluster scales up and down. Ray auto-generates a Prometheus [service discovery file](https://prometheus.io/docs/guides/file-sd/#installing-configuring-and-running-prometheus) on the head node to facilitate metrics agents' service discovery. This function allows you to scrape all metrics in the cluster without knowing their IPs. The following information guides you on the setup. The service discovery file is generated on the {ref}`head node `. On this node, look for ``/tmp/ray/prom_metrics_service_discovery.json`` (or the equivalent file if using a custom Ray ``temp_dir``). Ray periodically updates this file with the addresses of all metrics agents in the cluster. Ray automatically produces a Prometheus config, which scrapes the file for service discovery found at `/tmp/ray/session_latest/metrics/prometheus/prometheus.yml`. You can choose to use this config or modify your own config to enable this behavior. See the details of the config below. Find the full documentation [here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/). With this config, Prometheus automatically updates the addresses that it scrapes based on the contents of Ray's service discovery file. ```yaml # Prometheus config file # my global config global: scrape_interval: 2s evaluation_interval: 2s # Scrape from Ray. scrape_configs: - job_name: 'ray' file_sd_configs: - files: - '/tmp/ray/prom_metrics_service_discovery.json' ``` #### HTTP service discovery Ray also exposes the same list of addresses to scrape over an HTTP endpoint, compatible with [Prometheus HTTP Service Discovery](https://prometheus.io/docs/prometheus/latest/http_sd/). Use the following in your Prometheus config to use the HTTP endpoint for service discovery ([HTTP SD docs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config)): ```yaml scrape_configs: - job_name: 'ray' http_sd_configs: - url: 'http://:/api/prometheus/sd' refresh_interval: 60s ``` - `` is `8265` by default. See [Configuring and Managing Ray Dashboard](https://docs.ray.io/en/latest/cluster/configure-manage-dashboard.html) for more details. - The endpoint returns a JSON list of targets for Prometheus metrics. When no targets are available, it returns `[]`. ### Manually discovering metrics endpoints If you know the IP addresses of the nodes in your Ray Cluster, you can configure Prometheus to read metrics from a static list of endpoints. Set a fixed port that Ray should use to export metrics. If you're using the VM Cluster Launcher, pass ``--metrics-export-port=`` to ``ray start``. If you're using KubeRay, specify ``rayStartParams.metrics-export-port`` in the RayCluster configuration file. You must specify the port on all nodes in the cluster. If you do not know the IP addresses of the nodes in your Ray Cluster, you can also programmatically discover the endpoints by reading the Ray Cluster information. The following example uses a Python script and the {py:obj}`ray.nodes` API to find the metrics agents' URLs, by combining the ``NodeManagerAddress`` with the ``MetricsExportPort``. ```python # On a cluster node: import ray ray.init() from pprint import pprint pprint(ray.nodes()) """ Pass the : from each of these entries to Prometheus. [{'Alive': True, 'MetricsExportPort': 8080, 'NodeID': '2f480984702a22556b90566bdac818a4a771e69a', 'NodeManagerAddress': '192.168.1.82', 'NodeManagerHostname': 'host2.attlocal.net', 'NodeManagerPort': 61760, 'ObjectManagerPort': 61454, 'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet', 'Resources': {'CPU': 1.0, 'memory': 123.0, 'node:192.168.1.82': 1.0, 'object_store_memory': 2.0}, 'alive': True}, {'Alive': True, 'MetricsExportPort': 8080, 'NodeID': 'ce6f30a7e2ef58c8a6893b3df171bcd464b33c77', 'NodeManagerAddress': '192.168.1.82', 'NodeManagerHostname': 'host1.attlocal.net', 'NodeManagerPort': 62052, 'ObjectManagerPort': 61468, 'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store.1', 'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet.1', 'Resources': {'CPU': 1.0, 'memory': 134.0, 'node:192.168.1.82': 1.0, 'object_store_memory': 2.0}, 'alive': True}] """ ``` ## Processing and exporting metrics If you need to process and export metrics into other storage or management systems, check out open source metric processing tools like [Vector][Vector]. [Vector]: https://vector.dev/ ## Monitoring metrics To visualize and monitor collected metrics, there are 3 common paths: 1. **Simplest**: Use Grafana with Ray-provided configurations, which include default Grafana dashboards showing some of the most valuable metrics for debugging Ray applications. 2. **Recommended**: Use Ray Dashboard which embeds Grafana visualizations and look at metrics together with logs, Job info and so on in a single pane of glass. 3. **Manual**: Set up Grafana or other tools like CloudWatch, Cloud Monitoring, and Datadog from scratch. Here are some instructions for each of the paths: (grafana)= ### Simplest: Setting up Grafana with Ray-provided configurations Grafana is a tool that supports advanced visualizations of Prometheus metrics and allows you to create custom dashboards with your favorite metrics. ::::{tab-set} :::{tab-item} Creating a new Grafana server ```{admonition} Note :class: note The instructions below describe one way of starting a Grafana server on a macOS machine. Refer to the [Grafana documentation](https://grafana.com/docs/grafana/latest/setup-grafana/start-restart-grafana/#start-the-grafana-server) for how to start Grafana servers in different systems. For KubeRay users, follow [these instructions](kuberay-prometheus-grafana) to set up Grafana. ``` First, [download Grafana](https://grafana.com/grafana/download). Follow the instructions on the download page to download the right binary for your operating system. Go to the location of the binary and run Grafana using the built-in configuration found in the `/tmp/ray/session_latest/metrics/grafana` folder. ```shell ./bin/grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web ``` Access Grafana using the default grafana URL, `http://localhost:3000`. See the default dashboard by going to dashboards -> manage -> Ray -> Default Dashboard. The same {ref}`metric graphs ` are accessible in {ref}`Ray Dashboard ` after you integrate Grafana with Ray Dashboard. ```{admonition} Note :class: note If this is your first time using Grafana, login with the username: `admin` and password `admin`. ``` ![grafana login](images/graphs.png) **Troubleshooting** ***Using Ray configurations in Grafana with Homebrew on macOS X*** Homebrew installs Grafana as a service that is automatically launched for you. Therefore, to configure these services, you cannot simply pass in the config files as command line arguments. Instead, update the `/usr/local/etc/grafana/grafana.ini` file so that it matches the contents of `/tmp/ray/session_latest/metrics/grafana/grafana.ini`. You can then start or restart the services with `brew services start grafana` and `brew services start prometheus`. ***Loading Ray Grafana configurations with Docker Compose*** In the Ray container, the symbolic link "/tmp/ray/session_latest/metrics" points to the latest active Ray session. However, Docker does not support the mounting of symbolic links on shared volumes and you may fail to load the Grafana configuration files and default dashboards. To fix this issue, employ an automated shell script for seamlessly transferring the necessary Grafana configurations and dashboards from the Ray container to a shared volume. To ensure a proper setup, mount the shared volume on the respective path for the container, which contains the recommended configurations and default dashboards to initiate Grafana servers. ::: :::{tab-item} Using an existing Grafana server After your Grafana server is running, start a Ray Cluster and find the Ray-provided default Grafana dashboard JSONs at `/tmp/ray/session_latest/metrics/grafana/dashboards`. [Copy the JSONs over and import the Grafana dashboards](https://grafana.com/docs/grafana/latest/dashboards/manage-dashboards/#import-a-dashboard) to your Grafana. If Grafana reports that the datasource is not found, [add a datasource variable](https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/?pg=graf&plcmt=data-sources-prometheus-btn-1#add-a-data-source-variable). The datasource's name must be the same as value in the `RAY_PROMETHEUS_NAME` environment. By default, `RAY_PROMETHEUS_NAME` equals `Prometheus`. ::: :::: ### Recommended: Use Ray Dashboard with embedded Grafana visualizations 1. Follow the instructions above to set up Grafana with Ray-provided visualizations 2. View {ref}`configuring and managing Ray Dashboard ` for how to embed Grafana visualizations into Dashboard 3. View {ref}`Dashboard's metrics view` for how to inspect the metrics in Ray Dashboard. ### Manual: Set up Grafana, or other tools like CloudWatch, Cloud Monitoring and Datadog from scratch Refer to the documentation of these tools for how to query and visualize the metrics. ```{admonition} Tip :class: tip If you need to write Prometheus queries manually, check out the Prometheus queries in Ray-provided Grafana dashboard JSON at `/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json` for inspiration. ``` --- # Application guide This section introduces the main differences in running a Ray application on your laptop vs on a Ray Cluster. To get started, check out the [job submissions](jobs-quickstart) page. ```{toctree} :maxdepth: '2' job-submission/index autoscaling/reference ``` --- (jobs-overview)= # Ray Jobs Overview Once you have deployed a Ray cluster (on [VMs](vm-cluster-quick-start) or [Kubernetes](kuberay-quickstart)), you are ready to run a Ray application! ![A diagram that shows three ways of running a job on a Ray cluster.](../../images/ray-job-diagram.svg "Three ways of running a job on a Ray cluster.") ## Ray Jobs API The recommended way to run a job on a Ray cluster is to use the *Ray Jobs API*, which consists of a CLI tool, Python SDK, and a REST API. The Ray Jobs API allows you to submit locally developed applications to a remote Ray Cluster for execution. It simplifies the experience of packaging, deploying, and managing a Ray application. A submission to the Ray Jobs API consists of: 1. An entrypoint command, like `python my_script.py`, and 2. A [runtime environment](runtime-environments), which specifies the application's file and package dependencies. A job can be submitted by a remote client that lives outside of the Ray Cluster. We will show this workflow in the following user guides. After a job is submitted, it runs once to completion or failure, regardless of the original submitter's connectivity. Retries or different runs with different parameters should be handled by the submitter. Jobs are bound to the lifetime of a Ray cluster, so if the cluster goes down, all running jobs on that cluster will be terminated. To get started with the Ray Jobs API, check out the [quickstart](jobs-quickstart) guide, which walks you through the CLI tools for submitting and interacting with a Ray Job. This is suitable for any client that can communicate over HTTP to the Ray Cluster. If needed, the Ray Jobs API also provides APIs for [programmatic job submission](ray-job-sdk) and [job submission using REST](ray-job-rest-api). ## Running Jobs Interactively If you would like to run an application *interactively* and see the output in real time (for example, during development or debugging), you can: - (Recommended) Run your script directly on a cluster node (e.g. after SSHing into the node using [`ray attach`](ray-attach-doc)), or - (For Experts only) Use [Ray Client](ray-client-ref) to run a script from your local machine while maintaining a connection to the cluster. Note that jobs started in these ways are not managed by the Ray Jobs API, so the Ray Jobs API will not be able to see them or interact with them (with the exception of `ray job list` and `JobSubmissionClient.list_jobs()`). ## Contents ```{toctree} :maxdepth: '1' quickstart sdk jobs-package-ref cli rest ray-client ``` --- (vm-cluster-examples)= # Examples ```{toctree} :hidden: ml-example ``` :::{note} To learn the basics of Ray on Cloud VMs, we recommend taking a look at the {ref}`introductory guide ` first. ::: This section presents example Ray workloads to try out on your cloud cluster. More examples will be added in the future. Running the distributed XGBoost example below is a great way to start experimenting with production Ray workloads in the cloud. - {ref}`clusters-vm-ml-example` --- (clusters-vm-ml-example)= # Ray Train XGBoostTrainer on VMs :::{note} To learn the basics of Ray on VMs, we recommend taking a look at the {ref}`introductory guide ` first. ::: In this guide, we show you how to run a sample Ray machine learning workload on AWS. The similar steps can be used to deploy on GCP or Azure as well. We will run Ray's {ref}`XGBoost training benchmark ` with a 100 gigabyte training set. To learn more about using Ray's XGBoostTrainer, check out {ref}`the XGBoostTrainer documentation `. ## VM cluster setup For the workload in this guide, it is recommended to use the following setup: - 10 nodes total - A capacity of 16 CPU and 64 Gi memory per node. For the major cloud providers, suitable instance types include * m5.4xlarge (Amazon Web Services) * Standard_D5_v2 (Azure) * e2-standard-16 (Google Cloud) - Each node should be configured with 1000 gigabytes of disk space (to store the training set). The corresponding cluster configuration file is as follows: ```{literalinclude} ../configs/xgboost-benchmark.yaml :language: yaml ``` ```{admonition} Optional: Set up an autoscaling cluster **If you would like to try running the workload with autoscaling enabled**, change ``min_workers`` of worker nodes to 0. After the workload is submitted, 9 workers nodes will scale up to accommodate the workload. These nodes will scale back down after the workload is complete. ``` ## Deploy a Ray cluster Now we're ready to deploy the Ray cluster with the configuration that's defined above. Before running the command, make sure your aws credentials are configured correctly. ```shell ray up -y cluster.yaml ``` A Ray head node and 9 Ray worker nodes will be created. ## Run the workload We will use {ref}`Ray Job Submission ` to kick off the workload. ### Connect to the cluster First, we connect to the Job server. Run the following blocking command in a separate shell. ```shell ray dashboard cluster.yaml ``` This will forward remote port 8265 to port 8265 on localhost. ### Submit the workload We'll use the {ref}`Ray Job Python SDK ` to submit the XGBoost workload. ```{literalinclude} /cluster/doc_code/xgboost_submit.py :language: python ``` To submit the workload, run the above Python script. The script is available [in the Ray repository][XGBSubmit]. ```shell # Download the above script. curl https://raw.githubusercontent.com/ray-project/ray/releases/2.0.0/doc/source/cluster/doc_code/xgboost_submit.py -o xgboost_submit.py # Run the script. python xgboost_submit.py ``` ### Observe progress The benchmark may take up to 30 minutes to run. Use the following tools to observe its progress. #### Job logs To follow the job's logs, use the command printed by the above submission script. ```shell # Substitute the Ray Job's submission id. ray job logs 'raysubmit_xxxxxxxxxxxxxxxx' --address="http://localhost:8265" --follow ``` #### Ray Dashboard View `localhost:8265` in your browser to access the Ray Dashboard. #### Ray Status Observe autoscaling status and Ray resource usage with ```shell ray exec cluster.yaml 'ray status' ``` ### Job completion #### Benchmark results Once the benchmark is complete, the job log will display the results: ``` Results: {'training_time': 1338.488839321999, 'prediction_time': 403.36653568099973} ``` The performance of the benchmark is sensitive to the underlying cloud infrastructure -- you might not match {ref}`the numbers quoted in the benchmark docs `. #### Model parameters The file `model.json` in the Ray head node contains the parameters for the trained model. Other result data will be available in the directory `ray_results` in the head node. Refer to the {ref}`XGBoostTrainer documentation ` for details. ```{admonition} Scale-down If autoscaling is enabled, Ray worker nodes will scale down after the specified idle timeout. ``` #### Clean-up Delete your Ray cluster with the following command: ```shell ray down -y cluster.yaml ``` [XGBSubmit]: https://github.com/ray-project/ray/blob/releases/2.0.0/doc/source/cluster/doc_code/xgboost_submit.py --- # Ray on Cloud VMs (cloud-vm-index)= ```{toctree} :hidden: getting-started User Guides Examples references/index ``` ## Overview In this section we cover how to launch Ray clusters on Cloud VMs. Ray ships with built-in support for launching AWS, GCP, and Azure clusters, and also has community-maintained integrations for Aliyun and vSphere. Each Ray cluster consists of a head node and a collection of worker nodes. Optional [autoscaling](vms-autoscaling) support allows the Ray cluster to be sized according to the requirements of your Ray workload, adding and removing worker nodes as needed. Ray supports clusters composed of multiple heterogeneous compute nodes (including GPU nodes). Concretely, you will learn how to: - Set up and configure Ray in public clouds - Deploy applications and monitor your cluster ## Learn More The Ray docs present all the information you need to start running Ray workloads on VMs. ```{eval-rst} .. grid:: 1 2 2 2 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: **Getting Started** ^^^ Learn how to start a Ray cluster and deploy Ray applications in the cloud. +++ .. button-ref:: vm-cluster-quick-start :color: primary :outline: :expand: Get Started with Ray on Cloud VMs .. grid-item-card:: **Examples** ^^^ Try example Ray workloads in the Cloud +++ .. button-ref:: vm-cluster-examples :color: primary :outline: :expand: Try example workloads .. grid-item-card:: **User Guides** ^^^ Learn best practices for configuring cloud clusters +++ .. button-ref:: vm-cluster-guides :color: primary :outline: :expand: Read the User Guides .. grid-item-card:: **API Reference** ^^^ Find API references for cloud clusters +++ .. button-ref:: vm-cluster-api-references :color: primary :outline: :expand: Check API references ``` --- (vm-cluster-api-references)= # API References The following pages provide reference documentation for using Ray Clusters on virtual machines. ```{toctree} :caption: "Reference documentation for Ray Clusters on VMs:" :maxdepth: '2' :name: ray-clusters-vms-reference ray-cluster-cli ray-cluster-configuration ``` --- (vm-cluster-guides)= # User Guides ```{toctree} :hidden: launching-clusters/index large-cluster-best-practices configuring-autoscaling logging Community-supported Cluster Managers ``` :::{note} To learn the basics of Ray on Cloud VMs, we recommend taking a look at the {ref}`introductory guide ` first. ::: In these guides, we go into further depth on several topics related to deployments of Ray on cloud VMs or on-premises. * {ref}`launching-vm-clusters` * {ref}`vms-large-cluster` * {ref}`vms-autoscaling` * {ref}`ref-cluster-setup` * {ref}`cluster-FAQ` --- # Launching Ray Clusters on AWS This guide details the steps needed to start a Ray cluster on AWS. To start an AWS Ray cluster, you should use the Ray cluster launcher with the AWS Python SDK. ## Install Ray cluster launcher The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions. ```bash # install ray pip install -U ray[default] ``` ## Install and Configure AWS Python SDK (Boto3) Next, install AWS SDK using `pip install -U boto3` and configure your AWS credentials following [the AWS guide](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html). ```bash # install AWS Python SDK (boto3) pip install -U boto3 # setup AWS credentials using environment variables export AWS_ACCESS_KEY_ID=foo export AWS_SECRET_ACCESS_KEY=bar export AWS_SESSION_TOKEN=baz # alternatively, you can setup AWS credentials using ~/.aws/credentials file echo "[default] aws_access_key_id=foo aws_secret_access_key=bar aws_session_token=baz" >> ~/.aws/credentials ``` ## Start Ray with the Ray cluster launcher Once Boto3 is configured to manage resources in your AWS account, you should be ready to launch your cluster using the cluster launcher. The provided [cluster config file](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml) will create a small cluster with an m5.large head node (on-demand) configured to autoscale to up to two m5.large [spot-instance](https://aws.amazon.com/ec2/spot/) workers. Test that it works by running the following commands from your local machine: ```bash # Download the example-full.yaml wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/example-full.yaml # Create or update the cluster. When the command finishes, it will print # out the command that can be used to SSH into the cluster head node. ray up example-full.yaml # Get a remote shell on the head node. ray attach example-full.yaml # Try running a Ray program. python -c 'import ray; ray.init()' exit # Tear down the cluster. ray down example-full.yaml ``` Congrats, you have started a Ray cluster on AWS! If you want to learn more about the Ray cluster launcher, see this blog post for a [step by step guide](https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1). ## AWS Configurations (aws-cluster-efs)= ### Using Amazon EFS To utilize Amazon EFS in the Ray cluster, you will need to install some additional utilities and mount the EFS in `setup_commands`. Note that these instructions only work if you are using the Ray cluster launcher on AWS. ```yaml # Note You need to replace the {{FileSystemId}} with your own EFS ID before using the config. # You may also need to modify the SecurityGroupIds for the head and worker nodes in the config file. setup_commands: - sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`; sudo pkill -9 apt-get; sudo pkill -9 dpkg; sudo dpkg --configure -a; sudo apt-get -y install binutils; cd $HOME; git clone https://github.com/aws/efs-utils; cd $HOME/efs-utils; ./build-deb.sh; sudo apt-get -y install ./build/amazon-efs-utils*deb; cd $HOME; mkdir efs; sudo mount -t efs {{FileSystemId}}:/ efs; sudo chmod 777 efs; ``` ### Configuring IAM Role and EC2 Instance Profile By default, Ray nodes in a Ray AWS cluster have full EC2 and S3 permissions (i.e. `arn:aws:iam::aws:policy/AmazonEC2FullAccess` and `arn:aws:iam::aws:policy/AmazonS3FullAccess`). This is a good default for trying out Ray clusters but you may want to change the permissions Ray nodes have for various reasons (e.g. to reduce the permissions for security reasons). You can do so by providing a custom `IamInstanceProfile` to the related `node_config`: ```yaml available_node_types: ray.worker.default: node_config: ... IamInstanceProfile: Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE ``` Please refer to this [discussion](https://github.com/ray-project/ray/issues/9327) for more details on configuring IAM role and EC2 instance profile. (aws-cluster-s3)= ### Accessing S3 In various scenarios, worker nodes may need write access to an S3 bucket, e.g., Ray Tune has an option to write checkpoints to S3 instead of syncing them directly back to the driver. If you see errors like “Unable to locate credentials”, make sure that the correct `IamInstanceProfile` is configured for worker nodes in your cluster config file. This may look like: ```yaml available_node_types: ray.worker.default: node_config: ... IamInstanceProfile: Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE ``` You can verify if the set up is correct by SSHing into a worker node and running ```bash aws configure list ``` You should see something like ```bash Name Value Type Location ---- ----- ---- -------- profile None None access_key ****************XXXX iam-role secret_key ****************YYYY iam-role region None None ``` Please refer to this [discussion](https://github.com/ray-project/ray/issues/9327) for more details on accessing S3. ## Monitor Ray using Amazon CloudWatch ```{eval-rst} Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization. CloudWatch integration with Ray requires an AMI (or Docker image) with the Unified CloudWatch Agent pre-installed. AMIs with the Unified CloudWatch Agent pre-installed are provided by the Amazon Ray Team, and are currently available in the us-east-1, us-east-2, us-west-1, and us-west-2 regions. Please direct any questions, comments, or issues to the `Amazon Ray Team `_. The table below lists AMIs with the Unified CloudWatch Agent pre-installed in each region, and you can also find AMIs at `DLAMI Release Notes `_. Each DLAMI (Deep Learning AMI) is pre-installed with the Unified CloudWatch Agent, and its corresponding release notes include AWS CLI commands to query the latest AMI ID. .. list-table:: All available unified CloudWatch agent images * - Base AMI - AMI ID - Region - Unified CloudWatch Agent Version * - AWS Deep Learning AMI (Ubuntu 24.04, 64-bit) - ami-087feac195f30e722 - us-east-1 - v1.300057.1b1167 * - AWS Deep Learning AMI (Ubuntu 24.04, 64-bit) - ami-0ed6c422a7c93278a - us-east-2 - v1.300057.1b1167 * - AWS Deep Learning AMI (Ubuntu 24.04, 64-bit) - ami-0c5ddf2c101267018 - us-west-1 - v1.300057.1b1167 * - AWS Deep Learning AMI (Ubuntu 24.04, 64-bit) - ami-0cfd95c6c87d00570 - us-west-2 - v1.300057.1b1167 .. note:: Using Amazon CloudWatch will incur charges, please refer to `CloudWatch pricing `_ for details. Getting started --------------- 1. Create a minimal cluster config YAML named ``cloudwatch-basic.yaml`` with the following contents: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: yaml provider: type: aws region: us-west-2 availability_zone: us-west-2a # Start by defining a `cloudwatch` section to enable CloudWatch integration with your Ray cluster. cloudwatch: agent: # Path to Unified CloudWatch Agent config file config: "cloudwatch/example-cloudwatch-agent-config.json" dashboard: # CloudWatch Dashboard name name: "example-dashboard-name" # Path to the CloudWatch Dashboard config file config: "cloudwatch/example-cloudwatch-dashboard-config.json" auth: ssh_user: ubuntu available_node_types: ray.head.default: node_config: InstanceType: c5a.large ImageId: ami-0cfd95c6c87d00570 # Unified CloudWatch agent pre-installed AMI, us-west-2 resources: {} ray.worker.default: node_config: InstanceType: c5a.large ImageId: ami-0cfd95c6c87d00570 # Unified CloudWatch agent pre-installed AMI, us-west-2 IamInstanceProfile: Name: ray-autoscaler-cloudwatch-v1 resources: {} min_workers: 0 2. Download CloudWatch Agent and Dashboard config. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, create a ``cloudwatch`` directory in the same directory as ``cloudwatch-basic.yaml``. Then, download the example `CloudWatch Agent `_ and `CloudWatch Dashboard `_ config files to the ``cloudwatch`` directory. .. code-block:: console $ mkdir cloudwatch $ cd cloudwatch $ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json $ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json 3. Run ``ray up cloudwatch-basic.yaml`` to start your Ray Cluster. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This will launch your Ray cluster in ``us-west-2`` by default. When launching a cluster for a different region, you'll need to change your cluster config YAML file's ``region`` AND ``ImageId``. See the "Unified CloudWatch Agent Images" table above for available AMIs by region. 4. Check out your Ray cluster's logs, metrics, and dashboard in the `CloudWatch Console `_! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A tail can be acquired on all logs written to a CloudWatch log group by ensuring that you have the `AWS CLI V2+ installed `_ and then running: .. code-block:: bash aws logs tail $log_group_name --follow Advanced Setup -------------- Refer to `example-cloudwatch.yaml `_ for a complete example. 1. Choose an AMI with the Unified CloudWatch Agent pre-installed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ensure that you're launching your Ray EC2 cluster in the same region as the AMI, then specify the ``ImageId`` to use with your cluster's head and worker nodes in your cluster config YAML file. The following CLI command returns the latest available Unified CloudWatch Agent Image for ``us-west-2``: .. code-block:: bash aws ec2 describe-images --region us-west-2 --filters "Name=owner-id,Values=160082703681" "Name=name,Values=*cloudwatch*" --query 'Images[*].[ImageId,CreationDate]' --output text | sort -k2 -r | head -n1 .. code-block:: yaml available_node_types: ray.head.default: node_config: InstanceType: c5a.large ImageId: ami-0cfd95c6c87d00570 ray.worker.default: node_config: InstanceType: c5a.large ImageId: ami-0cfd95c6c87d00570 To build your own AMI with the Unified CloudWatch Agent installed: 1. Follow the `CloudWatch Agent Installation `_ user guide to install the Unified CloudWatch Agent on an EC2 instance. 2. Follow the `EC2 AMI Creation `_ user guide to create an AMI from this EC2 instance. 2. Define your own CloudWatch Agent, Dashboard, and Alarm JSON config files. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can start by using the example `CloudWatch Agent `_, `CloudWatch Dashboard `_ and `CloudWatch Alarm `_ config files. These example config files include the following features: **Logs and Metrics**: Logs written to ``/tmp/ray/session_*/logs/**.out`` will be available in the ``{cluster_name}-ray_logs_out`` log group, and logs written to ``/tmp/ray/session_*/logs/**.err`` will be available in the ``{cluster_name}-ray_logs_err`` log group. Log streams are named after the EC2 instance ID that emitted their logs. Extended EC2 metrics including CPU/Disk/Memory usage and process statistics can be found in the ``{cluster_name}-ray-CWAgent`` metric namespace. **Dashboard**: You will have a cluster-level dashboard showing total cluster CPUs and available object store memory. Process counts, disk usage, memory usage, and CPU utilization will be displayed as both cluster-level sums and single-node maximums/averages. **Alarms**: Node-level alarms tracking prolonged high memory, disk, and CPU usage are configured. Alarm actions are NOT set, and must be manually provided in your alarm config file. For more advanced options, see the `Agent `_, `Dashboard `_ and `Alarm `_ config user guides. CloudWatch Agent, Dashboard, and Alarm JSON config files support the following variables: ``{instance_id}``: Replaced with each EC2 instance ID in your Ray cluster. ``{region}``: Replaced with your Ray cluster's region. ``{cluster_name}``: Replaced with your Ray cluster name. See CloudWatch Agent `Configuration File Details `_ for additional variables supported natively by the Unified CloudWatch Agent. .. note:: Remember to replace the ``AlarmActions`` placeholder in your CloudWatch Alarm config file! .. code-block:: json "AlarmActions":[ "TODO: Add alarm actions! See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html" ] 3. Reference your CloudWatch JSON config files in your cluster config YAML. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Specify the file path to your CloudWatch JSON config files relative to the working directory that you will run ``ray up`` from: .. code-block:: yaml provider: cloudwatch: agent: config: "cloudwatch/example-cloudwatch-agent-config.json" 4. Set your IAM Role and EC2 Instance Profile. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default the ``ray-autoscaler-cloudwatch-v1`` IAM role and EC2 instance profile is created at Ray cluster launch time. This role contains all additional permissions required to integrate CloudWatch with Ray, namely the ``CloudWatchAgentAdminPolicy``, ``AmazonSSMManagedInstanceCore``, ``ssm:SendCommand``, ``ssm:ListCommandInvocations``, and ``iam:PassRole`` managed policies. Ensure that all worker nodes are configured to use the ``ray-autoscaler-cloudwatch-v1`` EC2 instance profile in your cluster config YAML: .. code-block:: yaml ray.worker.default: node_config: InstanceType: c5a.large IamInstanceProfile: Name: ray-autoscaler-cloudwatch-v1 5. Export Ray system metrics to CloudWatch. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To export Ray's Prometheus system metrics to CloudWatch, first ensure that your cluster has the Ray Dashboard installed, then uncomment the ``head_setup_commands`` section in `example-cloudwatch.yaml file `_ file. You can find Ray Prometheus metrics in the ``{cluster_name}-ray-prometheus`` metric namespace. .. code-block:: yaml head_setup_commands: # Make `ray_prometheus_waiter.sh` executable. - >- RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"` && sudo chmod +x $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh # Copy `prometheus.yml` to Unified CloudWatch Agent folder - >- RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"` && sudo cp -f $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/prometheus.yml /opt/aws/amazon-cloudwatch-agent/etc # First get current cluster name, then let the Unified CloudWatch Agent restart and use `AmazonCloudWatch-ray_agent_config_{cluster_name}` parameter at SSM Parameter Store. - >- nohup sudo sh -c "`pip show ray | grep -Po "(?<=Location:).*"`/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh `cat ~/ray_bootstrap_config.yaml | jq '.cluster_name'` >> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.out' 2>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.err'" & 6. Update CloudWatch Agent, Dashboard and Alarm config files. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can apply changes to the CloudWatch Logs, Metrics, Dashboard, and Alarms for your cluster by simply modifying the CloudWatch config files referenced by your Ray cluster config YAML and re-running ``ray up example-cloudwatch.yaml``. The Unified CloudWatch Agent will be automatically restarted on all cluster nodes, and your config changes will be applied. ``` --- # Launching Ray Clusters on Azure This guide details the steps needed to start a Ray cluster on Azure. There are two ways to start an Azure Ray cluster. - Launch through Ray cluster launcher. - Deploy a cluster using Azure portal. ## Using Ray cluster launcher ### Install Ray cluster launcher The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions. ```bash # install ray pip install -U ray[default] ``` ### Install and Configure Azure CLI Next, install the Azure CLI (`pip install -U azure-cli azure-identity`) and login using `az login`. ```bash # Install packages to use azure CLI. pip install azure-cli azure-identity # Login to azure. This will redirect you to your web browser. az login ``` ### Install Azure SDK libraries Now, install the Azure SDK libraries that enable the Ray cluster launcher to build Azure infrastructure. ```bash # Install azure SDK libraries. pip install azure-core azure-mgmt-network azure-mgmt-common azure-mgmt-resource azure-mgmt-compute msrestazure ``` ### Start Ray with the Ray cluster launcher The provided [cluster config file](https://github.com/ray-project/ray/tree/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml) will create a small cluster with a Standard DS2v3 on-demand head node that is configured to autoscale to up to two Standard DS2v3 [spot-instance](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms) worker nodes. Note that you'll need to fill in your Azure [resource_group](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L42) and [location](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L41) in those templates. You also need set the subscription to use. You can do this from the command line with `az account set -s ` or by filling in the [subscription_id](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L44) in the cluster config file. #### Download and configure the example configuration Download the reference example locally: ```bash # Download the example-full.yaml wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/azure/example-full.yaml ``` ##### Automatic SSH Key Generation To connect to the provisioned head node VM, Ray has automatic SSH Key Generation if none are specified in the config. This is the simplest approach and requires no manual key management. The default configuration in `example-full.yaml` uses automatic key generation: ```yaml auth: ssh_user: ubuntu # SSH keys are auto-generated if not specified # Uncomment and specify custom paths if you want to use existing keys: # ssh_private_key: /path/to/your/key.pem # ssh_public_key: /path/to/your/key.pub ``` ##### (Optional) Manual SSH Key Configuration If you prefer to use your own existing SSH keys, uncomment and specify both of the key paths in the `auth` section. For example, to use an existing `ed25519` key pair: ```yaml auth: ssh_user: ubuntu ssh_private_key: ~/.ssh/id_ed25519 ssh_public_key: ~/.ssh/id_ed25519.pub ``` Or for RSA keys: ```yaml auth: ssh_user: ubuntu ssh_private_key: ~/.ssh/id_rsa ssh_public_key: ~/.ssh/id_rsa.pub ``` Both methods inject the public key directly into the VM's `~/.ssh/authorized_keys` via Azure ARM templates. #### Launch the Ray cluster on Azure ```bash # Create or update the cluster. When the command finishes, it will print # out the command that can be used to SSH into the cluster head node. ray up example-full.yaml # Get a remote screen on the head node. ray attach example-full.yaml # Try running a Ray program. # Tear down the cluster. ray down example-full.yaml ``` Congratulations, you have started a Ray cluster on Azure! ## Using Azure portal Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through the Ray autoscaler. This will deploy [Azure Data Science VMs (DSVM)](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) for both the head node and the auto-scalable cluster managed by [Azure Virtual Machine Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/). The head node conveniently exposes both SSH as well as JupyterLab. Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input). Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster. ```python import ray; ray.init() ``` Under the hood, the [azure-init.sh](https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh) script is executed and performs the following actions: 1. Activates one of the conda environments available on DSVM 2. Installs Ray and any other user-specified dependencies 3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode --- # Launching Ray Clusters on GCP This guide details the steps needed to start a Ray cluster in GCP. To start a GCP Ray cluster, you will use the Ray cluster launcher with the Google API client. ## Install Ray cluster launcher The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions. ```bash # install ray pip install -U ray[default] ``` ## Install and Configure Google API Client If you have never created a Google APIs Console project, read google Cloud's [Managing Projects page](https://cloud.google.com/resource-manager/docs/creating-managing-projects?visit_id=637952351450670909-433962807&rd=1) and create a project in the [Google API Console](https://console.developers.google.com/). Next, install the Google API Client using `pip install -U google-api-python-client`. ```bash # Install the Google API Client. pip install google-api-python-client ``` ## Start Ray with the Ray cluster launcher Once the Google API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided [cluster config file](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml) will create a small cluster with an on-demand n1-standard-2 head node and is configured to autoscale to up to two n1-standard-2 [preemptible workers](https://cloud.google.com/preemptible-vms/). Note that you'll need to fill in your GCP [project_id](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/gcp/example-full.yaml#L42) in those templates. Test that it works by running the following commands from your local machine: ```bash # Download the example-full.yaml wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/gcp/example-full.yaml # Edit the example-full.yaml to update project_id. # vi example-full.yaml # Create or update the cluster. When the command finishes, it will print # out the command that can be used to SSH into the cluster head node. ray up example-full.yaml # Get a remote screen on the head node. ray attach example-full.yaml # Try running a Ray program. python -c 'import ray; ray.init()' exit # Tear down the cluster. ray down example-full.yaml ``` Congrats, you have started a Ray cluster on GCP! ## GCP Configurations ### Running workers with Service Accounts By default, only the head node runs with a Service Account (`ray-autoscaler-sa-v1@.iam.gserviceaccount.com`). To enable workers to run with this same Service Account (to access Google Cloud Storage, or GCR), add the following configuration to the worker_node configuration: ```yaml available_node_types: ray.worker.default: node_config: ... serviceAccounts: - email: ray-autoscaler-sa-v1@.iam.gserviceaccount.com scopes: - https://www.googleapis.com/auth/cloud-platform ``` --- (on-prem)= # Launching an On-Premise Cluster This document describes how to set up an on-premise Ray cluster, i.e., to run Ray on bare metal machines, or in a private cloud. We provide two ways to start an on-premise cluster. * You can [manually set up](manual-setup-cluster) the Ray cluster by installing the Ray package and starting the Ray processes on each node. * Alternatively, if you know all the nodes in advance and have SSH access to them, you should start the Ray cluster using the [cluster-launcher](manual-cluster-launcher). (manual-setup-cluster)= ## Manually Set up a Ray Cluster This section assumes that you have a list of machines and that the nodes in the cluster share the same network. It also assumes that Ray is installed on each machine. You can use pip to install the ray command line tool with cluster launcher support. Follow the [Ray installation instructions](installation) for more details. ```bash # install ray pip install -U "ray[default]" ``` ### Start the Head Node Choose any node to be the head node and run the following. If the `--port` argument is omitted, Ray will first choose port 6379, and then fall back to a random port if in 6379 is in use. ```bash ray start --head --port=6379 ``` The command will print out the Ray cluster address, which can be passed to `ray start` on other machines to start the worker nodes (see below). If you receive a ConnectionError, check your firewall settings and network configuration. ### Start Worker Nodes Then on each of the other nodes, run the following command to connect to the head node you just created. ```bash ray start --address= ``` Make sure to replace `head-node-address:port` with the value printed by the command on the head node (it should look something like 123.45.67.89:6379). Note that if your compute nodes are on their own subnetwork with Network Address Translation, the address printed by the head node will not work if connecting from a machine outside that subnetwork. You will need to use a head node address reachable from the remote machine. If the head node has a domain address like compute04.berkeley.edu, you can simply use that in place of an IP address and rely on DNS. Ray auto-detects the resources (e.g., CPU) available on each node, but you can also manually override this by passing custom resources to the `ray start` command. For example, if you wish to specify that a machine has 10 CPUs and 1 GPU available for use by Ray, you can do this with the flags `--num-cpus=10` and `--num-gpus=1`. See the [Configuration page](configuring-ray) for more information. ### Troubleshooting If you see `Unable to connect to GCS at ...`, this means the head node is inaccessible at the given `--address`. Some possible causes include: - the head node is not actually running; - a different version of Ray is running at the specified address; - the specified address is wrong; - or there are firewall settings preventing access. If the connection fails, to check whether each port can be reached from a node, you can use a tool such as nmap or nc. ```bash $ nmap -sV --reason -p $PORT $HEAD_ADDRESS Nmap scan report for compute04.berkeley.edu (123.456.78.910) Host is up, received echo-reply ttl 60 (0.00087s latency). rDNS record for 123.456.78.910: compute04.berkeley.edu PORT STATE SERVICE REASON VERSION 6379/tcp open redis? syn-ack Service detection performed. Please report any incorrect results at https://nmap.org/submit/ . $ nc -vv -z $HEAD_ADDRESS $PORT Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded! ``` If the node cannot access that port at that IP address, you might see ```bash $ nmap -sV --reason -p $PORT $HEAD_ADDRESS Nmap scan report for compute04.berkeley.edu (123.456.78.910) Host is up (0.0011s latency). rDNS record for 123.456.78.910: compute04.berkeley.edu PORT STATE SERVICE REASON VERSION 6379/tcp closed redis reset ttl 60 Service detection performed. Please report any incorrect results at https://nmap.org/submit/ . $ nc -vv -z $HEAD_ADDRESS $PORT nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused ``` (manual-cluster-launcher)= ## Using Ray cluster launcher The Ray cluster launcher is part of the `ray` command line tool. It allows you to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install it, or follow [install ray](installation) for more detailed instructions. ```bash # install ray pip install "ray[default]" ``` ### Start Ray with the Ray cluster launcher The provided [example-full.yaml](https://github.com/ray-project/ray/tree/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml) cluster config file will create a Ray cluster given a list of nodes. Note that you'll need to fill in your [head_ip](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L20), a list of [worker_ips](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L26), and the [ssh_user](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L34) field in those templates Test that it works by running the following commands from your local machine: ```bash # Download the example-full.yaml wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/local/example-full.yaml # Update the example-full.yaml to update head_ip, worker_ips, and ssh_user. # vi example-full.yaml # Create or update the cluster. When the command finishes, it will print # out the command that can be used to SSH into the cluster head node. ray up example-full.yaml # Get a remote screen on the head node. ray attach example-full.yaml # Try running a Ray program. # Tear down the cluster. ray down example-full.yaml ``` Congrats, you have started a local Ray cluster! --- # Launching Ray Clusters on vSphere This guide details the steps needed to launch a Ray cluster in a vSphere environment. To start a vSphere Ray cluster, you will use the Ray cluster launcher along with supervisor service (control plane) deployed on vSphere. ## Prepare the vSphere environment If you don't already have a vSphere deployment, you can learn more about it by reading the [vSphere documentation](https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/7-0/vsphere-with-tanzu-configuration-and-management-7-0/configuring-and-managing-a-supervisor-cluster/deploy-a-supervisor-with-nsx-networking.html). The vSphere Ray cluster launcher requires vSphere version 9.0 or later, along with the following prerequisites for creating Ray clusters. * [A vSphere cluster with Workload Control Plane (WCP) enabled ](https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/7-0/vsphere-with-tanzu-configuration-and-management-7-0/configuring-and-managing-a-supervisor-cluster/deploy-a-supervisor-with-nsx-networking.html) ## Installing supervisor service for Ray on vSphere Please refer [build and installation guide](https://github.com/vmware/ray-on-vcf) to install Ray control plane as a superviosr servise on vSphere. The vSphere Ray cluster launcher requires the vSphere environment to have a cotrol plane installed a s a supervisor service for deploying a Ray cluster. This service installs all the k8s CRDs used to rapidly create head and worker nodes. The details of the Ray cluster provisioning process using supervisor service can be found in this [Ray on vSphere architecture document](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/vsphere/ARCHITECTURE.md). ## Install Ray cluster launcher The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions. ```bash # install ray pip install -U ray[default] ``` ## Start Ray with the Ray cluster launcher Once the Ray supervisor service is active, you should be ready to launch your cluster using the cluster launcher. The provided [cluster config file](https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml) will create a small cluster with a head node configured to autoscale to up to two workers. Note that you need to configure your vSphere credentials and vCenter server address either via setting environment variables or adding them to the Ray cluster configuration YAML file. Test that it works by running the following commands from your local machine: ```bash # Download the example-full.yaml wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml # Create or update the cluster. When the command finishes, it will print # out the command that can be used to SSH into the cluster head node. ray up example-full.yaml # Get a remote screen on the head node. ray attach example-full.yaml # Try running a Ray program. python -c 'import ray; ray.init()' exit # Tear down the cluster. ray down example-full.yaml ``` Congrats, you have started a Ray cluster on vSphere! ## Configure vSAN File Service as persistent storage for Ray AI Libraries Starting in Ray 2.7, Ray AI Libraries (Train and Tune) will require users to provide a cloud storage or NFS path when running distributed training or tuning jobs. In a vSphere environment with a vSAN datastore, you can utilize the vSAN File Service feature to employ vSAN as a shared persistent storage. You can refer to [this vSAN File Service document](https://techdocs.broadcom.com/us/en/vmware-cis/vsan/vsan/8-0/vsan-administration.html) to create and configure NFS file shares supported by vSAN. The general steps are as follows: 1. Enable vSAN File Service and configure it with domain information and IP address pools. 2. Create a vSAN file share with NFS as the protocol. 3. View the file share information to get NFS export path. Once a file share is created, you can mount it into the head and worker node and use the mount path as the `storage_path` for the `RunConfig` parameter in Ray Train and Tune. --- (vm-logging)= # Log Persistence Logs are useful for troubleshooting Ray applications and Clusters. For example, you may want to access system logs if a node terminates unexpectedly. Ray does not provide a native storage solution for log data. Users need to manage the lifecycle of the logs by themselves. The following sections provide instructions on how to collect logs from Ray Clusters running on VMs. ## Ray log directory By default, Ray writes logs to files in the directory `/tmp/ray/session_*/logs` on each Ray node's file system, including application logs and system logs. Learn more about the {ref}`log directory and log files ` and the {ref}`log rotation configuration ` before you start to collect logs. ## Log processing tools A number of open source log processing tools are available, such as [Vector][Vector], [FluentBit][FluentBit], [Fluentd][Fluentd], [Filebeat][Filebeat], and [Promtail][Promtail]. [Vector]: https://vector.dev/ [FluentBit]: https://docs.fluentbit.io/manual [Filebeat]: https://www.elastic.co/guide/en/beats/filebeat/7.17/index.html [Fluentd]: https://docs.fluentd.org/ [Promtail]: https://grafana.com/docs/loki/latest/clients/promtail/ ## Log collection After choosing a log processing tool based on your needs, you may need to perform the following steps: 1. Ingest log files on each node of your Ray Cluster as sources. 2. Parse and transform the logs. You may want to use {ref}`Ray's structured logging ` to simplify this step. 3. Ship the transformed logs to log storage or management systems. --- # Ray Data Benchmarks This page documents benchmark results and methodologies for evaluating Ray Data performance across a variety of data modalities and workloads. --- ## Workload Summary - **Image Classification**: Processing 800k ImageNet images using ResNet18. The pipeline downloads images, deserializes them, applies transformations, runs ResNet18 inference on GPU, and outputs predicted labels. - **Document Embedding**: Processing 10k PDF documents from Digital Corpora. The pipeline reads PDF documents, extracts text page-by-page, splits into chunks with overlap, embeds using a `all-MiniLM-L6-v2` model on GPU, and outputs embeddings with metadata. - **Audio Transcription**: Transcribing 113,800 audio files from Mozilla Common Voice 17 dataset using a Whisper-tiny model. The pipeline loads FLAC audio files, resamples to 16kHz, extracts features using Whisper's processor, runs GPU-accelerated batch inference with the model, and outputs transcriptions with metadata. - **Video Object Detection**: Processing 10k video frames from Hollywood2 action videos dataset using YOLOv11n for object detection. The pipeline loads video frames, resizes them to 640x640, runs batch inference with YOLO to detect objects, extracts individual object crops, and outputs object metadata and cropped images in Parquet format. - **Large-scale Image Embedding**: Processing 4TiB of base64-encoded images from a Parquet dataset using ViT for image embedding. The pipeline decodes base64 images, converts to RGB, preprocesses using ViTImageProcessor (resizing, normalization), runs GPU-accelerated batch inference with ViT to generate embeddings, and outputs results to Parquet format. Ray Data 2.50 is compared with Daft 0.6.2, an open source multimodal data processing library built on Ray. --- ## Results Summary ![Multimodal Inference Benchmark Results](/data/images/multimodal_inference_results.png) ```{list-table} :header-rows: 1 :name: benchmark-results-summary - - Workload - **Daft (s)** - **Ray Data (s)** - - **Image Classification** - 195.3 ± 2.5 - **111.2 ± 1.2** - - **Document Embedding** - 51.3 ± 1.3 - **29.4 ± 0.8** - - **Audio Transcription** - 510.5 ± 10.4 - **312.6 ± 3.1** - - **Video Object Detection** - 735.3 ± 7.6 - **623 ± 1.4** - - **Large Scale Image Embedding** - 752.75 ± 5.5 - **105.81 ± 0.79** ``` All benchmark results are taken from an average/std across 4 runs. A warmup was also run to download the model and remove any startup overheads that would affect the result. ## Workload Configuration ```{list-table} :header-rows: 1 :name: workload-configuration - - Workload - Dataset - Data Path - Cluster Configuration - Code - - **Image Classification** - 800k images from ImageNet - s3://ray-example-data/imagenet/metadata_file.parquet - 1 head / 8 workers of varying instance types - [Link](https://github.com/ray-project/ray/tree/master/release/nightly_tests/multimodal_inference_benchmarks/image_classification) - - **Document Embedding** - 10k PDFs from Digital Corpora - s3://ray-example-data/digitalcorpora/metadata - g6.xlarge head, 8 g6.xlarge workers - [Link](https://github.com/ray-project/ray/tree/master/release/nightly_tests/multimodal_inference_benchmarks/document_embedding) - - **Audio Transcription** - 113,800 audio files from Mozilla Common Voice 17 en dataset - s3://air-example-data/common_voice_17/parquet/ - g6.xlarge head, 8 g6.xlarge workers - [Link](https://github.com/ray-project/ray/tree/master/release/nightly_tests/multimodal_inference_benchmarks/audio_transcription) - - **Video Object Detection** - 1,000 videos from Hollywood-2 Human Actions dataset - s3://ray-example-data/videos/Hollywood2-actions-videos/Hollywood2/AVIClips/ - 1 head, 8 workers of varying instance types - [Link](https://github.com/ray-project/ray/tree/master/release/nightly_tests/multimodal_inference_benchmarks/video_object_detection) - - **Large-scale Image Embedding** - 4 TiB of Parquet files containing base64 encoded images - s3://ray-example-data/image-datasets/10TiB-b64encoded-images-in-parquet-v3/ - m5.24xlarge (head), 40 g6e.xlarge (gpu workers), 64 r6i.8xlarge (cpu workers) - [Link](https://github.com/ray-project/ray/tree/master/release/nightly_tests/multimodal_inference_benchmarks/large_image_embedding) ``` ## Image Classification across different instance types This experiment compares the performance of Ray Data with Daft on the image classification workload across a variety of instance types. Each run is an average/std across 3 runs. A warmup was also run to download the model and remove any startup overheads that would affect the result. ```{list-table} :header-rows: 1 :name: image-classification-results - - - g6.xlarge (4 CPUs) - g6.2xlarge (8 CPUs) - g6.4xlarge (16 CPUs) - g6.8xlarge (32 CPUs) - - **Ray Data (s)** - 456.2 ± 39.9 - **195.5 ± 7.6** - **144.8 ± 1.9** - **111.2 ± 1.2** - - **Daft (s)** - **315.0 ± 31.2** - 202.0 ± 2.2 - 195.0 ± 6.6 - 195.3 ± 2.5 ``` ## Video Object Detection across different instance types This experiment compares the performance of Ray Data with Daft on the video object detection workload across a variety of instance types. Each run is an average/std across 4 runs. A warmup was also run to download the model and remove any startup overheads that would affect the result. ```{list-table} :header-rows: 1 :name: video-object-detection-results - - - g6.xlarge (4 CPUs) - g6.2xlarge (8 CPUs) - g6.4xlarge (16 CPUs) - g6.8xlarge (32 CPUs) - - **Ray Data (s)** - 922 ± 13.8 - **704.8 ± 25.0** - **629 ± 1.8** - **623 ± 1.4** - - **Daft (s)** - **758.8 ± 10.4** - 735.3 ± 7.6 - 747.5 ± 13.4 - 771.3 ± 25.6 ``` --- # Contributing Guide If you want your changes to be reviewed and merged quickly, following a few key practices makes a big difference. Clear, focused, and well-structured contributions help reviewers understand your intent and ensure your improvements land smoothly. :::{seealso} This guide covers contributing to Ray Data in specific. For information on contributing to the Ray project in general, see the {ref}`general Ray contributing guide`. ::: ## Find something to work on Start by solving a problem you encounter, like fixing a bug or adding a missing feature. If you're unsure where to start: * Browse the issue tracker for problems you understand. * Look for labels like ["good first issue"](https://github.com/ray-project/ray/issues?q=is%3Aissue%20state%3Aopen%20label%3Agood-first-issue%20label%3Adata) for approachable tasks. * [Join the Ray Slack](https://www.ray.io/join-slack) and post in #data-contributors. ## Get early feedback If you’re adding a new public API or making a substantial refactor, **share your plan early**. Discussing changes before you invest a lot of work can save time and align your work with the project’s direction. You can open a draft PR, discuss on an Issue, or post in Slack for early feedback. It won’t affect acceptance and often improves the final design. ## Write good tests Most changes to Ray Data require tests. For tips on how to write good tests, see {ref}`How to write tests `. ## Write simple, clear code Ray Data values **readable, maintainable, and extendable** code over clever tricks. For guidance on how to write code that aligns with Ray Data's design taste, see [A Philosophy of Software Design](https://web.stanford.edu/~ouster/cgi-bin/aposd2ndEdExtract.pdf). ## Test your changes locally To test your changes locally, build [Ray from source](https://docs.ray.io/en/latest/ray-contribute/development.html). For Ray Data development, you typically only need the Python environment—you can skip the C++ build unless you’re also contributing to Ray Core. Before submitting a PR, run `pre-commit` to lint your changes and `pytest` to execute your tests. Note that the full Ray Data test suite can be heavy to run locally, start with tests directly related to your changes. For example, if you modified `map`, from `python/ray/data/tests` run: `pytest test_map.py`. ## Open a pull request ### Write a clear pull request description Explain **why the change exists and what it achieves**. Clear descriptions reduce back-and-forth and speed up reviews. Here's an example of a PR with a good description: [[Data] Refactor PhysicalOperator.completed to fix side effects ](https://github.com/ray-project/ray/pull/58915). ### Keep pull requests small Review difficulty scales non-linearly with PR size. For fast reviews, do the following: * **Keep PRs under ~200 lines** of change when possible. * **Split large PRs** into multiple incremental PRs. * Avoid mixing refactors and new features in the same PR. Here's an example of a PR that keeps its scope small: [[Data] Support Non-String Items for ApproximateTopK Aggregator](https://github.com/ray-project/ray/pull/58659). While the broader effort focuses on optimizing preprocessors, this change was deliberately split out as a small, incremental PR, which made it much easier to review. ### Make CI pass Ray's CI runs lint and a small set of tests first in the `buildkite/microcheck` check. Start by making that pass. Once it’s green, tag your reviewer. They can add the go label to trigger the full test suite. --- (how-to-write-tests)= # How to write tests :::{note} **Disclaimer**: There are no hard rules in software engineering. Use your judgment when applying these. ::: Flaky or brittle tests (the kind that break when assumptions shift) slow development. Nobody likes getting stuck on a PR because a test failed for reasons unrelated to their change. This guide is a collection of practices to help you write tests that support the Ray Data project, not slow it down. ## General good practices ### Prefer unit tests over integration tests Unit tests give faster feedback and make it easier to pinpoint failures. They run in milliseconds, not seconds, and don’t depend on Ray clusters, external systems, or timing. This keeps the test suite fast, reliable, and easy to maintain. :::{note} Put unit tests in `python/ray/data/tests/unit`. ::: ### Use fixtures, skip try-finally Fixtures make tests cleaner, more reusable, and better isolated. They’re the right tool for setup and teardown, especially for things like `monkeypatch`. `try-finally` works, but fixtures make intent clearer and avoid boilerplate. **Original code** ```python def test_dynamic_block_split(ray_start_regular_shared): ctx = ray.data.context.DataContext.get_current() original_target_max_block_size = ctx.target_max_block_size ctx.target_max_block_size = 1 try: ... finally: ctx.target_max_block_size = original_target_max_block_size ``` **Better** ```python def test_dynamic_block_split(ray_start_regular_shared, restore_data_context): ctx = ray.data.context.DataContext.get_current() target_max_block_size = ctx.target_max_block_size ... # No need for try-finally ``` ## Ray-specific practices ### Don't assume Datasets produce outputs in a specific order Unless you set `preserve_order=True` in the `DataContext`, Ray Data doesn’t guarantee an output order. If your test relies on order without explicitly asking for it, you’re setting yourself up for brittle failures. **Original code** ```python ds_dfs = [] for path in os.listdir(out_path): assert path.startswith("data_") and path.endswith(".parquet") ds_dfs.append(pd.read_parquet(os.path.join(out_path, path))) ds_df = pd.concat(ds_dfs).reset_index(drop=True) df = pd.concat([df1, df2]).reset_index(drop=True) assert ds_df.equals(df) ``` **Better** ```python from ray.data._internal.util import rows_same actual_data = pd.read_parquet(out_path) expected_data = pd.concat([df1, df2] assert rows_same(actual_data, expected_data) ``` :::{tip} Use the `ray.data._internal.util.rows_same` utility function to compare pandas DataFrames for equality while ignoring indices and order. ::: ### Prefer shared cluster fixtures Prefer shared cluster fixtures like `ray_start_regular_shared` over isolated cluster fixtures like `shutdown_only` and `ray_start_regular`. `shutdown_only` and `ray_start_regular` restart the Ray cluster after each test finishes. Starting and stopping Ray can take over a second — which sounds small, but across thousands of tests (plus parameterizations) it adds up fast. Only use isolated clusters when your test truly needs a fresh cluster. :::{note} There's an inherent tradeoff between isolation and speed here. For this specific case, choose to prioritize speed. ::: **Original code** ```python @pytest.mark.parametrize("concurrency", [-1, 1.5], ids=["negative", "float"]) def test_invalid_concurrency_raises(shutdown_only, concurrency): ds = ray.data.range(1) # Each parametrization restarts the Ray cluster! with pytest.raises(ValueError): ds.map(lambda row: row, concurrency=concurrency) ``` **Better** ```python @pytest.mark.parametrize("concurrency", [-1, 1.5], ids=["negative", "float"]) def test_invalid_concurrency_raises(ray_start_regular_shared, concurrency): ds = ray.data.range(1) # Each parametrization reuses the same Ray cluster. with pytest.raises(ValueError): ds.map(lambda row: row, concurrency=concurrency) ``` ## Avoid testing against repr outputs to validate specific data `repr` output isn’t part of any interface contract — it can change at any time. Besides, tests that assert against repr often hide the real intent: are you trying to check the data, or just how it happens to print? Be explicit about what you care about. **Original code** ```python assert str(ds) == "Dataset(num_rows=6, schema={one: int64, two: string})", ds ``` **Better** ```python assert ds.schema() == Schema(pa.schema({"one": pa.int64(), "two": pa.string()})) assert ds.count() == 6 ``` ## Avoid assumptions about the number or size of blocks Unless you’re testing an API like `repartition`, don’t lock your test to a specific number or size of blocks. Both can change depending on the implementation or the cluster config — and that’s usually fine. **Original code** ```python ds = ray.data.read_parquet(paths + [txt_path], filesystem=fs) assert ds._plan.initial_num_blocks() == 2 # Where does 2 come from? assert rows_same(ds.to_pandas(), expected_data) ``` **Better** ```python ds = ray.data.read_parquet(paths + [txt_path], filesystem=fs) # Assertion about number of blocks has been removed. assert rows_same(ds.to_pandas(), expected_data) ``` **Original code** ```python ds2 = ds.repartition(5) assert ds2._plan.initial_num_blocks() == 5 assert ds2._block_num_rows() == [10, 10, 0, 0, 0] # Magic numbers? ``` **Better** ```python ds2 = ds.repartition(5) assert sum(len(bundle.blocks) for bundle in ds.iter_internal_ref_bundles()) == 5 # Assertion about the number of rows in each block has been removed. ``` ## Avoid testing that the DAG looks a particular way The operators in the execution plan can shift over time as the implementation evolves. Unless you’re specifically testing optimization rules or working at the operator level, tests shouldn’t expect a particular DAG structure. **Original code** ```python # Check that metadata fetch is included in stats. assert "FromArrow" in ds.stats() # Underlying implementation uses `FromArrow` operator assert ds._plan._logical_plan.dag.name == "FromArrow" ``` **Better** ```python # (Assertions removed). ``` --- # Contributing to the Ray Documentation There are many ways to contribute to the Ray documentation, and we're always looking for new contributors. Even if you just want to fix a typo or expand on a section, please feel free to do so! This document walks you through everything you need to do to get started. ## Editorial style We follow the [Google developer documentation style guide](https://developers.google.com/style). Here are some highlights: * [Use present tense](https://developers.google.com/style/tense) * [Use second person](https://developers.google.com/style/person) * [Use contractions](https://developers.google.com/style/contractions) * [Use active voice](https://developers.google.com/style/voice) * [Use sentence case](https://developers.google.com/style/capitalization) The editorial style is enforced in CI by Vale. For more information, see [How to use Vale](vale). ## Building the Ray documentation If you want to contribute to the Ray documentation, you need a way to build it. Don't install Ray in the environment you plan to use to build documentation. The requirements for the docs build system are generally not compatible with those you need to run Ray itself. Follow these instructions to build the documentation: ### Fork Ray 1. [Fork the Ray repository](https://docs.ray.io/en/master/ray-contribute/development.html#fork-the-ray-repository) 2. [Clone the forked repository](https://docs.ray.io/en/master/ray-contribute/development.html#fork-the-ray-repository) to your local machine Next, change into the `ray/doc` directory: ```shell cd ray/doc ``` ### Install dependencies If you haven't done so already, create a Python environment separate from the one you use to build and run Ray, preferably using the latest version of Python. For example, if you're using `conda`: ```shell conda create -n docs python=3.12 ``` Next, activate the Python environment you are using (e.g., venv, conda, etc.). With `conda` this would be: ```shell conda activate docs ``` Install the documentation dependencies with the following command: ```shell pip install -r requirements-doc.lock.txt ``` Don't use `-U` in this step; `requirements-doc.lock.txt` is a lock file that pins the exact versions of all the required dependencies. ### Build documentation Before building, clean your environment first by running: ```shell make clean ``` Choose from the following 2 options to build documentation locally: - Incremental build - Full build #### 1. Incremental build with global cache and live rendering To use this option, you can run: ```shell make local ``` This option is recommended if you need to make frequent uncomplicated and small changes like editing text, adding things within existing files, etc. In this approach, Sphinx only builds the changes you made in your branch compared to your last pull from upstream master. The rest of doc is cached with pre-built doc pages from your last commit from upstream (for every new commit pushed to Ray, CI builds all the documentation pages from that commit and store them on S3 as cache). The build first traces your commit tree to find the latest commit that CI already cached on S3. Once the build finds the commit, it fetches the corresponding cache from S3 and extracts it into the `doc/` directory. Simultaneously, CI tracks all the files that have changed from that commit to current `HEAD`, including any un-staged changes. Sphinx then rebuilds only the pages that your changes affect, leaving the rest untouched from the cache. When build finishes, the doc page would automatically pop up on your browser. If any change is made in the `doc/` directory, Sphinx would automatically rebuild and reload your doc page. You can stop it by interrupting with `Ctrl+C`. For more complicated changes that involve adding or removing files, always use `make develop` first, then you can start using `make local` afterwards to iterate on the cache that `make develop` produces. #### 2. Full build from scratch In the full build option, Sphinx rebuilds all files in `doc/` directory, ignoring all cache and saved environment. Because of this behavior, you get a really clean build but it's much slower. ```shell make develop ``` Find the documentation build in the `_build` directory. After the build finishes, you can simply open the `_build/html/index.html` file in your browser. It's considered good practice to check the output of your build to make sure everything is working as expected. Before committing any changes, make sure you run the [linter](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting) with `pre-commit run` from the `doc` folder, to make sure your changes are formatted correctly. ### Code completion and other developer tooling If you find yourself working with documentation often, you might find the [esbonio](https://github.com/swyddfa/esbonio) language server to be useful. Esbonio provides context-aware syntax completion, definitions, diagnostics, document links, and other information for RST documents. If you're unfamiliar with [language servers](https://en.wikipedia.org/wiki/Language_Server_Protocol), they are important pieces of a modern developer's toolkit; if you've used `pylance` or `python-lsp-server` before, you'll know how useful these tools can be. Esbonio also provides a vscode extension which includes a live preview. Simply install the `esbonio` vscode extension to start using the tool: ![esbonio](esbonio.png) As an example of Esbonio's autocompletion capabilities, you can type `..` to pull up an autocomplete menu for all RST directives: ![completion](completion.png) Esbonio also can be used with neovim - [see the lspconfig repository for installation instructions](https://github.com/neovim/nvim-lspconfig/blob/master/doc/server_configurations.md#esbonio). ## The basics of our build system The Ray documentation is built using the [`sphinx`](https://www.sphinx-doc.org/) build system. We're using the [PyData Sphinx Theme](https://pydata-sphinx-theme.readthedocs.io/en/stable/) for the documentation. We use [`myst-parser`](https://myst-parser.readthedocs.io/en/latest/) to allow you to write Ray documentation in either Sphinx's native [reStructuredText (rST)](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html) or in [Markedly Structured Text (MyST)](https://myst-parser.readthedocs.io/en/latest/). The two formats can be converted to each other, so the choice is up to you. Having said that, it's important to know that MyST is [common markdown compliant](https://myst-parser.readthedocs.io/en/latest/syntax/reference.html#commonmark-block-tokens). Past experience has shown that most developers are familiar with `md` syntax, so if you intend to add a new document, we recommend starting from an `.md` file. The Ray documentation also fully supports executable formats like [Jupyter Notebooks](https://jupyter.org/). Many of our examples are notebooks with [MyST markdown cells](https://myst-nb.readthedocs.io/en/latest/index.html). ## What to contribute? If you take Ray Tune as an example, you can see that our documentation is made up of several types of documentation, all of which you can contribute to: - [a project landing page](https://docs.ray.io/en/latest/tune/index.html), - [a getting started guide](https://docs.ray.io/en/latest/tune/getting-started.html), - [a key concepts page](https://docs.ray.io/en/latest/tune/key-concepts.html), - [user guides for key features](https://docs.ray.io/en/latest/tune/tutorials/overview.html), - [practical examples](https://docs.ray.io/en/latest/tune/examples/index.html), - [a detailed FAQ](https://docs.ray.io/en/latest/tune/faq.html), - [and API references](https://docs.ray.io/en/latest/tune/api/api.html). This structure is reflected in the [Ray documentation source code](https://github.com/ray-project/ray/tree/master/doc/source/tune) as well, so you should have no problem finding what you're looking for. All other Ray projects share a similar structure, but depending on the project there might be minor differences. Each type of documentation listed above has its own purpose, but at the end our documentation comes down to _two types_ of documents: - Markup documents, written in MyST or rST. If you don't have a lot of (executable) code to contribute or use more complex features such as [tabbed content blocks](https://docs.ray.io/en/latest/ray-core/walkthrough.html#starting-ray), this is the right choice. Most of the documents in Ray Tune are written in this way, for instance the [key concepts](https://github.com/ray-project/ray/blob/master/doc/source/tune/key-concepts.rst) or [API documentation](https://github.com/ray-project/ray/blob/master/doc/source/tune/api/api.rst). - Notebooks, written in `.ipynb` format. All Tune examples are written as notebooks. These notebooks render in the browser like `.md` or `.rst` files, but have the added benefit that users can easily run the code themselves. ## Fixing typos and improving explanations If you spot a typo in any document, or think that an explanation is not clear enough, please consider opening a pull request. In this scenario, just run the linter as described above and submit your pull request. ## Adding API references We use [Sphinx's autodoc extension](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html) to generate our API documentation from our source code. In case we're missing a reference to a function or class, please consider adding it to the respective document in question. For example, here's how you can add a function or class reference using `autofunction` and `autoclass`: ```markdown .. autofunction:: ray.tune.integration.docker.DockerSyncer .. autoclass:: ray.tune.integration.keras.TuneReportCallback ``` The above snippet was taken from the [Tune API documentation](https://github.com/ray-project/ray/blob/master/doc/source/tune/api/integration.rst), which you can look at for reference. If you want to change the content of the API documentation, you will have to edit the respective function or class signatures directly in the source code. For example, in the above `autofunction` call, to change the API reference for `ray.tune.integration.docker.DockerSyncer`, you would have to [change the following source file](https://github.com/ray-project/ray/blob/7f1bacc7dc9caf6d0ec042e39499bbf1d9a7d065/python/ray/tune/integration/docker.py#L15-L38). To show the usage of APIs, it is important to have small usage examples embedded in the API documentation. These should be self-contained and run out of the box, so a user can copy and paste them into a Python interpreter and play around with them (e.g., if applicable, they should point to example data). Users often rely on these examples to build their applications. To learn more about writing examples, read [How to write code snippets](writing-code-snippets). ## Adding code to an `.rST` or `.md` file Modifying text in an existing documentation file is easy, but you need to be careful when it comes to adding code. The reason is that we want to ensure every code snippet on our documentation is tested. This requires us to have a process for including and testing code snippets in documents. To learn how to write testable code snippets, read [How to write code snippets](writing-code-snippets). ```python from ray import train def objective(x, a, b): # Define an objective function. return a * (x ** 0.5) + b def trainable(config): # Pass a "config" dictionary into your trainable. for x in range(20): # "Train" for 20 iterations and compute intermediate scores. score = objective(x, config["a"], config["b"]) train.report({"score": score}) # Send the score to Tune. ``` This code is imported by `literalinclude` from a file called `doc_code/key_concepts.py`. Every Python file in the `doc_code` directory will automatically get tested by our CI system, but make sure to run scripts that you change (or new scripts) locally first. You do not need to run the testing framework locally. In rare situations, when you're adding _obvious_ pseudo-code to demonstrate a concept, it is ok to add it literally into your `.rST` or `.md` file, e.g. using a `.. code-cell:: python` directive. But if your code is supposed to run, it needs to be tested. ## Creating a new document from scratch Sometimes you might want to add a completely new document to the Ray documentation, like adding a new user guide or a new example. For this to work, you need to make sure to add the new document explicitly to a parent document's toctree, which determines the structure of the Ray documentation. See [the sphinx documentation](https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#directive-toctree) for more information. Depending on the type of document you're adding, you might also have to make changes to an existing overview page that curates the list of documents in question. For instance, for Ray Tune each user guide is added to the [user guide overview page](https://docs.ray.io/en/latest/tune/tutorials/overview.html) as a panel, and the same goes for [all Tune examples](https://docs.ray.io/en/latest/tune/examples/index.html). Always check the structure of the Ray sub-project whose documentation you're working on to see how to integrate it within the existing structure. In some cases you may be required to choose an image for the panel. Images are located in `doc/source/images`. ## Creating a notebook example To add a new executable example to the Ray documentation, you can start from our [MyST notebook template](https://github.com/ray-project/ray/tree/master/doc/source/_templates/template.md) or [Jupyter notebook template](https://github.com/ray-project/ray/tree/master/doc/source/_templates/template.ipynb). You could also simply download the document you're reading right now (click on the respective download button at the top of this page to get the `.ipynb` file) and start modifying it. All the example notebooks in Ray Tune get automatically tested by our CI system, provided you place them in the [`examples` folder](https://github.com/ray-project/ray/tree/master/doc/source/tune/examples). If you have questions about how to test your notebook when contributing to other Ray sub-projects, please make sure to ask a question in [the Ray community Slack](https://www.ray.io/join-slack) or directly on GitHub, when opening your pull request. To work off of an existing example, you could also have a look at the [Ray Tune Hyperopt example (`.ipynb`)](https://github.com/ray-project/ray/blob/master/doc/source/tune/examples/hyperopt_example.ipynb) or the [Ray Serve guide for text classification (`.md`)](https://github.com/ray-project/ray/blob/master/doc/source/serve/tutorials/text-classification.md). We recommend that you start with an `.md` file and convert your file to an `.ipynb` notebook at the end of the process. We'll walk you through this process below. What makes these notebooks different from other documents is that they combine code and text in one document, and can be launched in the browser. We also make sure they are tested by our CI system, before we add them to our documentation. To make this work, notebooks need to define a _kernel specification_ to tell a notebook server how to interpret and run the code. For instance, here's the kernel specification of a Python notebook: ```markdown --- jupytext: text_representation: extension: .md format_name: myst kernelspec: display_name: Python 3 language: python name: python3 --- ``` If you write a notebook in `.md` format, you need this YAML front matter at the top of the file. To add code to your notebook, you can use the `code-cell` directive. Here's an example: ````markdown ```python import ray import ray.rllib.agents.ppo as ppo from ray import serve def train_ppo_model(): trainer = ppo.PPOTrainer( config={"framework": "torch", "num_workers": 0}, env="CartPole-v0", ) # Train for one iteration trainer.train() trainer.save("/tmp/rllib_checkpoint") return "/tmp/rllib_checkpoint/checkpoint_000001/checkpoint-1" checkpoint_path = train_ppo_model() ``` ```` Putting this markdown block into your document will render as follows in the browser: ```python import ray import ray.rllib.agents.ppo as ppo from ray import serve def train_ppo_model(): trainer = ppo.PPOTrainer( config={"framework": "torch", "num_workers": 0}, env="CartPole-v0", ) # Train for one iteration trainer.train() trainer.save("/tmp/rllib_checkpoint") return "/tmp/rllib_checkpoint/checkpoint_000001/checkpoint-1" checkpoint_path = train_ppo_model() ``` ### Tags for your notebook What makes this work is the `:tags: [hide-cell]` directive in the `code-cell`. The reason we suggest starting with `.md` files is that it's much easier to add tags to them, as you've just seen. You can also add tags to `.ipynb` files, but you'll need to start a notebook server for that first, which may not want to do to contribute a piece of documentation. Apart from `hide-cell`, you also have `hide-input` and `hide-output` tags that hide the input and output of a cell. Also, if you need code that gets executed in the notebook, but you don't want to show it in the documentation, you can use the `remove-cell`, `remove-input`, and `remove-output` tags in the same way. ### Reference section labels [Reference sections labels](https://jupyterbook.org/en/stable/content/references.html#reference-section-labels) are a way to link to specific parts of the documentation from within a notebook. Creating one inside a markdown cell is simple: ```markdown (my-label)= # The thing to label ``` Then, you can link it in .rst files with the following syntax: ```rst See {ref}`the thing that I labeled ` for more information. ``` ### Testing notebooks Removing cells can be particularly interesting for compute-intensive notebooks. We want you to contribute notebooks that use _realistic_ values, not just toy examples. At the same time we want our notebooks to be tested by our CI system, and running them should not take too long. What you can do to address this is to have notebook cells with the parameters you want the users to see first: ````markdown ```{code-cell} python3 num_workers = 8 num_gpus = 2 ``` ```` which will render as follows in the browser: ```python num_workers = 8 num_gpus = 2 ``` But then in your notebook you follow that up with a _removed_ cell that won't get rendered, but has much smaller values and make the notebook run faster: ````markdown ```{code-cell} python3 :tags: [remove-cell] num_workers = 0 num_gpus = 0 ``` ```` ### Converting markdown notebooks to ipynb Once you're finished writing your example, you can convert it to an `.ipynb` notebook using `jupytext`: ```shell jupytext your-example.md --to ipynb ``` In the same way, you can convert `.ipynb` notebooks to `.md` notebooks with `--to myst`. And if you want to convert your notebook to a Python file, e.g. to test if your whole script runs without errors, you can use `--to py` instead. (vale)= ## How to use Vale ### What is Vale? [Vale](https://vale.sh/) checks if your writing adheres to the [Google developer documentation style guide](https://developers.google.com/style). It's only enforced on the Ray Data documentation. Vale catches typos and grammatical errors. It also enforces stylistic rules like “use contractions” and “use second person.” For the full list of rules, see the [configuration in the Ray repository](https://github.com/ray-project/ray/tree/master/.vale/styles/Google). ### How do you run Vale? #### How to use the VS Code extension 1. Install Vale. If you use macOS, use Homebrew. ```bash brew install vale ``` Otherwise, use PyPI. ```bash pip install vale ``` For more information on installation, see the [Vale documentation](https://vale.sh/docs/vale-cli/installation/). 2. Install the Vale VS Code extension by following these [installation instructions](https://marketplace.visualstudio.com/items?itemName=ChrisChinchilla.vale-vscode). 3. VS Code should show warnings in your code editor and in the “Problems” panel. ![Vale](../images/vale.png) #### How to run Vale on the command-line 1. Install Vale. If you use macOS, use Homebrew. ```bash brew install vale ``` Otherwise, use PyPI. ```bash pip install vale ``` For more information on installation, see the [Vale documentation](https://vale.sh/docs/vale-cli/installation/). 2. Run Vale in your terminal ```bash vale doc/source/data/overview.rst ``` 3. Vale should show warnings in your terminal. ``` ❯ vale doc/source/data/overview.rst doc/source/data/overview.rst 18:1 warning Try to avoid using Google.We first-person plural like 'We'. 18:46 error Did you really mean Vale.Spelling 'distrbuted'? 24:10 suggestion In general, use active voice Google.Passive instead of passive voice ('is built'). 28:14 warning Use 'doesn't' instead of 'does Google.Contractions not'. ✖ 1 error, 2 warnings and 1 suggestion in 1 file. ``` ### How to handle false Vale.Spelling errors To add custom terminology, complete the following steps: 1. If it doesn’t already exist, create a directory for your team in `.vale/styles/Vocab`. For example, `.vale/styles/Vocab/Data`. 2. If it doesn’t already exist, create a text file named `accept.txt`. For example, `.vale/styles/Vocab/Data/accept.txt`. 3. Add your term to `accept.txt`. Vale accepts Regex. For more information, see [Vocabularies](https://vale.sh/docs/topics/vocab/) in the Vale documentation. ### How to handle false Google.WordList errors Vale errors if you use a word that isn't on [Google's word list](https://developers.google.com/style/word-list). ``` 304:52 error Use 'select' instead of Google.WordList 'check'. ``` If you want to use the word anyway, modify the appropriate field in the [WordList configuration](https://github.com/ray-project/ray/blob/81c169bde2414fe4237f3d2f05fc76fccfd52dee/.vale/styles/Google/WordList.yml#L41). ## Troubleshooting If you run into a problem building the docs, following these steps can help isolate or eliminate most issues: 1. **Clean out build artifacts.** Use `make clean` to clean out docs build artifacts in the working directory. Sphinx uses caching to avoid doing work, and this sometimes causes problems. This is particularly true if you build the docs, then `git pull origin master` to pull in recent changes, and then try to build docs again. 2. **Check your environment.** Use `pip list` to check the installed dependencies. Compare them to `doc/requirements-doc.txt`. The documentation build system doesn't have the same dependency requirements as Ray. You don't need to run ML models or execute code on distributed systems in order to build the docs. In fact, it's best to use a completely separate docs build environment from the environment you use to run Ray to avoid dependency conflicts. When installing requirements, do `pip install -r doc/requirements-doc.txt`. Don't use `-U` because you don't want to upgrade any dependencies during the installation. 3. **Ensure a modern version of Python.** The docs build system doesn't keep the same dependency and Python version requirements as Ray. Use a modern version of Python when building docs. Newer versions of Python can be substantially faster than preceding versions. Consult for the latest version support information. 4. **Enable breakpoints in Sphinx**. Add -P to the `SPHINXOPTS` in `doc/Makefile` to tell `sphinx` to stop when it encounters a breakpoint, and remove `-j auto` to disable parallel builds. Now you can put breakpoints in the modules you're trying to import, or in `sphinx` code itself, which can help isolate build stubborn build issues. 5. **[Incremental build] Side navigation bar doesn't reflect new pages** If you are adding new pages, they should always show up in the side navigation bar on index pages. However, incremental builds with `make local` skips rebuilding many other pages, so Sphinx doesn't update the side navigation bar on those pages. To build docs with correct side navigation bar on all pages, consider using `make develop`. ## Where to go from here? There are many other ways to contribute to Ray other than documentation. See [our contributor guide](./getting-involved.rst) for more information. --- (runtime-env-auth)= # Authenticating Remote URIs in runtime_env This section helps you: * Avoid leaking remote URI credentials in your `runtime_env` * Provide credentials safely in KubeRay * Understand best practices for authenticating your remote URI ## Authenticating Remote URIs You can add dependencies to your `runtime_env` with [remote URIs](remote-uris). This is straightforward for files hosted publicly, because you simply paste the public URI into your `runtime_env`: ```python runtime_env = {"working_dir": ( "https://github.com/" "username/repo/archive/refs/heads/master.zip" ) } ``` However, dependencies hosted privately, in a private GitHub repo for example, require authentication. One common way to authenticate is to insert credentials into the URI itself: ```python runtime_env = {"working_dir": ( "https://username:personal_access_token@github.com/" "username/repo/archive/refs/heads/master.zip" ) } ``` In this example, `personal_access_token` is a secret credential that authenticates this URI. While Ray can successfully access your dependencies using authenticated URIs, **you should not include secret credentials in your URIs** for two reasons: 1. Ray may log the URIs used in your `runtime_env`, which means the Ray logs could contain your credentials. 2. Ray stores your remote dependency package in a local directory, and it uses a parsed version of the remote URI–including your credential–as the directory's name. In short, your remote URI is not treated as a secret, so it should not contain secret info. Instead, use a `netrc` file. ## Running on VMs: the netrc File The [netrc file](https://www.gnu.org/software/inetutils/manual/html_node/The-_002enetrc-file.html) contains credentials that Ray uses to automatically log into remote servers. Set your credentials in this file instead of in the remote URI: ```bash # "$HOME/.netrc" machine github.com login username password personal_access_token ``` In this example, the `machine github.com` line specifies that any access to `github.com` should be authenticated using the provided `login` and `password`. :::{note} On Unix, name the `netrc` file as `.netrc`. On Windows, name the file as `_netrc`. ::: The `netrc` file requires owner read/write access, so make sure to run the `chmod` command after creating the file: ```bash chmod 600 "$HOME/.netrc" ``` Add the `netrc` file to your VM container's home directory, so Ray can access the `runtime_env`'s private remote URIs, even when they don't contain credentials. ## Running on KubeRay: Secrets with netrc [KubeRay](kuberay-index) can also obtain credentials from a `netrc` file for remote URIs. Supply your `netrc` file using a Kubernetes secret and a Kubernetes volume with these steps: 1\. Launch your Kubernetes cluster. 2\. Create the `netrc` file locally in your home directory. 3\. Store the `netrc` file's contents as a Kubernetes secret on your cluster: ```bash kubectl create secret generic netrc-secret --from-file=.netrc="$HOME/.netrc" ``` 4\. Expose the secret to your KubeRay application using a mounted volume, and update the `NETRC` environment variable to point to the `netrc` file. Include the following YAML in your KubeRay config. ```yaml headGroupSpec: ... containers: - name: ... image: rayproject/ray:latest ... volumeMounts: - mountPath: "/home/ray/netrcvolume/" name: netrc-kuberay readOnly: true env: - name: NETRC value: "/home/ray/netrcvolume/.netrc" volumes: - name: netrc-kuberay secret: secretName: netrc-secret workerGroupSpecs: ... containers: - name: ... image: rayproject/ray:latest ... volumeMounts: - mountPath: "/home/ray/netrcvolume/" name: netrc-kuberay readOnly: true env: - name: NETRC value: "/home/ray/netrcvolume/.netrc" volumes: - name: netrc-kuberay secret: secretName: netrc-secret ``` 5\. Apply your KubeRay config. Your KubeRay application can use the `netrc` file to access private remote URIs, even when they don't contain credentials. --- --- description: "Learn about using labels to control how Ray schedules tasks, actors, and placement groups to nodes in your Kubernetes cluster." --- (labels)= # Use labels to control scheduling In Ray version 2.49.0 and above, you can use labels to control scheduling for KubeRay. Labels are a beta feature. This page provides a conceptual overview and usage instructions for labels. Labels are key-value pairs that provide a human-readable configuration for users to control how Ray schedules tasks, actors, and placement group bundles to specific nodes. ```{note} Ray labels share the same syntax and formatting restrictions as Kubernetes labels, but are conceptually distinct. See the [Kubernetes docs on labels and selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set). ``` ## How do labels work? The following is a high-level overview of how you use labels to control scheduling: - Ray sets default labels that describe the underlying compute. See [](defaults). - You define custom labels as key-value pairs. See [](custom). - You specify *label selectors* in your Ray code to define label requirements. You can specify these requirements at the task, actor, or placement group bundle level. See [](label-selectors). - Ray schedules tasks, actors, or placement group bundles based on the specified label selectors. - In Ray 2.50.0 and above, if you're using a dynamic cluster with autoscaler V2 enabled, the cluster scales up to add new nodes from a designated worker group to fulfill label requirements. (defaults)= ## Default node labels ```{note} Ray reserves all labels under ray.io namespace. ``` During cluster initialization or as autoscaling events add nodes to your cluster, Ray assigns the following default labels to each node: | Label | Description | | --- | --- | | `ray.io/node-id` | A unique ID generated for the node. | | `ray.io/accelerator-type` | The accelerator type of the node, for example `L4`. CPU-only machines don't have the label. See {ref}`accelerator types ` for a mapping of values. | ```{note} You can override default values using `ray start` parameters. ``` The following are examples of default labels: ```python "ray.io/accelerator-type": "L4" # Default label indicating the machine has Nvidia L4 GPU ``` (custom)= ## Define custom labels You can add custom labels to your nodes using the `--labels` or `--labels-file` parameter when running `ray start`. ```bash # Examples 1: Start a head node with cpu-family and test-label labels ray start --head --labels="cpu-family=amd,test-label=test-value" # Example 2: Start a head node with labels from a label file ray start --head --labels-files='./test-labels-file' # The file content can be the following (should be a valid YAML file): # "test-label": "test-value" # "test-label-2": "test-value-2" ``` ```{note} You can't set labels using `ray.init()`. Local Ray clusters don't support labels. ``` (label-selectors)= ## Specify label selectors You add label selector logic to your Ray code when defining Ray tasks, actors, or placement group bundles. Label selectors define the label requirements for matching your Ray code to a node in your Ray cluster. Label selectors specify the following: - The key of the label. - Operator logic for matching. - The value or values to match on. The following table shows the basic syntax for label selector operator logic: | Operator | Description | Example syntax | | --- | --- | --- | | Equals | Label matches exactly one value. | `{“key”: “value”}` | Not equal | Label matches anything by one value. | `{“key”: “!value”}` | In | Label matches one of the provided values. | `{“key”: “in(val1,val2)”}` | Not in | Label matches none of the provided values. | `{“key”: “!in(val1,val2)”}` You can specify one or more label selectors as a dict. When specifying multiple label selectors, the candidate node must meet all requirements. The following example configuration uses a custom label to require an `m5.16xlarge` EC2 instance and a default label to require node ID to be 123: ```python label_selector={"instance_type": "m5.16xlarge", "ray.io/node-id": "123"} ``` ## Specify label requirements for tasks and actors Use the following syntax to add label selectors to tasks and actors: ```python # An example for specifing label_selector in task's @ray.remote annotation @ray.remote(label_selector={"label_name":"label_value"}) def f(): pass # An example of specifying label_selector in actor's @ray.remote annotation @ray.remote(label_selector={"ray.io/accelerator-type": "H100"}) class Actor: pass # An example of specifying label_selector in task's options @ray.remote def test_task_label_in_options(): pass test_task_label_in_options.options(label_selector={"test-lable-key": "test-label-value"}).remote() # An example of specifying label_selector in actor's options @ray.remote class Actor: pass actor_1 = Actor.options( label_selector={"ray.io/accelerator-type": "H100"}, ).remote() ``` ## Specify label requirements for placement group bundles Use the `bundle_label_selector` option to add label selector to placement group bundles. See the following examples: ```python # All bundles require the same labels: ray.util.placement_group( bundles=[{"GPU": 1}, {"GPU": 1}], bundle_label_selector=[{"ray.io/accelerator-type": "H100"}] * 2, ) # Bundles require different labels: ray.util.placement_group( bundles=[{"CPU": 1}] + [{"GPU": 1}] * 2, bundle_label_selector=[{"ray.io/market-type": "spot"}] + [{"ray.io/accelerator-type": "H100"}] * 2 ) ``` ## Using labels with autoscaler Autoscaler V2 supports label-based scheduling. To enable autoscaler to scale up nodes to fulfill label requirements, you need to create multiple worker groups for different label requirement combinations and specify all the corresponding labels in the `rayStartParams` field in the Ray cluster configuration. For example: ```python rayStartParams: { labels: "region=me-central1,ray.io/accelerator-type=H100" } ``` ## Monitor nodes using labels The Ray dashboard automatically shows the following information: - Labels for each node. See {py:attr}`ray.util.state.common.NodeState.labels`. - Label selectors set for each task, actor, or placement group bundle. See {py:attr}`ray.util.state.common.TaskState.label_selector` and {py:attr}`ray.util.state.common.ActorState.label_selector`. Within a task, you can programmatically obtain the node label from the RuntimeContextAPI using `ray.get_runtime_context().get_node_labels()`. This returns a Python dict. See the following example: ```python @ray.remote def test_task_label(): node_labels = ray.get_runtime_context().get_node_labels() print(f"[test_task_label] node labels: {node_labels}") """ Example output: (test_task_label pid=68487) [test_task_label] node labels: {'test-label-1': 'test-value-1', 'test-label-key': 'test-label-value', 'test-label-2': 'test-value-2'} """ ``` You can also access information about node label and label selector information using the state API and state CLI. --- (core-type-hint)= # Type hints in Ray As of Ray 2.48, Ray provides comprehensive support for Python type hints with both remote functions and actors. This enables better IDE support, static type checking, and improved code maintainability in distributed Ray applications. ## Overview In most cases, Ray applications can use type hints without any modifications to existing code. Ray automatically handles type inference for standard remote functions and basic actor usage patterns. For example, remote functions support standard Python type annotations without additional configuration. The `@ray.remote` decorator preserves the original function signature and type information. ```python import ray @ray.remote def add_numbers(x: int, y: int) -> int: return x + y # Type hints work seamlessly with remote function calls a = add_numbers.remote(5, 3) print(ray.get(a)) ``` However, certain patterns, especially when working with actors, require specific approaches to ensure proper type annotation. ## Pattern 1: Use `ray.remote` as a function to build an actor Use the `ray.remote` function directly to create an actor class, instead of using the `@ray.remote` decorator. This will preserve the original class type and allow type inference to work correctly. For example, in this case, the original class type is `DemoRay`, and the actor class type is `ActorClass[DemoRay]`. ```python import ray from ray.actor import ActorClass class DemoRay: def __init__(self, init: int): self.init = init @ray.method def calculate(self, v1: int, v2: int) -> int: return self.init + v1 + v2 ActorDemoRay: ActorClass[DemoRay] = ray.remote(DemoRay) # DemoRay is the original class type, ActorDemoRay is the ActorClass[DemoRay] type ``` After creating the `ActorClass[DemoRay]` type, we can use it to instantiate an actor by calling `ActorDemoRay.remote(1)`. It returns an `ActorProxy[DemoRay]` type, which represents an actor handle. This handle will provide type hints for the actor methods, including their arguments and return types. ```python actor: ActorProxy[DemoRay] = ActorDemoRay.remote(1) def func(actor: ActorProxy[DemoRay]) -> int: b: ObjectRef[int] = actor.calculate.remote(1, 2) return ray.get(b) a = func.remote() print(ray.get(a)) ``` **Why do we need to do this?** In Ray, the `@ray.remote` decorator indicates that instances of the class `T` are actors, with each actor running in its own Python process. However, the `@ray.remote` decorator will transform the class `T` into a `ActorClass[T]` type, which is not the original class type. Unfortunately, IDE and static type checkers will not be able to infer the original type `T` of the `ActorClass[T]`. To solve this problem, using `ray.remote(T)` will explicitly return a new generic class `ActorClass[T]` type while preserving the original class type. ## Pattern 2: Use `@ray.method` decorator for remote methods Add the `@ray.method` decorator to the actor methods in order to obtain type hints for the remote methods of the actor through `ActorProxy[T]` type, including their arguments and return types. ```python from ray.actor import ActorClass, ActorProxy class DemoRay: def __init__(self, init: int): self.init = init @ray.method def calculate(self, v1: int, v2: int) -> int: return self.init + v1 + v2 ActorDemoRay: ActorClass[DemoRay] = ray.remote(DemoRay) actor: ActorProxy[DemoRay] = ActorDemoRay.remote(1) # IDEs will be able to correctly list the remote methods of the actor # and provide type hints for the arguments and return values of the remote methods a: ObjectRef[int] = actor.calculate.remote(1, 2) print(ray.get(a)) ``` :::{note} We would love to make the typing of remote methods work without `@ray.method` decorator. If any community member has an idea, we welcome PRs. ::: --- --- orphan: true --- # Distributed Data Processing in Data-Juicer Data-Juicer supports large-scale distributed data processing based on [Ray](https://github.com/ray-project/ray) and [Platform for AI](https://www.alibabacloud.com/en/product/machine-learning) of Alibaba Cloud. With a dedicated design, you can seamlessly execute almost all operators that Data-Juicer implements in standalone mode, in Ray distributed mode. The Data-Juicer team continuously conducts engine-specific optimizations for large-scale scenarios, such as data subset splitting strategies that balance the number of files and workers, and streaming I/O patches for JSON files to Ray and Apache Arrow. For reference, in experiments with 25 to 100 Alibaba Cloud nodes, Data-Juicer in Ray mode processes datasets containing 70 billion samples on 6400 CPU cores in 2 hours and 7 billion samples on 3200 CPU cores in 0.45 hours. Additionally, a MinHash-LSH-based deduplication operator in Ray mode can deduplicate terabyte-sized datasets on 8 nodes with 1280 CPU cores in 3 hours. See the [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf) paper for more details. ## Implementation and optimizations ### Ray mode in Data-Juicer - For most implementations of Data-Juicer [operators](https://github.com/modelscope/data-juicer/blob/main/docs/Operators.md), the core processing functions are engine-agnostic. Operator interoperability is managed primarily in [RayDataset](https://github.com/modelscope/data-juicer/blob/main/data_juicer/core/data/ray_dataset.py) and [RayExecutor](https://github.com/modelscope/data-juicer/blob/main/data_juicer/core/executor/ray_executor.py), which are subclasses of the base `DJDataset` and `BaseExecutor`, respectively, and support both Ray [Tasks](ray-remote-functions) and [Actors](actor-guide). - The exception is the deduplication operators, which are challenging to scale in standalone mode. The names of these operators follow the pattern of [`ray_xx_deduplicator`](https://github.com/modelscope/data-juicer/blob/main/data_juicer/ops/deduplicator/). ### Subset splitting When a cluster has tens of thousands of nodes but only a few dataset files, Ray splits the dataset files according to available resources and distributes the blocks across all nodes, incurring high network communication costs and reducing CPU utilization. For more details, see [Ray's `_autodetect_parallelism` function](https://github.com/ray-project/ray/blob/2dbd08a46f7f08ea614d8dd20fd0bca5682a3078/python/ray/data/_internal/util.py#L201-L205) and [tuning output blocks for Ray](read_output_blocks). This default execution plan can be quite inefficient especially for scenarios with a large number of nodes. To optimize performance for such cases, Data-Juicer automatically splits the original dataset into smaller files in advance, taking into consideration the features of Ray and Arrow. When you encounter such performance issues, you can use this feature or split the dataset according to your own preferences. In this auto-split strategy, the single file size is about 128 MB, and the result should ensure that the number of sub-files after splitting is at least twice the total number of CPU cores available in the cluster. ### Streaming reading of JSON files Streaming reading of JSON files is a common requirement in data processing for foundation models, as many datasets are in JSONL format and large in size. However, the current implementation in Ray Datasets, which depends on the underlying Arrow library (up to Ray version 2.40 and Arrow version 18.1.0), doesn't support streaming reading of JSON files. To address the lack of native support for streaming JSON data, the Data-Juicer team developed a streaming loading interface and contributed an in-house [patch](https://github.com/modelscope/data-juicer/pull/515) for Apache Arrow ([PR to the repository](https://github.com/apache/arrow/pull/45084)). This patch helps alleviate Out-of-Memory issues. With this patch, Data-Juicer in Ray mode, by default, uses the streaming loading interface to load JSON files. In addition, streaming-read support for CSV and Parquet files is already enabled. ### Deduplication Data-Juicer provides an optimized MinHash-LSH-based deduplication operator in Ray mode. It's a multiprocessing Union-Find set in Ray Actors and a load-balanced distributed algorithm, [BTS](https://ieeexplore.ieee.org/document/10598116), to complete equivalence class merging. This operator can deduplicate terabyte-sized datasets on 1280 CPU cores in 3 hours. The Data-Juicer team's ablation study shows 2x to 3x speedups with their dedicated optimizations for Ray mode compared to the vanilla version of this deduplication operator. ## Performance results ### Data Processing with Varied Scales Data-Juicer team conducted experiments on datasets with billions of samples. They prepared a 560k-sample multimodal dataset and expanded it by different factors (1x to 125000x) to create datasets of varying sizes. The experimental results, shown in the figure below, demonstrate good scalability. ### Distributed Deduplication on Large-Scale Datasets Data-Juicer team tested the MinHash-based RayDeduplicator on datasets sized at 200 GB, 1 TB, and 5 TB, using CPU counts ranging from 640 to 1280 cores. As the table below shows, when the data size increases by 5x, the processing time increases by 4.02x to 5.62x. When the number of CPU cores doubles, the processing time decreases to 58.9% to 67.1% of the original time. | # CPU | 200 GB Time | 1 TB Time | 5 TB Time | |---------|------------------|----------------|----------------| | 4 * 160 | 11.13 min | 50.83 min | 285.43 min | | 8 * 160 | 7.47 min | 30.08 min | 168.10 min | ## Quick Start Before starting, you should install Data-Juicer and its `dist` requirements: ```shell pip install -v -e . # Install the minimal requirements of Data-Juicer pip install -v -e ".[dist]" # Include dependencies on Ray and other distributed libraries ``` Then start a Ray cluster (ref to the [Ray doc](start-ray) for more details): ```shell # Start a cluster as the head node ray start --head # (Optional) Connect to the cluster on other nodes/machines. ray start --address='{head_ip}:6379' ``` Data-Juicer provides simple demos in the directory `demos/process_on_ray/`, which includes two config files and two test datasets. ```text demos/process_on_ray ├── configs │ ├── demo.yaml │ └── dedup.yaml └── data ├── demo-dataset.json └── demo-dataset.jsonl ``` > **Important:** > If you run these demos on multiple nodes, you need to put the demo dataset to a shared disk (for example, Network-attached storage) and export the result dataset to it as well by modifying the `dataset_path` and `export_path` in the config files. ### Running Example of Ray Mode In the `demo.yaml` config file, it sets the executor type to "ray" and specifies an automatic Ray address. ```yaml ... dataset_path: './demos/process_on_ray/data/demo-dataset.jsonl' export_path: './outputs/demo/demo-processed' executor_type: 'ray' # Set the executor type to "ray" ray_address: 'auto' # Set an automatic Ray address ... ``` Run the demo to process the dataset with 12 regular OPs: ```shell # Run the tool from source python tools/process_data.py --config demos/process_on_ray/configs/demo.yaml # Use the command-line tool dj-process --config demos/process_on_ray/configs/demo.yaml ``` Data-Juicer processes the demo dataset with the demo config file and exports the result datasets to the directory specified by the `export_path` argument in the config file. ### Running Example of Distributed Deduplication In the `dedup.yaml` config file, it sets the executor type to "ray" and specifies an automatic Ray address. And it uses a dedicated distributed version of MinHash deduplication operator to deduplicate the dataset. ```yaml project_name: 'demo-dedup' dataset_path: './demos/process_on_ray/data/' export_path: './outputs/demo-dedup/demo-ray-bts-dedup-processed' executor_type: 'ray' # Set the executor type to "ray" ray_address: 'auto' # Set an automatic Ray address # process schedule # a list of several process operators with their arguments process: - ray_bts_minhash_deduplicator: # a distributed version of minhash deduplicator tokenization: 'character' ``` Run the demo to deduplicate the dataset: ```shell # Run the tool from source python tools/process_data.py --config demos/process_on_ray/configs/dedup.yaml # Use the command-line tool dj-process --config demos/process_on_ray/configs/dedup.yaml ``` Data-Juicer deduplicates the demo dataset with the demo config file and exports the result datasets to the directory specified by the `export_path` argument in the config file. --- (observability)= # Monitoring and Debugging ```{toctree} :hidden: getting-started ray-distributed-debugger key-concepts User Guides Reference ``` This section covers how to **monitor and debug Ray applications and clusters** with Ray's Observability features. ## What is observability In general, observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs. In Ray's context, observability refers to the ability for users to observe and infer Ray applications' and Ray clusters' internal states with various external outputs, such as logs, metrics, events, etc. ![what is ray's observability](./images/what-is-ray-observability.png) ## Importance of observability Debugging a distributed system can be challenging due to the large scale and complexity. Good observability is important for Ray users to be able to easily monitor and debug their Ray applications and clusters. ![Importance of observability](./images/importance-of-observability.png) ## Monitoring and debugging workflow and tools Monitoring and debugging Ray applications consist of 4 major steps: 1. Monitor the clusters and applications. 2. Identify the surfaced problems or errors. 3. Debug with various tools and data. 4. Form a hypothesis, implement a fix, and validate it. The remainder of this section covers the observability tools that Ray provides to accelerate your monitoring and debugging workflow. --- (observability-reference)= # Reference ```{toctree} :hidden: api cli system-metrics ``` Monitor and debug your Ray applications and clusters using the API and CLI documented in these references. The guides include: * {ref}`state-api-ref` * {ref}`state-api-cli-ref` * {ref}`system-metrics` --- (configure-logging)= # Configuring Logging This guide helps you understand and modify the configuration of Ray's logging system. (logging-directory)= ## Logging directory By default, Ray stores the log files in a `/tmp/ray/session_*/logs` directory. View the {ref}`log files in logging directory ` below to understand how Ray organizes the log files within the logs folder. :::{note} For Linux and macOS, Ray uses ``/tmp/ray`` as the default temp directory. To change the temp and the logging directory, specify it when you call ``ray start`` or ``ray.init()``. ::: A new Ray session creates a new folder to the temp directory. Ray symlinks the latest session folder to `/tmp/ray/session_latest`. Here is an example temp directory: ``` ├── tmp/ray │ ├── session_latest │ │ ├── logs │ │ ├── ... │ ├── session_2023-05-14_21-19-58_128000_45083 │ │ ├── logs │ │ ├── ... │ ├── session_2023-05-15_21-54-19_361265_24281 │ ├── ... ``` Usually, Ray clears up the temp directories whenever the machines reboot. As a result, log files may get lost whenever your cluster or some of the nodes are stopped. If you need to inspect logs after the clusters stop, you need to store and persist the logs. See the instructions for how to process and export logs for {ref}`Log persistence ` and {ref}`KubeRay Clusters `. (logging-directory-structure)= ## Log files in logging directory Below are the log files in the logging directory. Broadly speaking, two types of log files exist: system log files and application log files. Note that ``.out`` logs are from stdout/stderr and ``.err`` logs are from stderr. Ray doesn't guarantee the backward compatibility of log directories. :::{note} System logs may include information about your applications. For example, ``runtime_env_setup-[job_id].log`` may include information about your application's environment and dependency. ::: ### Application logs - ``job-driver-[submission_id].log``: The stdout of a job submitted with the {ref}`Ray Jobs API `. - ``worker-[worker_id]-[job_id]-[pid].[out|err]``: Python or Java part of Ray drivers and workers. Ray streams all stdout and stderr from Tasks or Actors to these files. Note that job_id is the ID of the driver. ### System/component logs - ``dashboard.[log|out|err]``: A log file of a Ray Dashboard. ``.log`` files contain logs generated from the dashboard's logger. ``.out`` and ``.err`` files contain stdout and stderr printed from the dashboard respectively. They're usually empty except when the dashboard crashes unexpectedly. - ``dashboard_agent.[log|out|err]``: Every Ray node has one dashboard agent. ``.log`` files contain logs generated from the dashboard agent's logger. ``.out`` and ``.err`` files contain stdout and stderr printed from the dashboard agent respectively. They're usually empty except when the dashboard agent crashes unexpectedly. - ``dashboard_[module_name].[log|out|err]``: The log files for the Ray Dashboard child processes, one per each module. ``.log`` files contain logs generated from the module's logger. ``.out`` and ``.err`` files contain stdout and stderr printed from the module respectively. They're usually empty except when the module crashes unexpectedly. - ``gcs_server.[out|err]``: The GCS server is a stateless server that manages Ray cluster metadata. It exists only in the head node. - ``io-worker-[worker_id]-[pid].[out|err]``: Ray creates IO workers to spill/restore objects to external storage by default from Ray 1.3+. This is a log file of IO workers. - ``log_monitor.[log|out|err]``: The log monitor is in charge of streaming logs to the driver. ``.log`` files contain logs generated from the log monitor's logger. ``.out`` and ``.err`` files contain the stdout and stderr printed from the log monitor respectively. They're usually empty except when the log monitor crashes unexpectedly. - ``monitor.[log|out|err]``: Log files of the Autoscaler. ``.log`` files contain logs generated from the autoscaler's logger. ``.out`` and ``.err`` files contain stdout and stderr printed from the autoscaler respectively. They're usually empty except when the autoscaler crashes unexpectedly. - ``python-core-driver-[worker_id]_[pid].log``: Ray drivers consist of C++ core and a Python or Java frontend. C++ code generates this log file. - ``python-core-worker-[worker_id]_[pid].log``: Ray workers consist of C++ core and a Python or Java frontend. C++ code generates this log file. - ``raylet.[out|err]``: A log file of raylets. - ``runtime_env_agent.[log|out|err]``: Every Ray node has one agent that manages {ref}`Runtime Environment ` creation, deletion, and caching. ``.log`` files contain logs generated from the runtime env agent's logger. ``.out`` and ``.err`` files contain stdout and stderr printed from the runtime env agent respectively. They're usually empty except when the runtime env agent crashes unexpectedly. The logs of the actual installations for ``pip install`` logs are in the following ``runtime_env_setup-[job_id].log`` file. - ``runtime_env_setup-ray_client_server_[port].log``: Logs from installing {ref}`Runtime Environments ` for a job when connecting with {ref}`Ray Client `. - ``runtime_env_setup-[job_id].log``: Logs from installing {ref}`runtime environments ` for a Task, Actor, or Job. This file is only present if you install a runtime environment. (log-redirection-to-driver)= ## Redirecting Worker logs to the Driver By default, Worker stdout and stderr for Tasks and Actors stream to the Ray Driver (the entrypoint script that calls ``ray.init``). It helps users aggregate the logs for the distributed Ray application in a single place. ```{literalinclude} ../doc_code/app_logging.py ``` Ray prints all stdout emitted from the ``print`` method to the driver with a ``(Task or Actor repr, process ID, IP address)`` prefix. ``` bash (pid=45601) task (Actor pid=480956) actor ``` ### Customizing prefixes for Actor logs It's often useful to distinguish between log messages from different Actors. For example, if you have a large number of worker Actors, you may want to easily see the index of the Actor that logged a particular message. Define the `__repr__ `__ method for the Actor class to replace the Actor name with the Actor repr. For example: ```{literalinclude} /ray-core/doc_code/actor-repr.py ``` The resulting output follows: ```bash (MyActor(index=2) pid=482120) hello there (MyActor(index=1) pid=482119) hello there ``` ### Coloring Actor log prefixes By default, Ray prints Actor log prefixes in light blue. Turn color logging off by setting the environment variable ``RAY_COLOR_PREFIX=0`` - for example, when outputting logs to a file or other location that doesn't support ANSI codes. Or activate multi-color prefixes by setting the environment variable ``RAY_COLOR_PREFIX=1``; this indexes into an array of colors modulo the PID of each process. ![coloring-actor-log-prefixes](../images/coloring-actor-log-prefixes.png) ### Disable logging to the driver In large scale runs, you may not want to route all worker logs to the driver. Disable this feature by setting ``log_to_driver=False`` in `ray.init`: ```python import ray # Task and Actor logs are not copied to the driver stdout. ray.init(log_to_driver=False) ``` ## Log deduplication By default, Ray deduplicates logs that appear redundantly across multiple processes. The first instance of each log message is always immediately printed. However, Ray buffers subsequent log messages of the same pattern for up to five seconds and prints them in batch. Note that Ray also ignores words with numeric components. For example, for the following code snippet: ```python import ray import random @ray.remote def task(): print("Hello there, I am a task", random.random()) ray.get([task.remote() for _ in range(100)]) ``` The output is as follows: ```bash 2023-03-27 15:08:34,195 INFO worker.py:1603 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (task pid=534172) Hello there, I am a task 0.20583517821231412 (task pid=534174) Hello there, I am a task 0.17536720316370757 [repeated 99x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication) ``` This feature is useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when you import them. Configure the following environment variables on the driver process **before importing Ray** to customize log deduplication: * Set ``RAY_DEDUP_LOGS=0`` to turn off this feature entirely. * Set ``RAY_DEDUP_LOGS_AGG_WINDOW_S=`` to change the aggregation window. * Set ``RAY_DEDUP_LOGS_ALLOW_REGEX=`` to specify log messages to never deduplicate. * Example: ```python import os os.environ["RAY_DEDUP_LOGS_ALLOW_REGEX"] = "ABC" import ray @ray.remote def f(): print("ABC") print("DEF") ray.init() ray.get([f.remote() for _ in range(5)]) # 2024-10-10 17:54:19,095 INFO worker.py:1614 -- Connecting to existing Ray cluster at address: 172.31.13.10:6379... # 2024-10-10 17:54:19,102 INFO worker.py:1790 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 # (f pid=1574323) ABC # (f pid=1574323) DEF # (f pid=1574321) ABC # (f pid=1574318) ABC # (f pid=1574320) ABC # (f pid=1574322) ABC # (f pid=1574322) DEF [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) ``` * Set ``RAY_DEDUP_LOGS_SKIP_REGEX=`` to specify log messages to skip printing. * Example: ```python import os os.environ["RAY_DEDUP_LOGS_SKIP_REGEX"] = "ABC" import ray @ray.remote def f(): print("ABC") print("DEF") ray.init() ray.get([f.remote() for _ in range(5)]) # 2024-10-10 17:55:05,308 INFO worker.py:1614 -- Connecting to existing Ray cluster at address: 172.31.13.10:6379... # 2024-10-10 17:55:05,314 INFO worker.py:1790 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 # (f pid=1574317) DEF # (f pid=1575229) DEF [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) ``` ## Distributed progress bars with tqdm When using [tqdm](https://tqdm.github.io) in Ray remote Tasks or Actors, you may notice that the progress bar output is corrupted. To avoid this problem, use the Ray distributed tqdm implementation at ``ray.experimental.tqdm_ray``: ```{literalinclude} /ray-core/doc_code/tqdm.py ``` This tqdm implementation works as follows: 1. The ``tqdm_ray`` module translates tqdm calls into special JSON log messages written to the worker stdout. 2. The Ray log monitor routes these log messages to a tqdm singleton, instead of copying them directly to the driver stdout. 3. The tqdm singleton determines the positions of progress bars from various Ray Tasks or Actors, ensuring they don't collide or conflict with each other. Limitations: - Ray only supports a subset of tqdm features. Refer to the ray_tqdm [implementation](https://github.com/ray-project/ray/blob/master/python/ray/experimental/tqdm_ray.py) for more details. - Performance may be poor if there are more than a couple thousand updates per second because Ray doesn't batch updates. By default, the built-in print is also patched to use `ray.experimental.tqdm_ray.safe_print` when you use `tqdm_ray`. This avoids progress bar corruption on driver print statements. To turn off this, set `RAY_TQDM_PATCH_PRINT=0`. ## Using Ray's logger When Ray executes ``import ray``, Ray initializes Ray's logger, generating a default configuration given in ``python/ray/_private/log.py``. The default logging level is ``logging.INFO``. All Ray loggers are automatically configured in ``ray._private.ray_logging``. To modify the Ray logger: ```python import logging logger = logging.getLogger("ray") logger.setLevel(logging.WARNING) # Modify the Ray logging config ``` Similarly, to modify the logging configuration for Ray libraries, specify the appropriate logger name: ```python import logging # First, get the handle for the logger you want to modify ray_data_logger = logging.getLogger("ray.data") ray_tune_logger = logging.getLogger("ray.tune") ray_rllib_logger = logging.getLogger("ray.rllib") ray_train_logger = logging.getLogger("ray.train") ray_serve_logger = logging.getLogger("ray.serve") # Modify the ray.data logging level ray_data_logger.setLevel(logging.WARNING) # Other loggers can be modified similarly. # Here's how to add an additional file handler for Ray Tune: ray_tune_logger.addHandler(logging.FileHandler("extra_ray_tune_log.log")) ``` ### Using Ray logger for application logs A Ray app includes both driver and worker processes. For Python apps, use Python loggers to format your logs. As a result, you need to set up Python loggers for both driver and worker processes. ::::{tab-set} :::{tab-item} Ray Core ```{admonition} Caution :class: caution This is an experimental feature. It doesn't support [Ray Client](ray-client-ref) yet. ``` Set up the Python logger for driver and worker processes separately: 1. Set up the logger for the driver process after importing `ray`. 2. Use `worker_process_setup_hook` to configure the Python logger for all worker processes. ![Set up python loggers](../images/setup-logger-application.png) If you want to control the logger for particular actors or tasks, view the following [customizing logger for individual worker process](#customizing-worker-process-loggers). ::: :::{tab-item} Ray libraries If you are using any of the Ray libraries, follow the instructions provided in the documentation for the library. ::: :::: ### Customizing worker process loggers Ray executes Tasks and Actors remotely in Ray's worker processes. To provide your own logging configuration for the worker processes, customize the worker loggers with the instructions below: ::::{tab-set} :::{tab-item} Ray Core: individual worker process Customize the logger configuration when you define the Tasks or Actors. ```python import ray import logging # Initiate a driver. ray.init() @ray.remote class Actor: def __init__(self): # Basic config automatically configures logs to # stream to stdout and stderr. # Set the severity to INFO so that info logs are printed to stdout. logging.basicConfig(level=logging.INFO) def log(self, msg): logger = logging.getLogger(__name__) logger.info(msg) actor = Actor.remote() ray.get(actor.log.remote("A log message for an actor.")) @ray.remote def f(msg): logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) logger.info(msg) ray.get(f.remote("A log message for a task.")) ``` ```bash (Actor pid=179641) INFO:__main__:A log message for an actor. (f pid=177572) INFO:__main__:A log message for a task. ``` ::: :::{tab-item} Ray Core: all worker processes of a job ```{admonition} Caution :class: caution This is an experimental feature. The semantic of the API is subject to change. It doesn't support [Ray Client](ray-client-ref) yet. ``` Use `worker_process_setup_hook` to apply the new logging configuration to all worker processes within a job. ```python # driver.py def logging_setup_func(): logger = logging.getLogger("ray") logger.setLevel(logging.DEBUG) warnings.simplefilter("always") ray.init(runtime_env={"worker_process_setup_hook": logging_setup_func}) logging_setup_func() ``` ::: :::{tab-item} Ray libraries If you use any of the Ray libraries, follow the instructions provided in the documentation for the library. ::: :::: (structured-logging)= ## Structured logging Implement structured logging to enable downstream users and applications to consume the logs efficiently. ### Application logs Ray enables users to configure the Python logging library to output logs in a structured format. This setup standardizes log entries, making them easier to handle. #### Configure structured logging for Ray Core ```{admonition} Ray libraries If you are using any of the Ray libraries, follow the instructions provided in the documentation for the library. ``` The following methods are ways to configure Ray Core's structure logging format: ##### Method 1: Configure structured logging with `ray.init` ```python ray.init( log_to_driver=False, logging_config=ray.LoggingConfig(encoding="JSON", log_level="INFO") ) ``` You can configure the following parameters: * `encoding`: The encoding format for the logs. The default is `TEXT` for plain text logs. The other option is `JSON` for structured logs. In both `TEXT` and `JSON` encoding formats, the logs include Ray-specific fields such as `job_id`, `worker_id`, `node_id`, `actor_id`, `actor_name`, `task_id`, `task_name` and `task_function_name`, if available. * `log_level`: The log level for the driver process. The default is `INFO`. Available log levels are defined in the [Python logging library](https://docs.python.org/3/library/logging.html#logging-levels). * `additional_log_standard_attrs`: Since Ray version 2.43. A list of additional standard Python logger attributes to include in the log record. The default is an empty list. The list of already included standard attributes are: `asctime`, `levelname`, `message`, `filename`, `lineno`, `exc_text`. The list of all valid attributes are specified in the [Python logging library](http://docs.python.org/library/logging.html#logrecord-attributes). When you set up `logging_config` in `ray.init`, it configures the root loggers for the driver process, Ray actors, and Ray tasks. ```{admonition} note The `log_to_driver` parameter is set to `False` to disable logging to the driver process as the redirected logs to the driver will include prefixes that made the logs not JSON parsable. ``` ##### Method 2: Configure structured logging with an environment variable You can set the `RAY_LOGGING_CONFIG_ENCODING` environment variable to `TEXT` or `JSON` to set the encoding format for the logs. Note that you need to set the environment variables before `import ray`. ```python import os os.environ["RAY_LOGGING_CONFIG_ENCODING"] = "JSON" import ray import logging ray.init(log_to_driver=False) # Use the root logger to print log messages. ``` #### Example The following example configures the `LoggingConfig` to output logs in a structured JSON format and sets the log level to `INFO`. It then logs messages with the root loggers in the driver process, Ray tasks, and Ray actors. The logs include Ray-specific fields such as `job_id`, `worker_id`, `node_id`, `actor_id`, `actor_name`, `task_id`, `task_name` and `task_function_name` when applicable. ```python import ray import logging ray.init( logging_config=ray.LoggingConfig(encoding="JSON", log_level="INFO", additional_log_standard_attrs=['name']) ) def init_logger(): """Get the root logger""" return logging.getLogger() logger = logging.getLogger() logger.info("Driver process") @ray.remote def f(): logger = init_logger() logger.info("A Ray task") @ray.remote class actor: def print_message(self): logger = init_logger() logger.info("A Ray actor") task_obj_ref = f.remote() ray.get(task_obj_ref) actor_instance = actor.remote() ray.get(actor_instance.print_message.remote()) """ {"asctime": "2025-02-25 22:06:00,967", "levelname": "INFO", "message": "Driver process", "filename": "test-log-config-doc.py", "lineno": 13, "name": "root", "job_id": "01000000", "worker_id": "01000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "543c939946ec1321c9c1a10899bfb72f59aa6eab7655719f2611da04", "timestamp_ns": 1740549960968002000} {"asctime": "2025-02-25 22:06:00,974", "levelname": "INFO", "message": "A Ray task", "filename": "test-log-config-doc.py", "lineno": 18, "name": "root", "job_id": "01000000", "worker_id": "162f2bd846e84685b4c07eb75f2c1881b9df1cdbf58ffbbcccbf2c82", "node_id": "543c939946ec1321c9c1a10899bfb72f59aa6eab7655719f2611da04", "task_id": "c8ef45ccd0112571ffffffffffffffffffffffff01000000", "task_name": "f", "task_func_name": "test-log-config-doc.f", "timestamp_ns": 1740549960974027000} {"asctime": "2025-02-25 22:06:01,314", "levelname": "INFO", "message": "A Ray actor", "filename": "test-log-config-doc.py", "lineno": 24, "name": "root", "job_id": "01000000", "worker_id": "b7fd965bb12b1046ddfa3d73ead5ed54eb7678d97e743d98dfab852b", "node_id": "543c939946ec1321c9c1a10899bfb72f59aa6eab7655719f2611da04", "actor_id": "43b5d1828ad0a003ca6ebcfc01000000", "task_id": "c2668a65bda616c143b5d1828ad0a003ca6ebcfc01000000", "task_name": "actor.print_message", "task_func_name": "test-log-config-doc.actor.print_message", "actor_name": "", "timestamp_ns": 1740549961314391000} """ ``` #### Add metadata to structured logs Add extra fields to the log entries by using the `extra` parameter in the `logger.info` method. ```python import ray import logging ray.init( log_to_driver=False, logging_config=ray.LoggingConfig(encoding="JSON", log_level="INFO") ) logger = logging.getLogger() logger.info("Driver process with extra fields", extra={"username": "anyscale"}) # The log entry includes the extra field "username" with the value "anyscale". # {"asctime": "2024-07-17 21:57:50,891", "levelname": "INFO", "message": "Driver process with extra fields", "filename": "test.py", "lineno": 9, "username": "anyscale", "job_id": "04000000", "worker_id": "04000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "76cdbaa32b3938587dcfa278201b8cef2d20377c80ec2e92430737ae"} ``` If needed, you can fetch the metadata of Jobs, Tasks, or Actors with Ray’s {py:obj}`ray.runtime_context.get_runtime_context` API. ::::{tab-set} :::{tab-item} Ray Job Get the job ID. ```python import ray # Initiate a driver. ray.init() job_id = ray.get_runtime_context().get_job_id ``` ```{admonition} Note :class: note The job submission ID is not supported yet. This [GitHub issue](https://github.com/ray-project/ray/issues/28089#issuecomment-1557891407) tracks the work to support it. ``` ::: :::{tab-item} Ray Actor Get the actor ID. ```python import ray # Initiate a driver. ray.init() @ray.remote class actor(): actor_id = ray.get_runtime_context().get_actor_id ``` ::: :::{tab-item} Ray Task Get the task ID. ```python import ray # Initiate a driver. ray.init() @ray.remote def task(): task_id = ray.get_runtime_context().get_task_id ``` ::: :::{tab-item} Node Get the node ID. ```python import ray # Initiate a driver. ray.init() # Get the ID of the node where the driver process is running driver_process_node_id = ray.get_runtime_context().get_node_id @ray.remote def task(): # Get the ID of the node where the worker process is running worker_process_node_id = ray.get_runtime_context().get_node_id ``` ```{admonition} Tip :class: tip If you need node IP, use {py:obj}`ray.nodes` API to fetch all nodes and map the node ID to the corresponding IP. ``` ::: :::: ### System logs Ray structures most system or component logs by default.
Logging format for Python logs
```bash %(asctime)s\t%(levelname)s %(filename)s:%(lineno)s -- %(message)s ``` Example:
``` 2023-06-01 09:15:34,601 INFO job_manager.py:408 -- Submitting job with RAY_ADDRESS = 10.0.24.73:6379 ``` Logging format for c++ logs
```bash [year-month-day, time, pid, thread_id] (component) [file]:[line] [message] ``` Example:
```bash [2023-06-01 08:47:47,457 I 31009 225171] (gcs_server) gcs_node_manager.cc:42: Registering node info, node id = 8cc65840f0a332f4f2d59c9814416db9c36f04ac1a29ac816ad8ca1e, address = 127.0.0.1, node name = 127.0.0.1 ``` :::{note} Some system component logs aren't structured as suggested preceding as of 2.5. The migration of system logs to structured logs is ongoing. ::: (log-rotation)= ## Log rotation Ray supports log rotation of log files. Note that not all components support log rotation. (Raylet, Python, and Java worker logs don't rotate). By default, logs rotate when they reach 512 MB (maxBytes), and have a maximum of five backup files (backupCount). Ray appends indexes to all backup files - for example, `raylet.out.1`. To change the log rotation configuration, specify environment variables. For example, ```bash RAY_ROTATION_MAX_BYTES=1024; ray start --head # Start a ray instance with maxBytes 1KB. RAY_ROTATION_BACKUP_COUNT=1; ray start --head # Start a ray instance with backupCount 1. ``` The max size of a log file, including its backup, is `RAY_ROTATION_MAX_BYTES * RAY_ROTATION_BACKUP_COUNT + RAY_ROTATION_MAX_BYTES` ## Log persistence To process and export logs to external storage or management systems, see {ref}`log persistence on Kubernetes ` and {ref}`log persistence on VMs ` for more details. --- (observability-debug-apps)= # Debugging Applications ```{toctree} :hidden: general-debugging debug-memory debug-hangs debug-failures optimize-performance ../../ray-distributed-debugger ray-debugging ``` These guides help you perform common debugging or optimization tasks for your distributed application on Ray: * {ref}`observability-general-debugging` * {ref}`ray-core-mem-profiling` * {ref}`observability-debug-hangs` * {ref}`observability-debug-failures` * {ref}`observability-optimize-performance` * {ref}`ray-distributed-debugger` * {ref}`ray-debugger` (deprecated) --- (observability-user-guides)= # User Guides ```{toctree} :hidden: Debugging Applications cli-sdk configure-logging profiling add-app-metrics ray-tracing ray-event-export ``` These guides help you monitor and debug your Ray applications and clusters. The guides include: * {ref}`observability-debug-apps` * {ref}`observability-programmatic` * {ref}`configure-logging` * {ref}`application-level-metrics` * {ref}`ray-tracing` * {ref}`ray-event-export` --- (profiling)= # Profiling Profiling is one of the most important debugging tools to diagnose performance, out of memory, hanging, or other application issues. Here is a list of common profiling tools you may use when debugging Ray applications. - CPU profiling - py-spy - Memory profiling - memray - GPU profiling - PyTorch Profiler - Nsight System - Ray Task / Actor timeline If Ray doesn't work with certain profiling tools, try running them without Ray to debug the issues. (profiling-cpu)= ## CPU profiling Profile the CPU usage for Driver and Worker processes. This helps you understand the CPU usage by different processes and debug unexpectedly high or low usage. (profiling-pyspy)= ### py-spy [py-spy](https://github.com/benfred/py-spy/tree/master) is a sampling profiler for Python programs. Ray Dashboard has native integration with pyspy: - It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. - It dumps the stacktrace of the running process so that you can see what the process is doing at a certain time. It is useful when programs hangs. :::{note} You may run into permission errors when using py-spy in the docker containers. To fix the issue: - if you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it. - if you are a KubeRay user, follow the {ref}`guide to configure KubeRay ` and resolve it. ::: Here are the {ref}`steps to use py-spy with Ray and Ray Dashboard `. (profiling-cprofile)= ### cProfile cProfile is Python’s native profiling module to profile the performance of your Ray application. Here are the {ref}`steps to use cProfile `. (profiling-memory)= ## Memory profiling Profile the memory usage for Driver and Worker processes. This helps you analyze memory allocations in applications, trace memory leaks, and debug high/low memory or out of memory issues. (profiling-memray)= ### memray memray is a memory profiler for Python. It can track memory allocations in Python code, in native extension modules, and in the Python interpreter itself. Here are the {ref}`steps to profile the memory usage of Ray Tasks and Actors `. #### Ray Dashboard View You can now do memory profiling for Ray Driver or Worker processes in the Ray Dashboard, by clicking on the "Memory profiling” actions for active Worker processes, Tasks, Actors, and a Job’s driver process. ![memory profiling action](../images/memory-profiling-dashboard-view.png) Additionally, you can specify the following profiling Memray parameters from the dashboard view: - **Format:** Format of the profiling result. The value is either "flamegraph" or "table" - **Duration:** Duration to track for (in seconds) - **Leaks:** Enables the Memory Leaks View, which displays memory that Ray didn't deallocate, instead of peak memory usage - **Natives:** Track native (C/C++) stack frames (only supported in Linux) - **Python Allocator Tracing:** Record allocations made by the pymalloc allocator (profiling-gpu)= ## GPU profiling GPU and GRAM profiling for your GPU workloads like distributed training. This helps you analyze performance and debug memory issues. - PyTorch profiler is supported out of box when used with Ray Train - NVIDIA Nsight System is natively supported on Ray. (profiling-pytorch-profiler)= ### PyTorch Profiler PyTorch Profiler is a tool that allows the collection of performance metrics (especially GPU metrics) during training and inference. Here are the {ref}`steps to use PyTorch Profiler with Ray Train or Ray Data `. (profiling-nsight-profiler)= ### Nsight System Profiler #### Installation First, install the Nsight System CLI by following the [Nsight User Guide](https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html). Confirm that you installed Nsight correctly: ```bash $ nsys --version # NVIDIA Nsight Systems version 2022.4.1.21-0db2c85 ``` (run-nsight-on-ray)= #### Run Nsight on Ray To enable GPU profiling, specify the config in the `runtime_env` as follows: ```python import torch import ray ray.init() @ray.remote(num_gpus=1, runtime_env={ "nsight": "default"}) class RayActor: def run(self): a = torch.tensor([1.0, 2.0, 3.0]).cuda() b = torch.tensor([4.0, 5.0, 6.0]).cuda() c = a * b print("Result on GPU:", c) ray_actor = RayActor.remote() # The Actor or Task process runs with : "nsys profile [default options] ..." ray.get(ray_actor.run.remote()) ``` You can find the `"default"` config in [nsight.py](https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/nsight.py#L20). #### Custom options You can also add [custom options](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profile-command-switch-options) for Nsight System Profiler by specifying a dictionary of option values, which overwrites the `default` config, however, Ray preserves the `--output` option of the default config. ```python import torch import ray ray.init() @ray.remote( num_gpus=1, runtime_env={ "nsight": { "t": "cuda,cudnn,cublas", "cuda-memory-usage": "true", "cuda-graph-trace": "graph", }}) class RayActor: def run(self): a = torch.tensor([1.0, 2.0, 3.0]).cuda() b = torch.tensor([4.0, 5.0, 6.0]).cuda() c = a * b print("Result on GPU:", c) ray_actor = RayActor.remote() # The Actor or Task process runs with : # "nsys profile -t cuda,cudnn,cublas --cuda-memory-usage=True --cuda-graph-trace=graph ..." ray.get(ray_actor.run.remote()) ``` **Note:**: The default report filename (`-o, --output`) is `worker_process_{pid}.nsys-rep` in the logs dir. (profiling-result)= #### Profiling result Find profiling results under the `/tmp/ray/session_*/logs/{profiler_name}` directory. This specific directory location may change in the future. You can download the profiling reports from the {ref}`Ray Dashboard `. ![Nsight System Profiler folder](../images/nsight-profiler-folder.png) To visualize the results, install the [Nsight System GUI](https://developer.nvidia.com/nsight-systems/get-started#latest-Platforms) on your laptop, which becomes the host. Transfer the .nsys-rep file to your host and open it using the GUI. You can now view the visual profiling info. **Note**: The Nsight System Profiler output (-o, --output) option allows you to set the path to a filename. Ray uses the logs directory as the base and appends the output option to it. For example: ``` --output job_name/ray_worker -> /tmp/ray/session_*/logs/nsight/job_name/ray_worker --output /Users/Desktop/job_name/ray_worker -> /Users/Desktop/job_name/ray_worker ``` The best practice is to only specify the filename in output option. (profiling-timeline)= ## Ray Task or Actor timeline Ray Timeline profiles the execution time of Ray Tasks and Actors. This helps you analyze performance, identify the stragglers, and understand the distribution of workloads. Open your Ray Job in Ray Dashboard and follow the {ref}`instructions to download and visualize the trace files ` generated by Ray Timeline. --- (gentle-intro)= # Getting Started Ray is an open source unified framework for scaling AI and Python applications. It provides a simple, universal API for building distributed applications that can scale from a laptop to a cluster. ## What's Ray? Ray simplifies distributed computing by providing: - **Scalable compute primitives**: Tasks and actors for painless parallel programming - **Specialized AI libraries**: Tools for common ML workloads like data processing, model training, hyperparameter tuning, and model serving - **Unified resource management**: Seamless scaling from laptop to cloud with automatic resource handling ## Choose Your Path Select the guide that matches your needs: * **Scale ML workloads**: [Ray Libraries Quickstart](#libraries-quickstart) * **Scale general Python applications**: [Ray Core Quickstart](#ray-core-quickstart) * **Deploy to the cloud**: [Ray Clusters Quickstart](#ray-cluster-quickstart) * **Debug and monitor applications**: [Debugging and Monitoring Quickstart](#debugging-and-monitoring-quickstart) ```{image} ../images/map-of-ray.svg :align: center :alt: Ray Framework Architecture ``` (libraries-quickstart)= ## Ray AI Libraries Quickstart Use individual libraries for ML workloads. Each library specializes in a specific part of the ML workflow, from data processing to model serving. Click on the dropdowns for your workload below. `````{dropdown} ray Data: Scalable Data Processing for AI Workloads :animate: fade-in-slide-down [Ray Data](data_quickstart) provides distributed data processing capabilities for AI workloads. It efficiently streams data through data pipelines. Here's an example of how to scale offline inference and training ingest with Ray Data. ````{note} To run this example, install Ray Data: ```bash pip install -U "ray[data]" ``` ```` ```{testcode} from typing import Dict import numpy as np import ray # Create datasets from on-disk files, Python objects, and cloud storage like S3. ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") # Apply functions to transform data. Ray Data executes transformations in parallel. def compute_area(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: length = batch["petal length (cm)"] width = batch["petal width (cm)"] batch["petal area (cm^2)"] = length * width return batch transformed_ds = ds.map_batches(compute_area) # Iterate over batches of data. for batch in transformed_ds.iter_batches(batch_size=4): print(batch) # Save dataset contents to on-disk files or cloud storage. transformed_ds.write_parquet("local:///tmp/iris/") ``` ```{testoutput} :hide: ... ``` ```{button-ref} ../data/data :color: primary :outline: :expand: Learn more about Ray Data ``` ````` ``````{dropdown} ray Train: Distributed Model Training :animate: fade-in-slide-down **Ray Train** makes distributed model training simple. It abstracts away the complexity of setting up distributed training across popular frameworks like PyTorch and TensorFlow. `````{tab-set} ````{tab-item} PyTorch This example shows how you can use Ray Train with PyTorch. To run this example install Ray Train and PyTorch packages: :::{note} ```bash pip install -U "ray[train]" torch torchvision ``` ::: Set up your dataset and model. ```{literalinclude} /../../python/ray/train/examples/pytorch/torch_quick_start.py :language: python :start-after: __torch_setup_begin__ :end-before: __torch_setup_end__ ``` Now define your single-worker PyTorch training function. ```{literalinclude} /../../python/ray/train/examples/pytorch/torch_quick_start.py :language: python :start-after: __torch_single_begin__ :end-before: __torch_single_end__ ``` This training function can be executed with: ```{literalinclude} /../../python/ray/train/examples/pytorch/torch_quick_start.py :language: python :start-after: __torch_single_run_begin__ :end-before: __torch_single_run_end__ :dedent: 4 ``` Convert this to a distributed multi-worker training function. Use the ``ray.train.torch.prepare_model`` and ``ray.train.torch.prepare_data_loader`` utility functions to set up your model and data for distributed training. This automatically wraps the model with ``DistributedDataParallel`` and places it on the right device, and adds ``DistributedSampler`` to the DataLoaders. ```{literalinclude} /../../python/ray/train/examples/pytorch/torch_quick_start.py :language: python :start-after: __torch_distributed_begin__ :end-before: __torch_distributed_end__ ``` Instantiate a ``TorchTrainer`` with 4 workers, and use it to run the new training function. ```{literalinclude} /../../python/ray/train/examples/pytorch/torch_quick_start.py :language: python :start-after: __torch_trainer_begin__ :end-before: __torch_trainer_end__ :dedent: 4 ``` To accelerate the training job using GPU, make sure you have GPU configured, then set `use_gpu` to `True`. If you don't have a GPU environment, Anyscale provides a development workspace integrated with an autoscaling GPU cluster for this purpose. ```` ````{tab-item} TensorFlow This example shows how you can use Ray Train to set up [Multi-worker training with Keras](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras). To run this example install Ray Train and Tensorflow packages: :::{note} ```bash pip install -U "ray[train]" tensorflow ``` ::: Set up your dataset and model. ```{literalinclude} /../../python/ray/train/examples/tf/tensorflow_quick_start.py :language: python :start-after: __tf_setup_begin__ :end-before: __tf_setup_end__ ``` Now define your single-worker TensorFlow training function. ```{literalinclude} /../../python/ray/train/examples/tf/tensorflow_quick_start.py :language: python :start-after: __tf_single_begin__ :end-before: __tf_single_end__ ``` This training function can be executed with: ```{literalinclude} /../../python/ray/train/examples/tf/tensorflow_quick_start.py :language: python :start-after: __tf_single_run_begin__ :end-before: __tf_single_run_end__ :dedent: 0 ``` Now convert this to a distributed multi-worker training function. 1. Set the *global* batch size - each worker processes the same size batch as in the single-worker code. 2. Choose your TensorFlow distributed training strategy. This examples uses the ``MultiWorkerMirroredStrategy``. ```{literalinclude} /../../python/ray/train/examples/tf/tensorflow_quick_start.py :language: python :start-after: __tf_distributed_begin__ :end-before: __tf_distributed_end__ ``` Instantiate a ``TensorflowTrainer`` with 4 workers, and use it to run the new training function. ```{literalinclude} /../../python/ray/train/examples/tf/tensorflow_quick_start.py :language: python :start-after: __tf_trainer_begin__ :end-before: __tf_trainer_end__ :dedent: 0 ``` To accelerate the training job using GPU, make sure you have GPU configured, then set `use_gpu` to `True`. If you don't have a GPU environment, Anyscale provides a development workspace integrated with an autoscaling GPU cluster for this purpose. ```{button-ref} ../train/train :color: primary :outline: :expand: Learn more about Ray Train ``` ```` ````` `````` `````{dropdown} ray Tune: Hyperparameter Tuning at Scale :animate: fade-in-slide-down [Ray Tune](../tune/index.rst) is a library for hyperparameter tuning at any scale. It automatically finds the best hyperparameters for your models with efficient distributed search algorithms. With Tune, you can launch a multi-node distributed hyperparameter sweep in less than 10 lines of code, supporting any deep learning framework including PyTorch, TensorFlow, and Keras. ````{note} To run this example, install Ray Tune: ```bash pip install -U "ray[tune]" ``` ```` This example runs a small grid search with an iterative training function. ```{literalinclude} ../../../python/ray/tune/tests/example.py :end-before: __quick_start_end__ :language: python :start-after: __quick_start_begin__ ``` If TensorBoard is installed (`pip install tensorboard`), you can automatically visualize all trial results: ```bash tensorboard --logdir ~/ray_results ``` ```{button-ref} ../tune/index :color: primary :outline: :expand: Learn more about Ray Tune ``` ````` `````{dropdown} ray Serve: Scalable Model Serving :animate: fade-in-slide-down [Ray Serve](../serve/index) provides scalable and programmable serving for ML models and business logic. Deploy models from any framework with production-ready performance. ````{note} To run this example, install Ray Serve and scikit-learn: ```{code-block} bash pip install -U "ray[serve]" scikit-learn ``` ```` This example runs serves a scikit-learn gradient boosting classifier. ```{literalinclude} ../serve/doc_code/sklearn_quickstart.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` The response shows `{"result": "versicolor"}`. ```{button-ref} ../serve/index :color: primary :outline: :expand: Learn more about Ray Serve ``` ````` `````{dropdown} ray RLlib: Industry-Grade Reinforcement Learning :animate: fade-in-slide-down [RLlib](../rllib/index.rst) is a reinforcement learning (RL) library that offers high performance implementations of popular RL algorithms and supports various training environments. RLlib offers high scalability and unified APIs for a variety of industry- and research applications. ````{note} To run this example, install `rllib` and either `tensorflow` or `pytorch`: ```bash pip install -U "ray[rllib]" tensorflow # or torch ``` You may also need CMake installed on your system. ```` ```{literalinclude} ../rllib/doc_code/rllib_on_ray_readme.py :end-before: __quick_start_end__ :language: python :start-after: __quick_start_begin__ ``` ```{button-ref} ../rllib/index :color: primary :outline: :expand: Learn more about Ray RLlib ``` ````` ## Ray Core Quickstart try-anyscale-quickstart-ray-quickstart

Ray Core provides simple primitives for building and running distributed applications. It enables you to turn regular Python or Java functions and classes into distributed stateless tasks and stateful actors with just a few lines of code. The examples below show you how to: 1. Convert Python functions to Ray tasks for parallel execution 2. Convert Python classes to Ray actors for distributed stateful computation ``````{dropdown} ray Core: Parallelizing Functions with Ray Tasks :animate: fade-in-slide-down `````{tab-set} ````{tab-item} Python :::{note} To run this example install Ray Core: ```bash pip install -U "ray" ``` ::: Import Ray and and initialize it with `ray.init()`. Then decorate the function with ``@ray.remote`` to declare that you want to run this function remotely. Lastly, call the function with ``.remote()`` instead of calling it normally. This remote call yields a future, a Ray _object reference_, that you can then fetch with ``ray.get``. ```{code-block} python import ray ray.init() @ray.remote def f(x): return x * x futures = [f.remote(i) for i in range(4)] print(ray.get(futures)) # [0, 1, 4, 9] ``` ```` ````{tab-item} Java ```{note} To run this example, add the [ray-api](https://mvnrepository.com/artifact/io.ray/ray-api) and [ray-runtime](https://mvnrepository.com/artifact/io.ray/ray-runtime) dependencies in your project. ``` Use `Ray.init` to initialize Ray runtime. Then use `Ray.task(...).remote()` to convert any Java static method into a Ray task. The task runs asynchronously in a remote worker process. The `remote` method returns an ``ObjectRef``, and you can fetch the actual result with ``get``. ```{code-block} java import io.ray.api.ObjectRef; import io.ray.api.Ray; import java.util.ArrayList; import java.util.List; public class RayDemo { public static int square(int x) { return x * x; } public static void main(String[] args) { // Initialize Ray runtime. Ray.init(); List> objectRefList = new ArrayList<>(); // Invoke the `square` method 4 times remotely as Ray tasks. // The tasks run in parallel in the background. for (int i = 0; i < 4; i++) { objectRefList.add(Ray.task(RayDemo::square, i).remote()); } // Get the actual results of the tasks. System.out.println(Ray.get(objectRefList)); // [0, 1, 4, 9] } } ``` In the above code block we defined some Ray Tasks. While these are great for stateless operations, sometimes you must maintain the state of your application. You can do that with Ray Actors. ```{button-ref} ../ray-core/walkthrough :color: primary :outline: :expand: Learn more about Ray Core ``` ```` ````` `````` ``````{dropdown} ray Core: Parallelizing Classes with Ray Actors :animate: fade-in-slide-down Ray provides actors to allow you to parallelize an instance of a class in Python or Java. When you instantiate a class that is a Ray actor, Ray starts a remote instance of that class in the cluster. This actor can then execute remote method calls and maintain its own internal state. `````{tab-set} ````{tab-item} Python :::{note} To run this example install Ray Core: ```bash pip install -U "ray" ``` ::: ```{code-block} python import ray ray.init() # Only call this once. @ray.remote class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 def read(self): return self.n counters = [Counter.remote() for i in range(4)] [c.increment.remote() for c in counters] futures = [c.read.remote() for c in counters] print(ray.get(futures)) # [1, 1, 1, 1] ``` ```` ````{tab-item} Java ```{note} To run this example, add the [ray-api](https://mvnrepository.com/artifact/io.ray/ray-api) and [ray-runtime](https://mvnrepository.com/artifact/io.ray/ray-runtime) dependencies in your project. ``` ```{code-block} java import io.ray.api.ActorHandle; import io.ray.api.ObjectRef; import io.ray.api.Ray; import java.util.ArrayList; import java.util.List; import java.util.stream.Collectors; public class RayDemo { public static class Counter { private int value = 0; public void increment() { this.value += 1; } public int read() { return this.value; } } public static void main(String[] args) { // Initialize Ray runtime. Ray.init(); List> counters = new ArrayList<>(); // Create 4 actors from the `Counter` class. // These run in remote worker processes. for (int i = 0; i < 4; i++) { counters.add(Ray.actor(Counter::new).remote()); } // Invoke the `increment` method on each actor. // This sends an actor task to each remote actor. for (ActorHandle counter : counters) { counter.task(Counter::increment).remote(); } // Invoke the `read` method on each actor, and print the results. List> objectRefList = counters.stream() .map(counter -> counter.task(Counter::read).remote()) .collect(Collectors.toList()); System.out.println(Ray.get(objectRefList)); // [1, 1, 1, 1] } } ``` ```{button-ref} ../ray-core/walkthrough :color: primary :outline: :expand: Learn more about Ray Core ``` ```` ````` `````` ## Ray Cluster Quickstart Deploy your applications on Ray clusters on AWS, GCP, Azure, and more, often with minimal code changes to your existing code. `````{dropdown} ray Clusters: Launching a Ray Cluster on AWS :animate: fade-in-slide-down Ray programs can run on a single machine, or seamlessly scale to large clusters. :::{note} To run this example install the following: ```bash pip install -U "ray[default]" boto3 ``` If you haven't already, configure your credentials as described in the [documentation for boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#guide-credentials). ::: Take this simple example that waits for individual nodes to join the cluster. ````{dropdown} example.py :animate: fade-in-slide-down ```{literalinclude} ../../yarn/example.py :language: python ``` ```` You can also download this example from the [GitHub repository](https://github.com/ray-project/ray/blob/master/doc/yarn/example.py). Store it locally in a file called `example.py`. To execute this script in the cloud, download [this configuration file](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-minimal.yaml), or copy it here: ````{dropdown} cluster.yaml :animate: fade-in-slide-down ```{literalinclude} ../../../python/ray/autoscaler/aws/example-minimal.yaml :language: yaml ``` ```` Assuming you have stored this configuration in a file called `cluster.yaml`, you can now launch an AWS cluster as follows: ```bash ray submit cluster.yaml example.py --start ``` ```{button-ref} cluster-index :color: primary :outline: :expand: Learn more about launching Ray Clusters on AWS, GCP, Azure, and more ``` ````` `````{dropdown} ray Clusters: Launching a Ray Cluster on Kubernetes :animate: fade-in-slide-down Ray programs can run on a single node Kubernetes cluster, or seamlessly scale to larger clusters. ```{button-ref} kuberay-index :color: primary :outline: :expand: Learn more about launching Ray Clusters on Kubernetes ``` ````` `````{dropdown} ray Clusters: Launching a Ray Cluster on Anyscale :animate: fade-in-slide-down Anyscale is the company behind Ray. The Anyscale platform provides an enterprise-grade Ray deployment on top of your AWS, GCP, Azure, or on-prem Kubernetes clusters. ```{button-link} https://console.anyscale.com/register/ha?render_flow=ray&utm_source=ray_docs&utm_medium=docs&utm_campaign=ray-doc-upsell&utm_content=get-started-launch-ray-cluster :color: primary :outline: :expand: Try Ray on Anyscale ``` ````` ## Debugging and Monitoring Quickstart Use built-in observability tools to monitor and debug Ray applications and clusters. These tools help you understand your application's performance and identify bottlenecks. `````{dropdown} ray Ray Dashboard: Web GUI to monitor and debug Ray :animate: fade-in-slide-down Ray dashboard provides a visual interface that displays real-time system metrics, node-level resource monitoring, job profiling, and task visualizations. The dashboard is designed to help users understand the performance of their Ray applications and identify potential issues. ```{image} https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard/Dashboard-overview.png :align: center ``` ````{note} To get started with the dashboard, install the default installation as follows: ```bash pip install -U "ray[default]" ``` ```` The dashboard automatically becomes available when running Ray scripts. Access the dashboard through the default URL, http://localhost:8265. ```{button-ref} observability-getting-started :color: primary :outline: :expand: Learn more about Ray Dashboard ``` ````` `````{dropdown} ray Ray State APIs: CLI to access cluster states :animate: fade-in-slide-down Ray state APIs allow users to conveniently access the current state (snapshot) of Ray through CLI or Python SDK. ````{note} To get started with the state API, install the default installation as follows: ```bash pip install -U "ray[default]" ``` ```` Run the following code. ```{code-block} python import ray import time ray.init(num_cpus=4) @ray.remote def task_running_300_seconds(): print("Start!") time.sleep(300) @ray.remote class Actor: def __init__(self): print("Actor created") # Create 2 tasks tasks = [task_running_300_seconds.remote() for _ in range(2)] # Create 2 actors actors = [Actor.remote() for _ in range(2)] ray.get(tasks) ``` See the summarized statistics of Ray tasks using ``ray summary tasks`` in a terminal. ```{code-block} bash ray summary tasks ``` ```{code-block} text ======== Tasks Summary: 2022-07-22 08:54:38.332537 ======== Stats: ------------------------------------ total_actor_scheduled: 2 total_actor_tasks: 0 total_tasks: 2 Table (group by func_name): ------------------------------------ FUNC_OR_CLASS_NAME STATE_COUNTS TYPE 0 task_running_300_seconds RUNNING: 2 NORMAL_TASK 1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK ``` ```{button-ref} observability-programmatic :color: primary :outline: :expand: Learn more about Ray State APIs ``` ````` ## Learn More Ray has a rich ecosystem of resources to help you learn more about distributed computing and AI scaling. ### Blog and Press - [Modern Parallel and Distributed Python: A Quick Tutorial on Ray](https://medium.com/data-science/modern-parallel-and-distributed-python-a-quick-tutorial-on-ray-99f8d70369b8) - [Why Every Python Developer Will Love Ray](https://www.datanami.com/2019/11/05/why-every-python-developer-will-love-ray/) - [Ray: A Distributed System for AI (Berkeley Artificial Intelligence Research, BAIR)](http://bair.berkeley.edu/blog/2018/01/09/ray/) - [10x Faster Parallel Python Without Python Multiprocessing](https://medium.com/data-science/10x-faster-parallel-python-without-python-multiprocessing-e5017c93cce1) - [Implementing A Parameter Server in 15 Lines of Python with Ray](https://ray-project.github.io/2018/07/15/parameter-server-in-fifteen-lines.html) - [Ray Distributed AI Framework Curriculum](https://rise.cs.berkeley.edu/blog/ray-intel-curriculum/) - [RayOnSpark: Running Emerging AI Applications on Big Data Clusters with Ray and Analytics Zoo](https://medium.com/riselab/rayonspark-running-emerging-ai-applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a) - [First user tips for Ray](https://rise.cs.berkeley.edu/blog/ray-tips-for-first-time-users/) - [Tune: a Python library for fast hyperparameter tuning at any scale](https://medium.com/data-science/fast-hyperparameter-tuning-at-scale-d428223b081c) - [Cutting edge hyperparameter tuning with Ray Tune](https://medium.com/riselab/cutting-edge-hyperparameter-tuning-with-ray-tune-be6c0447afdf) - [New Library Targets High Speed Reinforcement Learning](https://www.datanami.com/2018/02/01/rays-new-library-targets-high-speed-reinforcement-learning/) - [Scaling Multi Agent Reinforcement Learning](http://bair.berkeley.edu/blog/2018/12/12/rllib/) - [Functional RL with Keras and Tensorflow Eager](https://bair.berkeley.edu/blog/2019/10/14/functional-rl/) - [How to Speed up Pandas by 4x with one line of code](https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html) - [Quick Tip—Speed up Pandas using Modin](https://ericbrown.com/quick-tip-speed-up-pandas-using-modin.htm) - [Ray Blog](https://medium.com/distributed-computing-with-ray) ### Videos - [Unifying Large Scale Data Preprocessing and Machine Learning Pipelines with Ray Data \| PyData 2021](https://zoom.us/rec/share/0cjbk_YdCTbiTm7gNhzSeNxxTCCEy1pCDUkkjfBjtvOsKGA8XmDOx82jflHdQCUP.fsjQkj5PWSYplOTz?startTime=1635456658000) [(slides)](https://docs.google.com/presentation/d/19F_wxkpo1JAROPxULmJHYZd3sKryapkbMd0ib3ndMiU/edit?usp=sharing) - [Programming at any Scale with Ray \| SF Python Meetup Sept 2019](https://www.youtube.com/watch?v=LfpHyIXBhlE) - [Ray for Reinforcement Learning \| Data Council 2019](https://www.youtube.com/watch?v=Ayc0ca150HI) - [Scaling Interactive Pandas Workflows with Modin](https://www.youtube.com/watch?v=-HjLd_3ahCw) - [Ray: A Distributed Execution Framework for AI \| SciPy 2018](https://www.youtube.com/watch?v=D_oz7E4v-U0) - [Ray: A Cluster Computing Engine for Reinforcement Learning Applications \| Spark Summit](https://www.youtube.com/watch?v=xadZRRB_TeI) - [RLlib: Ray Reinforcement Learning Library \| RISECamp 2018](https://www.youtube.com/watch?v=eeRGORQthaQ) - [Enabling Composition in Distributed Reinforcement Learning \| Spark Summit 2018](https://www.youtube.com/watch?v=jAEPqjkjth4) - [Tune: Distributed Hyperparameter Search \| RISECamp 2018](https://www.youtube.com/watch?v=38Yd_dXW51Q) ### Slides - [Talk given at UC Berkeley DS100](https://docs.google.com/presentation/d/1sF5T_ePR9R6fAi2R6uxehHzXuieme63O2n_5i9m7mVE/edit?usp=sharing) - [Talk given in October 2019](https://docs.google.com/presentation/d/13K0JsogYQX3gUCGhmQ1PQ8HILwEDFysnq0cI2b88XbU/edit?usp=sharing) - [Talk given at RISECamp 2019](https://docs.google.com/presentation/d/1v3IldXWrFNMK-vuONlSdEuM82fuGTrNUDuwtfx4axsQ/edit?usp=sharing) ### Papers - [Ray 2.0 Architecture white paper](https://docs.google.com/document/d/1tBw9A4j62ruI5omIJbMxly-la5w4q_TjyJgJL_jN2fI/preview) - [Ray 1.0 Architecture white paper (old)](https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2IBSDEHyXNOQZlGuj93c/preview) - [Exoshuffle: large-scale data shuffle in Ray](https://arxiv.org/abs/2203.05072) - [RLlib paper](https://arxiv.org/abs/1712.09381) - [RLlib flow paper](https://arxiv.org/abs/2011.12719) - [Tune paper](https://arxiv.org/abs/1807.05118) - [Ray paper (old)](https://arxiv.org/abs/1712.05889) - [Ray HotOS paper (old)](https://arxiv.org/abs/1703.03924) If you encounter technical issues, post on the [Ray discussion forum](https://discuss.ray.io/). For general questions, announcements, and community discussions, join the [Ray community on Slack](https://www.ray.io/join-slack). --- (overview-overview)= # Overview Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert. Ray minimizes the complexity of running your distributed individual workflows and end-to-end machine learning workflows with these components: * Scalable libraries for common machine learning tasks such as data preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving. * Pythonic distributed computing primitives for parallelizing and scaling Python applications. * Integrations and utilities for integrating and deploying a Ray cluster with existing tools and infrastructure such as Kubernetes, AWS, GCP, and Azure. For data scientists and machine learning practitioners, Ray lets you scale jobs without needing infrastructure expertise: * Easily parallelize and distribute ML workloads across multiple nodes and GPUs. * Leverage the ML ecosystem with native and extensible integrations. For ML platform builders and ML engineers, Ray: * Provides compute abstractions for creating a scalable and robust ML platform. * Provides a unified ML API that simplifies onboarding and integration with the broader ML ecosystem. * Reduces friction between development and production by enabling the same Python code to scale seamlessly from a laptop to a large cluster. For distributed systems engineers, Ray automatically handles key processes: * Orchestration: Managing the various components of a distributed system. * Scheduling: Coordinating when and where tasks are executed. * Fault tolerance: Ensuring tasks complete regardless of inevitable points of failure. * Auto-scaling: Adjusting the number of resources allocated to dynamic demand. ## What you can do with Ray These are some common ML workloads that individuals, organizations, and companies leverage Ray to build their AI applications: * [Batch inference on CPUs and GPUs](project:#ref-use-cases-batch-infer) * [Model serving](project:#ref-use-cases-model-serving) * [Distributed training of large models](project:#ref-use-cases-distributed-training) * [Parallel hyperparameter tuning experiments](project:#ref-use-cases-hyperparameter-tuning) * [Reinforcement learning](project:#ref-use-cases-reinforcement-learning) * [ML platform](project:#ref-use-cases-ml-platform) ## Ray framework || |:--:| |Stack of Ray libraries - unified toolkit for ML workloads.| Ray's unified compute framework consists of three layers: 1. **Ray AI Libraries**--An open-source, Python, domain-specific set of libraries that equip ML engineers, data scientists, and researchers with a scalable and unified toolkit for ML applications. 2. **Ray Core**--An open-source, Python, general purpose, distributed computing library that enables ML engineers and Python developers to scale Python applications and accelerate machine learning workloads. 3. **Ray Clusters**--A set of worker nodes connected to a common Ray head node. Ray clusters can be fixed-size, or they can autoscale up and down according to the resources requested by applications running on the cluster. ```{eval-rst} .. grid:: 1 2 3 3 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: **Scale machine learning workloads** ^^^ Build ML applications with a toolkit of libraries for distributed :doc:`data processing <../data/data>`, :doc:`model training <../train/train>`, :doc:`tuning <../tune/index>`, :doc:`reinforcement learning <../rllib/index>`, :doc:`model serving <../serve/index>`, and :doc:`more <../ray-more-libs/index>`. +++ .. button-ref:: libraries-quickstart :color: primary :outline: :expand: Ray AI Libraries .. grid-item-card:: **Build distributed applications** ^^^ Build and run distributed applications with a :doc:`simple and flexible API <../ray-core/walkthrough>`. :doc:`Parallelize <../ray-core/walkthrough>` single machine code with little to zero code changes. +++ .. button-ref:: ../ray-core/walkthrough :color: primary :outline: :expand: Ray Core .. grid-item-card:: **Deploy large-scale workloads** ^^^ Deploy workloads on :doc:`AWS, GCP, Azure <../cluster/getting-started>` or :doc:`on premise <../cluster/vms/user-guides/launching-clusters/on-premises>`. Use Ray cluster managers to run Ray on existing :doc:`Kubernetes <../cluster/kubernetes/index>`, :doc:`YARN <../cluster/vms/user-guides/community/yarn>`, or :doc:`Slurm <../cluster/vms/user-guides/community/slurm>` clusters. +++ .. button-ref:: ../cluster/getting-started :color: primary :outline: :expand: Ray Clusters ``` Each of [Ray's](../ray-air/getting-started) five native libraries distributes a specific ML task: - [Data](../data/data): Scalable, framework-agnostic data loading and transformation across training, tuning, and prediction. - [Train](../train/train): Distributed multi-node and multi-core model training with fault tolerance that integrates with popular training libraries. - [Tune](../tune/index): Scalable hyperparameter tuning to optimize model performance. - [Serve](../serve/index): Scalable and programmable serving to deploy models for online inference, with optional microbatching to improve performance. - [RLlib](../rllib/index): Scalable distributed reinforcement learning workloads. Ray's libraries are for both data scientists and ML engineers. For data scientists, these libraries can be used to scale individual workloads and end-to-end ML applications. For ML engineers, these libraries provide scalable platform abstractions that can be used to easily onboard and integrate tooling from the broader ML ecosystem. For custom applications, the [Ray Core](../ray-core/walkthrough) library enables Python developers to easily build scalable, distributed systems that can run on a laptop, cluster, cloud, or Kubernetes. It's the foundation that Ray AI libraries and third-party integrations (Ray ecosystem) are built on. Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing [ecosystem of community integrations](ray-libraries). --- (security)= # Security ```{toctree} :hidden: token-auth ``` Ray is an easy-to-use framework to run arbitrary code across one or more nodes in a Ray Cluster. Ray provides fault-tolerance, optimized scheduling, task orchestration, and auto-scaling to run a given workload. To achieve performant and distributed workloads, Ray components require intra-cluster communication. This communication includes central tenets like distributed memory and node-heartbeats, as well as auxiliary functions like metrics and logs. Ray leverages gRPC for a majority of this communication. Ray offers additional services to improve the developer experience. These services include Ray Dashboard (to allow for cluster introspection and debugging), Ray Jobs (hosted alongside the Dashboard, which services Ray Job submissions), and Ray Client (to allow for local, interactive development with a remote cluster). These services provide complete access to the Ray Cluster and the underlying compute resources. :::{admonition} Ray allows any clients to run arbitrary code. Be extremely careful about what is allowed to access your Ray Cluster :class: caution If you expose these services (Ray Dashboard, Ray Jobs, Ray Client), anybody who can access the associated ports can execute arbitrary code on your Ray Cluster. This can happen: * Explicitly: By submitting a Ray Job, or using the Ray Client * Indirectly: By calling the Dashboard REST APIs of these services * Implicitly: Ray extensively uses cloudpickle for serialization of arbitrary Python objects. See [the pickle documentation](https://docs.python.org/3/library/pickle.html) for more details on Pickle's security model. The Ray Dashboard, Ray Jobs and Ray Client are developer tools that you should only use with the necessary access controls in place to restrict access to trusted parties only. ::: ## Personas When considering the security responsibilities of running Ray, think about the different personas interacting with Ray. * **Ray Developers** write code that relies on Ray. They either run a single-node Ray Cluster locally or multi-node clusters remotely on provided compute infrastructure. * **Platform providers** provide the compute environment on which **Developers** run Ray. * **Users** interact with the output of Ray-powered applications. ## Best practices **Security and isolation must be enforced outside of the Ray Cluster.** Ray expects to run in a safe network environment and to act upon trusted code. Developers and platform providers must maintain the following invariants to ensure the safe operation of Ray clusters. ### Deploy Ray clusters in a controlled network environment * Network traffic between core Ray components and additional Ray components should always be in a controlled, isolated network. Access to additional services should be gated with strict network controls and/or external authentication/authorization proxies. * gRPC communication can be encrypted with TLS, but it's not a replacement for network isolation. * Platform providers are responsible for ensuring that Ray runs in sufficiently controlled network environments and that developers can access features like Ray Dashboard in a secure manner. ### Only execute trusted code within Ray * Ray faithfully executes code that is passed to it – Ray doesn’t differentiate between a tuning experiment, a rootkit install, or an S3 bucket inspection. * Ray developers are responsible for building their applications with this understanding in mind. ### Enforce isolation outside of Ray with multiple Ray clusters * If workloads require isolation from each other, use separate, isolated Ray clusters. Ray can schedule multiple distinct Jobs in a single Cluster, but doesn't attempt to enforce isolation between them. Similarly, Ray doesn't implement access controls for developers interacting with a given cluster. * Ray developers are responsible for determining which applications need to be separated and platform providers are responsible for providing this isolation. ### Enable token authentication * Starting in Ray 2.52.0, Ray supports built-in token authentication that provides an additional measure to prevent unauthorized access to the cluster (including untrusted code execution). See {ref}`Ray token authentication ` for details. * Token authentication is not an alternative to deploying Ray clusters in a controlled network environment. Rather, it is a defense-in-depth measure that adds to network-level security. --- (token-auth)= # Ray token authentication Enable token authentication in Ray to secure cluster access and prevent unauthorized use. This guide explains how authentication works and how to set it up for different deployment scenarios. :::{note} Token authentication is available in Ray 2.52.0 or later. ::: ## How Ray token authentication works To enable token authentication, set the environment variable `RAY_AUTH_MODE=token` before starting your Ray cluster. When you start a Ray cluster with authentication enabled, all external Ray APIs and internal communications are authenticated using the token as a shared secret. The process for generating and configuring authentication tokens differs depending on how you launch your Ray cluster. When you start a local instance of Ray using `ray.init()` with token authentication enabled, Ray automatically generates and uses a token. Other cluster launching methods require that you generate a token before starting the cluster. You can `ray get-auth-token [--generate]` to retrieve your existing token or generate a new one. :::{note} Authentication is disabled by default in Ray 2.52.0. Ray plans to enable token authentication by default in a future release. We recommend enabling token authentication to protect your cluster from unauthorized access. ::: ### What token does Ray use? You can configure authentication tokens using environment variables or the default path. We recommend using the default path when possible to reduce the chances of committing the token to version control. Ray checks for tokens in the following order, highest priority first: 1. `RAY_AUTH_TOKEN` environment variable. 2. `RAY_AUTH_TOKEN_PATH` environment variable, which provides a path to a token file. 3. The default location, `~/.ray/auth_token`. When managing multiple tokens, we recommend storing them in local files and using the `RAY_AUTH_TOKEN_PATH` environment variable rather than setting the `RAY_AUTH_TOKEN` value directly to avoid exposing the token to other code that reads environment variables. ## Security considerations Ray transmits the authentication token as an HTTP header, which is transmitted in plaintext when using insecure `http` connections. We recommend enabling some form of encryption whenever exposing a Ray cluster over the network. Consider the following: - **Local development**: Traffic doesn't leave your machine, so no additional security is needed. - **SSH tunneling**: Use SSH tunneling/port forwarding via the `ray dashboard` command or `kubectl port-forward`. - **TLS termination**: Deploy a TLS proxy in front of your Ray cluster. - **VPN/Overlay networks**: Use network-level encryption for all traffic into and within the cluster. :::{warning} Don't expose Ray clusters directly to the internet without encryption. Tokens alone don't protect against network eavesdropping. ::: Tokens have the following properties: - Ray stores tokens by default in plaintext at `~/.ray/auth_token`. - Use file permissions to keep token files secure, especially in shared environments. - Don't commit tokens to version control. - Tokens don't expire. Your local token remains valid unless you delete and regenerate the token. Ray clusters use the same token for the lifetime of the cluster. ## Configure token authentication for local development To enable authentication on your local machine for development, set the `RAY_AUTH_MODE=token` environment variable in your shell or IDE. You can persist this configuration in your `.bashrc` file or similar. ### Local development with ray.init() When you run a script that starts a local Ray instance with `ray.init()` after setting `RAY_AUTH_MODE=token` as an environment variable, Ray handles authentication automatically: - If a token doesn't already exist at `~/.ray/auth_token`, Ray generates a token and saves it to the file. A log message displays to confirm token creation. - If a token already exists at `~/.ray/auth_token`, Ray reuses the existing token automatically. The following example shows what happens on the first run: ```bash $ export RAY_AUTH_MODE=token $ python -c "import ray;ray.init()" ``` On the first run, this command (or any other script that initializes Ray) logs a line similar to the following: ```bash Generated new authentication token and saved to /Users//.ray/auth_token ``` ### Local development with ray start When you use `ray start --head` to start a local cluster after setting `RAY_AUTH_MODE=token` as an environment variable, you need to generate a token first: - If no token exists, `ray start` shows an error message with instructions. - Run `ray get-auth-token --generate` to generate a new token at the path `~/.ray/auth_token`. - Once generated, Ray uses the token every time you run `ray start`. The following example demonstrates this flow: ```bash # Set the environment variable. $ export RAY_AUTH_MODE=token # First attempt - an error is raised if no token exists. $ ray start --head ... ray.exceptions.AuthenticationError: Token authentication is enabled but no authentication token was found. Ensure that the token for the cluster is available in a local file (e.g., ~/.ray/auth_token or via RAY_AUTH_TOKEN_PATH) or as the `RAY_AUTH_TOKEN` environment variable. To generate a token for local development, use `ray get-auth-token --generate` For remote clusters, ensure that the token is propagated to all nodes of the cluster when token authentication is enabled. For more information, see: https://docs.ray.io/en/latest/ray-security/token-auth.html # Generate a token. $ ray get-auth-token --generate # Start local cluster again - works now. $ ray start --head ... Ray runtime started. ... ``` ## Configure token authentication for remote clusters When working with remote clusters you must ensure that all nodes in the remote cluster have token authentication enabled and access to the same token. Any clients that interact with the remote cluster, including your local machine, must also have the token configured. The following sections provide an overview of configuring this using the Ray cluster launcher and self-managed clusters. For instructions on configuring token authentication with KubeRay, see {ref}`Token authentication with KubeRay `. :::{note} If you're using a hosted version of Ray, contact your customer support for authentication questions. Anyscale manages authentication automatically for users. ::: ### Ray clusters on remote virtual machines This section provides instructions for using `ray up` to launch a remote cluster on virtual machines with token authentication enabled. #### Step 1: Generate a token You must generate a token on your local machine. Run the following command to generate a token: ```bash ray get-auth-token --generate ``` This command generates a token on your local machine at the path `~/.ray/auth_token` and outputs it to the terminal. #### Step 2: Specify token authentication values in your cluster configuration YAML To enable and configure token authentication, you add the following setting to your cluster configuration YAML: - Use `file_mounts` to mount your locally generated token file to all nodes in the cluster. - Use `initialization_commands` to set the environment variable `RAY_AUTH_MODE=token` for all virtual machines in the cluster. The following is an example cluster configuration YAML that includes the `file_mounts` and `initialization_commands` settings required to enable token authentication and use the token generated on your local machine at the path `~/.ray/auth_token`: ```yaml cluster_name: my-cluster-name provider: type: aws region: us-west-2 max_workers: 2 available_node_types: ray.head.default: resources: {} node_config: InstanceType: m5.large ray.worker.default: min_workers: 2 max_workers: 2 resources: {} node_config: InstanceType: m5.large # Mount a locally generated token file to all nodes in the Ray cluster. file_mounts: { "/home/ubuntu/.ray/auth_token": "~/.ray/auth_token", } # Set the RAY_AUTH_MODE environment variable for all shell sessions on the cluster. initialization_commands: - echo "export RAY_AUTH_MODE=token" >> ~/.bashrc ``` #### Step 3: Launch the Ray cluster Run the following command to launch a Ray cluster using your cluster configuration YAML: ```bash ray up cluster.yaml ``` #### Step 4: Configure the Ray dashboard and port forwarding Connecting to the Ray dashboard configures secure SSH port forwarding between your local machine and the Ray cluster. Complete this step even if you don't plan to use the dashboard for monitoring. Run the following command to set up port forwarding for the Ray dashboard port (`8265` by default): ```bash ray dashboard cluster.yaml ``` Upon opening the dashboard, a prompt displays requesting your authentication token. To display the token in plaintext, you can run the following on your local machine: ```bash export RAY_AUTH_MODE=token ray get-auth-token ``` Paste the token in the prompt and click **Submit**. The token gets stored as a cookie for a maximum of 30 days. When you open the dashboard for a cluster that uses a different token, a prompt appears to enter the token for that cluster. #### Step 5: Submit a Ray job You can submit a Ray job with token authentication using secure SSH port forwarding: ```bash export RAY_AUTH_MODE=token ray job submit --working-dir . -- python script.py ``` ### Self-managed clusters If you have a custom deployment where you run `ray start` on multiple nodes, you can use token authentication by generating a token and distributing it to all nodes in the cluster, as shown in the following steps. #### Step 1: Generate a token Generate a token on a single machine using the following command: ```bash ray get-auth-token --generate ``` :::{note} Any machine that needs to interact with the cluster must have the token used to configure authentication. ::: #### Step 2: Copy the token to all nodes Copy the same token to each node in your Ray cluster. For example, use `scp` to copy the token: ```bash scp ~/.ray/auth_token user@node1:~/.ray/auth_token; scp ~/.ray/auth_token user@node2:~/.ray/auth_token; ``` #### Step 3: Start Ray with token authentication You must set the environment variable `RAY_AUTH_MODE=token` on each node before running `ray start`, as in the following example: ```bash ssh user@node1 "RAY_AUTH_MODE=token ray start --head"; ssh user@node2 "RAY_AUTH_MODE=token ray start --address=node1:6379"; ``` ## Troubleshooting token authentication issues You might encounter the following problems with token authentication. ### Token authentication isn't enabled Make sure you've set the `RAY_AUTH_MODE=token` environment variable in the environment where you're launching Ray *and* in any shell where you are using a client to connect to Ray. ### Authentication token not found If running locally, run `ray get-auth-token --generate` to create a token on your local machine. If running a remote cluster, make sure you've followed instructions to copy your token into the cluster. ### Invalid authentication token Any client that tries to interact with a Ray cluster must have the same token as the Ray cluster. If the token on your local machine doesn't match the token in a Ray cluster, you can use the `RAY_AUTH_TOKEN_PATH` or `RAY_AUTH_TOKEN` environment variable to configure a token for interacting with that cluster. You must work with the creator of the cluster to get the token. :::{note} It's possible to stop and then restart a cluster using a different token. All clients connecting to the cluster must have the updated token to connect successfully. ::: ## Next steps - See {ref}`overall security guidelines `. - Read about {ref}`KubeRay authentication ` for Kubernetes-specific configuration. --- (serve-advanced-autoscaling)= # Advanced Ray Serve Autoscaling This guide goes over more advanced autoscaling parameters in [autoscaling_config](../api/doc/ray.serve.config.AutoscalingConfig.rst) and an advanced model composition example. (serve-autoscaling-config-parameters)= ## Autoscaling config parameters In this section, we go into more detail about Serve autoscaling concepts as well as how to set your autoscaling config. ### [Required] Define the steady state of your system To define what the steady state of your deployments should be, set values for `target_ongoing_requests` and `max_ongoing_requests`. #### **[`target_ongoing_requests`](../api/doc/ray.serve.config.AutoscalingConfig.target_ongoing_requests.rst) [default=2]** :::{note} The default for `target_ongoing_requests` changed from 1.0 to 2.0 in Ray 2.32.0. You can continue to set it manually to override the default. ::: Serve scales the number of replicas for a deployment up or down based on the average number of ongoing requests per replica. Specifically, Serve compares the *actual* number of ongoing requests per replica with the target value you set in the autoscaling config and makes upscale or downscale decisions from that. Set the target value with `target_ongoing_requests`, and Serve attempts to ensure that each replica has roughly that number of requests being processed and waiting in the queue. Always load test your workloads. For example, if the use case is latency sensitive, you can lower the `target_ongoing_requests` number to maintain high performance. Benchmark your application code and set this number based on an end-to-end latency objective. :::{note} As an example, suppose you have two replicas of a synchronous deployment that has 100ms latency, serving a traffic load of 30 QPS. Then Serve assigns requests to replicas faster than the replicas can finish processing them; more and more requests queue up at the replica (these requests are "ongoing requests") as time progresses, and then the average number of ongoing requests at each replica steadily increases. Latency also increases because new requests have to wait for old requests to finish processing. If you set `target_ongoing_requests = 1`, Serve detects a higher than desired number of ongoing requests per replica, and adds more replicas. At 3 replicas, your system would be able to process 30 QPS with 1 ongoing request per replica on average. ::: #### **`max_ongoing_requests` [default=5]** :::{note} The default for `max_ongoing_requests` changed from 100 to 5 in Ray 2.32.0. You can continue to set it manually to override the default. ::: There is also a maximum queue limit that proxies respect when assigning requests to replicas. Define the limit with `max_ongoing_requests`. Set `max_ongoing_requests` to ~20 to 50% higher than `target_ongoing_requests`. - Setting it too low can throttle throughput. Instead of being forwarded to replicas for concurrent execution, requests will tend to queue up at the proxy, waiting for replicas to finish processing existing requests. :::{note} `max_ongoing_requests` should be tuned higher especially for lightweight requests, else the overall throughput will be impacted. ::: - Setting it too high can lead to imbalanced routing. Concretely, this can lead to very high tail latencies during upscale, because when the autoscaler is scaling a deployment up due to a traffic spike, most or all of the requests might be assigned to the existing replicas before the new replicas are started. ### [Required] Define upper and lower autoscaling limits To use autoscaling, you need to define the minimum and maximum number of resources allowed for your system. * **[`min_replicas`](../api/doc/ray.serve.config.AutoscalingConfig.min_replicas.rst) [default=1]**: This is the minimum number of replicas for the deployment. If you want to ensure your system can deal with a certain level of traffic at all times, set `min_replicas` to a positive number. On the other hand, if you anticipate periods of no traffic and want to scale to zero to save cost, set `min_replicas = 0`. Note that setting `min_replicas = 0` causes higher tail latencies; when you start sending traffic, the deployment scales up, and there will be a cold start time as Serve waits for replicas to be started to serve the request. * **[`max_replicas`](../api/doc/ray.serve.config.AutoscalingConfig.max_replicas.rst) [default=1]**: This is the maximum number of replicas for the deployment. This should be greater than `min_replicas`. Ray Serve Autoscaling relies on the Ray Autoscaler to scale up more nodes when the currently available cluster resources (CPUs, GPUs, etc.) are not enough to support more replicas. * **[`initial_replicas`](../api/doc/ray.serve.config.AutoscalingConfig.initial_replicas.rst)**: This is the number of replicas that are started initially for the deployment. This defaults to the value for `min_replicas`. ### [Optional] Define how the system reacts to changing traffic Given a steady stream of traffic and appropriately configured `min_replicas` and `max_replicas`, the steady state of your system is essentially fixed for a chosen configuration value for `target_ongoing_requests`. Before reaching steady state, however, your system is reacting to traffic shifts. How you want your system to react to changes in traffic determines how you want to set the remaining autoscaling configurations. * **[`upscale_delay_s`](../api/doc/ray.serve.config.AutoscalingConfig.upscale_delay_s.rst) [default=30s]**: This defines how long Serve waits before scaling up the number of replicas in your deployment. In other words, this parameter controls the frequency of upscale decisions. If the replicas are *consistently* serving more requests than desired for an `upscale_delay_s` number of seconds, then Serve scales up the number of replicas based on aggregated ongoing requests metrics. For example, if your service is likely to experience bursts of traffic, you can lower `upscale_delay_s` so that your application can react quickly to increases in traffic. Ray Serve allows you to use different delays for different downscaling scenarios, providing more granular control over when replicas are removed. This is particularly useful when you want different behavior for scaling down to zero versus scaling down to a non-zero number of replicas. * **[`downscale_delay_s`](../api/doc/ray.serve.config.AutoscalingConfig.downscale_delay_s.rst) [default=600s]**: This defines how long Serve waits before scaling down the number of replicas in your deployment. If the replicas are *consistently* serving fewer requests than desired for a `downscale_delay_s` number of seconds, Serve scales down the number of replicas based on aggregated ongoing requests metrics. This delay applies to all downscaling decisions except for the optional 1→0 transition (see below). For example, if your application initializes slowly, you can increase `downscale_delay_s` to make downscaling happen more infrequently and avoid reinitialization costs when the application needs to upscale again. * **[`downscale_to_zero_delay_s`](../api/doc/ray.serve.config.AutoscalingConfig.downscale_to_zero_delay_s.rst) [Optional]**: This defines how long Serve waits before scaling from one replica down to zero (only applies when `min_replicas = 0`). If not specified, the 1→0 transition uses the `downscale_delay_s` value. This is useful when you want more conservative scale-to-zero behavior. For example, you might set `downscale_delay_s = 300` for regular downscaling but `downscale_to_zero_delay_s = 1800` to wait 30 minutes before scaling to zero, avoiding cold starts for brief periods of inactivity. * **[`upscale_smoothing_factor`](../api/doc/ray.serve.config.AutoscalingConfig.upscale_smoothing_factor.rst) [default_value=1.0] (DEPRECATED)**: This parameter is renamed to `upscaling_factor`. `upscale_smoothing_factor` will be removed in a future release. * **[`downscale_smoothing_factor`](../api/doc/ray.serve.config.AutoscalingConfig.downscale_smoothing_factor.rst) [default_value=1.0] (DEPRECATED)**: This parameter is renamed to `downscaling_factor`. `downscale_smoothing_factor` will be removed in a future release. * **[`upscaling_factor`](../api/doc/ray.serve.config.AutoscalingConfig.upscaling_factor.rst) [default_value=1.0]**: The multiplicative factor to amplify or moderate each upscaling decision. For example, when the application has high traffic volume in a short period of time, you can increase `upscaling_factor` to scale up the resource quickly. This parameter is like a "gain" factor to amplify the response of the autoscaling algorithm. * **[`downscaling_factor`](../api/doc/ray.serve.config.AutoscalingConfig.downscaling_factor.rst) [default_value=1.0]**: The multiplicative factor to amplify or moderate each downscaling decision. For example, if you want your application to be less sensitive to drops in traffic and scale down more conservatively, you can decrease `downscaling_factor` to slow down the pace of downscaling. * **[`metrics_interval_s`](../api/doc/ray.serve.config.AutoscalingConfig.metrics_interval_s.rst) [default_value=10]**: In future this deployment level config will be removed in favor of cross-application level global config. This controls how often each replica and handle sends reports on current ongoing requests to the autoscaler. ::{note} If metrics are reported infrequently, Ray Serve can take longer to notice a change in autoscaling metrics, so scaling can start later even if your delays are short. For example, if you set `upscale_delay_s = 3` but metrics are pushed every 10 seconds, Ray Serve might not see a change until the next push, so scaling up can be limited to about once every 10 seconds. :: * **[`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst) [default_value=30]**: This is the window over which the average number of ongoing requests per replica is calculated. * **[`aggregation_function`](../api/doc/ray.serve.config.AutoscalingConfig.aggregation_function.rst) [default_value="mean"]**: This controls how metrics are aggregated over the `look_back_period_s` time window. The aggregation function determines how Ray Serve combines multiple metric measurements into a single value for autoscaling decisions. Supported values: - `"mean"` (default): Uses time-weighted average of metrics. This provides smooth scaling behavior that responds to sustained traffic patterns. - `"max"`: Uses the maximum metric value observed. This makes autoscaling more sensitive to spikes, scaling up quickly when any replica experiences high load. - `"min"`: Uses the minimum metric value observed. This results in more conservative scaling behavior. For most workloads, the default `"mean"` aggregation provides the best balance. Use `"max"` if you need to react quickly to traffic spikes, or `"min"` if you prefer conservative scaling that avoids rapid fluctuations. ### How autoscaling metrics work Understanding how metrics flow through the autoscaling system helps you configure the parameters effectively. The metrics pipeline involves several stages, each with its own timing parameters: ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ Metrics Pipeline Overview │ ├──────────────────────────────────────────────────────────────────────────┤ │ │ │ Replicas/Handles Controller Autoscaling Policy │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Record │ Push │ Receive │ Decide │ Policy │ │ │ │ Metrics │────────────>│ Metrics │──────────>│ Runs │ │ │ │ (10s) │ (10s) │ │ (0.1s) │ │ │ │ └──────────┘ │ Aggregate│ └──────────┘ │ │ │ (30s) │ │ │ └──────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────┘ ``` #### Stage 1: Metric recording Replicas and deployment handles continuously record autoscaling metrics: - **What**: Number of ongoing requests (queued + running) - **Frequency**: Every 10s (configurable via [`metrics_interval_s`](../api/doc/ray.serve.config.AutoscalingConfig.metrics_interval_s.rst)) - **Storage**: Metrics are stored locally as a timeseries #### Stage 2: Metric pushing Periodically, replicas and handles push their metrics to the controller: - **Frequency**: Every 10s (configurable via `RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S` and `RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S`) - **Data sent**: Both raw timeseries data and pre-aggregated metrics - **Raw timeseries**: Data points are clipped to the [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst) window before sending (only recent measurements within the window are sent) - **Pre-aggregated metrics**: A simple average computed over the [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst) window at the replica/handle - **Controller usage**: The controller decides which data to use based on the `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER` setting (see Stage 3 below) #### Stage 3: Metric aggregation The controller aggregates metrics to compute total ongoing requests across all replicas. Ray Serve supports two aggregation modes (controlled by `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER`): **Simple mode (default - `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=0`):** - **Input**: Pre-aggregated simple averages from replicas/handles (already clipped to [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst)) - **Method**: Sums the pre-aggregated values from all sources. Each component computes a simple average (arithmetic mean) before sending. - **Output**: Single value representing total ongoing requests - **Characteristics**: Lightweight and works well for most workloads. However, because it uses simple averages rather than time-weighted averages, it can be less accurate when replicas have different metric reporting intervals or when metrics arrive at different times. **Aggregate mode (experimental - `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1`):** - **Input**: Raw timeseries data from replicas/handles (already clipped to [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst)) - **Method**: Time-weighted aggregation using the [`aggregation_function`](../api/doc/ray.serve.config.AutoscalingConfig.aggregation_function.rst) (mean, max, or min). Uses an instantaneous merge approach that treats metrics as right-continuous step functions. - **Output**: Single value representing total ongoing requests - **Characteristics**: Provides more mathematically accurate aggregation, especially when replicas report metrics at different intervals or you need precise time-weighted averages. The trade-off is increased controller overhead. :::{note} The [`aggregation_function`](../api/doc/ray.serve.config.AutoscalingConfig.aggregation_function.rst) parameter only applies in aggregate mode. In simple mode, the aggregation is always a sum of the pre-computed simple averages. ::: :::{note} The long-term plan is to deprecate simple mode in favor of aggregate mode. Aggregate mode provides more accurate metrics aggregation and will become the default in a future release. Consider testing aggregate mode(`RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1`) in your deployments to prepare for this transition. ::: #### Stage 4: Policy execution The autoscaling policy runs frequently to make scaling decisions, see [Custom policy for deployment](#custom-policy-for-deployment) for details on implementing custom scaling logic: - **Frequency**: Every 0.1s (configurable via `RAY_SERVE_CONTROL_LOOP_INTERVAL_S`) - **Input**: [`AutoscalingContext`](../api/doc/ray.serve.config.AutoscalingContext.rst) - **Output**: Tuple of `(target_replicas, updated_policy_state)` #### Timing parameter interactions The timing parameters interact in important ways: **Recording vs pushing intervals:** - Push interval ≥ Recording interval - Recording interval (10s) determines granularity of data - Push interval (10s) determines how fresh the controller's data is - With default values: Each push contains 1 data points (10s ÷ 10s) **Push interval vs look-back period:** - [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst) (30s) should be > push interval (10s) - If look-back is too short, you won't have enough data for stable decisions - If look-back is too long, autoscaling becomes less responsive **Push interval vs control loop:** - Control loop (0.1s) runs much faster than metrics arrive (10s) - Most control loop iterations reuse existing metrics - New scaling decisions only happen when fresh metrics arrive **Push interval vs upscale/downscale delays:** - Delays control when Ray Serve applies a scale up or scale down. - The metrics push interval controls how quickly Ray Serve receives fresh metrics. - If the push interval < delay, Ray Serve can use multiple metric updates before it scales. - Example: push every 10s with `upscale_delay_s = 20` means up to 2 new metric updates before scaling **Recommendation:** Keep default values unless you have specific needs. If you need faster autoscaling, decrease push intervals first, then adjust delays. ### Environment variables Several environment variables control autoscaling behavior at a lower level. These variables affect metrics collection and the control loop timing: #### Control loop and timeout settings * **`RAY_SERVE_CONTROL_LOOP_INTERVAL_S`** (default: 0.1s): How often the Ray Serve controller runs the autoscaling control loop. Your autoscaling policy function executes at this frequency. The default value of 0.1s means policies run approximately 10 times per second. * **`RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S`** (default: 10.0s): Maximum time allowed for the `record_autoscaling_stats()` method to complete in custom metrics collection. If this timeout is exceeded, the metrics collection fails and a warning is logged. * **`RAY_SERVE_MIN_HANDLE_METRICS_TIMEOUT_S`** (default: 10.0s): Minimum timeout for handle metrics collection. The system uses the maximum of this value and `2 * `[`metrics_interval_s`](../api/doc/ray.serve.config.AutoscalingConfig.metrics_interval_s.rst) to determine when to drop stale handle metrics. #### Advanced feature flags * **`RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER`** (default: false): Enables an experimental metrics aggregation mode where the controller aggregates raw timeseries data instead of using pre-aggregated metrics. This mode provides more accurate time-weighted averages but may increase controller overhead. See Stage 3 in "How autoscaling metrics work" for details. ## Model composition example Determining the autoscaling configuration for a multi-model application requires understanding each deployment's scaling requirements. Every deployment has a different latency and differing levels of concurrency. As a result, finding the right autoscaling config for a model-composition application requires experimentation. This example is a simple application with three deployments composed together to build some intuition about multi-model autoscaling. Assume these deployments: * `HeavyLoad`: A mock 200ms workload with high CPU usage. * `LightLoad`: A mock 100ms workload with high CPU usage. * `Driver`: A driver deployment that fans out to the `HeavyLoad` and `LightLoad` deployments and aggregates the two outputs. ### Attempt 1: One `Driver` replica First consider the following deployment configurations. Because the driver deployment has low CPU usage and is only asynchronously making calls to the downstream deployments, allocating one fixed `Driver` replica is reasonable. ::::{tab-set} :::{tab-item} Driver ```yaml - name: Driver num_replicas: 1 max_ongoing_requests: 200 ``` ::: :::{tab-item} HeavyLoad ```yaml - name: HeavyLoad max_ongoing_requests: 3 autoscaling_config: target_ongoing_requests: 1 min_replicas: 0 initial_replicas: 0 max_replicas: 200 upscale_delay_s: 3 downscale_delay_s: 60 upscaling_factor: 0.3 downscaling_factor: 0.3 metrics_interval_s: 2 look_back_period_s: 10 ``` ::: :::{tab-item} LightLoad ```yaml - name: LightLoad max_ongoing_requests: 3 autoscaling_config: target_ongoing_requests: 1 min_replicas: 0 initial_replicas: 0 max_replicas: 200 upscale_delay_s: 3 downscale_delay_s: 60 upscaling_factor: 0.3 downscaling_factor: 0.3 metrics_interval_s: 2 look_back_period_s: 10 ``` ::: :::{tab-item} Application Code ```{literalinclude} ../doc_code/autoscale_model_comp_example.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` ::: :::: Running the same Locust load test from the [Resnet workload](resnet-autoscaling-example) generates the following results: | | | | ----------------------- | ---------------------- | | HeavyLoad and LightLoad Number Replicas | comp | As you might expect, the number of autoscaled `LightLoad` replicas is roughly half that of autoscaled `HeavyLoad` replicas. Although the same number of requests per second are sent to both deployments, `LightLoad` replicas can process twice as many requests per second as `HeavyLoad` replicas can, so the deployment should need half as many replicas to handle the same traffic load. Unfortunately, the service latency rises to from 230 to 400 ms when the number of Locust users increases to 100. | P50 Latency | QPS | | ------- | --- | | ![comp_latency](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/model_comp_latency.svg) | ![comp_rps](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/model_comp_rps.svg) | Note that the number of `HeavyLoad` replicas should roughly match the number of Locust users to adequately serve the Locust traffic. However, when the number of Locust users increased to 100, the `HeavyLoad` deployment struggled to reach 100 replicas, and instead only reached 65 replicas. The per-deployment latencies reveal the root cause. While `HeavyLoad` and `LightLoad` latencies stayed steady at 200ms and 100ms, `Driver` latencies rose from 230 to 400 ms. This suggests that the high Locust workload may be overwhelming the `Driver` replica and impacting its asynchronous event loop's performance. ### Attempt 2: Autoscale `Driver` For this attempt, set an autoscaling configuration for `Driver` as well, with the setting `target_ongoing_requests = 20`. Now the deployment configurations are as follows: ::::{tab-set} :::{tab-item} Driver ```yaml - name: Driver max_ongoing_requests: 200 autoscaling_config: target_ongoing_requests: 20 min_replicas: 1 initial_replicas: 1 max_replicas: 10 upscale_delay_s: 3 downscale_delay_s: 60 upscaling_factor: 0.3 downscaling_factor: 0.3 metrics_interval_s: 2 look_back_period_s: 10 ``` ::: :::{tab-item} HeavyLoad ```yaml - name: HeavyLoad max_ongoing_requests: 3 autoscaling_config: target_ongoing_requests: 1 min_replicas: 0 initial_replicas: 0 max_replicas: 200 upscale_delay_s: 3 downscale_delay_s: 60 upscaling_factor: 0.3 downscaling_factor: 0.3 metrics_interval_s: 2 look_back_period_s: 10 ``` ::: :::{tab-item} LightLoad ```yaml - name: LightLoad max_ongoing_requests: 3 autoscaling_config: target_ongoing_requests: 1 min_replicas: 0 initial_replicas: 0 max_replicas: 200 upscale_delay_s: 3 downscale_delay_s: 60 upscaling_factor: 0.3 downscaling_factor: 0.3 metrics_interval_s: 2 look_back_period_s: 10 ``` ::: :::: Running the same Locust load test again generates the following results: | | | | ------------------------------------ | ------------------- | | HeavyLoad and LightLoad Number Replicas | heavy | | Driver Number Replicas | driver | With up to 6 `Driver` deployments to receive and distribute the incoming requests, the `HeavyLoad` deployment successfully scales up to 90+ replicas, and `LightLoad` up to 47 replicas. This configuration helps the application latency stay consistent as the traffic load increases. | Improved P50 Latency | Improved RPS | | ---------------- | ------------ | | ![comp_latency](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/model_composition_improved_latency.svg) | ![comp_latency](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/model_comp_improved_rps.svg) | ## Troubleshooting guide ### Unstable number of autoscaled replicas If the number of replicas in your deployment keeps oscillating even though the traffic is relatively stable, try the following: * Set a smaller `upscaling_factor` and `downscaling_factor`. Setting both values smaller than one helps the autoscaler make more conservative upscale and downscale decisions. It effectively smooths out the replicas graph, and there will be less "sharp edges". * Set a `look_back_period_s` value that matches the rest of the autoscaling config. For longer upscale and downscale delay values, a longer look back period can likely help stabilize the replica graph, but for shorter upscale and downscale delay values, a shorter look back period may be more appropriate. For instance, the following replica graphs show how a deployment with `upscale_delay_s = 3` works with a longer vs shorter look back period. | `look_back_period_s = 30` | `look_back_period_s = 3` | | ------------------------------------------------ | ----------------------------------------------- | | ![look-back-before](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/look_back_period_before.png) | ![look-back-after](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/look_back_period_after.png) | ### High spikes in latency during bursts of traffic If you expect your application to receive bursty traffic, and at the same time want the deployments to scale down in periods of inactivity, you are likely concerned about how quickly the deployment can scale up and respond to bursts of traffic. While an increase in latency initially during a burst in traffic may be unavoidable, you can try the following to improve latency during bursts of traffic. * Set a lower `upscale_delay_s`. The autoscaler always waits `upscale_delay_s` seconds before making a decision to upscale, so lowering this delay allows the autoscaler to react more quickly to changes, especially bursts, of traffic. * Set a larger `upscaling_factor`. If `upscaling_factor > 1`, then the autoscaler scales up more aggressively than normal. This setting can allow your deployment to be more sensitive to bursts of traffic. * Lower the [`metrics_interval_s`](../api/doc/ray.serve.config.AutoscalingConfig.metrics_interval_s.rst). Always set [`metrics_interval_s`](../api/doc/ray.serve.config.AutoscalingConfig.metrics_interval_s.rst) to be less than or equal to `upscale_delay_s`, otherwise upscaling is delayed because the autoscaler doesn't receive fresh information often enough. * Set a lower `max_ongoing_requests`. If `max_ongoing_requests` is too high relative to `target_ongoing_requests`, then when traffic increases, Serve might assign most or all of the requests to the existing replicas before the new replicas are started. This setting can lead to very high latencies during upscale. ### Deployments scaling down too quickly You may observe that deployments are scaling down too quickly. Instead, you may want the downscaling to be much more conservative to maximize the availability of your service. * Set a longer `downscale_delay_s`. The autoscaler always waits `downscale_delay_s` seconds before making a decision to downscale, so by increasing this number, your system has a longer "grace period" after traffic drops before the autoscaler starts to remove replicas. * Set a smaller `downscaling_factor`. If `downscaling_factor < 1`, then the autoscaler removes *less replicas* than what it thinks it should remove to achieve the target number of ongoing requests. In other words, the autoscaler makes more conservative downscaling decisions. | `downscaling_factor = 1` | `downscaling_factor = 0.5` | | ------------------------------------------------ | ----------------------------------------------- | | ![downscale-smooth-before](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/downscale_smoothing_factor_before.png) | ![downscale-smooth-after](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/downscale_smoothing_factor_after.png) | (serve-custom-autoscaling-policies)= ## Custom autoscaling policies :::{warning} Custom autoscaling policies are experimental and may change in future releases. ::: Ray Serve’s built-in, request-driven autoscaling works well for most apps. Use **custom autoscaling policies** when you need more control—e.g., scaling on external metrics (CloudWatch, Prometheus), anticipating predictable traffic (scheduled batch jobs), or applying business logic that goes beyond queue thresholds. Custom policies let you implement scaling logic based on any metrics or rules you choose. ### Custom policy for deployment A custom autoscaling policy is a user-provided Python function that takes an [`AutoscalingContext`](../api/doc/ray.serve.config.AutoscalingContext.rst) and returns a tuple `(target_replicas, policy_state)` for a single Deployment. An `AutoscalingContext` object provides the following information to the custom autoscaling policy: * **Current state:** Current replica count and deployment metadata. * **Built-in metrics:** Total requests, queued requests, per-replica counts. * **Custom metrics:** Values your deployment reports via `record_autoscaling_stats()`. (See below.) * **Capacity bounds:** `min` / `max` replica limits adjusted for current cluster capacity. * **Policy state:** A `dict` you can use to persist arbitrary state across control-loop iterations. * **Timing:** Timestamps of the last scale actions and “now”. The following example showcases a policy that scales up during business hours and evening batch processing, and scales down during off-peak hours: `autoscaling_policy.py` file: ```{literalinclude} ../doc_code/autoscaling_policy.py :language: python :start-after: __begin_scheduled_batch_processing_policy__ :end-before: __end_scheduled_batch_processing_policy__ ``` `main.py` file: ```{literalinclude} ../doc_code/scheduled_batch_processing.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` Policies are defined **per deployment**. If you don’t provide one, Ray Serve falls back to its built-in request-based policy. The policy function is invoked by the Ray Serve controller every `RAY_SERVE_CONTROL_LOOP_INTERVAL_S` seconds (default **0.1s**), so your logic runs against near-real-time state. :::{warning} Keep policy functions **fast and lightweight**. Slow logic can block the Serve controller and degrade cluster responsiveness. ::: ### Custom metrics You can make richer decisions by emitting your own metrics from the deployment. Implement `record_autoscaling_stats()` to return a `dict[str, float]`. Ray Serve will surface these values in the [`AutoscalingContext`](../api/doc/ray.serve.config.AutoscalingContext.rst). This example demonstrates how deployments can provide their own metrics (CPU usage, memory usage) and how autoscaling policies can use these metrics to make scaling decisions: `autoscaling_policy.py` file: ```{literalinclude} ../doc_code/autoscaling_policy.py :language: python :start-after: __begin_custom_metrics_autoscaling_policy__ :end-before: __end_custom_metrics_autoscaling_policy__ ``` `main.py` file: ```{literalinclude} ../doc_code/custom_metrics_autoscaling.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` :::{note} The `record_autoscaling_stats()` method can be either synchronous or asynchronous. It must complete within the timeout specified by `RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S` (default 10 seconds). ::: In your policy, access custom metrics via: * **`ctx.raw_metrics[metric_name]`** — A mapping of replica IDs to lists of raw metric values. The number of data points stored for each replica depends on the [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst) (the sliding window size) and [`metrics_interval_s`](../api/doc/ray.serve.config.AutoscalingConfig.metrics_interval_s.rst) (the metric recording interval). * **`ctx.aggregated_metrics[metric_name]`** — A time-weighted average computed from the raw metric values for each replica. ### Application level autoscaling By default, each deployment in Ray Serve autoscales independently. When you have multiple deployments that need to scale in a coordinated way—such as deployments that share backend resources, have dependencies on each other, or need load-aware routing—you can define an **application-level autoscaling policy**. This policy makes scaling decisions for all deployments within an application simultaneously. #### Define an application level policy An application-level autoscaling policy is a function that takes a `dict[DeploymentID, AutoscalingContext]` objects (one per deployment) and returns a tuple of `(decisions, policy_state)`. Each context contains metrics and bounds for one deployment, and the policy returns target replica counts for all deployments. The `policy_state` returned from an application-level policy must be a `dict[DeploymentID, dict]`— a dictionary mapping each deployment ID to its own state dictionary. Serve stores this per-deployment state and on the next control-loop iteration, injects each deployment's state back into that deployment's `AutoscalingContext.policy_state`. Serve itself does not interpret the contents of `policy_state`. All the keys in each deployment's state dictionary are user-controlled. The following example shows a policy that scales deployments based on their relative load, ensuring that downstream deployments have enough capacity for upstream traffic: `autoscaling_policy.py` file: ```{literalinclude} ../doc_code/autoscaling_policy.py :language: python :start-after: __begin_application_level_autoscaling_policy__ :end-before: __end_application_level_autoscaling_policy__ ``` The following example shows a stateful application-level policy that persists state between control-loop iterations: `autoscaling_policy.py` file: ```{literalinclude} ../doc_code/autoscaling_policy.py :language: python :start-after: __begin_stateful_application_level_policy__ :end-before: __end_stateful_application_level_policy__ ``` #### Configure application level autoscaling To use an application-level policy, you need to define your deployments: `main.py` file: ```{literalinclude} ../doc_code/application_level_autoscaling.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` Then specify the application-level policy in your application config: `serve.yaml` file: ```{literalinclude} ../doc_code/application_level_autoscaling.yaml :language: yaml :emphasize-lines: 4-5 ``` :::{note} Programmatic configuration of application-level autoscaling policies through `serve.run()` will be supported in a future release. ::: :::{note} When you specify both a deployment-level policy and an application-level policy, the application-level policy takes precedence. Ray Serve logs a warning if you configure both. ::: :::{warning} ### Gotchas and limitations When you provide a custom policy, Ray Serve can fully support it as long as it's simple, self-contained Python code that relies only on the standard library. Once the policy becomes more complex, such as depending on other custom modules or packages, you need to bundle those modules into the Docker image or environment. This is because Ray Serve uses `cloudpickle` to serialize custom policies and it doesn't vendor transitive dependencies—if your policy inherits from a superclass in another module or imports custom packages, those must exist in the target environment. Additionally, environment parity matters: differences in Python version, `cloudpickle` version, or library versions can affect deserialization. #### Alternatives for complex policies When your custom autoscaling policy has complex dependencies or you want better control over versioning and deployment, you have several alternatives: - **Contribute to Ray Serve**: If your policy is general-purpose and might benefit others, consider contributing it to Ray Serve as a built-in policy by opening a feature request or pull request on the [Ray GitHub repository](https://github.com/ray-project/ray/issues). The recommended location for the implementation is `python/ray/serve/autoscaling_policy.py`. - **Ensure dependencies in your environment**: Make sure that the external dependencies are installed in your Docker image or environment. ::: (serve-external-scale-api)= ### External scaling API :::{warning} This API is in alpha and may change before becoming stable. ::: The external scaling API provides programmatic control over the number of replicas for any deployment in your Ray Serve application. Unlike Ray Serve's built-in autoscaling, which scales based on queue depth and ongoing requests, this API allows you to scale based on any external criteria you define. #### Example: Predictive scaling This example shows how to implement predictive scaling based on historical patterns or forecasts. You can preemptively scale up before anticipated traffic spikes by running an external script that adjusts replica counts based on time of day. ##### Define the deployment The following example creates a simple text processing deployment that you can scale externally. Save this code to a file named `external_scaler_predictive.py`: ```{literalinclude} ../doc_code/external_scaler_predictive.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` ##### Configure external scaling Before using the external scaling API, enable it in your application configuration by setting `external_scaler_enabled: true`. Save this configuration to a file named `external_scaler_config.yaml`: ```{literalinclude} ../doc_code/external_scaler_config.yaml :language: yaml :start-after: __external_scaler_config_begin__ :end-before: __external_scaler_config_end__ ``` :::{warning} External scaling and built-in autoscaling are mutually exclusive. You can't use both for the same application. If you set `external_scaler_enabled: true`, you **must not** configure `autoscaling_config` on any deployment in that application. Attempting to use both results in an error. ::: ##### Implement the scaling logic The following script implements predictive scaling based on time of day and historical traffic patterns. Save this script to a file named `external_scaler_predictive_client.py`: ```{literalinclude} ../doc_code/external_scaler_predictive_client.py :language: python :start-after: __client_script_begin__ :end-before: __client_script_end__ ``` The script uses the external scaling API endpoint to scale deployments: - **API endpoint**: `POST http://localhost:8265/api/v1/applications/{application_name}/deployments/{deployment_name}/scale` - **Request body**: `{"target_num_replicas": }` (must conform to the [`ScaleDeploymentRequest`](../api/doc/ray.serve.schema.ScaleDeploymentRequest.rst) schema) The scaling client continuously adjusts the number of replicas based on the time of day: - Business hours (9 AM - 5 PM): 10 replicas - Off-peak hours: 3 replicas ##### Run the example Follow these steps to run the complete example: 1. Start the Ray Serve application with the configuration: ```bash serve run external_scaler_config.yaml ``` 2. Run the predictive scaling client in a separate terminal: ```bash python external_scaler_predictive_client.py ``` The client adjusts replica counts automatically based on the time of day. You can monitor the scaling behavior in the Ray dashboard or by checking the application logs. #### Important considerations Understanding how the external scaler interacts with your deployments helps you build reliable scaling logic: - **Idempotent API calls**: The scaling API is idempotent. You can safely call it multiple times with the same `target_num_replicas` value without side effects. This makes it safe to run your scaling logic on a schedule or in response to repeated metric updates. - **Interaction with serve deploy**: When you upgrade your service with `serve deploy`, the number of replicas you set through the external scaler API stays intact. This behavior matches what you'd expect from Ray Serve's built-in autoscaler—deployment updates don't reset replica counts. - **Query current replica count**: You can get the current number of replicas for any deployment by querying the GET `/applications` API: ```bash curl -X GET http://localhost:8265/api/serve/applications/ \ ``` The response follows the [`ServeInstanceDetails`](../api/doc/ray.serve.schema.ServeInstanceDetails.rst) schema, which includes an `applications` field containing a dictionary with application names as keys. Each application includes detailed information about all its deployments, including current replica counts. Use this information to make informed scaling decisions. For example, you might scale up gradually by adding a percentage of existing replicas rather than jumping to a fixed number. - **Initial replica count**: When you deploy an application for the first time, Ray Serve creates the number of replicas specified in the `num_replicas` field of your deployment configuration. The external scaler can then adjust this count dynamically based on your scaling logic. --- (serve-app-builder-guide)= # Pass Arguments to Applications This section describes how to pass arguments to your applications using an application builder function. ## Defining an application builder When writing an application, there are often parameters that you want to be able to easily change in development or production. For example, you might have a path to trained model weights and want to test out a newly trained model. In Ray Serve, these parameters are typically passed to the constructor of your deployments using `.bind()`. This pattern allows you to configure deployments using ordinary Python code, but it requires modifying the code whenever one of the parameters needs to change. To pass arguments without changing the code, define an "application builder" function that takes an arguments dictionary (or [Pydantic object](typed-app-builders)) and returns the built application to be run. ```{literalinclude} ../doc_code/app_builder.py :start-after: __begin_untyped_builder__ :end-before: __end_untyped_builder__ :language: python ``` You can use this application builder function as the import path in the `serve run` CLI command or the config file (as shown below). To avoid writing code to handle type conversions and missing arguments, use a [Pydantic object](typed-app-builders) instead. ### Passing arguments via `serve run` Pass arguments to the application builder from `serve run` using the following syntax: ```bash $ serve run hello:app_builder key1=val1 key2=val2 ``` The arguments are passed to the application builder as a dictionary, in this case `{"key1": "val1", "key2": "val2"}`. For example, to pass a new message to the `HelloWorld` app defined above (with the code saved in `hello.py`): ```bash % serve run hello:app_builder message="Hello from CLI" 2023-05-16 10:47:31,641 INFO scripts.py:404 -- Running import path: 'hello:app_builder'. 2023-05-16 10:47:33,344 INFO worker.py:1615 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ServeController pid=56826) INFO 2023-05-16 10:47:35,115 controller 56826 deployment_state.py:1244 - Deploying new version of deployment default_HelloWorld. (ServeController pid=56826) INFO 2023-05-16 10:47:35,141 controller 56826 deployment_state.py:1483 - Adding 1 replica to deployment default_HelloWorld. (ProxyActor pid=56828) INFO: Started server process [56828] (ServeReplica:default_HelloWorld pid=56830) Message: Hello from CLI 2023-05-16 10:47:36,131 SUCC scripts.py:424 -- Deployed Serve app successfully. ``` Notice that the "Hello from CLI" message is printed from within the deployment constructor. ### Passing arguments via config file Pass arguments to the application builder in the config file's `args` field: ```yaml applications: - name: MyApp import_path: hello:app_builder args: message: "Hello from config" ``` For example, to pass a new message to the `HelloWorld` app defined above (with the code saved in `hello.py` and the config saved in `config.yaml`): ```bash % serve run config.yaml 2023-05-16 10:49:25,247 INFO scripts.py:351 -- Running config file: 'config.yaml'. 2023-05-16 10:49:26,949 INFO worker.py:1615 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 2023-05-16 10:49:28,678 SUCC scripts.py:419 -- Submitted deploy config successfully. (ServeController pid=57109) INFO 2023-05-16 10:49:28,676 controller 57109 controller.py:559 - Building application 'MyApp'. (ProxyActor pid=57111) INFO: Started server process [57111] (ServeController pid=57109) INFO 2023-05-16 10:49:28,940 controller 57109 application_state.py:202 - Built application 'MyApp' successfully. (ServeController pid=57109) INFO 2023-05-16 10:49:28,942 controller 57109 deployment_state.py:1244 - Deploying new version of deployment MyApp_HelloWorld. (ServeController pid=57109) INFO 2023-05-16 10:49:29,016 controller 57109 deployment_state.py:1483 - Adding 1 replica to deployment MyApp_HelloWorld. (ServeReplica:MyApp_HelloWorld pid=57113) Message: Hello from config ``` Notice that the "Hello from config" message is printed from within the deployment constructor. (typed-app-builders)= ### Typing arguments with Pydantic To avoid writing logic to parse and validate the arguments by hand, define a [Pydantic model](https://pydantic-docs.helpmanual.io/usage/models/) as the single input parameter's type to your application builder function (the parameter must be type annotated). Arguments are passed the same way, but the resulting dictionary is used to construct the Pydantic model using `model.parse_obj(args_dict)`. ```{literalinclude} ../doc_code/app_builder.py :start-after: __begin_typed_builder__ :end-before: __end_typed_builder__ :language: python ``` ```bash % serve run hello:typed_app_builder message="Hello from CLI" 2023-05-16 10:47:31,641 INFO scripts.py:404 -- Running import path: 'hello:typed_app_builder'. 2023-05-16 10:47:33,344 INFO worker.py:1615 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ServeController pid=56826) INFO 2023-05-16 10:47:35,115 controller 56826 deployment_state.py:1244 - Deploying new version of deployment default_HelloWorld. (ServeController pid=56826) INFO 2023-05-16 10:47:35,141 controller 56826 deployment_state.py:1483 - Adding 1 replica to deployment default_HelloWorld. (ProxyActor pid=56828) INFO: Started server process [56828] (ServeReplica:default_HelloWorld pid=56830) Message: Hello from CLI 2023-05-16 10:47:36,131 SUCC scripts.py:424 -- Deployed Serve app successfully. ``` ## Common patterns ### Multiple parametrized applications using the same builder You can use application builders to run multiple applications with the same code but different parameters. For example, multiple applications may share preprocessing and HTTP handling logic but use many different trained model weights. The same application builder `import_path` can take different arguments to define multiple applications as follows: ```yaml applications: - name: Model1 import_path: my_module:my_model_code args: model_uri: s3://my_bucket/model_1 - name: Model2 import_path: my_module:my_model_code args: model_uri: s3://my_bucket/model_2 - name: Model3 import_path: my_module:my_model_code args: model_uri: s3://my_bucket/model_3 ``` ### Configuring multiple composed deployments You can use the arguments passed to an application builder to configure multiple deployments in a single application. For example a model composition application might take weights to two different models as follows: ```{literalinclude} ../doc_code/app_builder.py :start-after: __begin_composed_builder__ :end-before: __end_composed_builder__ :language: python ``` --- (serve-asyncio-best-practices)= # Asyncio and concurrency best practices in Ray Serve The code that runs inside of each replica in a Ray Serve deployment runs on an asyncio event loop. Asyncio enables efficient I/O bound concurrency but requires following a few best practices for optimal performance. This guide explains: - When to use `async def` versus `def` in Ray Serve. - How Ray Serve executes your code (loops, threads, and the router). - How `max_ongoing_requests` interacts with asyncio concurrency. - How to think about Python's GIL, native code, and true parallelism. The examples assume the following imports unless stated otherwise: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __imports_begin__ :end-before: __imports_end__ :language: python ``` ## How to choose between `async def` and `def` Use this decision table as a starting point: | Workload type | Recommended handler | Reason | | --- | --- | --- | | I/O-bound (databases, HTTP calls, queues) | `async def` | Lets the event loop handle many requests while each waits on I/O. | | CPU-bound (model inference, heavy numeric compute) | `def` or `async def` with offload | Async alone doesn't make CPU work faster. You need more replicas, threads, or native parallelism. | | Streaming responses | `async def` generator | Integrates with backpressure and non-blocking iteration. | | FastAPI ingress (`@serve.ingress`) | `def` or `async def` | FastAPI runs `def` endpoints in a threadpool, so they don't block the loop. | ## How Ray Serve executes your code At a high level, requests go through a router to a replica actor that runs your code: ```text Client ↓ Serve router (asyncio loop A) ↓ Replica actor ├─ System / control loop └─ User code loop (your handlers) └─ Optional threadpool for sync methods ``` The following are the key ideas to consider when deciding to use `async def` or `def`: - Serve uses asyncio event loops for routing and for running replicas. - By default, user code runs on a separate event loop from the replica's main/control loop, so blocking user code doesn't interfere with health checks and autoscaling. - Depending on the value of `RAY_SERVE_RUN_SYNC_IN_THREADPOOL`, `def` handlers may run directly on the user event loop (blocking) or in a threadpool (non-blocking for the loop). ### Pure Serve deployments (no FastAPI ingress) For a simple deployment: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __echo_async_begin__ :end-before: __echo_async_end__ :language: python ``` - `async def __call__` runs directly on the replica's user event loop. - While this handler awaits `asyncio.sleep`, the loop is free to start handling other requests. For a synchronous deployment: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __blocking_echo_begin__ :end-before: __blocking_echo_end__ :language: python ``` How this method executes depends on configuration: - With `RAY_SERVE_RUN_SYNC_IN_THREADPOOL=0` (current default), `__call__` runs directly on the user event loop and blocks it for 1 second. - With `RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1`, Serve offloads `__call__` to a threadpool so the event loop stays responsive. ### FastAPI ingress (`@serve.ingress`) When you use FastAPI ingress, FastAPI controls how endpoints run: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __fastapi_deployment_begin__ :end-before: __fastapi_deployment_end__ :language: python ``` Important differences: - FastAPI always dispatches `def` endpoints to a threadpool. - In pure Serve, `def` methods run on the event loop unless you opt into threadpool behavior. ## Blocking versus non-blocking in practice Blocking code keeps the event loop from processing other work. Non-blocking code yields control back to the loop when it's waiting on something. ### Blocking I/O versus asynchronous I/O Blocking I/O example: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __blocking_http_begin__ :end-before: __blocking_http_end__ :language: python ``` Even though the method is `async def`, `requests.get` blocks the loop. No other requests can run on this replica during the request call. Blocking in `async def` is still blocking. Non-blocking equivalent with async HTTP client: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __async_http_begin__ :end-before: __async_http_end__ :language: python ``` Non-blocking equivalent using a threadpool: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __threaded_http_begin__ :end-before: __threaded_http_end__ :language: python ``` ## Concurrency doesn't equal parallelism in Python It's common to expect `async` code to "use all the cores" or make CPU-heavy code faster. asyncio doesn't do that. ### Concurrency: Handling many waiting operations Asyncio gives you **concurrency** for I/O-bound workloads: - While one request waits on the database, another can wait on an HTTP call. - Handlers yield back to the event loop at each `await`. This is ideal for high-throughput APIs that mostly wait on external systems. ### Parallelism: Using multiple CPU cores True CPU parallelism usually comes from: - Multiple processes (for example, multiple Serve replicas). - Native code that releases the GIL and runs across cores. Python's GIL means that pure Python bytecode runs one thread at a time in a process, even if you use a threadpool. ### Using GIL-releasing native code Many numeric and ML libraries release the GIL while doing heavy work in native code: - NumPy, many linear algebra routines. - PyTorch and some other deep learning frameworks. - Some image-processing or compression libraries. In these cases, you can still get useful parallelism from threads inside a single replica process: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __numpy_deployment_begin__ :end-before: __numpy_deployment_end__ :language: python ``` However: - GIL-releasing behavior is library-specific and sometimes operation-specific. - Some libraries use their own internal threadpools; combining them with your own threadpools can oversubscribe CPUs. - You should verify that your model stack is thread-safe before relying on this form of parallelism. For predictable CPU scaling, it's usually simpler to increase the number of replicas. ### Summary - `async def` improves **concurrency** for I/O-bound code. - CPU-bound code doesn't become faster merely because it's `async`. - Parallel CPU scaling comes mostly from **more processes** (replicas or tasks) and, in some cases, native code that releases the GIL. ## How `max_ongoing_requests` and replica concurrency work Each deployment has a `max_ongoing_requests` configuration that controls how many in-flight requests a replica handles at once. ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __max_ongoing_requests_begin__ :end-before: __max_ongoing_requests_end__ :language: python ``` Key points: - Ray Serve uses an internal semaphore to limit concurrent in-flight requests per replica to `max_ongoing_requests`. - Requests beyond that limit queue in the router or handle until capacity becomes available, or they fail with backpressure depending on configuration. How useful `max_ongoing_requests` is depends on how your handler behaves. ### `async` handlers and `max_ongoing_requests` With an `async def` handler that spends most of its time awaiting I/O, `max_ongoing_requests` directly controls concurrency: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __async_io_bound_begin__ :end-before: __async_io_bound_end__ :language: python ``` - Up to 100 requests can be in-flight per replica. - While one request is waiting, the event loop can work on others. ### Blocking `def` handlers and `max_ongoing_requests` With a blocking `def` handler that runs on the event loop (threadpool disabled), `max_ongoing_requests` doesn't give you the concurrency you expect: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __blocking_cpu_begin__ :end-before: __blocking_cpu_end__ :language: python ``` In this case: - The event loop can only run one handler at a time. - Even though `max_ongoing_requests=100`, the replica effectively processes requests serially. If you enable the sync-in-threadpool behavior (see the next section), each in-flight request can run in a thread: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __cpu_with_threadpool_begin__ :end-before: __cpu_with_threadpool_end__ :language: python ``` Now: - Up to `max_ongoing_requests` calls can be running at once. - Real throughput depends on: - How many threads the threadpool uses. - Whether your workload is CPU-bound or GIL-releasing. - Underlying native libraries and system resources. For heavily CPU-bound workloads, it's usually better to: - Keep `max_ongoing_requests` modest (to avoid queueing too many heavy tasks), and - Scale **replicas** (`num_replicas`) rather than pushing a single replica's concurrency too high. ## Environment flags and sync-in-threadpool warning Ray Serve exposes several environment variables that control how user code interacts with event loops and threads. ### `RAY_SERVE_RUN_SYNC_IN_THREADPOOL` By default (`RAY_SERVE_RUN_SYNC_IN_THREADPOOL=0`), which means synchronous methods in a deployment run directly on the user event loop. To help you migrate to a safer model, Serve emits a warning like: > `RAY_SERVE_RUN_SYNC_IN_THREADPOOL_WARNING`: Calling sync method '...' directly on the asyncio loop. In a future version, sync methods will be run in a threadpool by default... This warning means: - You have a `def` method that is currently running on the event loop. - In a future version, that method runs in a threadpool instead. You can opt in to the future behavior now by setting: ```bash export RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1 ``` When this flag is `1`: - Serve runs synchronous methods in a threadpool. - The event loop is free to keep serving other requests while sync methods run. Before enabling this in production, make sure: - Your handler code and any shared state are thread-safe. - Your model objects can safely be used from multiple threads, or you protect them with locks. ### `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` By default, Serve runs user code in a separate event loop from the replica's main/control loop: ```bash export RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD=1 # default ``` This isolation: - Protects system tasks (health checks, controller communication) from being blocked by user code. - Adds some amount of overhead to cross-loop communication, resulting in higher latency in request. For throughput-optimized configurations, see [High throughput optimization](serve-high-throughput). You can disable this behavior: ```bash export RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD=0 ``` Only advanced users should change this. When user code and system tasks share a loop, any blocking operation in user code can interfere with replica health and control-plane operations. ### `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` Serve's request router is also run on its own event loop by default: ```bash export RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=1 # default ``` This ensures: - The router can continue routing and load balancing requests even if some replicas are running slow user code. Disabling this: ```bash export RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0 ``` makes the router share an event loop with other work. This can reduce overhead in advanced, highly optimized scenarios, but makes the system more sensitive to blocking operations. See [High throughput optimization](serve-high-throughput). For most production deployments, you should keep the defaults (`1`) for both separate-loop flags. ## Batching and streaming semantics Batching and streaming both rely on the event loop to stay responsive. They don't change where your code runs: batched handlers and streaming handlers still run on the same user event loop as any other handler. This means that if you add batching or streaming on top of blocking code, you can make event loop blocking effects much worse. ### Batching When you enable batching, Serve groups multiple incoming requests together and passes them to your handler as a list. The handler still runs on the user event loop, but each call now processes many requests at once instead of just one. If that batched work is blocking, it blocks the event loop for all of those requests at the same time. The following example shows a batched deployment: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __batched_model_begin__ :end-before: __batched_model_end__ :language: python ``` The batch handler runs on the user event loop: - If `_run_model` is CPU-heavy and runs inline, it blocks the loop for the duration of the batch. - You can offload the batch computation: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __batched_model_offload_begin__ :end-before: __batched_model_offload_end__ :language: python :emphasize-lines: 9-16 ``` This keeps the event loop responsive while the model runs in a thread. #### `max_concurrent_batches` and event loop yielding The `@serve.batch` decorator accepts a `max_concurrent_batches` argument that controls how many batches can be processed concurrently. However, this argument only works effectively if your batch handler yields control back to the event loop during processing. If your batch handler blocks the event loop (for example, by doing heavy CPU work without awaiting or offloading), `max_concurrent_batches` won't provide the concurrency you expect. The event loop can only start processing a new batch when the current batch yields control. To get the benefit of `max_concurrent_batches`: - Use `async def` for your batch handler and `await` I/O operations or offloaded CPU work. - Offload CPU-heavy batch processing to a threadpool with `asyncio.to_thread()` or `loop.run_in_executor()`. - Avoid blocking operations that prevent the event loop from scheduling other batches. In the offloaded batch example above, the handler yields to the event loop when awaiting the threadpool executor, which allows multiple batches to be in flight simultaneously (up to the `max_concurrent_batches` limit). ### Streaming Streaming is different from a regular response because the client starts receiving data while your handler is still running. Serve calls your handler once, gets back a generator or async generator, and then repeatedly asks it for the next chunk. That generator code still runs on the user event loop (or in a worker thread if you offload it). Streaming is especially sensitive to blocking: - If you block between chunks, you delay the next piece of data to the client. - While the generator is blocked on the event loop, other requests on that loop can't make progress. - The system also cannot react quickly to slow clients (backpressure) or cancellation. Bad streaming example: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __blocking_stream_begin__ :end-before: __blocking_stream_end__ :language: python ``` Better streaming example: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __async_stream_begin__ :end-before: __async_stream_end__ :language: python ``` In streaming scenarios: - Prefer `async def` generators that use `await` between yields. - Avoid long CPU-bound loops between yields; offload them if needed. ## Offloading patterns: I/O, CPU This section summarizes common offloading patterns you can use inside `async` handlers. ### Blocking I/O in `async def` ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __offload_io_begin__ :end-before: __offload_io_end__ :language: python ``` ### CPU-heavy code in `async def` ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __offload_cpu_begin__ :end-before: __offload_cpu_end__ :language: python ``` ### (Advance) Using Ray tasks or remote actors for true parallelism :::{note} While you can spawn Ray tasks from Ray Serve deployments, this approach isn't recommended because it lacks tooling for observability and debugging. ::: ```{literalinclude} ../doc_code/asyncio_best_practices.py :start-after: __ray_parallel_begin__ :end-before: __ray_parallel_end__ :language: python ``` This pattern: - Uses multiple Ray workers and processes. - Bypasses the GIL limitation of a single Python process. ## Summary - Use `async def` for I/O-bound and streaming work so the event loop can stay responsive. - Use `max_ongoing_requests` to bound concurrency per replica, but remember that blocking `def` handlers can still serialize work if they run on the event loop. - Consider enabling `RAY_SERVE_RUN_SYNC_IN_THREADPOOL` once your code is thread-safe, and be aware of the sync-in-threadpool warning. - For CPU-heavy workloads, scale replicas or GIL-releasing native code for real parallelism. --- (custom-request-router-guide)= # Use Custom Algorithm for Request Routing :::{warning} This API is in alpha and may change before becoming stable. ::: Different Ray serve applications demand different logics for load balancing. For example, in serving LLMs you might want to have a different policy than balancing number of requests across replicas: e.g. balancing ongoing input tokens, balancing kv-cache utilization, etc. [`RequestRouter`](../api/doc/ray.serve.request_router.RequestRouter.rst) is an abstraction in Ray Serve that allows extension and customization of load-balancing logic for each deployment. This guide shows how to use [`RequestRouter`](../api/doc/ray.serve.request_router.RequestRouter.rst) API to achieve custom load balancing across replicas of a given deployment. It will cover the following: - Define a simple uniform request router for load balancing - Deploy an app with the uniform request router - Utility mixins for request routing - Define a complex throughput-aware request router - Deploy an app with the throughput-aware request router (simple-uniform-request-router)= ## Define simple uniform request router Create a file `custom_request_router.py` with the following code: ```{literalinclude} ../doc_code/custom_request_router.py :start-after: __begin_define_uniform_request_router__ :end-before: __end_define_uniform_request_router__ :language: python ``` This code defines a simple uniform request router that routes requests a random replica to distribute the load evenly regardless of the queue length of each replica or the body of the request. The router is defined as a class that inherits from [`RequestRouter`](../api/doc/ray.serve.request_router.RequestRouter.rst). It implements the [`choose_replicas`](../api/doc/ray.serve.request_router.RequestRouter.choose_replicas.rst) method, which returns the random replica for all incoming requests. The returned type is a list of lists of replicas, where each inner list represents a rank of replicas. The first rank is the most preferred and the last rank is the least preferred. The request will be attempted to be routed to the replica with the shortest request queue in each set of the rank in order until a replica is able to process the request. If none of the replicas are able to process the request, [`choose_replicas`](../api/doc/ray.serve.request_router.RequestRouter.choose_replicas.rst) will be called again with a backoff delay until a replica is able to process the request. :::{note} This request router also implements [`on_request_routed`](../api/doc/ray.serve.request_router.RequestRouter.on_request_routed.rst) which can help you update the state of the request router after a request is routed. ::: (deploy-app-with-uniform-request-router)= ## Deploy an app with the uniform request router To use a custom request router, you need to pass the `request_router_class` argument to the [`deployment`](../api/doc/ray.serve.deployment_decorator.rst) decorator. Also note that the `request_router_class` can be passed as the already imported class or as the string of import path to the class. Let's deploy a simple app that uses the uniform request router like this: ```{literalinclude} ../doc_code/custom_request_router_app.py :start-after: __begin_deploy_app_with_uniform_request_router__ :end-before: __end_deploy_app_with_uniform_request_router__ :language: python ``` As the request is routed, both "UniformRequestRouter routing request" and "on_request_routed callback is called!!" messages will be printed to the console. The response will also be randomly routed to one of the replicas. You can test this by sending more requests and seeing the distribution of the replicas are roughly equal. :::{note} Currently, the only way to configure the request router is to pass it as an argument to the deployment decorator. This means that you cannot change the request router for an existing deployment handle with running router. If you have a particular usecase where you need to reconfigure a request router on the deployment handle, please open a feature request on the [Ray GitHub repository](https://github.com/ray-project/ray/issues) ::: (utility-mixin)= ## Utility mixins for request routing Ray Serve provides utility mixins that can be used to extend the functionality of the request router. These mixins can be used to implement common routing policies such as locality-aware routing, multiplexed model support, and FIFO request routing. - [`FIFOMixin`](../api/doc/ray.serve.request_router.FIFOMixin.rst): This mixin implements first in first out (FIFO) request routing. The default behavior for the request router is OOO (out of order) which routes requests to the exact replica which got assigned by the request passed to [`choose_replicas`](../api/doc/ray.serve.request_router.RequestRouter.choose_replicas.rst). This mixin is useful for the routing algorithm that can work independently of the request content, so the requests can be routed as soon as possible in the order they were received. By including this mixin, in your custom request router, the request matching algorithm will be updated to route requests FIFO. There are no additional flags needs to be configured and no additional helper methods provided by this mixin. - [`LocalityMixin`](../api/doc/ray.serve.request_router.LocalityMixin.rst): This mixin implements locality-aware request routing. It updates the internal states when between replica updates to track the location between replicas in the same node, same zone, and everything else. It offers helpers [`apply_locality_routing`](../api/doc/ray.serve.request_router.LocalityMixin.apply_locality_routing.rst) and [`rank_replicas_via_locality`](../api/doc/ray.serve.request_router.LocalityMixin.rank_replicas_via_locality.rst) to route and ranks replicas based on their locality to the request, which can be useful for reducing latency and improving performance. - [`MultiplexMixin`](../api/doc/ray.serve.request_router.MultiplexMixin.rst): When you use model-multiplexing you need to route requests based on knowing which replica has already a hot version of the model. It updates the internal states when between replica updates to track the model loaded on each replica, and size of the model cache on each replica. It offers helpers [`apply_multiplex_routing`](../api/doc/ray.serve.request_router.MultiplexMixin.apply_multiplex_routing.rst) and [`rank_replicas_via_multiplex`](../api/doc/ray.serve.request_router.MultiplexMixin.rank_replicas_via_multiplex.rst) to route and ranks replicas based on their multiplexed model id of the request. (throughput-aware-request-router)= ## Define a complex throughput-aware request router A fully featured request router can be more complex and should take into account the multiplexed model, locality, the request queue length on each replica, and using custom statistics like throughput to decide which replica to route the request to. The following class defines a throughput-aware request router that routes requests to the replica with these factors in mind. Add the following code into the `custom_request_router.py` file: ```{literalinclude} ../doc_code/custom_request_router.py :start-after: __begin_define_throughput_aware_request_router__ :end-before: __end_define_throughput_aware_request_router__ :language: python ``` This request router inherits from [`RequestRouter`](../api/doc/ray.serve.request_router.RequestRouter.rst), as well as [`FIFOMixin`](../api/doc/ray.serve.request_router.FIFOMixin.rst) for FIFO request routing, [`LocalityMixin`](../api/doc/ray.serve.request_router.LocalityMixin.rst) for locality-aware request routing, and [`MultiplexMixin`](../api/doc/ray.serve.request_router.MultiplexMixin.rst) for multiplexed model support. It implements [`choose_replicas`](../api/doc/ray.serve.request_router.RequestRouter.choose_replicas.rst) to take the highest ranked replicas from [`rank_replicas_via_multiplex`](../api/doc/ray.serve.request_router.MultiplexMixin.rank_replicas_via_multiplex.rst) and [`rank_replicas_via_locality`](../api/doc/ray.serve.request_router.LocalityMixin.rank_replicas_via_locality.rst) and uses the [`select_available_replicas`](../api/doc/ray.serve.request_router.RequestRouter.select_available_replicas.rst) helper to filter out replicas that have reached their maximum request queue length. Finally, it takes the replicas with the minimum throughput and returns the top one. (deploy-app-with-throughput-aware-request-router)= ## Deploy an app with the throughput-aware request router To use the throughput-aware request router, you can deploy an app like this: ```{literalinclude} ../doc_code/custom_request_router_app.py :start-after: __begin_deploy_app_with_throughput_aware_request_router__ :end-before: __end_deploy_app_with_throughput_aware_request_router__ :language: python ``` Similar to the uniform request router, the custom request router can be defined in the `request_router_class` argument of the [`deployment`](../api/doc/ray.serve.deployment_decorator.rst) decorator. The Serve controller pulls statistics from the replica of each deployment by calling record_routing_stats. The `request_routing_stats_period_s` and `request_routing_stats_timeout_s` arguments control the frequency and timeout time of the serve controller pulling information from each replica in its background thread. You can customize the emission of these statistics by overriding `record_routing_stats` in the definition of the deployment class. The custom request router can then get the updated routing stats by looking up the `routing_stats` attribute of the running replicas and use it in the routing policy. :::{warning} ## Gotchas and limitations When you provide a custom router, Ray Serve can fully support it as long as it's simple, self-contained Python code that relies only on the standard library. Once the router becomes more complex, such as depending on other custom modules or packages, you need to ensure those modules are bundled into the Docker image or environment. This is because Ray Serve uses `cloudpickle` to serialize custom routers and it doesn't vendor transitive dependencies—if your router inherits from a superclass in another module or imports custom packages, those must exist in the target environment. Additionally, environment parity matters: differences in Python version, `cloudpickle` version, or library versions can affect deserialization. ### Alternatives for complex routers When your custom request router has complex dependencies or you want better control over versioning and deployment, you have several alternatives: - **Use built-in routers**: Consider using the routers shipped with Ray Serve—these are well-tested, production-ready, and guaranteed to work across different environments. - **Contribute to Ray Serve**: If your router is general-purpose and might benefit others, consider contributing it to Ray Serve as a built-in router by opening a feature request or pull request on the [Ray GitHub repository](https://github.com/ray-project/ray/issues). The recommended location for the implementation is `python/ray/serve/_private/request_router/`. - **Ensure dependencies in your environment**: Make sure that the external dependencies are installed in your Docker image or environment. ::: --- (serve-in-production-deploying)= # Deploy on VM You can deploy your Serve application to production on a Ray cluster using the Ray Serve CLI. `serve deploy` takes in a config file path and it deploys that file to a Ray cluster over HTTP. This could either be a local, single-node cluster as in this example or a remote, multi-node cluster started with the [Ray Cluster Launcher](cloud-vm-index). This section should help you: - understand how to deploy a Ray Serve config file using the CLI. - understand how to update your application using the CLI. - understand how to deploy to a remote cluster started with the [Ray Cluster Launcher](cloud-vm-index). Start by deploying this [config](production-config-yaml) for the Text ML Application [example](serve-in-production-example): ```console $ ls text_ml.py serve_config.yaml $ ray start --head ... $ serve deploy serve_config.yaml 2022-06-20 17:26:31,106 SUCC scripts.py:139 -- Sent deploy request successfully! * Use `serve status` to check deployments' statuses. * Use `serve config` to see the running app's config. ``` `ray start --head` starts a long-lived Ray cluster locally. `serve deploy serve_config.yaml` deploys the `serve_config.yaml` file to this local cluster. To stop Ray cluster, run the CLI command `ray stop`. The message `Sent deploy request successfully!` means: * The Ray cluster has received your config file successfully. * It will start a new Serve application if one hasn't already started. * The Serve application will deploy the deployments from your deployment graph, updated with the configurations from your config file. It does **not** mean that your Serve application, including your deployments, has already started running successfully. This happens asynchronously as the Ray cluster attempts to update itself to match the settings from your config file. See [Inspect an application](serve-in-production-inspecting) for how to get the current status. (serve-in-production-remote-cluster)= ## Using a remote cluster By default, `serve deploy` deploys to a cluster running locally. However, you should also use `serve deploy` whenever you want to deploy your Serve application to a remote cluster. `serve deploy` takes in an optional `--address/-a` argument where you can specify your remote Ray cluster's dashboard address. This address should be of the form: ``` [RAY_CLUSTER_URI]:[DASHBOARD_PORT] ``` As an example, the address for the local cluster started by `ray start --head` is `http://127.0.0.1:8265`. We can explicitly deploy to this address using the command ```console $ serve deploy config_file.yaml -a http://127.0.0.1:8265 ``` The Ray Dashboard's default port is 8265. To set it to a different value, use the `--dashboard-port` argument when running `ray start`. :::{note} When running on a remote cluster, you need to ensure that the import path is accessible. See [Handle Dependencies](serve-handling-dependencies) for how to add a runtime environment. ::: :::{tip} By default, all the Serve CLI commands assume that you're working with a local cluster. All Serve CLI commands, except `serve start` and `serve run` use the Ray Dashboard address associated with a local cluster started by `ray start --head`. However, if the `RAY_DASHBOARD_ADDRESS` environment variable is set, these Serve CLI commands will default to that value instead. Similarly, `serve start` and `serve run`, use the Ray head node address associated with a local cluster by default. If the `RAY_ADDRESS` environment variable is set, they will use that value instead. You can check `RAY_DASHBOARD_ADDRESS`'s value by running: ```console $ echo $RAY_DASHBOARD_ADDRESS ``` You can set this variable by running the CLI command: ```console $ export RAY_DASHBOARD_ADDRESS=[YOUR VALUE] ``` You can unset this variable by running the CLI command: ```console $ unset RAY_DASHBOARD_ADDRESS ``` Check for this variable in your environment to make sure you're using your desired Ray Dashboard address. ::: To inspect the status of the Serve application in production, see [Inspect an application](serve-in-production-inspecting). Make heavyweight code updates (like `runtime_env` changes) by starting a new Ray Cluster, updating your Serve config file, and deploying the file with `serve deploy` to the new cluster. Once the new deployment is finished, switch your traffic to the new cluster. --- (serve-dev-workflow)= # Development Workflow This page describes the recommended workflow for developing Ray Serve applications. If you're ready to go to production, jump to the [Production Guide](serve-in-production) section. ## Local Development using `serve.run` You can use `serve.run` in a Python script to run and test your application locally, using a handle to send requests programmatically rather than over HTTP. Benefits: - Self-contained Python is convenient for writing local integration tests. - No need to deploy to a cloud provider or manage infrastructure. Drawbacks: - Doesn't test HTTP endpoints. - Can't use GPUs if your local machine doesn't have them. Let's see a simple example. ```{literalinclude} ../doc_code/local_dev.py :start-after: __local_dev_start__ :end-before: __local_dev_end__ :language: python ``` We can add the code below to deploy and test Serve locally. ```{literalinclude} ../doc_code/local_dev.py :start-after: __local_dev_handle_start__ :end-before: __local_dev_handle_end__ :language: python ``` ## Local Development with HTTP requests You can use the `serve run` CLI command to run and test your application locally using HTTP to send requests (similar to how you might use the `uvicorn` command if you're familiar with [Uvicorn](https://www.uvicorn.org/)). Recall our example above: ```{literalinclude} ../doc_code/local_dev.py :start-after: __local_dev_start__ :end-before: __local_dev_end__ :language: python ``` Now run the following command in your terminal: ```bash serve run local_dev:app # 2022-08-11 11:31:47,692 INFO scripts.py:294 -- Deploying from import path: "local_dev:app". # 2022-08-11 11:31:50,372 INFO worker.py:1481 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265. # (ServeController pid=9865) INFO 2022-08-11 11:31:54,039 controller 9865 proxy_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-dff7dc5b97b4a11facaed746f02448224aa0c1fb651988ba7197e949' on node 'dff7dc5b97b4a11facaed746f02448224aa0c1fb651988ba7197e949' listening on '127.0.0.1:8000' # (ServeController pid=9865) INFO 2022-08-11 11:31:55,373 controller 9865 deployment_state.py:1232 - Adding 1 replicas to deployment 'Doubler'. # (ServeController pid=9865) INFO 2022-08-11 11:31:55,389 controller 9865 deployment_state.py:1232 - Adding 1 replicas to deployment 'HelloDeployment'. # (HTTPProxyActor pid=9872) INFO: Started server process [9872] # 2022-08-11 11:31:57,383 SUCC scripts.py:315 -- Deployed successfully. ``` The `serve run` command blocks the terminal and can be canceled with Ctrl-C. Typically, `serve run` should not be run simultaneously from multiple terminals, unless each `serve run` is targeting a separate running Ray cluster. Now that Serve is running, we can send HTTP requests to the application. For simplicity, we'll just use the `curl` command to send requests from another terminal. ```bash curl -X PUT "http://localhost:8000/?name=Ray" # Hello, Ray! Hello, Ray! ``` After you're done testing, you can shut down Ray Serve by interrupting the `serve run` command (e.g., with Ctrl-C): ```console ^C2022-08-11 11:47:19,829 INFO scripts.py:323 -- Got KeyboardInterrupt, shutting down... (ServeController pid=9865) INFO 2022-08-11 11:47:19,926 controller 9865 deployment_state.py:1257 - Removing 1 replicas from deployment 'Doubler'. (ServeController pid=9865) INFO 2022-08-11 11:47:19,929 controller 9865 deployment_state.py:1257 - Removing 1 replicas from deployment 'HelloDeployment'. ``` Note that rerunning `serve run` redeploys all deployments. To prevent redeploying the deployments whose code hasn't changed, you can use `serve deploy`; see the [Production Guide](serve-in-production) for details. ### Local Testing Mode :::{note} This is an experimental feature. ::: Ray Serve supports a local testing mode that allows you to run your deployments locally in a single process. This mode is useful for unit testing and debugging your application logic without the overhead of a full Ray cluster. To enable this mode, use the `_local_testing_mode` flag in the `serve.run` function: ```{literalinclude} ../doc_code/local_dev.py :start-after: __local_dev_testing_start__ :end-before: __local_dev_testing_end__ :language: python ``` This mode runs each deployment in a background thread and supports most of the same features as running on a full Ray cluster. Note that some features, such as converting `DeploymentResponses` to `ObjectRefs`, are not supported in local testing mode. If you encounter limitations, consider filing a feature request on GitHub. ## Testing on a remote cluster To test on a remote cluster, use `serve run` again, but this time, pass in an `--address` argument to specify the address of the Ray cluster to connect to. For remote clusters, this address has the form `ray://:10001`; see [Ray Client](ray-client-ref) for more information. When making the transition from your local machine to a remote cluster, you'll need to make sure your cluster has a similar environment to your local machine--files, environment variables, and Python packages, for example. Let's see a simple example that just packages the code. Run the following command on your local machine, with your remote cluster head node IP address substituted for `` in the command: ```bash serve run --address=ray://:10001 --working-dir="./project/src" local_dev:app ``` This connects to the remote cluster with the Ray Client, uploads the `working_dir` directory, and runs your Serve application. Here, the local directory specified by `working_dir` must contain `local_dev.py` so that it can be uploaded to the cluster and imported by Ray Serve. Once this is up and running, we can send requests to the application: ```bash curl -X PUT http://:8000/?name=Ray # Hello, Ray! Hello, Ray! ``` For more complex dependencies, including files outside the working directory, environment variables, and Python packages, you can use {ref}`Runtime Environments`. This example uses the --runtime-env-json argument: ```bash serve run --address=ray://:10001 --runtime-env-json='{"env_vars": {"MY_ENV_VAR": "my-value"}, "working_dir": "./project/src", "pip": ["requests", "chess"]}' local_dev:app ``` You can also specify the `runtime_env` in a YAML file; see [serve run](#serve-cli) for details. ## What's Next? View details about your Serve application in the [Ray dashboard](dash-serve-view). Once you are ready to deploy to production, see the [Production Guide](serve-in-production). --- (serve-performance-batching-requests)= # Dynamic Request Batching Serve offers a request batching feature that can improve your service throughput without sacrificing latency. This improvement is possible because ML models can utilize efficient vectorized computation to process a batch of requests at a time. Batching is also necessary when your model is expensive to use and you want to maximize the utilization of hardware. Machine Learning (ML) frameworks such as Tensorflow, PyTorch, and Scikit-Learn support evaluating multiple samples at the same time. Ray Serve allows you to take advantage of this feature with dynamic request batching. When a request arrives, Serve puts the request in a queue. This queue buffers the requests to form a batch. The deployment picks up the batch and evaluates it. After the evaluation, Ray Serve splits up the resulting batch, and returns each response individually. ## Enable batching for your deployment You can enable batching by using the {mod}`ray.serve.batch` decorator. The following simple example modifies the `Model` class to accept a batch: ```{literalinclude} ../doc_code/batching_guide.py --- start-after: __single_sample_begin__ end-before: __single_sample_end__ --- ``` The batching decorators expect you to make the following changes in your method signature: - Declare the method as an async method because the decorator batches in asyncio event loop. - Modify the method to accept a list of its original input types as input. For example, `arg1: int, arg2: str` should be changed to `arg1: List[int], arg2: List[str]`. - Modify the method to return a list. The length of the return list and the input list must be of equal lengths for the decorator to split the output evenly and return a corresponding response back to its respective request. ```{literalinclude} ../doc_code/batching_guide.py --- start-after: __batch_begin__ end-before: __batch_end__ emphasize-lines: 11-12 --- ``` You can supply 4 optional parameters to the decorators: - `max_batch_size` controls the size of the batch. The default value is 10. - `batch_wait_timeout_s` controls how long Serve should wait for a batch once the first request arrives. The default value is 0.01 (10 milliseconds). - `max_concurrent_batches` maximum number of batches that can run concurrently. The default value is 1. - `batch_size_fn` optional function to compute the effective batch size. If provided, this function takes a list of items and returns an integer representing the batch size. This is useful for batching based on custom metrics such as total nodes in graphs or total tokens in sequences. If `None` (the default), the batch size is computed as `len(batch)`. Once the first request arrives, the batching decorator waits for a full batch (up to `max_batch_size`) until `batch_wait_timeout_s` is reached. If the timeout is reached, Serve sends the batch to the model regardless of the batch size. :::{tip} You can reconfigure your `batch_wait_timeout_s` and `max_batch_size` parameters using the `set_batch_wait_timeout_s` and `set_max_batch_size` methods: ```{literalinclude} ../doc_code/batching_guide.py --- start-after: __batch_params_update_begin__ end-before: __batch_params_update_end__ --- ``` Use these methods in the constructor or the `reconfigure` [method](serve-user-config) to control the `@serve.batch` parameters through your Serve configuration file. ::: ## Custom batch size functions By default, Ray Serve measures batch size as the number of items in the batch (`len(batch)`). However, in many workloads, the computational cost depends on properties of the items themselves rather than just the count. For example: - **Graph Neural Networks (GNNs)**: The cost depends on the total number of nodes across all graphs, not the number of graphs - **Natural Language Processing (NLP)**: Transformer models batch by total token count, not the number of sequences - **Variable-resolution images**: Memory usage depends on total pixels, not the number of images Use the `batch_size_fn` parameter to define a custom metric for batch size: ### Graph Neural Network example The following example shows how to batch graph data by total node count: ```{literalinclude} ../doc_code/batching_guide.py --- start-after: __batch_size_fn_begin__ end-before: __batch_size_fn_end__ emphasize-lines: 20 --- ``` In this example, `batch_size_fn=lambda graphs: sum(g.num_nodes for g in graphs)` ensures that the batch contains at most 10,000 total nodes, preventing GPU memory overflow regardless of how many individual graphs are in the batch. ### NLP token batching example The following example shows how to batch text sequences by total token count: ```{literalinclude} ../doc_code/batching_guide.py --- start-after: __batch_size_fn_nlp_begin__ end-before: __batch_size_fn_nlp_end__ emphasize-lines: 12 --- ``` This pattern ensures that the total number of tokens doesn't exceed the model's context window or memory limits. (serve-streaming-batched-requests-guide)= ## Streaming batched requests Use an async generator to stream the outputs from your batched requests. The following example converts the `StreamingResponder` class to accept a batch. ```{literalinclude} ../doc_code/batching_guide.py --- start-after: __single_stream_begin__ end-before: __single_stream_end__ --- ``` Decorate async generator functions with the {mod}`ray.serve.batch` decorator. Similar to non-streaming methods, the function takes in a `List` of inputs and in each iteration it `yield`s an iterable of outputs with the same length as the input batch size. ```{literalinclude} ../doc_code/batching_guide.py --- start-after: __batch_stream_begin__ end-before: __batch_stream_end__ --- ``` Calling the `serve.batch`-decorated function returns an async generator that you can `await` to receive results. Some inputs within a batch may generate fewer outputs than others. When a particular input has nothing left to yield, pass a `StopIteration` object into the output iterable. This action terminates the generator that Serve returns when it calls the `serve.batch` function with that input. When `serve.batch`-decorated functions return streaming generators over HTTP, this action allows the end client's connection to terminate once its call is done, instead of waiting until the entire batch is done. ## Tips for fine-tuning batching parameters `max_batch_size` ideally should be a power of 2 (2, 4, 8, 16, ...) because CPUs and GPUs are both optimized for data of these shapes. Large batch sizes incur a high memory cost as well as latency penalty for the first few requests. When using `batch_size_fn`, set `max_batch_size` based on your custom metric rather than item count. For example, if batching by total nodes in graphs, set `max_batch_size` to your GPU's maximum node capacity (such as 10,000 nodes) rather than a count of graphs. Set `batch_wait_timeout_s` considering the end-to-end latency SLO (Service Level Objective). For example, if your latency target is 150ms, and the model takes 100ms to evaluate the batch, set the `batch_wait_timeout_s` to a value much lower than 150ms - 100ms = 50ms. When using batching in a Serve Deployment Graph, the relationship between an upstream node and a downstream node might affect the performance as well. Consider a chain of two models where first model sets `max_batch_size=8` and second model sets `max_batch_size=6`. In this scenario, when the first model finishes a full batch of 8, the second model finishes one batch of 6 and then to fill the next batch, which Serve initially only partially fills with 8 - 6 = 2 requests, leads to incurring latency costs. The batch size of downstream models should ideally be multiples or divisors of the upstream models to ensure the batches work optimally together. --- (serve-set-up-grpc-service)= # Set Up a gRPC Service This section helps you understand how to: - Build a user defined gRPC service and protobuf - Start Serve with gRPC enabled - Deploy gRPC applications - Send gRPC requests to Serve deployments - Check proxy health - Work with gRPC metadata - Use streaming and model composition - Handle errors - Use gRPC context (custom-serve-grpc-service)= ## Define a gRPC service Running a gRPC server starts with defining gRPC services, RPC methods, and protobufs similar to the one below. ```{literalinclude} ../doc_code/grpc_proxy/user_defined_protos.proto :start-after: __begin_proto__ :end-before: __end_proto__ :language: proto ``` This example creates a file named `user_defined_protos.proto` with two gRPC services: `UserDefinedService` and `ImageClassificationService`. `UserDefinedService` has three RPC methods: `__call__`, `Multiplexing`, and `Streaming`. `ImageClassificationService` has one RPC method: `Predict`. Their corresponding input and output types are also defined specifically for each RPC method. Once you define the `.proto` services, use `grpcio-tools` to compile python code for those services. Example command looks like the following: ```bash python -m grpc_tools.protoc -I=. --python_out=. --grpc_python_out=. ./user_defined_protos.proto ``` It generates two files: `user_defined_protos_pb2.py` and `user_defined_protos_pb2_grpc.py`. For more details on `grpcio-tools` see [https://grpc.io/docs/languages/python/basics/#generating-client-and-server-code](https://grpc.io/docs/languages/python/basics/#generating-client-and-server-code). :::{note} Ensure that the generated files are in the same directory as where the Ray cluster is running so that Serve can import them when starting the proxies. ::: (start-serve-with-grpc-proxy)= ## Start Serve with gRPC enabled The [Serve start](https://docs.ray.io/en/releases-2.7.0/serve/api/index.html#serve-start) CLI, [`ray.serve.start`](https://docs.ray.io/en/releases-2.7.0/serve/api/doc/ray.serve.start.html#ray.serve.start) API, and [Serve config files](https://docs.ray.io/en/releases-2.7.0/serve/production-guide/config.html#serve-config-files-serve-build) all support starting Serve with a gRPC proxy. Two options are related to Serve's gRPC proxy: `grpc_port` and `grpc_servicer_functions`. `grpc_port` is the port for gRPC proxies to listen to. It defaults to 9000. `grpc_servicer_functions` is a list of import paths for gRPC `add_servicer_to_server` functions to add to a gRPC proxy. It also serves as the flag to determine whether to start gRPC server. The default is an empty list, meaning no gRPC server is started. ::::{tab-set} :::{tab-item} CLI ```bash ray start --head serve start \ --grpc-port 9000 \ --grpc-servicer-functions user_defined_protos_pb2_grpc.add_UserDefinedServiceServicer_to_server \ --grpc-servicer-functions user_defined_protos_pb2_grpc.add_ImageClassificationServiceServicer_to_server ``` ::: :::{tab-item} Python API ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_start_grpc_proxy__ :end-before: __end_start_grpc_proxy__ :language: python ``` ::: :::{tab-item} Serve config file ```yaml # config.yaml grpc_options: port: 9000 grpc_servicer_functions: - user_defined_protos_pb2_grpc.add_UserDefinedServiceServicer_to_server - user_defined_protos_pb2_grpc.add_ImageClassificationServiceServicer_to_server applications: - name: app1 route_prefix: /app1 import_path: test_deployment_v2:g runtime_env: {} - name: app2 route_prefix: /app2 import_path: test_deployment_v2:g2 runtime_env: {} ``` ```bash # Start Serve with above config file. serve run config.yaml ``` ::: :::: :::{note} The default max gRPC message size is ~2GB. To adjust it, set `RAY_SERVE_GRPC_MAX_MESSAGE_SIZE` (in bytes) before starting Ray, e.g., `export RAY_SERVE_GRPC_MAX_MESSAGE_SIZE=104857600` for 100MB. ::: (deploy-serve-grpc-applications)= ## Deploy gRPC applications gRPC applications in Serve works similarly to HTTP applications. The only difference is that the input and output of the methods need to match with what's defined in the `.proto` file and that the method of the application needs to be an exact match (case sensitive) with the predefined RPC methods. For example, if we want to deploy `UserDefinedService` with `__call__` method, the method name needs to be `__call__`, the input type needs to be `UserDefinedMessage`, and the output type needs to be `UserDefinedResponse`. Serve passes the protobuf object into the method and expects the protobuf object back from the method. Example deployment: ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_grpc_deployment__ :end-before: __end_grpc_deployment__ :language: python ``` Deploy the application: ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_deploy_grpc_app__ :end-before: __end_deploy_grpc_app__ :language: python ``` :::{note} `route_prefix` is still a required field as of Ray 2.7.0 due to a shared code path with HTTP. Future releases will make it optional for gRPC. ::: (send-serve-grpc-proxy-request)= ## Send gRPC requests to serve deployments Sending a gRPC request to a Serve deployment is similar to sending a gRPC request to any other gRPC server. Create a gRPC channel and stub, then call the RPC method on the stub with the appropriate input. The output is the protobuf object that your Serve application returns. Sending a gRPC request: ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_send_grpc_requests__ :end-before: __end_send_grpc_requests__ :language: python ``` Read more about gRPC clients in Python: [https://grpc.io/docs/languages/python/basics/#client](https://grpc.io/docs/languages/python/basics/#client) (serve-grpc-proxy-health-checks)= ## Check proxy health Similar to HTTP `/-/routes` and `/-/healthz` endpoints, Serve also provides gRPC service method to be used in health check. - `/ray.serve.RayServeAPIService/ListApplications` is used to list all applications deployed in Serve. - `/ray.serve.RayServeAPIService/Healthz` is used to check the health of the proxy. It returns `OK` status and "success" message if the proxy is healthy. The service method and protobuf are defined as below: ```proto message ListApplicationsRequest {} message ListApplicationsResponse { repeated string application_names = 1; } message HealthzRequest {} message HealthzResponse { string message = 1; } service RayServeAPIService { rpc ListApplications(ListApplicationsRequest) returns (ListApplicationsResponse); rpc Healthz(HealthzRequest) returns (HealthzResponse); } ``` You can call the service method with the following code: ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_health_check__ :end-before: __end_health_check__ :language: python ``` :::{note} Serve provides the `RayServeAPIServiceStub` stub, and `HealthzRequest` and `ListApplicationsRequest` protobufs for you to use. You don't need to generate them from the proto file. They are available for your reference. ::: (serve-grpc-metadata)= ## Work with gRPC metadata Just like HTTP headers, gRPC also supports metadata to pass request related information. You can pass metadata to Serve's gRPC proxy and Serve knows how to parse and use them. Serve also passes trailing metadata back to the client. List of Serve accepted metadata keys: - `application`: The name of the Serve application to route to. If not passed and only one application is deployed, serve routes to the only deployed app automatically. - `request_id`: The request ID to track the request. - `multiplexed_model_id`: The model ID to do model multiplexing. List of Serve returned trailing metadata keys: - `request_id`: The request ID to track the request. Example of using metadata: ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_metadata__ :end-before: __end_metadata__ :language: python ``` (serve-grpc-proxy-more-examples)= ## Use streaming and model composition gRPC proxy remains at feature parity with HTTP proxy. Here are more examples of using gRPC proxy for getting streaming response as well as doing model composition. ### Streaming The `Steaming` method is deployed with the app named "app1" above. The following code gets a streaming response. ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_streaming__ :end-before: __end_streaming__ :language: python ``` ### Model composition Assuming we have the below deployments. `ImageDownloader` and `DataPreprocessor` are two separate steps to download and process the image before PyTorch can run inference. The `ImageClassifier` deployment initializes the model, calls both `ImageDownloader` and `DataPreprocessor`, and feed into the resnet model to get the classes and probabilities of the given image. ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_model_composition_deployment__ :end-before: __end_model_composition_deployment__ :language: python ``` We can deploy the application with the following code: ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_model_composition_deploy__ :end-before: __end_model_composition_deploy__ :language: python ``` The client code to call the application looks like the following: ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_model_composition_client__ :end-before: __end_model_composition_client__ :language: python ``` :::{note} At this point, two applications are running on Serve, "app1" and "app2". If more than one application is running, you need to pass `application` to the metadata so Serve knows which application to route to. ::: (serve-grpc-proxy-error-handling)= ## Handle errors Similar to any other gRPC server, request throws a `grpc.RpcError` when the response code is not "OK". Put your request code in a try-except block and handle the error accordingly. ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_error_handle__ :end-before: __end_error_handle__ :language: python ``` Serve uses the following gRPC error codes: - `NOT_FOUND`: When multiple applications are deployed to Serve and the application is not passed in metadata or passed but no matching application. - `UNAVAILABLE`: Only on the health check methods when the proxy is in draining state. When the health check is throwing `UNAVAILABLE`, it means the health check failed on this node and you should no longer route to this node. - `DEADLINE_EXCEEDED`: The request took longer than the timeout setting and got cancelled. - `INTERNAL`: Other unhandled errors during the request. (serve-grpc-proxy-grpc-context)= ## Use gRPC context Serve provides a [gRPC context object](https://grpc.github.io/grpc/python/grpc.html#grpc.ServicerContext) to the deployment replica to get information about the request as well as setting response metadata such as code and details. If the handler function is defined with a `grpc_context` argument, Serve will pass a [RayServegRPCContext](../api/doc/ray.serve.grpc_util.RayServegRPCContext.rst) object in for each request. Below is an example of how to set a custom status code, details, and trailing metadata. ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_grpc_context_define_app__ :end-before: __end_grpc_context_define_app__ :language: python ``` The client code is defined like the following to get those attributes. ```{literalinclude} ../doc_code/grpc_proxy/grpc_guide.py :start-after: __begin_grpc_context_client__ :end-before: __end_grpc_context_client__ :language: python ``` :::{note} If the handler raises an unhandled exception, Serve will return an `INTERNAL` error code with the stacktrace in the details, regardless of what code and details are set in the `RayServegRPCContext` object. ::: --- (serve-advanced-guides)= # Advanced Guides ```{toctree} :hidden: app-builder-guide advanced-autoscaling asyncio-best-practices performance dyn-req-batch inplace-updates dev-workflow grpc-guide replica-ranks replica-scheduling managing-java-deployments deploy-vm multi-app-container custom-request-router multi-node-gpu-troubleshooting ``` If you’re new to Ray Serve, start with the [Ray Serve Quickstart](serve-getting-started). Use these advanced guides for more options and configurations: - [Pass Arguments to Applications](app-builder-guide) - [Advanced Ray Serve Autoscaling](serve-advanced-autoscaling) - [Asyncio and Concurrency best practices in Ray Serve](serve-asyncio-best-practices) - [Performance Tuning](serve-perf-tuning) - [Dynamic Request Batching](serve-performance-batching-requests) - [In-Place Updates for Serve](serve-inplace-updates) - [Development Workflow](serve-dev-workflow) - [gRPC Support](serve-set-up-grpc-service) - [Replica Ranks](serve-replica-ranks) - [Replica Scheduling](serve-replica-scheduling) - [Ray Serve Dashboard](dash-serve-view) - [Experimental Java API](serve-java-api) - [Run Applications in Different Containers](serve-container-runtime-env-guide) - [Use Custom Algorithm for Request Routing](custom-request-router) - [Troubleshoot multi-node GPU setups for serving LLMs](multi-node-gpu-troubleshooting) --- (serve-inplace-updates)= # Updating Applications In-Place You can update your Serve applications once they're in production by updating the settings in your config file and redeploying it using the `serve deploy` command. In the redeployed config file, you can add new deployment settings or remove old deployment settings. This is because `serve deploy` is **idempotent**, meaning your Serve application's config always matches (or honors) the latest config you deployed successfully – regardless of what config files you deployed before that. (serve-in-production-lightweight-update)= ## Lightweight Config Updates Lightweight config updates modify running deployment replicas without tearing them down and restarting them, so there's less downtime as the deployments update. For each deployment, modifying the following values is considered a lightweight config update, and won't tear down the replicas for that deployment: - `num_replicas` - `autoscaling_config` - `user_config` - `max_ongoing_requests` - `graceful_shutdown_timeout_s` - `graceful_shutdown_wait_loop_s` - `health_check_period_s` - `health_check_timeout_s` (serve-updating-user-config)= ## Updating the user config This example uses the text summarization and translation application [from the production guide](production-config-yaml). Both of the individual deployments contain a `reconfigure()` method. This method allows you to issue lightweight updates to the deployments by updating the `user_config`. First let's deploy the graph. Make sure to stop any previous Ray cluster using the CLI command `ray stop` for this example: ```console $ ray start --head $ serve deploy serve_config.yaml ``` Then send a request to the application: ```{literalinclude} ../doc_code/production_guide/text_ml.py :language: python :start-after: __start_client__ :end-before: __end_client__ ``` Change the language that the text is translated into from French to German by changing the `language` attribute in the `Translator` user config: ```yaml ... applications: - name: default route_prefix: / import_path: text_ml:app runtime_env: pip: - torch - transformers deployments: - name: Translator num_replicas: 1 user_config: language: german ... ``` Without stopping the Ray cluster, redeploy the app using `serve deploy`: ```console $ serve deploy serve_config.yaml ... ``` We can inspect our deployments with `serve status`. Once the application's `status` returns to `RUNNING`, we can try our request one more time: ```console $ serve status proxies: cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec: HEALTHY applications: default: status: RUNNING message: '' last_deployed_time_s: 1694041157.2211847 deployments: Translator: status: HEALTHY replica_states: RUNNING: 1 message: '' Summarizer: status: HEALTHY replica_states: RUNNING: 1 message: '' ``` The language has updated. Now the returned text is in German instead of French. ```{literalinclude} ../doc_code/production_guide/text_ml.py :language: python :start-after: __start_second_client__ :end-before: __end_second_client__ ``` ## Code Updates Changing the following values in a deployment's config will trigger redeployment and restart all the deployment's replicas. - `ray_actor_options` - `placement_group_bundles` - `placement_group_strategy` Changing the following application-level config values is also considered a code update, and all deployments in the application will be restarted. - `import_path` - `runtime_env` :::{warning} Although you can update your Serve application by deploying an entirely new deployment graph using a different `import_path` and a different `runtime_env`, this is NOT recommended in production. The best practice for large-scale code updates is to start a new Ray cluster, deploy the updated code to it using `serve deploy`, and then switch traffic from your old cluster to the new one. ::: --- (serve-java-api)= # Experimental Java API :::{warning} Java API support is an experimental feature and subject to change. The Java API is not currently supported on KubeRay. ::: Java is a mainstream programming language for production services. Ray Serve offers a native Java API for creating, updating, and managing deployments. You can create Ray Serve deployments using Java and call them via Python, or vice versa. This section helps you to: - create, query, and update Java deployments - configure Java deployment resources - manage Python deployments using the Java API ```{contents} ``` ## Creating a Deployment By specifying the full name of the class as an argument to the `Serve.deployment()` method, as shown in the code below, you can create and deploy a deployment of the class. ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/ManageDeployment.java :start-after: docs-create-start :end-before: docs-create-end :language: java ``` ## Accessing a Deployment Once a deployment is deployed, you can fetch its instance by name. ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/ManageDeployment.java :start-after: docs-query-start :end-before: docs-query-end :language: java ``` ## Updating a Deployment You can update a deployment's code and configuration and then redeploy it. The following example updates the `"counter"` deployment's initial value to 2. ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/ManageDeployment.java :start-after: docs-update-start :end-before: docs-update-end :language: java ``` ## Configuring a Deployment Ray Serve lets you configure your deployments to: - scale out by increasing the number of [deployment replicas](serve-architecture-high-level-view) - assign [replica resources](serve-cpus-gpus) such as CPUs and GPUs. The next two sections describe how to configure your deployments. ### Scaling Out By specifying the `numReplicas` parameter, you can change the number of deployment replicas: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/ManageDeployment.java :start-after: docs-scale-start :end-before: docs-scale-end :language: java ``` ### Resource Management (CPUs, GPUs) Through the `rayActorOptions` parameter, you can reserve resources for each deployment replica, such as one GPU: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/ManageDeployment.java :start-after: docs-resource-start :end-before: docs-resource-end :language: java ``` ## Managing a Python Deployment A Python deployment can also be managed and called by the Java API. Suppose you have a Python file `counter.py` in the `/path/to/code/` directory: ```python from ray import serve @serve.deployment class Counter(object): def __init__(self, value): self.value = int(value) def increase(self, delta): self.value += int(delta) return str(self.value) ``` You can deploy it through the Java API and call it through a `RayServeHandle`: ```java import io.ray.api.Ray; import io.ray.serve.api.Serve; import io.ray.serve.deployment.Deployment; import io.ray.serve.generated.DeploymentLanguage; import java.io.File; public class ManagePythonDeployment { public static void main(String[] args) { System.setProperty( "ray.job.code-search-path", System.getProperty("java.class.path") + File.pathSeparator + "/path/to/code/"); Serve.start(true, false, null); Deployment deployment = Serve.deployment() .setDeploymentLanguage(DeploymentLanguage.PYTHON) .setName("counter") .setDeploymentDef("counter.Counter") .setNumReplicas(1) .setInitArgs(new Object[] {"1"}) .create(); deployment.deploy(true); System.out.println(Ray.get(deployment.getHandle().method("increase").remote("2"))); } } ``` :::{note} Before `Ray.init` or `Serve.start`, you need to specify a directory to find the Python code. For details, please refer to [Cross-Language Programming](cross_language). ::: ## Future Roadmap In the future, Ray Serve plans to provide more Java features, such as: - an improved Java API that matches the Python version - HTTP ingress support - bring-your-own Java Spring project as a deployment --- (serve-container-runtime-env-guide)= # Run Multiple Applications in Different Containers This section explains how to run multiple Serve applications on the same cluster in separate containers with different images. This feature is experimental and the API is subject to change. If you have additional feature requests or run into issues, please submit them on [Github](https://github.com/ray-project/ray/issues). ## Install Podman The `image_uri` runtime environment feature uses [Podman](https://podman.io/) to start and run containers. Follow the [Podman Installation Instructions](https://podman.io/docs/installation) to install Podman in the environment for all head and worker nodes. :::{note} For Ubuntu, the Podman package is only available in the official repositories for Ubuntu 20.10 and newer. To install Podman in Ubuntu 20.04 or older, you need to first add the software repository as a debian source. Follow these instructions to install Podman on Ubuntu 20.04 or older: ```bash sudo sh -c "echo 'deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_20.04/ /' > /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list" sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 4D64390375060AA4 sudo apt-get update sudo apt-get install podman -y ``` ::: ## Run a Serve application in a container This example deploys two applications in separate containers: a Whisper model and a Resnet50 image classification model. First, install the required dependencies in the images. :::{warning} The Ray version and Python version in the container *must* match those of the host environment exactly. Note that for Python, the versions must match down to the patch number. ::: Save the following to files named `whisper.Dockerfile` and `resnet.Dockerfile`. ::::{tab-set} :::{tab-item} whisper.Dockerfile ```dockerfile # Use the latest Ray GPU image, `rayproject/ray:latest-py38-gpu`, so the Whisper model can run on GPUs. FROM rayproject/ray:latest-py38-gpu # Install the package `faster_whisper`, which is a dependency for the Whisper model. RUN pip install faster_whisper==0.10.0 RUN sudo apt-get update && sudo apt-get install curl -y # Download the source code for the Whisper application into `whisper_example.py`. RUN curl -O https://raw.githubusercontent.com/ray-project/ray/master/doc/source/serve/doc_code/whisper_example.py # Add /home/ray path to PYTHONPATH avoid import module error ENV PYTHONPATH "${PYTHONPATH}:/home/ray" ``` ::: :::{tab-item} resnet.Dockerfile ```dockerfile # Use the latest Ray CPU image, `rayproject/ray:latest-py38-cpu`. FROM rayproject/ray:latest-py38-cpu # Install the packages `torch` and `torchvision`, which are dependencies for the ResNet model. RUN pip install torch==2.0.1 torchvision==0.15.2 RUN sudo apt-get update && sudo apt-get install curl -y # Download the source code for the ResNet application into `resnet50_example.py`. RUN curl -O https://raw.githubusercontent.com/ray-project/ray/master/doc/source/serve/doc_code/resnet50_example.py # Add /home/ray path to PYTHONPATH avoid import module error ENV PYTHONPATH "${PYTHONPATH}:/home/ray" ``` ::: :::: Then, build the corresponding images and push it to your choice of container registry. This tutorial uses `alice/whisper_image:latest` and `alice/resnet_image:latest` as placeholder names for the images, but make sure to swap out `alice` for a repo name of your choice. ::::{tab-set} :::{tab-item} Whisper ```bash # Build the image from the Dockerfile using Podman export IMG1=alice/whisper_image:latest podman build -t $IMG1 -f whisper.Dockerfile . # Push to a registry. This step is unnecessary if you are deploying Serve locally. podman push $IMG1 ``` ::: :::{tab-item} Resnet ```bash # Build the image from the Dockerfile using Podman export IMG2=alice/resnet_image:latest podman build -t $IMG2 -f resnet.Dockerfile . # Push to a registry. This step is unnecessary if you are deploying Serve locally. podman push $IMG2 ``` ::: :::: Finally, you can specify the container image within which you want to run each application in the `image_uri` field of an application's runtime environment specification. :::{note} Previously you could access the feature through the `container` field of the runtime environment. That API is now deprecated in favor of `image_uri`. ::: The following Serve config runs the `whisper` app with the image `IMG1`, and the `resnet` app with the image `IMG2`. `podman images` command can be used to list the names of the images. Concretely, all deployment replicas in the applications start and run in containers with the respective images. ```yaml applications: - name: whisper import_path: whisper_example:entrypoint route_prefix: /whisper runtime_env: image_uri: {IMG1} - name: resnet import_path: resnet50_example:app route_prefix: /resnet runtime_env: image_uri: {IMG2} ``` ### Send queries ```python >>> import requests >>> audio_file = "https://storage.googleapis.com/public-lyrebird-test/test_audio_22s.wav" >>> resp = requests.post("http://localhost:8000/whisper", json={"filepath": audio_file}) # doctest: +SKIP >>> resp.json() # doctest: +SKIP { "language": "en", "language_probability": 1, "duration": 21.775, "transcript_text": " Well, think about the time of our ancestors. A ping, a ding, a rustling in the bushes is like, whoo, that means an immediate response. Oh my gosh, what's that thing? Oh my gosh, I have to do it right now. And dude, it's not a tiger, right? Like, but our, our body treats stress as if it's life-threatening because to quote Robert Sapolsky or butcher his quote, he's a Robert Sapolsky is like one of the most incredible stress physiologists of", "whisper_alignments": [ [ 0.0, 0.36, " Well,", 0.3125 ], ... ] } >>> link_to_image = "https://serve-resnet-benchmark-data.s3.us-west-1.amazonaws.com/000000000019.jpeg" >>> resp = requests.post("http://localhost:8000/resnet", json={"uri": link_to_image}) # doctest: +SKIP >>> resp.text # doctest: +SKIP ox ``` ## Advanced ### Compatibility with other runtime environment fields Currently, use of the `image_uri` field is only supported with `config` and `env_vars`. If you have a use case for pairing `image_uri` with another runtime environment feature, submit a feature request on [Github](https://github.com/ray-project/ray/issues). ### Environment variables The following environment variables will be set for the process in your container, in order of highest to lowest priority: 1. Environment variables specified in `runtime_env["env_vars"]`. 2. All environment variables that start with the prefix `RAY_` (including the two special variables `RAY_RAYLET_PID` and `RAY_JOB_ID`) are inherited by the container at runtime. 3. Any environment variables set in the docker image. ### Running the Ray cluster in a Docker container If raylet is running inside a container, then that container needs the necessary permissions to start a new container. To setup correct permissions, you need to start the container that runs the raylet with the flag `--privileged`. ### Troubleshooting * **Permission denied: '/tmp/ray/session_2023-11-28_15-27-22_167972_6026/ports_by_node.json.lock'** * This error likely occurs because the user running inside the Podman container is different from the host user that started the Ray cluster. The folder `/tmp/ray`, which is volume mounted into the podman container, is owned by the host user that started Ray. The container, on the other hand, is started with the flag `--userns=keep-id`, meaning the host user is mapped into the container as itself. Therefore, permissions issues should only occur if the user inside the container is different from the host user. For instance, if the user on host is `root`, and you're using a container whose base image is a standard Ray image, then by default the container starts with user `ray(1000)`, who won't be able to access the mounted `/tmp/ray` volume. * **ERRO[0000] 'overlay' is not supported over overlayfs: backing file system is unsupported for this graph driver** * This error should only occur when you're running the Ray cluster inside a container. If you see this error when starting the replica actor, try volume mounting `/var/lib/containers` in the container that runs raylet. That is, add `-v /var/lib/containers:/var/lib/containers` to the command that starts the Docker container. * **cannot clone: Operation not permitted; Error: cannot re-exec process** * This error should only occur when you're running the Ray cluster inside a container. This error implies that you don't have the permissions to use Podman to start a container. You need to start the container that runs raylet, with privileged permissions by adding `--privileged`. --- (serve-multi-node-gpu-troubleshooting)= # Troubleshoot multi-node GPU serving on KubeRay This guide helps you diagnose and resolve common issues when deploying multi-node GPU workloads on KubeRay, particularly for large language model (LLM) serving with vLLM. ## Debugging strategy When encountering issues with multi-node GPU serving, use this systematic approach to isolate the problem: 1. **Test on different platforms** Compare behavior between: - Single node without KubeRay - Standalone vLLM server on KubeRay - Ray Serve LLM deployment on KubeRay 2. **Vary hardware configurations** Test with different GPU types—for example, A100s vs H100s—to identify hardware-specific issues 3. **Use minimal reproducers** Create simplified test cases that isolate specific components (NCCL, model loading, etc.) ## Common issues and solutions ### 1. Head pod scheduled on GPU node **Symptoms** - `ray status` shows duplicate GPU resources, for example, 24 GPUs when cluster only has 16 GPUs - Model serving hangs when using pipeline parallelism (PP > 1) - Resource allocation conflicts **Root Cause** The Ray head pod is incorrectly scheduled on a GPU worker node, causing resource accounting issues. **Solution** Configure the head pod to use zero GPUs in your RayCluster specification: ```yaml apiVersion: ray.io/v1 kind: RayCluster metadata: name: my-cluster spec: headGroupSpec: rayStartParams: num-cpus: "0" num-gpus: "0" # Ensure head pod doesn't claim GPU resources. # ... other head group configuration ``` ### 2. AWS OFI plugin version issues (H100-specific) **Symptoms** - NCCL initialization failures on H100 instances - Works fine on A100 but fails on H100 with identical configuration - Malformed topology files **Root Cause** Outdated `aws-ofi-plugin` in container images causes NCCL topology detection to fail on H100 instances. **Related issues** - [NVIDIA NCCL Issue #1726](https://github.com/NVIDIA/nccl/issues/1726) - [vLLM Issue #18997](https://github.com/vllm-project/vllm/issues/18997) - [AWS OFI NCCL Fix](https://github.com/aws/aws-ofi-nccl/pull/916) **Solution** - Update to a newer container image with an updated `aws-ofi-plugin` - Use the NCCL debugging script below to verify NCCL functions as expected - Consider hardware-specific configuration adjustments ## Further troubleshooting If you continue to experience issues after following this guide: 1. **Collect diagnostic information**: Run the NCCL debugging script below and save the output 2. **Check compatibility**: Verify Ray, vLLM, PyTorch, and CUDA versions are compatible 3. **Review logs**: Examine Ray cluster logs and worker pod logs for additional error details 4. **Hardware verification**: Test with different GPU types if possible 5. **Community support**: Share your findings with the Ray and vLLM communities for additional help ## Additional resources - [Ray Multi-Node GPU Guide](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gpu.html) - [vLLM Distributed Serving Documentation](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) - [NCCL Troubleshooting Guide](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html) ## NCCL debugging script Use this diagnostic script to identify NCCL-related issues in your multi-node GPU setup: ```python #!/usr/bin/env python3 """ NCCL Diagnostic Script for Multi-Node GPU Serving This script helps identify NCCL configuration issues that can cause multi-node GPU serving failures. Run this script on each node to verify NCCL function before deploying distributed workloads. Usage: python3 multi-node-nccl-check.py """ import os import sys import socket import torch from datetime import datetime def log(msg): """Log messages with timestamp for better debugging.""" timestamp = datetime.now().strftime("%H:%M:%S") print(f"[{timestamp}] {msg}", flush=True) def print_environment_info(): """Print relevant environment information for debugging.""" log("=== Environment Information ===") log(f"Hostname: {socket.gethostname()}") log(f"CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'not set')}") # Print all NCCL-related environment variables. nccl_vars = [var for var in os.environ.keys() if var.startswith('NCCL_')] if nccl_vars: log("NCCL Environment Variables:") for var in sorted(nccl_vars): log(f" {var}: {os.environ[var]}") else: log("No NCCL environment variables set") def check_cuda_availability(): """Verify CUDA is available and functional.""" log("\n=== CUDA Availability Check ===") if not torch.cuda.is_available(): log("ERROR: CUDA not available") return False device_count = torch.cuda.device_count() log(f"CUDA device count: {device_count}") log(f"PyTorch version: {torch.__version__}") # Check NCCL availability in PyTorch. try: import torch.distributed as dist if hasattr(torch.distributed, 'nccl'): log(f"PyTorch NCCL available: {torch.distributed.is_nccl_available()}") except Exception as e: log(f"Error checking NCCL availability: {e}") return True def test_individual_gpus(): """Test that each GPU is working individually.""" log("\n=== Individual GPU Tests ===") for gpu_id in range(torch.cuda.device_count()): log(f"\n--- Testing GPU {gpu_id} ---") try: torch.cuda.set_device(gpu_id) device = torch.cuda.current_device() log(f"Device {device}: {torch.cuda.get_device_name(device)}") # Print device properties. props = torch.cuda.get_device_properties(device) log(f" Compute capability: {props.major}.{props.minor}") log(f" Total memory: {props.total_memory / 1024**3:.2f} GB") # Test basic CUDA operations. log(" Testing basic CUDA operations...") tensor = torch.ones(1000, device=f'cuda:{gpu_id}') result = tensor.sum() log(f" Basic CUDA test passed: sum = {result.item()}") # Test cross-GPU operations if multiple GPUs are available. if torch.cuda.device_count() > 1: log(" Testing cross-GPU operations...") try: other_gpu = (gpu_id + 1) % torch.cuda.device_count() test_tensor = torch.randn(10, 10, device=f'cuda:{gpu_id}') tensor_copy = test_tensor.to(f'cuda:{other_gpu}') log(f" Cross-GPU copy successful: GPU {gpu_id} -> GPU {other_gpu}") except Exception as e: log(f" Cross-GPU copy failed: {e}") # Test memory allocation. log(" Testing large memory allocations...") try: large_tensor = torch.zeros(1000, 1000, device=f'cuda:{gpu_id}') log(" Large memory allocation successful") del large_tensor except Exception as e: log(f" Large memory allocation failed: {e}") except Exception as e: log(f"ERROR testing GPU {gpu_id}: {e}") import traceback log(f"Traceback:\n{traceback.format_exc()}") def test_nccl_initialization(): """Test NCCL initialization and basic operations.""" log("\n=== NCCL Initialization Test ===") try: import torch.distributed as dist # Set up single-process NCCL environment. os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' os.environ['RANK'] = '0' os.environ['WORLD_SIZE'] = '1' log("Attempting single-process NCCL initialization...") dist.init_process_group( backend='nccl', rank=0, world_size=1 ) log("Single-process NCCL initialization successful!") # Test basic NCCL operation. if torch.cuda.is_available(): device = torch.cuda.current_device() tensor = torch.ones(10, device=device) # This is a no-op with world_size=1 but exercises NCCL dist.all_reduce(tensor) log("NCCL all_reduce test successful!") dist.destroy_process_group() log("NCCL cleanup successful!") except Exception as e: log(f"NCCL initialization failed: {e}") import traceback log(f"Full traceback:\n{traceback.format_exc()}") def main(): """Main diagnostic routine.""" log("Starting NCCL Diagnostic Script") log("=" * 50) print_environment_info() if not check_cuda_availability(): sys.exit(1) test_individual_gpus() test_nccl_initialization() log("\n" + "=" * 50) log("NCCL diagnostic script completed") log("If you encountered errors, check the specific error messages above") log("and refer to the troubleshooting guide for solutions.") if __name__ == "__main__": main() --- (serve-perf-tuning)= # Performance Tuning This section should help you: - understand Ray Serve's performance characteristics - find ways to debug and tune your Serve application's performance :::{note} This section offers some tips and tricks to improve your Ray Serve application's performance. Check out the [architecture page](serve-architecture) for helpful context, including an overview of the HTTP proxy actor and deployment replica actors. ::: ```{contents} ``` ## Performance and benchmarks Ray Serve is built on top of Ray, so its scalability is bounded by Ray’s scalability. See Ray’s [scalability envelope](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md) to learn more about the maximum number of nodes and other limitations. ## Debugging performance issues in request path The performance issue you're most likely to encounter is high latency or low throughput for requests. Once you set up [monitoring](serve-monitoring) with Ray and Ray Serve, these issues may appear as: * `serve_num_router_requests_total` staying constant while your load increases * `serve_deployment_processing_latency_ms` spiking up as queries queue up in the background The following are ways to address these issues: 1. Make sure you are using the right hardware and resources: * Are you reserving GPUs for your deployment replicas using `ray_actor_options` (e.g., `ray_actor_options={“num_gpus”: 1}`)? * Are you reserving one or more cores for your deployment replicas using `ray_actor_options` (e.g., `ray_actor_options={“num_cpus”: 2}`)? * Are you setting [OMP_NUM_THREADS](serve-omp-num-threads) to increase the performance of your deep learning framework? 2. Try batching your requests. See [Dynamic Request Batching](serve-performance-batching-requests). 3. Consider using `async` methods in your callable. See [the section below](serve-performance-async-methods). 4. Set an end-to-end timeout for your HTTP requests. See [the section below](serve-performance-e2e-timeout). (serve-performance-async-methods)= ### Using `async` methods :::{note} According to the [FastAPI documentation](https://fastapi.tiangolo.com/async/#very-technical-details), `def` endpoint functions are called in a separate threadpool, so you might observe many requests running at the same time inside one replica, and this scenario might cause OOM or resource starvation. In this case, you can try to use `async def` to control the workload performance. ::: Are you using `async def` in your callable? If you are using `asyncio` and hitting the same queuing issue mentioned above, you might want to increase `max_ongoing_requests`. By default, Serve sets this to a low value (5) to ensure clients receive proper backpressure. You can increase the value in the deployment decorator; for example, `@serve.deployment(max_ongoing_requests=1000)`. (serve-performance-e2e-timeout)= ### Set an end-to-end request timeout By default, Serve lets client HTTP requests run to completion no matter how long they take. However, slow requests could bottleneck the replica processing, blocking other requests that are waiting. Set an end-to-end timeout, so slow requests can be terminated and retried. You can set an end-to-end timeout for HTTP requests by setting the `request_timeout_s` parameter in the `http_options` field of the Serve config. HTTP Proxies wait for that many seconds before terminating an HTTP request. This config is global to your Ray cluster, and you can't update it during runtime. Use [client-side retries](serve-best-practices-http-requests) to retry requests that time out due to transient failures. :::{note} Serve returns a response with status code `408` when a request times out. Clients can retry when they receive this `408` response. ::: ### Set backoff time when choosing replica Ray Serve allows you to fine-tune the backoff behavior of the request router, which can help reduce latency when waiting for replicas to become ready. It uses exponential backoff strategy when retrying to route requests to replicas that are temporarily unavailable. You can optimize this behavior for your workload by configuring the following environment variables: - `RAY_SERVE_ROUTER_RETRY_INITIAL_BACKOFF_S`: The initial backoff time (in seconds) before retrying a request. Default is `0.025`. - `RAY_SERVE_ROUTER_RETRY_BACKOFF_MULTIPLIER`: The multiplier applied to the backoff time after each retry. Default is `2`. - `RAY_SERVE_ROUTER_RETRY_MAX_BACKOFF_S`: The maximum backoff time (in seconds) between retries. Default is `0.5`. (serve-high-throughput)= ### Enable throughput-optimized serving :::{note} In Ray v2.54.0, the defaults for `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` will change to `0` for improved performance. ::: This section details how to enable Ray Serve options focused on improving throughput and reducing latency. These configurations focus on the following: - Reducing overhead associated with frequent logging. - Disabling behavior that allowed Serve applications to include blocking operations. If your Ray Serve code includes thread blocking operations, you must refactor your code to achieve enhanced throughput. The following table shows examples of blocking and non-blocking code:
Blocking operation (❌) Non-blocking operation (✅)
```python from ray import serve from fastapi import FastAPI import time app = FastAPI() @serve.deployment @serve.ingress(app) class BlockingDeployment: @app.get("/process") async def process(self): # ❌ Blocking operation time.sleep(2) return {"message": "Processed (blocking)"} serve.run(BlockingDeployment.bind()) ``` ```python from ray import serve from fastapi import FastAPI import asyncio app = FastAPI() @serve.deployment @serve.ingress(app) class NonBlockingDeployment: @app.get("/process") async def process(self): # ✅ Non-blocking operation await asyncio.sleep(2) return {"message": "Processed (non-blocking)"} serve.run(NonBlockingDeployment.bind()) ```
To configure all options to the recommended settings, set the environment variable `RAY_SERVE_THROUGHPUT_OPTIMIZED=1`. You can also configure each option individually. The following table details the recommended configurations and their impact: | Configured value | Impact | | --- | --- | | `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD=0` | Your code runs in the same event loop as the replica's main event loop. You must avoid blocking operations in your request path. Set this configuration to `1` to run your code in a separate event loop, which protects the replica's ability to communicate with the Serve Controller if your code has blocking operations. | | `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0`| The request router runs in the same event loop as the your code's event loop. You must avoid blocking operations in your request path. Set this configuration to `1` to run the router in a separate event loop, which protect Ray Serve's request routing ability when your code has blocking operations | | `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE=1000` | Sets the log buffer to batch writes to every `1000` logs, flushing the buffer on write. The system always flushes the buffer and writes logs when it detects a line with level ERROR. Set the buffer size to `1` to disable buffering and write logs immediately. | | `RAY_SERVE_LOG_TO_STDERR=0` | Only write logs to files under the `logs/serve/` directory. Proxy, Controller, and Replica logs no longer appear in the console, worker files, or the Actor Logs section of the Ray Dashboard. Set this property to `1` to enable additional logging. | You may want to enable throughput-optimized serving while customizing the options above. You can do this by setting `RAY_SERVE_THROUGHPUT_OPTIMIZED=1` and overriding the specific options. For example, to enable throughput-optimized serving and continue logging to stderr, you should set `RAY_SERVE_THROUGHPUT_OPTIMIZED=1` and override with `RAY_SERVE_LOG_TO_STDERR=1`. ## Debugging performance issues in controller The Serve Controller runs on the Ray head node and is responsible for a variety of tasks, including receiving autoscaling metrics from other Ray Serve components. If the Serve Controller becomes overloaded (symptoms might include high CPU usage and a large number of pending `ServeController.record_autoscaling_metrics_from_handle` tasks), you can tune the following environment variables: - `RAY_SERVE_CONTROL_LOOP_INTERVAL_S`: The interval between cycles of the control loop (defaults to `0.1` seconds). Increasing this value gives the Controller more time to process requests and may help alleviate overload. - `RAY_SERVE_CONTROLLER_MAX_CONCURRENCY`: The maximum number of concurrent requests the Controller can handle (defaults to `15000`). The Controller accepts one long poll request per handle, so its concurrency needs scale with the number of handles. Increase this value if you have a large number of deployment handles. --- (serve-replica-ranks)= # Replica ranks :::{warning} This API is experimental and may change between Ray minor versions. ::: Replica ranks provide a unique identifier for **each replica within a deployment**. Each replica receives a **`ReplicaRank` object** containing rank information and **a world size (the total number of replicas)**. The rank object includes a global rank (an integer from 0 to N-1), a node rank, and a local rank on the node. ## Access replica ranks You can access the rank and world size from within a deployment through the replica context using [`serve.get_replica_context()`](../api/doc/ray.serve.get_replica_context.rst). The following example shows how to access replica rank information: ```{literalinclude} ../doc_code/replica_rank.py :start-after: __replica_rank_start__ :end-before: __replica_rank_end__ :language: python ``` ```{literalinclude} ../doc_code/replica_rank.py :start-after: __replica_rank_start_run_main__ :end-before: __replica_rank_end_run_main__ :language: python ``` The [`ReplicaContext`](../api/doc/ray.serve.context.ReplicaContext.rst) provides two key fields: - `rank`: A [`ReplicaRank`](../api/doc/ray.serve.schema.ReplicaRank.rst) object containing rank information for this replica. Access the integer rank value with `.rank`. - `world_size`: The target number of replicas for the deployment. The `ReplicaRank` object contains three fields: - `rank`: The global rank (an integer from 0 to N-1) representing this replica's unique identifier across all nodes. - `node_rank`: The rank of the node this replica runs on (an integer from 0 to M-1 where M is the number of nodes). - `local_rank`: The rank of this replica on its node (an integer from 0 to K-1 where K is the number of replicas on this node). :::{note} **Accessing rank values:** To use the rank in your code, access the `.rank` attribute to get the integer value: ```python context = serve.get_replica_context() my_rank = context.rank.rank # Get the integer rank value my_node_rank = context.rank.node_rank # Get the node rank my_local_rank = context.rank.local_rank # Get the local rank on this node ``` Most use cases only need the global `rank` value. The `node_rank` and `local_rank` are useful for advanced scenarios such as coordinating replicas on the same node. ::: ## Handle rank changes with reconfigure When a replica's rank changes (such as during downscaling), Ray Serve can automatically call the `reconfigure` method on your deployment class to notify it of the new rank. This allows you to update replica-specific state when ranks change. The following example shows how to implement `reconfigure` to handle rank changes: ```{literalinclude} ../doc_code/replica_rank.py :start-after: __reconfigure_rank_start__ :end-before: __reconfigure_rank_end__ :language: python ``` ```{literalinclude} ../doc_code/replica_rank.py :start-after: __reconfigure_rank_start_run_main__ :end-before: __reconfigure_rank_end_run_main__ :language: python ``` ### When reconfigure is called Ray Serve automatically calls your `reconfigure` method in the following situations: 1. **At replica startup:** When a replica starts, if your deployment has both a `reconfigure` method and a `user_config`, Ray Serve calls `reconfigure` after running `__init__`. This lets you initialize rank-aware state without duplicating code between `__init__` and `reconfigure`. 2. **When you update user_config:** When you redeploy with a new `user_config`, Ray Serve calls `reconfigure` on all running replicas. If your `reconfigure` method includes `rank` as a parameter, Ray Serve passes both the new `user_config` and the current rank as a `ReplicaRank` object. 3. **When a replica's rank changes:** During downscaling, ranks may be reassigned to maintain contiguity (0 to N-1). If your `reconfigure` method includes `rank` as a parameter and your deployment has a `user_config`, Ray Serve calls `reconfigure` with the existing `user_config` and the new rank as a `ReplicaRank` object. :::{note} **Requirements to receive rank updates:** To get rank changes through `reconfigure`, your deployment needs: - A class-based deployment (function deployments don't support `reconfigure`) - A `reconfigure` method with `rank` as a parameter: `def reconfigure(self, user_config, rank: ReplicaRank)` - A `user_config` in your deployment (even if it's just an empty dict: `user_config={}`) Without a `user_config`, Ray Serve won't call `reconfigure` for rank changes. ::: :::{tip} If you'd like different behavior for when `reconfigure` is called with rank changes, [open a GitHub issue](https://github.com/ray-project/ray/issues/new/choose) to discuss your use case with the Ray Serve team. ::: ## How replica ranks work :::{note} **Rank reassignment is eventually consistent** When replicas are removed during downscaling, rank reassignment to maintain contiguity (0 to N-1) doesn't happen immediately. The controller performs rank consistency checks and reassignment only when the deployment reaches a `HEALTHY` state in its update loop. This means there can be a brief period after downscaling where ranks are non-contiguous before the controller reassigns them. This design choice prevents rank reassignment from interfering with ongoing deployment updates and rollouts. If you need immediate rank reassignment or different behavior, [open a GitHub issue](https://github.com/ray-project/ray/issues/new/choose) to discuss your use case with the Ray Serve team. ::: :::{note} **Ranks don't influence scheduling or eviction decisions** Replica ranks are independent of scheduling and eviction decisions. The deployment scheduler doesn't consider ranks when placing replicas on nodes, so there's no guarantee that replicas with contiguous ranks (such as rank 0 and rank 1) will be on the same node. Similarly, during downscaling, the autoscaler's eviction decisions don't take replica ranks into account—any replica can be chosen for removal regardless of its rank. If you need rank-aware scheduling or eviction (for example, to colocate replicas with consecutive ranks), [open a GitHub issue](https://github.com/ray-project/ray/issues/new/choose) to discuss your requirements with the Ray Serve team. ::: Ray Serve manages replica ranks automatically throughout the deployment lifecycle. The system maintains these invariants: 1. Ranks are contiguous integers from 0 to N-1. 2. Each running replica has exactly one rank. 3. No two replicas share the same rank. ### Rank assignment lifecycle The following table shows how ranks and world size behave during different events: | Event | Local Rank | World Size | |-------|------------|------------| | Upscaling | No change for existing replicas | Increases to target count | | Downscaling | Can change to maintain contiguity | Decreases to target count | | Other replica dies(will be restarted) | No change | No change | | Self replica dies | No change | No change | :::{note} World size always reflects the target number of replicas configured for the deployment, not the current number of running replicas. During scaling operations, the world size updates immediately to the new target, even while replicas are still starting or stopping. ::: ### Rank lifecycle state machine ``` ┌─────────────────────────────────────────────────────────────┐ │ DEPLOYMENT LIFECYCLE │ └─────────────────────────────────────────────────────────────┘ Initial Deployment / Upscaling: ┌──────────┐ assign ┌──────────┐ │ No Rank │ ───────────────> │ Rank: N-1│ └──────────┘ └──────────┘ (Contiguous: 0, 1, 2, ..., N-1) Replica Crash: ┌──────────┐ release ┌──────────┐ assign ┌──────────┐ │ Rank: K │ ───────────────> │ Released │ ────────────> │ Rank: K │ │ (Dead) │ │ │ │ (New) │ └──────────┘ └──────────┘ └──────────┘ (K can be any rank from 0 to N-1) :::{note} When a replica crashes, Ray Serve automatically starts a replacement replica and assigns it the **same rank** as the crashed replica. This ensures rank contiguity is maintained without reassigning other replicas. ::: Downscaling: ┌──────────┐ release ┌──────────┐ │ Rank: K │ ───────────────> │ Released │ │ (Stopped)│ │ │ └──────────┘ └──────────┘ │ └──> Remaining replicas may be reassigned to maintain contiguity: [0, 1, 2, ..., M-1] where M < N (K can be any rank from 0 to N-1) Controller Recovery: ┌──────────┐ recover ┌──────────┐ │ Running │ ───────────────> │ Rank: N │ │ Replicas │ │(Restored)│ └──────────┘ └──────────┘ (Controller queries replicas to reconstruct rank state) ``` ### Detailed lifecycle events 1. **Rank assignment on startup**: Ranks are assigned when replicas start, such as during initial deployment, cold starts, or upscaling. The controller assigns ranks and propagates them to replicas during initialization. New replicas receive the lowest available rank. 2. **Rank release on shutdown**: Ranks are released only after a replica fully stops, which occurs during graceful shutdown or downscaling. Ray Serve preserves existing rank assignments as much as possible to minimize disruption. 3. **Handling replica crashes**: If a replica crashes unexpectedly, the system releases its rank and assigns the **same rank** to the replacement replica. This means if replica with rank 3 crashes, the new replacement replica will also receive rank 3. The replacement receives its rank during initialization, and other replicas keep their existing ranks unchanged. 4. **Controller crash and recovery**: When the controller recovers from a crash, it reconstructs the rank state by querying all running replicas for their assigned ranks. Ranks aren't checkpointed; the system re-learns them directly from replicas during recovery. 5. **Maintaining rank contiguity**: After downscaling, the system may reassign ranks to remaining replicas to maintain contiguity (0 to N-1). Ray Serve minimizes reassignments by only changing ranks when necessary. --- (serve-replica-scheduling)= # Replica scheduling This guide explains how Ray Serve schedules deployment replicas across your cluster and the APIs and environment variables you can use to control placement behavior. ## Quick reference: Choosing the right approach | Goal | Solution | Example | |------|----------|---------| | Multi-GPU inference with tensor parallelism | `placement_group_bundles` + `STRICT_PACK` | vLLM with `tensor_parallel_size=4` | | Target specific GPU types or zones | Custom resources in `ray_actor_options` | Schedule on A100 nodes only | | Limit replicas per node for high availability | `max_replicas_per_node` | Max 2 replicas of each deployment per node | | Reduce cloud costs by packing nodes | `RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY=1` | Many small models sharing nodes | | Reserve resources for worker actors | `placement_group_bundles` | Replica spawns Ray Data workers | | Shard large embeddings across nodes | `placement_group_bundles` + `STRICT_SPREAD` | Recommendation model with distributed embedding table | | Simple deployment, no special needs | Default (just `ray_actor_options`) | Single-GPU model | ## How replica scheduling works When you deploy an application, Ray Serve's deployment scheduler determines where to place each replica actor across the available nodes in your Ray cluster. The scheduler runs on the Serve Controller and makes batch scheduling decisions during each update cycle. For information on configuring CPU, GPU, and other resource requirements for your replicas, see [Resource allocation](serve-resource-allocation). ```text ┌──────────────────────────────────┐ │ serve.run(app) │ └────────────────┬─────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────────┐ │ Serve Controller │ │ ┌───────────────────────────────────────────────────────────────────────────┐ │ │ │ Deployment Scheduler │ │ │ │ │ │ │ │ 1. Check placement_group_bundles ──▶ PlacementGroupSchedulingStrategy │ │ │ │ 2. Check target node affinity ──▶ NodeAffinitySchedulingStrategy │ │ │ │ 3. Use default strategy ──▶ SPREAD (default) or PACK │ │ │ └───────────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────────┘ │ ┌─────────────────────────────────┴─────────────────────────────────┐ │ │ ▼ ▼ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐ │ SPREAD Strategy (default) │ │ PACK Strategy │ │ │ │ │ │ Distributes replicas across nodes │ │ Packs replicas onto fewer nodes │ │ for fault tolerance │ │ to minimize resource waste │ │ │ │ │ │ ┌─────────┐ ┌─────────┐ ┌───────┐ │ │ ┌─────────┐ ┌─────────┐ ┌───────┐ │ │ │ Node 1 │ │ Node 2 │ │Node 3 │ │ │ │ Node 1 │ │ Node 2 │ │Node 3 │ │ │ │ ┌─────┐ │ │ ┌─────┐ │ │┌─────┐│ │ │ │ ┌─────┐ │ │ │ │ │ │ │ │ │ R1 │ │ │ │ R2 │ │ ││ R3 ││ │ │ │ │ R1 │ │ │ idle │ │ idle │ │ │ │ └─────┘ │ │ └─────┘ │ │└─────┘│ │ │ │ │ R2 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ R3 │ │ │ │ │ │ │ │ └─────────┘ └─────────┘ └───────┘ │ │ └─────────┘ └─────────┘ └───────┘ │ │ │ │ ▲ ▲ │ │ ✓ High availability │ │ └───────────┘ │ │ ✓ Load balanced │ │ Can be released │ │ ✓ Reduced contention │ │ ✓ Fewer nodes = lower cloud costs │ └─────────────────────────────────────┘ └────────────────────────────────────┘ ``` By default, Ray Serve uses a **spread scheduling strategy** that distributes replicas across nodes with best effort. This approach: - Maximizes fault tolerance by avoiding concentration of replicas on a single node - Balances load across the cluster - Helps prevent resource contention between replicas ### Scheduling priority When scheduling a replica, the scheduler evaluates strategies in the following priority order: 1. **Placement groups**: If you specify `placement_group_bundles`, the scheduler uses a `PlacementGroupSchedulingStrategy` to co-locate the replica with its required resources. 2. **Pack scheduling with node affinity**: If pack scheduling is enabled, the scheduler identifies the best available node by preferring non-idle nodes (nodes already running replicas) and using a best-fit algorithm to minimize resource fragmentation. It then uses a `NodeAffinitySchedulingStrategy` with soft constraints to schedule the replica on that node. 3. **Default strategy**: Falls back to `SPREAD` when pack scheduling isn't enabled. ### Downscaling behavior When Ray Serve scales down a deployment, it intelligently selects which replicas to stop: 1. **Non-running replicas first**: Pending, launching, or recovering replicas are stopped before running replicas. 2. **Minimize node count**: Running replicas are stopped from nodes with the fewest total replicas across all deployments, helping to free up nodes faster. Among replicas on the same node, newer replicas are stopped before older ones. 3. **Head node protection**: Replicas on the head node have the lowest priority for removal since the head node can't be released. Among replicas on the head node, newer replicas are stopped before older ones. :::{note} Running replicas on the head node isn't recommended for production deployments. The head node runs critical cluster processes such as the GCS and Serve controller, and replica workloads can compete for resources. ::: ## APIs for controlling replica placement Ray Serve provides several options to control where replicas are scheduled. These parameters are configured through the [`@serve.deployment`](serve-configure-deployment) decorator. For the full API reference, see the [deployment decorator documentation](../api/doc/ray.serve.deployment_decorator.rst). ### Limit replicas per node with `max_replicas_per_node` Use [`max_replicas_per_node`](../api/doc/ray.serve.deployment_decorator.rst) to cap the number of replicas of a deployment that can run on a single node. This is useful when: - You want to ensure high availability by spreading replicas across nodes - You want to avoid resource contention between replicas of the same deployment ```{literalinclude} ../doc_code/replica_scheduling.py :start-after: __max_replicas_per_node_start__ :end-before: __max_replicas_per_node_end__ :language: python ``` In this example, if you have 6 replicas and `max_replicas_per_node=2`, Ray Serve requires at least 3 nodes to schedule all replicas. :::{note} Valid values for `max_replicas_per_node` are `None` (default, no limit) or an integer. You can't set `max_replicas_per_node` together with `placement_group_bundles`. ::: You can also specify this in a config file: ```yaml applications: - name: my_app import_path: my_module:app deployments: - name: MyDeployment num_replicas: 6 max_replicas_per_node: 2 ``` ### Reserve resources with placement groups For more details on placement group strategies, see the [Ray Core placement groups documentation](ray-placement-group-doc-ref). A **placement group** is a Ray primitive that reserves a group of resources (called **bundles**) across one or more nodes in your cluster. When you configure [`placement_group_bundles`](../api/doc/ray.serve.deployment_decorator.rst) for a Ray Serve deployment, Ray creates a dedicated placement group for *each replica*, ensuring those resources are reserved and available for that replica's use. A **bundle** is a dictionary specifying resource requirements, such as `{"CPU": 2, "GPU": 1}`. When you define multiple bundles, you're telling Ray to reserve multiple sets of resources that can be placed according to your chosen strategy. #### What placement groups and bundles mean The following diagram illustrates how a deployment with `placement_group_bundles=[{"GPU": 1}, {"GPU": 1}, {"CPU": 4}]` and [`placement_group_strategy`](../api/doc/ray.serve.deployment_decorator.rst)` set to "STRICT_PACK"` is scheduled: ```text ┌─────────────────────────────────────────────────────────────────────────────┐ │ Node (8 CPUs, 4 GPUs) │ │ ┌───────────────────────────────────────────────────────────────────────┐ │ │ │ Placement Group (per replica) │ │ │ │ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │ │ │ │ │ Bundle 0 │ │ Bundle 1 │ │ Bundle 2 │ │ │ │ │ │ {"GPU": 1} │ │ {"GPU": 1} │ │ {"CPU": 4} │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────────┐ │ │ │ │ │ │ │ Replica │ │ │ │ Worker │ │ │ │ Worker Tasks │ │ │ │ │ │ │ │ Actor │ │ │ │ Actor │ │ │ │ (preprocessing)│ │ │ │ │ │ │ │ (main GPU) │ │ │ │ (2nd GPU) │ │ │ │ │ │ │ │ │ │ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────────┘ │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────────┘ │ │ │ │ ▲ │ │ │ │ │ │ │ │ │ Replica runs in │ │ │ │ first bundle │ │ │ └───────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ With STRICT_PACK: All bundles guaranteed on same node ``` Consider a deployment with `placement_group_bundles=[{"GPU": 1}, {"GPU": 1}, {"CPU": 4}]`: - Ray reserves 3 bundles of resources for each replica - The replica actor runs in the **first bundle** (so `ray_actor_options` must fit within it) - The remaining bundles are available for worker actors/tasks spawned by the replica - All child actors and tasks are automatically scheduled within the placement group This is different from simply requesting resources in `ray_actor_options`. With `ray_actor_options={"num_gpus": 2}`, your replica actor gets 2 GPUs but you have no control over where additional worker processes run. With placement groups, you explicitly reserve resources for both the replica and its workers. #### When to use placement groups | Scenario | Why placement groups help | |----------|---------------------------| | **Model parallelism** | Tensor parallelism or pipeline parallelism requires multiple GPUs that must communicate efficiently. Use `STRICT_PACK` to guarantee all GPUs are on the same node. For example, vLLM with `tensor_parallel_size=4` and the Ray distributed executor backend spawns 4 Ray worker actors (one per GPU shard), all of which must be on the same node for efficient inter-GPU communication via NVLink/NVSwitch. | | **Replica spawns workers** | Your deployment creates Ray actors or tasks for parallel processing. Placement groups reserve resources for these workers. For example, a video processing service that spawns Ray tasks to decode frames in parallel, or a batch inference service using Ray Data to preprocess inputs before model inference. | | **Cross-node distribution** | You need bundles spread across different nodes. Use `SPREAD` or `STRICT_SPREAD`. For example, serving a model with a massive embedding table (such as a recommendation model with billions of item embeddings) that must be sharded across multiple nodes because it exceeds single-node memory. Each bundle holds one shard, and `STRICT_SPREAD` ensures each shard is on a separate node. | Don't use placement groups when: - Your replica is self-contained and doesn't spawn additional actors/tasks - You only need simple resource requirements (use `ray_actor_options` instead) - You want to use `max_replicas_per_node`. The combination of these two options is not supported today. :::{note} **How `max_replicas_per_node` works:** Ray Serve creates a synthetic custom resource for each deployment. Every node implicitly has 1.0 of this resource, and each replica requests `1.0 / max_replicas_per_node` of it. For example, with `max_replicas_per_node=3`, each replica requests ~0.33 of the resource, so only 3 replicas can fit on a node before the resource is exhausted. This mechanism relies on Ray's standard resource scheduling, which conflicts with placement group scheduling. ::: #### Configuring placement groups The following example reserves 2 GPUs for each replica using a strict pack strategy: ```{literalinclude} ../doc_code/replica_scheduling.py :start-after: __placement_group_start__ :end-before: __placement_group_end__ :language: python ``` The replica actor is scheduled in the first bundle, so the resources specified in `ray_actor_options` must be a subset of the first bundle's resources. All actors and tasks created by the replica are scheduled in the placement group by default (`placement_group_capture_child_tasks=True`). ### Target nodes with custom resources You can use custom resources in [`ray_actor_options`](../api/doc/ray.serve.deployment_decorator.rst) to target replicas to specific nodes. This is the recommended approach for controlling which nodes run your replicas. Then configure your deployment to require the specific resource: ```{literalinclude} ../doc_code/replica_scheduling.py :start-after: __custom_resources_start__ :end-before: __custom_resources_end__ :language: python ``` First, start your Ray nodes with custom resources that identify their capabilities: ```{literalinclude} ../doc_code/replica_scheduling.py :start-after: __custom_resources_main_start__ :end-before: __custom_resources_main_end__ :language: python ``` Custom resources offer several advantages for Ray Serve deployments: - **Quantifiable**: You can request specific amounts (such as `{"A100": 2}` for 2 GPUs or `{"A100": 0.5}` to share a GPU between 2 replicas), while labels are binary (present or absent). - **Autoscaler-aware**: The Ray autoscaler understands custom resources and can provision nodes with the required resources automatically. - **Scheduling guarantees**: Replicas won't be scheduled until nodes with the required custom resources are available, preventing placement on incompatible nodes. :::{tip} Use descriptive resource names that reflect the node's capabilities, such as GPU types, availability zones, or hardware generations. ::: ## Environment variables These environment variables modify Ray Serve's scheduling behavior. Set them before starting Ray. ### `RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY` **Default**: `0` (disabled) When enabled, switches from spread scheduling to **pack scheduling**. Pack scheduling: - Packs replicas onto fewer nodes to minimize resource fragmentation - Sorts pending replicas by resource requirements (largest first) - Prefers scheduling on nodes that already have replicas (non-idle nodes) - Uses best-fit bin packing to find the optimal node for each replica ```bash export RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY=1 ray start --head ``` **When to use pack scheduling:** When you run many small deployments (such as 10 models each needing 0.5 CPUs), spread scheduling scatters them across nodes, wasting capacity. Pack scheduling fills nodes efficiently before using new ones. Cloud providers bill per node-hour. Packing replicas onto fewer nodes allows idle nodes to be released by the autoscaler, directly reducing your bill. **When to avoid pack scheduling:** High availability is critical and you want replicas spread across nodes :::{note} Pack scheduling automatically falls back to spread scheduling when any deployment uses placement groups with `PACK`, `SPREAD`, or `STRICT_SPREAD` strategies. This happens because pack scheduling needs to predict where resources will be consumed to bin-pack effectively. With `STRICT_PACK`, all bundles are guaranteed to land on one node, making resource consumption predictable. With other strategies, bundles may spread across multiple nodes unpredictably, so the scheduler can't accurately track available resources per node. ::: ### `RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES` **Default**: empty A comma-separated list of custom resource names that should be prioritized when sorting replicas for pack scheduling. Resources listed earlier have higher priority. ```bash export RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES="TPU,custom_accelerator" ray start --head ``` When pack scheduling sorts replicas by resource requirements, the priority order is: 1. Custom resources in `RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES` (in order) 2. GPU 3. CPU 4. Memory 5. Other custom resources This ensures that replicas requiring high-priority resources are scheduled first, reducing the chance of resource fragmentation. ## See also - [Resource allocation](serve-resource-allocation) for configuring CPU, GPU, and other resources - [Autoscaling](serve-autoscaling) for automatically adjusting replica count - [Ray placement groups](ray-placement-group-doc-ref) for advanced resource co-location --- (serve-api)= # Ray Serve API ## Python API (core-apis)= ```{eval-rst} .. currentmodule:: ray ``` ### Writing Applications ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_init_args.rst serve.Deployment serve.Application ``` #### Deployment Decorators ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ serve.deployment :noindex: serve.ingress serve.batch serve.multiplexed ``` #### Deployment Handles :::{note} The deprecated `RayServeHandle` and `RayServeSyncHandle` APIs have been fully removed as of Ray 2.10. See the [model composition guide](serve-model-composition) for how to update code to use the {mod}`DeploymentHandle ` API instead. ::: ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_init_args.rst serve.handle.DeploymentHandle serve.handle.DeploymentResponse serve.handle.DeploymentResponseGenerator ``` ### Running Applications ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ serve.start serve.run serve.delete serve.status serve.shutdown serve.shutdown_async ``` ### Configurations ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ serve.config.ProxyLocation serve.config.gRPCOptions serve.config.HTTPOptions serve.config.AutoscalingConfig serve.config.AutoscalingPolicy serve.config.AutoscalingContext serve.config.AggregationFunction serve.config.RequestRouterConfig ``` ### Schemas ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/class_without_init_args.rst serve.schema.ServeActorDetails serve.schema.ProxyDetails serve.schema.ApplicationStatusOverview serve.schema.ServeStatus serve.schema.DeploymentStatusOverview serve.schema.EncodingType serve.schema.AutoscalingMetricsHealth serve.schema.AutoscalingStatus serve.schema.ScalingDecision serve.schema.DeploymentAutoscalingDetail serve.schema.ReplicaRank ``` ### Request Router ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ serve.request_router.ReplicaID serve.request_router.PendingRequest serve.request_router.RunningReplica serve.request_router.FIFOMixin serve.request_router.LocalityMixin serve.request_router.MultiplexMixin serve.request_router.RequestRouter ``` #### Advanced APIs ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ serve.get_replica_context serve.context.ReplicaContext serve.get_multiplexed_model_id serve.get_app_handle serve.get_deployment_handle serve.grpc_util.RayServegRPCContext serve.exceptions.BackPressureError serve.exceptions.RayServeException serve.exceptions.RequestCancelledError serve.exceptions.DeploymentUnavailableError ``` (serve-cli)= ## Command Line Interface (CLI) ```{eval-rst} .. click:: ray.serve.scripts:cli :prog: serve :nested: full ``` (serve-rest-api)= ## Serve REST API The Serve REST API is exposed at the same port as the Ray Dashboard. The Dashboard port is `8265` by default. This port can be changed using the `--dashboard-port` argument when running `ray start`. All example requests in this section use the default port. ### `PUT "/api/serve/applications/"` Declaratively deploys a list of Serve applications. If Serve is already running on the Ray cluster, removes all applications not listed in the new config. If Serve is not running on the Ray cluster, starts Serve. See [multi-app config schema](serve-rest-api-config-schema) for the request's JSON schema. **Example Request**: ```http PUT /api/serve/applications/ HTTP/1.1 Host: http://localhost:8265/ Accept: application/json Content-Type: application/json { "applications": [ { "name": "text_app", "route_prefix": "/", "import_path": "text_ml:app", "runtime_env": { "working_dir": "https://github.com/ray-project/serve_config_examples/archive/HEAD.zip" }, "deployments": [ {"name": "Translator", "user_config": {"language": "french"}}, {"name": "Summarizer"}, ] }, ] } ``` **Example Response** ```http HTTP/1.1 200 OK Content-Type: application/json ``` ### `GET "/api/serve/applications/"` Gets cluster-level info and comprehensive details on all Serve applications deployed on the Ray cluster. See [metadata schema](serve-rest-api-response-schema) for the response's JSON schema. ```http GET /api/serve/applications/ HTTP/1.1 Host: http://localhost:8265/ Accept: application/json ``` **Example Response (abridged JSON)**: ```http HTTP/1.1 200 OK Content-Type: application/json { "controller_info": { "node_id": "cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec", "node_ip": "10.0.29.214", "actor_id": "1d214b7bdf07446ea0ed9d7001000000", "actor_name": "SERVE_CONTROLLER_ACTOR", "worker_id": "adf416ae436a806ca302d4712e0df163245aba7ab835b0e0f4d85819", "log_file_path": "/serve/controller_29778.log" }, "proxy_location": "EveryNode", "http_options": { "host": "0.0.0.0", "port": 8000, "root_path": "", "request_timeout_s": null, "keep_alive_timeout_s": 5 }, "grpc_options": { "port": 9000, "grpc_servicer_functions": [], "request_timeout_s": null }, "proxies": { "cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec": { "node_id": "cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec", "node_ip": "10.0.29.214", "actor_id": "b7a16b8342e1ced620ae638901000000", "actor_name": "SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec", "worker_id": "206b7fe05b65fac7fdceec3c9af1da5bee82b0e1dbb97f8bf732d530", "log_file_path": "/serve/http_proxy_10.0.29.214.log", "status": "HEALTHY" } }, "deploy_mode": "MULTI_APP", "applications": { "app1": { "name": "app1", "route_prefix": "/", "docs_path": null, "status": "RUNNING", "message": "", "last_deployed_time_s": 1694042836.1912267, "deployed_app_config": { "name": "app1", "route_prefix": "/", "import_path": "src.text-test:app", "deployments": [ { "name": "Translator", "num_replicas": 1, "user_config": { "language": "german" } } ] }, "deployments": { "Translator": { "name": "Translator", "status": "HEALTHY", "message": "", "deployment_config": { "name": "Translator", "num_replicas": 1, "max_ongoing_requests": 100, "user_config": { "language": "german" }, "graceful_shutdown_wait_loop_s": 2.0, "graceful_shutdown_timeout_s": 20.0, "health_check_period_s": 10.0, "health_check_timeout_s": 30.0, "ray_actor_options": { "runtime_env": { "env_vars": {} }, "num_cpus": 1.0 }, "is_driver_deployment": false }, "replicas": [ { "node_id": "cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec", "node_ip": "10.0.29.214", "actor_id": "4bb8479ad0c9e9087fee651901000000", "actor_name": "SERVE_REPLICA::app1#Translator#oMhRlb", "worker_id": "1624afa1822b62108ead72443ce72ef3c0f280f3075b89dd5c5d5e5f", "log_file_path": "/serve/deployment_Translator_app1#Translator#oMhRlb.log", "replica_id": "app1#Translator#oMhRlb", "state": "RUNNING", "pid": 29892, "start_time_s": 1694042840.577496 } ] }, "Summarizer": { "name": "Summarizer", "status": "HEALTHY", "message": "", "deployment_config": { "name": "Summarizer", "num_replicas": 1, "max_ongoing_requests": 100, "user_config": null, "graceful_shutdown_wait_loop_s": 2.0, "graceful_shutdown_timeout_s": 20.0, "health_check_period_s": 10.0, "health_check_timeout_s": 30.0, "ray_actor_options": { "runtime_env": {}, "num_cpus": 1.0 }, "is_driver_deployment": false }, "replicas": [ { "node_id": "cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec", "node_ip": "10.0.29.214", "actor_id": "7118ae807cffc1c99ad5ad2701000000", "actor_name": "SERVE_REPLICA::app1#Summarizer#cwiPXg", "worker_id": "12de2ac83c18ce4a61a443a1f3308294caf5a586f9aa320b29deed92", "log_file_path": "/serve/deployment_Summarizer_app1#Summarizer#cwiPXg.log", "replica_id": "app1#Summarizer#cwiPXg", "state": "RUNNING", "pid": 29893, "start_time_s": 1694042840.5789504 } ] } } } } } ``` ### `DELETE "/api/serve/applications/"` Shuts down Serve and all applications running on the Ray cluster. Has no effect if Serve is not running on the Ray cluster. **Example Request**: ```http DELETE /api/serve/applications/ HTTP/1.1 Host: http://localhost:8265/ Accept: application/json ``` **Example Response** ```http HTTP/1.1 200 OK Content-Type: application/json ``` (serve-rest-api-config-schema)= ## Config Schemas ```{eval-rst} .. currentmodule:: ray.serve ``` ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ schema.ServeDeploySchema schema.gRPCOptionsSchema schema.HTTPOptionsSchema schema.ServeApplicationSchema schema.DeploymentSchema schema.RayActorOptionsSchema schema.CeleryAdapterConfig schema.TaskProcessorConfig schema.TaskResult schema.ScaleDeploymentRequest schema.TaskProcessorAdapter ``` (serve-rest-api-response-schema)= ## Response Schemas ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ schema.ServeInstanceDetails schema.APIType schema.ApplicationStatus schema.ApplicationDetails schema.DeploymentDetails schema.ReplicaDetails schema.ProxyStatus schema.TargetGroup schema.Target schema.DeploymentNode schema.DeploymentTopology ``` ## Observability ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ metrics.Counter metrics.Histogram metrics.Gauge schema.LoggingConfig ``` (serve-llm-api)= ## LLM API ```{eval-rst} .. currentmodule:: ray ``` ### Builders ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ serve.llm.build_llm_deployment serve.llm.build_openai_app ``` ### Configs ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ :template: autosummary/autopydantic.rst serve.llm.LLMConfig serve.llm.LLMServingArgs serve.llm.ModelLoadingConfig serve.llm.CloudMirrorConfig serve.llm.LoraConfig ``` ### Deployments ```{eval-rst} .. autosummary:: :nosignatures: :toctree: doc/ serve.llm.LLMServer serve.llm.LLMRouter ``` --- (serve-architecture)= # Architecture In this section, we explore Serve's key architectural concepts and components. It will offer insight and overview into: - the role of each component in Serve and how they work - the different types of actors that make up a Serve application % Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing ```{image} architecture-2.0.svg :align: center :width: 600px ``` (serve-architecture-high-level-view)= ## High-Level View Serve runs on Ray and utilizes [Ray actors](actor-guide). There are three kinds of actors that are created to make up a Serve instance: - **Controller**: A global actor unique to each Serve instance that manages the control plane. The Controller is responsible for creating, updating, and destroying other actors. Serve API calls like creating or getting a deployment make remote calls to the Controller. - **HTTP Proxy**: By default there is one HTTP proxy actor on the head node. This actor runs a [Uvicorn](https://www.uvicorn.org/) HTTP server that accepts incoming requests, forwards them to replicas, and responds once they are completed. For scalability and high availability, you can also run a proxy on each node in the cluster via the `proxy_location` field inside [`serve.start()`](core-apis) or [the config file](serve-in-production-config-file). - **gRPC Proxy**: If Serve is started with valid `port` and `grpc_servicer_functions`, then the gRPC proxy is started alongside the HTTP proxy. This Actor runs a [grpcio](https://grpc.github.io/grpc/python/) server. The gRPC server accepts incoming requests, forwards them to replicas, and responds once they are completed. - **Replicas**: Actors that actually execute the code in response to a request. For example, they may contain an instantiation of an ML model. Each replica processes individual requests from the proxy. The replica may batch the requests using `@serve.batch`. See the [batching](serve-performance-batching-requests) docs. ## Lifetime of a request When an HTTP or gRPC request is sent to the corresponding HTTP or gRPC proxy, the following happens: 1. The request is received and parsed. 2. Ray Serve looks up the correct deployment associated with the HTTP URL path or application name metadata. Serve places the request in a queue. 3. For each request in a deployment's queue, an available replica is looked up and the request is sent to it. If no replicas are available (that is, more than `max_ongoing_requests` requests are outstanding at each replica), the request is left in the queue until a replica becomes available. Each replica maintains a queue of requests and executes requests one at a time, possibly using `asyncio` to process them concurrently. If the handler (the deployment function or the `__call__` method of the deployment class) is declared with `async def`, the replica will not wait for the handler to run. Otherwise, the replica blocks until the handler returns. When making a request via a [DeploymentHandle](serve-key-concepts-deployment-handle) instead of HTTP or gRPC for [model composition](serve-model-composition), the request is placed on a queue in the `DeploymentHandle`, and we skip to step 3 above. (serve-ft-detail)= ## Fault tolerance Application errors like exceptions in your model evaluation code are caught and wrapped. A 500 status code will be returned with the traceback information. The replica will be able to continue to handle requests. Machine errors and faults are handled by Ray Serve as follows: - When replica Actors fail, the Controller Actor replaces them with new ones. - When the proxy Actor fails, the Controller Actor restarts it. - When the Controller Actor fails, Ray restarts it. - When using the [KubeRay RayService](kuberay-rayservice-quickstart), KubeRay recovers crashed nodes or a crashed cluster. You can avoid cluster crashes by using the [GCS FT feature](kuberay-gcs-ft). - If you aren't using KubeRay, when the Ray cluster fails, Ray Serve cannot recover. When a machine hosting any of the actors crashes, those actors are automatically restarted on another available machine. All data in the Controller (routing policies, deployment configurations, etc) is checkpointed to the Ray Global Control Store (GCS) on the head node. Transient data in the router and the replica (like network connections and internal request queues) will be lost for this kind of failure. See [the end-to-end fault tolerance guide](serve-e2e-ft) for more details on how actor crashes are detected. (serve-autoscaling-architecture)= ## Ray Serve Autoscaling Ray Serve's autoscaling feature automatically increases or decreases a deployment's number of replicas based on its load. ![pic](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling.svg) - The Serve Autoscaler runs in the Serve Controller actor. - Each `DeploymentHandle` and each replica periodically pushes its metrics to the autoscaler. - For each deployment, the autoscaler periodically checks `DeploymentHandle` queues and in-flight queries on replicas to decide whether or not to scale the number of replicas. - Each `DeploymentHandle` continuously polls the controller to check for new deployment replicas. Whenever new replicas are discovered, it sends any buffered or new queries to the replica until `max_ongoing_requests` is reached. Queries are sent to replicas in round-robin fashion, subject to the constraint that no replica is handling more than `max_ongoing_requests` requests at a time. :::{note} When the controller dies, requests can still be sent via HTTP, gRPC and `DeploymentHandle`, but autoscaling is paused. When the controller recovers, the autoscaling resumes, but all previous metrics collected are lost. ::: ## Ray Serve API Server Ray Serve provides a [CLI](serve-cli) for managing your Ray Serve instance, as well as a [REST API](serve-rest-api). Each node in your Ray cluster provides a Serve REST API server that can connect to Serve and respond to Serve REST requests. ## FAQ ### How does Serve ensure horizontal scalability and availability? You can configure Serve to start one proxy Actor per node with the `proxy_location` field inside [`serve.start()`](core-apis) or [the config file](serve-in-production-config-file). Each proxy binds to the same port. You should be able to reach Serve and send requests to any models with any of the servers. You can use your own load balancer on top of Ray Serve. This architecture ensures horizontal scalability for Serve. You can scale your HTTP and gRPC ingress by adding more nodes. You can also scale your model inference by increasing the number of replicas via the `num_replicas` option of your deployment. ### How do DeploymentHandles work? {mod}`DeploymentHandles ` wrap a handle to a "router" on the same node which routes requests to replicas for a deployment. When a request is sent from one replica to another via the handle, the requests go through the same data path as incoming HTTP or gRPC requests. This enables the same deployment selection and batching procedures to happen. DeploymentHandles are often used to implement [model composition](serve-model-composition). ### What happens to large requests? Serve utilizes Ray’s [shared memory object store](plasma-store) and in process memory store. Small request objects are directly sent between actors via network call. Larger request objects (100KiB+) are written to the object store and the replica can read them via zero-copy read. --- (serve-asynchronous-inference)= :::{warning} This API is in alpha and may change before becoming stable. ::: # Asynchronous Inference This guide shows how to run long-running inference asynchronously in Ray Serve using background task processing. With asynchronous tasks, your HTTP APIs stay responsive while the system performs work in the background. ## Why asynchronous inference? Ray Serve customers need a way to handle long-running API requests asynchronously. Some inference workloads (such as video processing or large document indexing) take longer than typical HTTP timeouts, so when a user submits one of these requests the system should enqueue the work in a background queue for later processing and immediately return a quick response. This decouples request lifetime from compute time while the task executes asynchronously, while still leveraging Serve's scalability. ## Use cases Common use cases include video inference (such as transcoding, detection, and transcription over long videos) and document indexing pipelines that ingest, parse, and vectorize large files or batches. More broadly, any long-running AI/ML workload where immediate results aren't required benefits from running asynchronously. ## Key concepts - **@task_consumer**: A Serve deployment that consumes and executes tasks from a queue. Requires a `TaskProcessorConfig` parameter to configure the task processor; by default it uses the Celery task processor, but you can provide your own implementation. - **@task_handler**: A decorator applied to a method inside a `@task_consumer` class. Each handler declares the task it handles via `name=...`; if `name` is omitted, the method's function name is used as the task name. All tasks with that name in the consumer's configured queue (set via the `TaskProcessorConfig` above) are routed to this method for execution. ## Components and APIs The following sections describe the core APIs for asynchronous inference, with minimal examples to get you started. ### `TaskProcessorConfig` Configures the task processor, including queue name, adapter (default is Celery), adapter config, retry limits, and dead-letter queues. The following example shows how to configure the task processor: ```python from ray.serve.schema import TaskProcessorConfig, CeleryAdapterConfig processor_config = TaskProcessorConfig( queue_name="my_queue", # Optional: Override default adapter string (default is Celery) # adapter="ray.serve.task_processor.CeleryTaskProcessorAdapter", adapter_config=CeleryAdapterConfig( broker_url="redis://localhost:6379/0", # Or "filesystem://" for local testing backend_url="redis://localhost:6379/1", # Result backend (optional for fire-and-forget) ), max_retries=5, failed_task_queue_name="failed_tasks", # Application errors after retries ) ``` :::{note} The filesystem broker is intended for local testing only and has limited functionality. For example, it doesn't support `cancel_tasks`. For production deployments, use a production-ready broker such as Redis or RabbitMQ. See the [Celery broker documentation](https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/) for the full list of supported brokers. ::: ### `@task_consumer` Decorator that turns a Serve deployment into a task consumer using the provided `TaskProcessorConfig`. The following code creates a task consumer: ```python from ray import serve from ray.serve.task_consumer import task_consumer @serve.deployment @task_consumer(task_processor_config=processor_config) class SimpleConsumer: pass ``` ### `@task_handler` Decorator that registers a method on the consumer as a named task handler. The following example shows how to define a task handler: ```python from ray.serve.task_consumer import task_handler, task_consumer @serve.deployment @task_consumer(task_processor_config=processor_config) class SimpleConsumer: @task_handler(name="process_request") def process_request(self, data): return f"processed: {data}" ``` :::{note} Ray Serve currently supports only synchronous handlers. Declaring an `async def` handler raises `NotImplementedError`. ::: ### `instantiate_adapter_from_config` Factory function that returns a task processor adapter instance for the given `TaskProcessorConfig`. You can use the returned object to enqueue tasks, fetch status, retrieve metrics, and more. The following example demonstrates creating an adapter and enqueuing tasks: ```python from ray.serve.task_consumer import instantiate_adapter_from_config adapter = instantiate_adapter_from_config(task_processor_config=processor_config) # Enqueue synchronously (returns TaskResult) result = adapter.enqueue_task_sync(task_name="process_request", args=["hello"]) # Later, fetch status synchronously status = adapter.get_task_status_sync(result.id) ``` :::{note} All Ray actor options specified in the `@serve.deployment` decorator (such as `num_gpus`, `num_cpus`, `resources`, etc.) are applied to the task consumer replicas. This allows you to allocate specific hardware resources for your task processing workloads. ::: ## End-to-end example: Document indexing This example shows how to configure the processor, build a consumer with a handler, enqueue tasks from an ingress deployment, and check task status. ```python import io import logging import requests from fastapi import FastAPI from pydantic import BaseModel, HttpUrl from PyPDF2 import PdfReader from ray import serve from ray.serve.schema import CeleryAdapterConfig, TaskProcessorConfig from ray.serve.task_consumer import ( instantiate_adapter_from_config, task_consumer, task_handler, ) logger = logging.getLogger("ray.serve") fastapi_app = FastAPI(title="Async PDF Processing API") TASK_PROCESSOR_CONFIG = TaskProcessorConfig( queue_name="pdf_processing_queue", adapter_config=CeleryAdapterConfig( broker_url="redis://127.0.0.1:6379/0", backend_url="redis://127.0.0.1:6379/0", ), max_retries=3, failed_task_queue_name="failed_pdfs", ) class ProcessPDFRequest(BaseModel): pdf_url: HttpUrl max_summary_paragraphs: int = 3 @serve.deployment(num_replicas=2, max_ongoing_requests=5) @task_consumer(task_processor_config=TASK_PROCESSOR_CONFIG) class PDFProcessor: """Background worker that processes PDF documents asynchronously.""" @task_handler(name="process_pdf") def process_pdf(self, pdf_url: str, max_summary_paragraphs: int = 3): """Download PDF, extract text, and generate summary.""" try: response = requests.get(pdf_url, timeout=30) response.raise_for_status() pdf_reader = PdfReader(io.BytesIO(response.content)) if not pdf_reader.pages: raise ValueError("PDF contains no pages") full_text = "\n".join( page.extract_text() for page in pdf_reader.pages if page.extract_text() ) if not full_text.strip(): raise ValueError("PDF contains no extractable text") paragraphs = [p.strip() for p in full_text.split("\n\n") if p.strip()] summary = "\n\n".join(paragraphs[:max_summary_paragraphs]) return { "status": "success", "pdf_url": pdf_url, "page_count": len(pdf_reader.pages), "word_count": len(full_text.split()), "summary": summary, } except requests.exceptions.RequestException as e: raise ValueError(f"Failed to download PDF: {str(e)}") except Exception as e: raise ValueError(f"Failed to process PDF: {str(e)}") @serve.deployment() @serve.ingress(fastapi_app) class AsyncPDFAPI: """HTTP API for submitting and checking PDF processing tasks.""" def __init__(self, task_processor_config: TaskProcessorConfig, handler): self.adapter = instantiate_adapter_from_config(task_processor_config) @fastapi_app.post("/process") def process_pdf(self, request: ProcessPDFRequest): """Submit a PDF processing task and return task_id immediately.""" task_result = self.adapter.enqueue_task_sync( task_name="process_pdf", kwargs={ "pdf_url": str(request.pdf_url), "max_summary_paragraphs": request.max_summary_paragraphs, }, ) return { "task_id": task_result.id, "status": task_result.status, "message": "PDF processing task submitted successfully", } @fastapi_app.get("/status/{task_id}") def get_status(self, task_id: str): """Get task status and results.""" status = self.adapter.get_task_status_sync(task_id) return { "task_id": task_id, "status": status.status, "result": status.result if status.status == "SUCCESS" else None, "error": str(status.result) if status.status == "FAILURE" else None, } app = AsyncPDFAPI.bind(TASK_PROCESSOR_CONFIG, PDFProcessor.bind()) ``` In this example: - `DocumentIndexingConsumer` reads tasks from `document_indexing_queue` queue and processes them. - `API` enqueues tasks through `enqueue_task_sync` and fetches status through `get_task_status_sync`. - Passing `consumer` into `API.__init__` ensures both deployments are part of the Serve application graph. ## Concurrency and reliability Manage concurrency by setting `max_ongoing_requests` on the consumer deployment; this caps how many tasks each replica can process simultaneously. For at-least-once delivery, adapters should acknowledge a task only after the handler completes successfully. Failed tasks are retried up to `max_retries`; once exhausted, they are routed to the failed-task DLQ when configured. The default Celery adapter acknowledges on success, providing at-least-once processing. ## Dead letter queues (DLQs) Dead letter queues handle two types of problematic tasks: - **Unprocessable tasks**: The system routes tasks with no matching handler to `unprocessable_task_queue_name` if set. - **Failed tasks**: The system routes tasks that raise application exceptions after exhausting retries, have mismatched arguments, and other errors to `failed_task_queue_name` if set. ## Rollouts and compatibility During deployment upgrades, both old and new consumer replicas may run concurrently and pull from the same queue. If task schemas or names change, either version may see incompatible tasks. Recommendations: - **Version task names and payloads** to allow coexistence across versions. - **Don't remove handlers** until you drain old tasks. - **Monitor DLQs** for deserialization or handler resolution failures and re-enqueue or transform as needed. ## Limitations - Ray Serve supports only synchronous `@task_handler` methods. - External (non-Serve) workers are out of scope; all consumers run as Serve deployments. - Delivery guarantees ultimately depend on the configured broker. Results are optional when you don't configure a result backend. :::{note} The APIs in this guide reflect the alpha interfaces in `ray.serve.schema` and `ray.serve.task_consumer`. ::: --- (serve-autoscaling)= # Ray Serve Autoscaling Each [Ray Serve deployment](serve-key-concepts-deployment) has one [replica](serve-architecture-high-level-view) by default. This means there is one worker process running the model and serving requests. When traffic to your deployment increases, the single replica can become overloaded. To maintain high performance of your service, you need to scale out your deployment. ## Manual Scaling Before jumping into autoscaling, which is more complex, the other option to consider is manual scaling. You can increase the number of replicas by setting a higher value for [num_replicas](serve-configure-deployment) in the deployment options through [in place updates](serve-inplace-updates). By default, `num_replicas` is 1. Increasing the number of replicas will horizontally scale out your deployment and improve latency and throughput for increased levels of traffic. ```yaml # Deploy with a single replica deployments: - name: Model num_replicas: 1 # Scale up to 10 replicas deployments: - name: Model num_replicas: 10 ``` ## Autoscaling Basic Configuration Instead of setting a fixed number of replicas for a deployment and manually updating it, you can configure a deployment to autoscale based on incoming traffic. The Serve autoscaler reacts to traffic spikes by monitoring queue sizes and making scaling decisions to add or remove replicas. Turn on autoscaling for a deployment by setting `num_replicas="auto"`. You can further configure it by tuning the [autoscaling_config](../serve/api/doc/ray.serve.config.AutoscalingConfig.rst) in deployment options. The following config is what we will use in the example in the following section. ```yaml - name: Model num_replicas: auto ``` Setting `num_replicas="auto"` is equivalent to the following deployment configuration. ```yaml - name: Model max_ongoing_requests: 5 autoscaling_config: target_ongoing_requests: 2 min_replicas: 1 max_replicas: 100 ``` :::{note} When you set `num_replicas="auto"`, Ray Serve applies the defaults shown above, including `max_replicas: 100`. However, if you configure autoscaling manually without using `num_replicas="auto"`, the base default for `max_replicas` is 1, which means autoscaling won't occur unless you explicitly set a higher value. You can override any of these defaults by specifying `autoscaling_config` even when using `num_replicas="auto"`. ::: Let's dive into what each of these parameters do. * **target_ongoing_requests** is the average number of ongoing requests per replica that the Serve autoscaler tries to ensure. You can adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be). * **max_ongoing_requests** is the maximum number of ongoing requests allowed for a replica. Note this parameter is not part of the autoscaling config because it's relevant to all deployments, but it's important to set it relative to the target value if you turn on autoscaling for your deployment. * **min_replicas** is the minimum number of replicas for the deployment. Set this to 0 if there are long periods of no traffic and some extra tail latency during upscale is acceptable. Otherwise, set this to what you think you need for low traffic. * **max_replicas** is the maximum number of replicas for the deployment. Set this to ~20% higher than what you think you need for peak traffic. These guidelines are a great starting point. If you decide to further tune your autoscaling config for your application, see [Advanced Ray Serve Autoscaling](serve-advanced-autoscaling). (resnet-autoscaling-example)= ## Basic example This example is a synchronous workload that runs ResNet50. The application code and its autoscaling configuration are below. Alternatively, see the second tab for specifying the autoscaling config through a YAML file. ::::{tab-set} :::{tab-item} Application Code ```{literalinclude} doc_code/resnet50_example.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` ::: :::{tab-item} (Alternative) YAML config ```yaml applications: - name: default import_path: resnet:app deployments: - name: Model num_replicas: auto ``` ::: :::: This example uses [Locust](https://locust.io/) to run a load test against this application. The Locust load test runs a certain number of "users" that ping the ResNet50 service, where each user has a [constant wait time](https://docs.locust.io/en/stable/writing-a-locustfile.html#wait-time-attribute) of 0. Each user (repeatedly) sends a request, waits for a response, then immediately sends the next request. The number of users running over time is shown in the following graph: ![users](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling-guide/resnet50_users.png) The results of the load test are as follows: | | | | | -------- | --- | ------- | | Replicas | replicas | | QPS | qps | | P50 Latency | latency | Notice the following: - Each Locust user constantly sends a single request and waits for a response. As a result, the number of autoscaled replicas is roughly half the number of Locust users over time as Serve attempts to satisfy the `target_ongoing_requests=2` setting. - The throughput of the system increases with the number of users and replicas. - The latency briefly spikes when traffic increases, but otherwise stays relatively steady. ## Ray Serve Autoscaler vs Ray Autoscaler The Ray Serve Autoscaler is an application-level autoscaler that sits on top of the [Ray Autoscaler](cluster-index). Concretely, this means that the Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand. If the Ray Autoscaler determines there aren't enough available resources (e.g. CPUs, GPUs, etc.) to place these actors, it responds by requesting more Ray nodes. The underlying cloud provider then responds by adding more nodes. Similarly, when Ray Serve scales down and terminates replica Actors, it attempts to make as many nodes idle as possible so the Ray Autoscaler can remove them. To learn more about the architecture underlying Ray Serve Autoscaling, see [Ray Serve Autoscaling Architecture](serve-autoscaling-architecture). --- (serve-configure-deployment)= # Configure Ray Serve deployments Ray Serve default values for deployments are a good starting point for exploration. To further tailor scaling behavior, resource management, or performance tuning, you can configure parameters to alter the default behavior of Ray Serve deployments. Use this guide to learn the essentials of configuring deployments: - What parameters you can configure for a Ray Serve deployment - The different locations where you can specify the parameters. ## Configurable parameters You can also refer to the [API reference](../serve/api/doc/ray.serve.deployment_decorator.rst) for the `@serve.deployment` decorator. - `name` - Name uniquely identifying this deployment within the application. If not provided, the name of the class or function is used. - `num_replicas` - Controls the number of replicas to run that handle requests to this deployment. This can be a positive integer, in which case the number of replicas stays constant, or `auto`, in which case the number of replicas will autoscale with a default configuration (see [Ray Serve Autoscaling](serve-autoscaling) for more). Defaults to 1. - `ray_actor_options` - Options to pass to the Ray Actor decorator, such as resource requirements. Valid options are: `accelerator_type`, `memory`, `num_cpus`, `num_gpus`, `object_store_memory`, `resources`, and `runtime_env` For more details - [Resource management in Serve](serve-cpus-gpus) - `max_ongoing_requests` - Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 5 (note the default changed from 100 to 5 in Ray 2.32.0). This may be an important parameter to configure for [performance tuning](serve-perf-tuning). - `autoscaling_config` - Parameters to configure autoscaling behavior. If this is set, you can't set `num_replicas` to a number. For more details on configurable parameters for autoscaling, see [Ray Serve Autoscaling](serve-autoscaling). - `max_queued_requests` - [EXPERIMENTAL] Maximum number of requests to this deployment that will be queued at each caller (proxy or DeploymentHandle). Once this limit is reached, subsequent requests will raise a BackPressureError (for handles) or return an HTTP 503 status code (for HTTP requests). Defaults to -1 (no limit). - `user_config` - Config to pass to the reconfigure method of the deployment. This can be updated dynamically without restarting the replicas of the deployment. The user_config must be fully JSON-serializable. For more details, see [Serve User Config](serve-user-config). - `health_check_period_s` - Duration between health check calls for the replica. Defaults to 10s. The health check is by default a no-op Actor call to the replica, but you can define your own health check using the "check_health" method in your deployment that raises an exception when unhealthy. - `health_check_timeout_s` - Duration in seconds, that replicas wait for a health check method to return before considering it as failed. Defaults to 30s. - `graceful_shutdown_wait_loop_s` - Duration that replicas wait until there is no more work to be done before shutting down. Defaults to 2s. - `graceful_shutdown_timeout_s` - Duration to wait for a replica to gracefully shut down before being forcefully killed. Defaults to 20s. - `logging_config` - Logging Config for the deployment (e.g. log level, log directory, JSON log format and so on). See [LoggingConfig](../serve/api/doc/ray.serve.schema.LoggingConfig.rst) for details. ## How to specify parameters You can specify the above mentioned parameters in two locations: 1. In your application code. 2. In the Serve Config file, which is the recommended method for production. ### Specify parameters through the application code You can specify parameters in the application code in two ways: - In the `@serve.deployment` decorator when you first define a deployment - With the `options()` method when you want to modify a deployment Use the `@serve.deployment` decorator to specify deployment parameters when you are defining a deployment for the first time: ```{literalinclude} ../serve/doc_code/configure_serve_deployment/model_deployment.py :start-after: __deployment_start__ :end-before: __deployment_end__ :language: python ``` Use the [`.options()`](../serve/api/doc/ray.serve.Deployment.rst) method to modify deployment parameters on an already-defined deployment. Modifying an existing deployment lets you reuse deployment definitions and dynamically set parameters at runtime. ```{literalinclude} ../serve/doc_code/configure_serve_deployment/model_deployment.py :start-after: __deployment_end__ :end-before: __options_end__ :language: python ``` ### Specify parameters through the Serve config file In production, we recommend configuring individual deployments through the Serve config file. You can change parameter values without modifying your application code. Learn more about how to use the Serve Config in the [production guide](serve-in-production-config-file). ```yaml applications: - name: app1 import_path: configure_serve:translator_app deployments: - name: Translator num_replicas: 2 max_ongoing_requests: 100 graceful_shutdown_wait_loop_s: 2.0 graceful_shutdown_timeout_s: 20.0 health_check_period_s: 10.0 health_check_timeout_s: 30.0 ray_actor_options: num_cpus: 0.2 num_gpus: 0.0 ``` ### Order of Priority You can set parameters to different values in various locations. For each individual parameter, the order of priority is (from highest to lowest): 1. Serve Config file 2. Application code (either through the `@serve.deployment` decorator or through `.options()`) 3. Serve defaults In other words, if you specify a parameter for a deployment in the config file and the application code, Serve uses the config file's value. If it's only specified in the code, Serve uses the value you specified in the code. If you don't specify the parameter anywhere, Serve uses the default for that parameter. For example, the following application code contains a single deployment `ExampleDeployment`: ```python @serve.deployment(num_replicas=2, graceful_shutdown_timeout_s=6) class ExampleDeployment: ... example_app = ExampleDeployment.bind() ``` Then you deploy the application with the following config file: ```yaml applications: - name: default import_path: models:example_app deployments: - name: ExampleDeployment num_replicas: 5 ``` Serve uses `num_replicas=5` from the value set in the config file and `graceful_shutdown_timeout_s=6` from the value set in the application code. All other deployment settings use Serve defaults because you didn't specify them in the code or the config. For instance, `health_check_period_s=10` because by default Serve health checks deployments once every 10 seconds. :::{tip} Remember that `ray_actor_options` counts as a single setting. The entire `ray_actor_options` dictionary in the config file overrides the entire `ray_actor_options` dictionary from the graph code. If you set individual options within `ray_actor_options` (e.g. `runtime_env`, `num_gpus`, `memory`) in the code but not in the config, Serve still won't use the code settings if the config has a `ray_actor_options` dictionary. It treats these missing options as though the user never set them and uses defaults instead. This dictionary overriding behavior also applies to `user_config` and `autoscaling_config`. ::: --- (serve-develop-and-deploy)= # Develop and Deploy an ML Application The flow for developing a Ray Serve application locally and deploying it in production covers the following steps: * Converting a Machine Learning model into a Ray Serve application * Testing the application locally * Building Serve config files for production deployment * Deploying applications using a config file ## Convert a model into a Ray Serve application This example uses a text-translation model: ```{literalinclude} ../serve/doc_code/getting_started/models.py :start-after: __start_translation_model__ :end-before: __end_translation_model__ :language: python ``` The Python file, called `model.py`, uses the `Translator` class to translate English text to French. - The `self.model` variable inside the `Translator`'s `__init__` method stores a function that uses the [t5-small](https://huggingface.co/t5-small) model to translate text. - When `self.model` is called on English text, it returns translated French text inside a dictionary formatted as `[{"translation_text": "..."}]`. - The `Translator`'s `translate` method extracts the translated text by indexing into the dictionary. Copy and paste the script and run it locally. It translates `"Hello world!"` into `"Bonjour Monde!"`. ```console $ python model.py Bonjour Monde! ``` Converting this model into a Ray Serve application with FastAPI requires three changes: 1. Import Ray Serve and FastAPI dependencies 2. Add decorators for Serve deployment with FastAPI: `@serve.deployment` and `@serve.ingress(app)` 3. `bind` the `Translator` deployment to the arguments that are passed into its constructor For other HTTP options, see [Set Up FastAPI and HTTP](serve-set-up-fastapi-http). ```{literalinclude} ../serve/doc_code/develop_and_deploy.py :start-after: __deployment_start__ :end-before: __deployment_end__ :language: python ``` Note that the code configures parameters for the deployment, such as `num_replicas` and `ray_actor_options`. These parameters help configure the number of copies of the deployment and the resource requirements for each copy. In this case, we set up 2 replicas of the model that take 0.2 CPUs and 0 GPUs each. For a complete guide on the configurable parameters on a deployment, see [Configure a Serve deployment](serve-configure-deployment). ## Test a Ray Serve application locally To test locally, run the script with the `serve run` CLI command. This command takes in an import path formatted as `module:application`. Run the command from a directory containing a local copy of the script saved as `model.py`, so it can import the application: ```console $ serve run model:translator_app ``` This command runs the `translator_app` application and then blocks, streaming logs to the console. You can kill it with `Ctrl-C`, which tears down the application. Now test the model over HTTP. Reach it at the following default URL: ``` http://127.0.0.1:8000/ ``` Send a POST request with JSON data containing the English text. This client script requests a translation for "Hello world!": ```{literalinclude} ../serve/doc_code/develop_and_deploy.py :start-after: __client_function_start__ :end-before: __client_function_end__ :language: python ``` While a Ray Serve application is deployed, use the `serve status` CLI command to check the status of the application and deployment. For more details on the output format of `serve status`, see [Inspect Serve in production](serve-in-production-inspecting). ```console $ serve status proxies: a85af35da5fcea04e13375bdc7d2c83c7d3915e290f1b25643c55f3a: HEALTHY applications: default: status: RUNNING message: '' last_deployed_time_s: 1693428451.894696 deployments: Translator: status: HEALTHY replica_states: RUNNING: 2 message: '' ``` ## Build Serve config files for production deployment To deploy Serve applications in production, you need to generate a Serve config YAML file. A Serve config file is the single source of truth for the cluster, allowing you to specify system-level configuration and your applications in one place. It also allows you to declaratively update your applications. The `serve build` CLI command takes as input the import path and saves to an output file using the `-o` flag. You can specify all deployment parameters in the Serve config files. ```console $ serve build model:translator_app -o config.yaml ``` The `serve build` command adds a default application name that can be modified. The resulting Serve config file is: ``` proxy_location: EveryNode http_options: host: 0.0.0.0 port: 8000 grpc_options: port: 9000 grpc_servicer_functions: [] applications: - name: app1 route_prefix: / import_path: model:translator_app runtime_env: {} deployments: - name: Translator num_replicas: 2 ray_actor_options: num_cpus: 0.2 num_gpus: 0.0 ``` You can also use the Serve config file with `serve run` for local testing. For example: ```console $ serve run config.yaml ``` ```console $ serve status proxies: 1894261b372d34854163ac5ec88405328302eb4e46ac3a2bdcaf8d18: HEALTHY applications: app1: status: RUNNING message: '' last_deployed_time_s: 1693430474.873806 deployments: Translator: status: HEALTHY replica_states: RUNNING: 2 message: '' ``` For more details, see [Serve Config Files](serve-in-production-config-file). ## Deploy Ray Serve in production Deploy the Ray Serve application in production on Kubernetes using the [KubeRay] operator. Copy the YAML file generated in the previous step directly into the Kubernetes configuration. KubeRay supports zero-downtime upgrades, status reporting, and fault tolerance for your production application. See [Deploying on Kubernetes](serve-in-production-kubernetes) for more information. For production usage, consider implementing the recommended practice of setting up [head node fault tolerance](serve-e2e-ft-guide-gcs). ## Monitor Ray Serve Use the Ray Dashboard to get a high-level overview of your Ray Cluster and Ray Serve application's states. The Ray Dashboard is available both during local testing and on a remote cluster in production. Ray Serve provides some in-built metrics and logging as well as utilities for adding custom metrics and logs in your application. For production deployments, exporting logs and metrics to your observability platforms is recommended. See [Monitoring](serve-monitoring) for more details. [KubeRay]: kuberay-index --- (serve-getting-started)= # Getting Started This tutorial will walk you through the process of writing and testing a Ray Serve application. It will show you how to * convert a machine learning model to a Ray Serve deployment * test a Ray Serve application locally over HTTP * compose multi-model machine learning models together into a single application We'll use two models in this tutorial: * [HuggingFace's TranslationPipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TranslationPipeline) as a text-translation model * [HuggingFace's SummarizationPipeline](https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/pipelines#transformers.SummarizationPipeline) as a text-summarizer model You can also follow along using your own models from any Python framework. After deploying those two models, we'll test them with HTTP requests. :::{tip} If you have suggestions on how to improve this tutorial, please [let us know](https://github.com/ray-project/ray/issues/new/choose)! ::: To run this example, you will need to install the following: ```bash pip install "ray[serve]" transformers requests torch ``` ## Text Translation Model (before Ray Serve) First, let's take a look at our text-translation model. Here's its code: ```{literalinclude} ../serve/doc_code/getting_started/models.py :start-after: __start_translation_model__ :end-before: __end_translation_model__ :language: python ``` The Python file, called `model.py`, uses the `Translator` class to translate English text to French. - The `self.model` variable inside `Translator`'s `__init__` method stores a function that uses the [t5-small](https://huggingface.co/t5-small) model to translate text. - When `self.model` is called on English text, it returns translated French text inside a dictionary formatted as `[{"translation_text": "..."}]`. - The `Translator`'s `translate` method extracts the translated text by indexing into the dictionary. You can copy-paste this script and run it locally. It translates `"Hello world!"` into `"Bonjour Monde!"`. ```console $ python model.py Bonjour Monde! ``` Keep in mind that the `TranslationPipeline` is an example ML model for this tutorial. You can follow along using arbitrary models from any Python framework. Check out our tutorials on scikit-learn, PyTorch, and Tensorflow for more info and examples: - {ref}`serve-ml-models-tutorial` (converting-to-ray-serve-application)= ## Converting to a Ray Serve Application In this section, we'll deploy the text translation model using Ray Serve, so it can be scaled up and queried over HTTP. We'll start by converting `Translator` into a Ray Serve deployment. First, we open a new Python file and import `ray` and `ray.serve`: ```{literalinclude} ../serve/doc_code/getting_started/model_deployment.py :start-after: __import_start__ :end-before: __import_end__ :language: python ``` After these imports, we can include our model code from above: ```{literalinclude} ../serve/doc_code/getting_started/model_deployment.py :start-after: __model_start__ :end-before: __model_end__ :language: python ``` The `Translator` class has two modifications: 1. It has a decorator, `@serve.deployment`. 2. It has a new method, `__call__`. The decorator converts `Translator` from a Python class into a Ray Serve `Deployment` object. Each deployment stores a single Python function or class that you write and uses it to serve requests. You can scale and configure each of your deployments independently using parameters in the `@serve.deployment` decorator. The example configures a few common parameters: * `num_replicas`: an integer that determines how many copies of our deployment process run in Ray. Requests are load balanced across these replicas, allowing you to scale your deployments horizontally. * `ray_actor_options`: a dictionary containing configuration options for each replica. * `num_cpus`: a float representing the logical number of CPUs each replica should reserve. You can make this a fraction to pack multiple replicas together on a machine with fewer CPUs than replicas. * `num_gpus`: a float representing the logical number of GPUs each replica should reserve. You can make this a fraction to pack multiple replicas together on a machine with fewer GPUs than replicas. * `resources`: a dictionary containing other resource requirements for the replica, such as non-GPU accelerators like HPUs or TPUs. All these parameters are optional, so feel free to omit them: ```python ... @serve.deployment class Translator: ... ``` Deployments receive Starlette HTTP `request` objects [^f1]. By default, the deployment class's `__call__` method is called on this `request` object. The return value is sent back in the HTTP response body. This is why `Translator` needs a new `__call__` method. The method processes the incoming HTTP request by reading its JSON data and forwarding it to the `translate` method. The translated text is returned and sent back through the HTTP response. You can also use Ray Serve's FastAPI integration to avoid working with raw HTTP requests. Check out {ref}`serve-fastapi-http` for more info about FastAPI with Serve. Next, we need to `bind` our `Translator` deployment to arguments that will be passed into its constructor. This defines a Ray Serve application that we can run locally or deploy to production (you'll see later that applications can consist of multiple deployments). Since `Translator`'s constructor doesn't take in any arguments, we can call the deployment's `bind` method without passing anything in: ```{literalinclude} ../serve/doc_code/getting_started/model_deployment.py :start-after: __model_deploy_start__ :end-before: __model_deploy_end__ :language: python ``` With that, we are ready to test the application locally. ## Running a Ray Serve Application Here's the full Ray Serve script that we built above: ```{literalinclude} ../serve/doc_code/getting_started/model_deployment_full.py :start-after: __deployment_full_start__ :end-before: __deployment_full_end__ :language: python ``` To test locally, we run the script with the `serve run` CLI command. This command takes in an import path to our deployment formatted as `module:application`. Make sure to run the command from a directory containing a local copy of this script saved as `serve_quickstart.py`, so it can import the application: ```console $ serve run serve_quickstart:translator_app ``` This command will run the `translator_app` application and then block, streaming logs to the console. It can be killed with `Ctrl-C`, which will tear down the application. We can now test our model over HTTP. It can be reached at the following URL by default: ``` http://127.0.0.1:8000/ ``` We'll send a POST request with JSON data containing our English text. `Translator`'s `__call__` method will unpack this text and forward it to the `translate` method. Here's a client script that requests a translation for "Hello world!": ```{literalinclude} ../serve/doc_code/getting_started/model_deployment.py :start-after: __client_function_start__ :end-before: __client_function_end__ :language: python ``` To test our deployment, first make sure `Translator` is running: ``` $ serve run serve_deployment:translator_app ``` While `Translator` is running, we can open a separate terminal window and run the client script. This will get a response over HTTP: ```console $ python model_client.py Bonjour monde! ``` ## Composing Multiple Models Ray Serve allows you to compose multiple deployments into a single Ray Serve application. This makes it easy to combine multiple machine learning models along with business logic to serve a single request. We can use parameters like `autoscaling_config`, `num_replicas`, `num_cpus`, and `num_gpus` to independently configure and scale each deployment in the application. For example, let's deploy a machine learning pipeline with two steps: 1. Summarize English text 2. Translate the summary into French `Translator` already performs step 2. We can use [HuggingFace's SummarizationPipeline](https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/pipelines#transformers.SummarizationPipeline) to accomplish step 1. Here's an example of the `SummarizationPipeline` that runs locally: ```{literalinclude} ../serve/doc_code/getting_started/models.py :start-after: __start_summarization_model__ :end-before: __end_summarization_model__ :language: python ``` You can copy-paste this script and run it locally. It summarizes the snippet from _A Tale of Two Cities_ to `it was the best of times, it was the worst of times .` ```console $ python summary_model.py it was the best of times, it was the worst of times . ``` Here's an application that chains the two models together. The graph takes English text, summarizes it, and then translates it: ```{literalinclude} ../serve/doc_code/getting_started/translator.py :start-after: __start_graph__ :end-before: __end_graph__ :language: python ``` This script contains our `Summarizer` class converted to a deployment and our `Translator` class with some modifications. In this script, the `Summarizer` class contains the `__call__` method since requests are sent to it first. It also takes in a handle to the `Translator` as one of its constructor arguments, so it can forward summarized texts to the `Translator` deployment. The `__call__` method also contains some new code: ```python translation = await self.translator.translate.remote(summary) ``` `self.translator.translate.remote(summary)` issues an asynchronous call to the `Translator`'s `translate` method and returns a `DeploymentResponse` object immediately. Calling `await` on the response waits for the remote method call to execute and returns its return value. The response could also be passed directly to another `DeploymentHandle` call. We define the full application as follows: ```python app = Summarizer.bind(Translator.bind()) ``` Here, we bind `Translator` to its (empty) constructor arguments, and then we pass in the bound `Translator` as the constructor argument for the `Summarizer`. We can run this deployment graph using the `serve run` CLI command. Make sure to run this command from a directory containing a local copy of the `serve_quickstart_composed.py` code: ```console $ serve run serve_quickstart_composed:app ``` We can use this client script to make requests to the graph: ```{literalinclude} ../serve/doc_code/getting_started/translator.py :start-after: __start_client__ :end-before: __end_client__ :language: python ``` While the application is running, we can open a separate terminal window and query it: ```console $ python composed_client.py c'était le meilleur des temps, c'était le pire des temps . ``` Composed Ray Serve applications let you deploy each part of your machine learning pipeline, such as inference and business logic steps, in separate deployments. Each of these deployments can be individually configured and scaled, ensuring you get maximal performance from your resources. See the guide on [model composition](serve-model-composition) to learn more. ## Next Steps - Dive into the {doc}`key-concepts` to get a deeper understanding of Ray Serve. - View details about your Serve application in the Ray Dashboard: {ref}`dash-serve-view`. - Learn more about how to deploy your Ray Serve application to production: {ref}`serve-in-production`. - Check more in-depth tutorials for popular machine learning frameworks: {doc}`examples`. ```{rubric} Footnotes ``` [^f1]: [Starlette](https://www.starlette.io/) is a web server framework used by Ray Serve. --- (serve-set-up-fastapi-http)= # Set Up FastAPI and HTTP This section helps you understand how to: - Send HTTP requests to Serve deployments - Use Ray Serve to integrate with FastAPI - Use customized HTTP adapters - Choose which feature to use for your use case - Set up keep alive timeout ## Choosing the right HTTP feature Serve offers a layered approach to expose your model with the right HTTP API. Considering your use case, you can choose the right level of abstraction: - If you are comfortable working with the raw request object, use [`starlette.request.Requests` API](serve-http). - If you want a fully fledged API server with validation and doc generation, use the [FastAPI integration](serve-fastapi-http). (serve-http)= ## Calling Deployments via HTTP When you deploy a Serve application, the [ingress deployment](serve-key-concepts-ingress-deployment) (the one passed to `serve.run`) is exposed over HTTP. ```{literalinclude} doc_code/http_guide/http_guide.py :start-after: __begin_starlette__ :end-before: __end_starlette__ :language: python ``` Requests to the Serve HTTP server at `/` are routed to the deployment's `__call__` method with a [Starlette Request object](https://www.starlette.io/requests/) as the sole argument. The `__call__` method can return any JSON-serializable object or a [Starlette Response object](https://www.starlette.io/responses/) (e.g., to return a custom status code or custom headers). A Serve app's route prefix can be changed from `/` to another string by setting `route_prefix` in `serve.run()` or the Serve config file. (serve-request-cancellation-http)= ### Request cancellation When processing a request takes longer than the [end-to-end timeout](serve-performance-e2e-timeout) or an HTTP client disconnects before receiving a response, Serve cancels the in-flight request: - If the proxy hasn't yet sent the request to a replica, Serve simply drops the request. - If the request has been sent to a replica, Serve attempts to interrupt the replica and cancel the request. The `asyncio.Task` running the handler on the replica is cancelled, raising an `asyncio.CancelledError` the next time it enters an `await` statement. See [the asyncio docs](https://docs.python.org/3/library/asyncio-task.html#task-cancellation) for more info. Handle this exception in a try-except block to customize your deployment's behavior when a request is cancelled: ```{literalinclude} doc_code/http_guide/disconnects.py :start-after: __start_basic_disconnect__ :end-before: __end_basic_disconnect__ :language: python ``` If no `await` statements are left in the deployment's code before the request completes, the replica processes the request as usual, sends the response back to the proxy, and the proxy discards the response. Use `await` statements for blocking operations in a deployment, so Serve can cancel in-flight requests without waiting for the blocking operation to complete. Cancellation cascades to any downstream deployment handle, task, or actor calls that were spawned in the deployment's request-handling method. These can handle the `asyncio.CancelledError` in the same way as the ingress deployment. To prevent an async call from being interrupted by `asyncio.CancelledError`, use `asyncio.shield()`: ```{literalinclude} doc_code/http_guide/disconnects.py :start-after: __start_shielded_disconnect__ :end-before: __end_shielded_disconnect__ :language: python ``` When the request is cancelled, a cancellation error is raised inside the `SnoringSleeper` deployment's `__call__()` method. However, the cancellation is not raised inside the `snore()` call, so `ZZZ` is printed even if the request is cancelled. Note that `asyncio.shield` cannot be used on a `DeploymentHandle` call to prevent the downstream handler from being cancelled. You need to explicitly handle the cancellation error in that handler as well. (serve-fastapi-http)= ## FastAPI HTTP Deployments If you want to define more complex HTTP handling logic, Serve integrates with [FastAPI](https://fastapi.tiangolo.com/). This allows you to define a Serve deployment using the {mod}`@serve.ingress ` decorator that wraps a FastAPI app with its full range of features. The most basic example of this is shown below, but for more details on all that FastAPI has to offer such as variable routes, automatic type validation, dependency injection (e.g., for database connections), and more, please check out [their documentation](https://fastapi.tiangolo.com/). :::{note} A Serve application that's integrated with FastAPI still respects the `route_prefix` set through Serve. The routes that are registered through the FastAPI `app` object are layered on top of the route prefix. For instance, if your Serve application has `route_prefix = /my_app` and you decorate a method with `@app.get("/fetch_data")`, then you can call that method by sending a GET request to the path `/my_app/fetch_data`. ::: ```{literalinclude} doc_code/http_guide/http_guide.py :start-after: __begin_fastapi__ :end-before: __end_fastapi__ :language: python ``` Now if you send a request to `/hello`, this will be routed to the `root` method of our deployment. We can also easily leverage FastAPI to define multiple routes with different HTTP methods: ```{literalinclude} doc_code/http_guide/http_guide.py :start-after: __begin_fastapi_multi_routes__ :end-before: __end_fastapi_multi_routes__ :language: python ``` You can also pass in an existing FastAPI app to a deployment to serve it as-is: ```{literalinclude} doc_code/http_guide/http_guide.py :start-after: __begin_byo_fastapi__ :end-before: __end_byo_fastapi__ :language: python ``` This is useful for scaling out an existing FastAPI app with no modifications necessary. Existing middlewares, **automatic OpenAPI documentation generation**, and other advanced FastAPI features should work as-is. ### WebSockets Serve supports WebSockets via FastAPI: ```{literalinclude} doc_code/http_guide/websockets_example.py :start-after: __websocket_serve_app_start__ :end-before: __websocket_serve_app_end__ :language: python ``` Decorate the function that handles WebSocket requests with `@app.websocket`. Read more about FastAPI WebSockets in the [FastAPI documentation](https://fastapi.tiangolo.com/advanced/websockets/). Query the deployment using the `websockets` package (`pip install websockets`): ```{literalinclude} doc_code/http_guide/websockets_example.py :start-after: __websocket_serve_client_start__ :end-before: __websocket_serve_client_end__ :language: python ``` ### FastAPI factory pattern Ray Serve's object-based pattern, shown previously, requires FastAPI objects to be serializable via cloudpickle, which prevents the use of some standard libraries like `FastAPIInstrumentor` due to their reliance on non-serializable components such as thread locks. The factory pattern create the object of FastAPI directly on each replica, avoiding the need for FastAPI object serialization. ```{literalinclude} doc_code/http_guide/http_guide.py :start-after: __begin_fastapi_factory_pattern__ :end-before: __end_fastapi_factory_pattern__ :language: python ``` (serve-http-streaming-response)= ## Streaming Responses Some applications must stream incremental results back to the caller. This is common for text generation using large language models (LLMs) or video processing applications. The full forward pass may take multiple seconds, so providing incremental results as they're available provides a much better user experience. To use HTTP response streaming, return a [StreamingResponse](https://www.starlette.io/responses/#streamingresponse) that wraps a generator from your HTTP handler. This is supported for basic HTTP ingress deployments using a `__call__` method and when using the [FastAPI integration](serve-fastapi-http). The code below defines a Serve application that incrementally streams numbers up to a provided `max`. The client-side code is also updated to handle the streaming outputs. This code uses the `stream=True` option to the [requests](https://requests.readthedocs.io/en/latest/user/advanced.html#streaming-requests) library. ```{literalinclude} doc_code/http_guide/streaming_example.py :start-after: __begin_example__ :end-before: __end_example__ :language: python ``` Save this code in `stream.py` and run it: ```bash $ python stream.py [2023-05-25 10:44:23] INFO ray._private.worker::Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ServeController pid=40401) INFO 2023-05-25 10:44:25,296 controller 40401 deployment_state.py:1259 - Deploying new version of deployment default_StreamingResponder. (ProxyActor pid=40403) INFO: Started server process [40403] (ServeController pid=40401) INFO 2023-05-25 10:44:25,333 controller 40401 deployment_state.py:1498 - Adding 1 replica to deployment default_StreamingResponder. Got result 0.0s after start: '0' Got result 0.1s after start: '1' Got result 0.2s after start: '2' Got result 0.3s after start: '3' Got result 0.4s after start: '4' Got result 0.5s after start: '5' Got result 0.6s after start: '6' Got result 0.7s after start: '7' Got result 0.8s after start: '8' Got result 0.9s after start: '9' (ServeReplica:default_StreamingResponder pid=41052) INFO 2023-05-25 10:49:52,230 default_StreamingResponder default_StreamingResponder#qlZFCa yomKnJifNJ / default replica.py:634 - __CALL__ OK 1017.6ms ``` ### Terminating the stream when a client disconnects In some cases, you may want to cease processing a request when the client disconnects before the full stream has been returned. If you pass an async generator to `StreamingResponse`, it is cancelled and raises an `asyncio.CancelledError` when the client disconnects. Note that you must `await` at some point in the generator for the cancellation to occur. In the example below, the generator streams responses forever until the client disconnects, then it prints that it was cancelled and exits. Save this code in `stream.py` and run it: ```{literalinclude} doc_code/http_guide/streaming_example.py :start-after: __begin_cancellation__ :end-before: __end_cancellation__ :language: python ``` ```bash $ python stream.py [2023-07-10 16:08:41] INFO ray._private.worker::Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ServeController pid=50801) INFO 2023-07-10 16:08:42,296 controller 40401 deployment_state.py:1259 - Deploying new version of deployment default_StreamingResponder. (ProxyActor pid=50803) INFO: Started server process [50803] (ServeController pid=50805) INFO 2023-07-10 16:08:42,963 controller 50805 deployment_state.py:1586 - Adding 1 replica to deployment default_StreamingResponder. Got result 0.0s after start: '0' Got result 0.1s after start: '1' Got result 0.2s after start: '2' Got result 0.3s after start: '3' Got result 0.4s after start: '4' Got result 0.5s after start: '5' Got result 0.6s after start: '6' Got result 0.7s after start: '7' Got result 0.8s after start: '8' Got result 0.9s after start: '9' Got result 1.0s after start: '10' Client disconnecting (ServeReplica:default_StreamingResponder pid=50842) Cancelled! Exiting. (ServeReplica:default_StreamingResponder pid=50842) INFO 2023-07-10 16:08:45,756 default_StreamingResponder default_StreamingResponder#cmpnmF ahteNDQSWx / default replica.py:691 - __CALL__ OK 1019.1ms ``` (serve-http-guide-keep-alive-timeout)= ## Set keep alive timeout Serve uses a Uvicorn HTTP server internally to serve HTTP requests. By default, Uvicorn keeps HTTP connections alive for 5 seconds between requests. Modify the keep-alive timeout by setting the `keep_alive_timeout_s` in the `http_options` field of the Serve config files. This config is global to your Ray cluster, and you can't update it during runtime. See Uvicorn's keep alive timeout [guide](https://www.uvicorn.org/server-behavior/#timeouts) for more information. --- (rayserve)= # Ray Serve: Scalable and Programmable Serving ```{toctree} :hidden: getting_started key-concepts develop-and-deploy model_composition multi-app model-multiplexing configure-serve-deployment http-guide Serving LLMs Production Guide monitoring resource-allocation autoscaling-guide asynchronous-inference advanced-guides/index architecture examples api/index ``` ```{image} logo.svg :align: center :height: 250px :width: 400px ``` (rayserve-overview)= Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework-agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. It has several features and performance optimizations for serving Large Language Models such as response streaming, dynamic request batching, multi-node/multi-GPU serving, etc. Ray Serve is particularly well suited for [model composition](serve-model-composition) and multi-model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. Ray Serve is built on top of Ray, so it easily scales to many machines and offers flexible scheduling support such as fractional GPUs so you can share resources and serve many machine learning models at low cost. ## Quickstart Install Ray Serve and its dependencies: ```bash pip install "ray[serve]" ``` Define a simple "hello world" application, run it locally, and query it over HTTP. ```{literalinclude} doc_code/quickstart.py :language: python ``` ## More examples ::::{tab-set} :::{tab-item} Model composition Use Serve's model composition API to combine multiple deployments into a single application. ```{literalinclude} doc_code/quickstart_composed.py :language: python ``` ::: :::{tab-item} FastAPI integration Use Serve's [FastAPI](https://fastapi.tiangolo.com/) integration to elegantly handle HTTP parsing and validation. ```{literalinclude} doc_code/fastapi_example.py :language: python ``` ::: :::{tab-item} Hugging Face Transformers model To run this example, install the following: ``pip install transformers`` Serve a pre-trained [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) model using Ray Serve. The model we'll use is a sentiment analysis model: it will take a text string as input and return if the text was "POSITIVE" or "NEGATIVE." ```{literalinclude} doc_code/transformers_example.py :language: python ``` ::: :::: ## Why choose Serve? :::{dropdown} Build end-to-end ML-powered applications :animate: fade-in-slide-down Many solutions for ML serving focus on "tensor-in, tensor-out" serving: that is, they wrap ML models behind a predefined, structured endpoint. However, machine learning isn't useful in isolation. It's often important to combine machine learning with business logic and traditional web serving logic such as database queries. Ray Serve is unique in that it allows you to build and deploy an end-to-end distributed serving application in a single framework. You can combine multiple ML models, business logic, and expressive HTTP handling using Serve's FastAPI integration (see {ref}`serve-fastapi-http`) to build your entire application as one Python program. ::: :::{dropdown} Combine multiple models using a programmable API :animate: fade-in-slide-down Often solving a problem requires more than just a single machine learning model. For instance, image processing applications typically require a multi-stage pipeline consisting of steps like preprocessing, segmentation, and filtering to achieve their end goal. In many cases each model may use a different architecture or framework and require different resources (like CPUs vs GPUs). Many other solutions support defining a static graph in YAML or some other configuration language. This can be limiting and hard to work with. Ray Serve, on the other hand, supports multi-model composition using a programmable API where calls to different models look just like function calls. The models can use different resources and run across different machines in the cluster, but you can write it like a regular program. See {ref}`serve-model-composition` for more details. ::: :::{dropdown} Flexibly scale up and allocate resources :animate: fade-in-slide-down Machine learning models are compute-intensive and therefore can be very expensive to operate. A key requirement for any ML serving system is being able to dynamically scale up and down and allocate the right resources for each model to handle the request load while saving cost. Serve offers a number of built-in primitives to help make your ML serving application efficient. It supports dynamically scaling the resources for a model up and down by adjusting the number of replicas, batching requests to take advantage of efficient vectorized operations (especially important on GPUs), and a flexible resource allocation model that enables you to serve many models on limited hardware resources. ::: :::{dropdown} Avoid framework or vendor lock-in :animate: fade-in-slide-down Machine learning moves fast, with new libraries and model architectures being released all the time, it's important to avoid locking yourself into a solution that is tied to a specific framework. This is particularly important in serving, where making changes to your infrastructure can be time consuming, expensive, and risky. Additionally, many hosted solutions are limited to a single cloud provider which can be a problem in today's multi-cloud world. Ray Serve is not tied to any specific machine learning library or framework, but rather provides a general-purpose scalable serving layer. Because it's built on top of Ray, you can run it anywhere Ray can: on your laptop, Kubernetes, any major cloud provider, or even on-premise. ::: ## How can Serve help me as a... :::{dropdown} Data scientist :animate: fade-in-slide-down Serve makes it easy to go from a laptop to a cluster. You can test your models (and your entire deployment graph) on your local machine before deploying it to production on a cluster. You don't need to know heavyweight Kubernetes concepts or cloud configurations to use Serve. ::: :::{dropdown} ML engineer :animate: fade-in-slide-down Serve helps you scale out your deployment and runs them reliably and efficiently to save costs. With Serve's first-class model composition API, you can combine models together with business logic and build end-to-end user-facing applications. Additionally, Serve runs natively on Kubernetes with minimal operation overhead. ::: :::{dropdown} ML platform engineer :animate: fade-in-slide-down Serve specializes in scalable and reliable ML model serving. As such, it can be an important plug-and-play component of your ML platform stack. Serve supports arbitrary Python code and therefore integrates well with the MLOps ecosystem. You can use it with model optimizers (ONNX, TVM), model monitoring systems (Seldon Alibi, Arize), model registries (MLFlow, Weights and Biases), machine learning frameworks (XGBoost, Scikit-learn), data app UIs (Gradio, Streamlit), and Web API frameworks (FastAPI, gRPC). ::: :::{dropdown} LLM developer :animate: fade-in-slide-down Serve enables you to rapidly prototype, develop, and deploy scalable LLM applications to production. Many large language model (LLM) applications combine prompt preprocessing, vector database lookups, LLM API calls, and response validation. Because Serve supports any arbitrary Python code, you can write all these steps as a single Python module, enabling rapid development and easy testing. You can then quickly deploy your Ray Serve LLM application to production, and each application step can independently autoscale to efficiently accommodate user traffic without wasting resources. In order to improve performance of your LLM applications, Ray Serve has features for batching and can integrate with any model optimization technique. Ray Serve also supports streaming responses, a key feature for chatbot-like applications. ::: ## How does Serve compare to ... :::{dropdown} TFServing, TorchServe, ONNXRuntime :animate: fade-in-slide-down Ray Serve is *framework-agnostic*, so you can use it alongside any other Python framework or library. We believe data scientists should not be bound to a particular machine learning framework. They should be empowered to use the best tool available for the job. Compared to these framework-specific solutions, Ray Serve doesn't perform any model-specific optimizations to make your ML model run faster. However, you can still optimize the models yourself and run them in Ray Serve. For example, you can run a model compiled by [PyTorch JIT](https://pytorch.org/docs/stable/jit.html) or [ONNXRuntime](https://onnxruntime.ai/). ::: :::{dropdown} AWS SageMaker, Azure ML, Google Vertex AI :animate: fade-in-slide-down As an open-source project, Ray Serve brings the scalability and reliability of these hosted offerings to your own infrastructure. You can use the Ray [cluster launcher](cluster-index) to deploy Ray Serve to all major public clouds, K8s, as well as on bare-metal, on-premise machines. Ray Serve is not a full-fledged ML Platform. Compared to these other offerings, Ray Serve lacks the functionality for managing the lifecycle of your models, visualizing their performance, etc. Ray Serve primarily focuses on model serving and providing the primitives for you to build your own ML platform on top. ::: :::{dropdown} Seldon, KServe, Cortex :animate: fade-in-slide-down You can develop Ray Serve on your laptop, deploy it on a dev box, and scale it out to multiple machines or a Kubernetes cluster, all with minimal or no changes to code. It's a lot easier to get started with when you don't need to provision and manage a K8s cluster. When it's time to deploy, you can use our [Kubernetes Operator](kuberay-quickstart) to transparently deploy your Ray Serve application to K8s. ::: :::{dropdown} BentoML, Comet.ml, MLflow :animate: fade-in-slide-down Many of these tools are focused on serving and scaling models independently. In contrast, Ray Serve is framework-agnostic and focuses on model composition. As such, Ray Serve works with any model packaging and registry format. Ray Serve also provides key features for building production-ready machine learning applications, including best-in-class autoscaling and naturally integrating with business logic. ::: We truly believe Serve is unique as it gives you end-to-end control over your ML application while delivering scalability and high performance. To achieve Serve's feature offerings with other tools, you would need to glue together multiple frameworks like Tensorflow Serving and SageMaker, or even roll your own micro-batching component to improve throughput. ## Learn More Check out {ref}`serve-getting-started` and {ref}`serve-key-concepts`, or head over to the {doc}`examples` to get started building your Ray Serve applications. ```{eval-rst} .. grid:: 1 2 2 2 :gutter: 1 :class-container: container pb-3 .. grid-item-card:: :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img **Getting Started** ^^^ Start with our quick start tutorials for :ref:`deploying a single model locally ` and how to :ref:`convert an existing model into a Ray Serve deployment `. +++ .. button-ref:: serve-getting-started :color: primary :outline: :expand: Get Started with Ray Serve .. grid-item-card:: :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img **Key Concepts** ^^^ Understand the key concepts behind Ray Serve. Learn about :ref:`Deployments `, :ref:`how to query them `, and using :ref:`DeploymentHandles ` to compose multiple models and business logic together. +++ .. button-ref:: serve-key-concepts :color: primary :outline: :expand: Learn Key Concepts .. grid-item-card:: :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img **Examples** ^^^ Follow the tutorials to learn how to integrate Ray Serve with :ref:`TensorFlow `, and :ref:`Scikit-Learn `. +++ .. button-ref:: examples :color: primary :outline: :expand: :ref-type: doc Serve Examples .. grid-item-card:: :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img **API Reference** ^^^ Get more in-depth information about the Ray Serve API. +++ .. button-ref:: serve-api :color: primary :outline: :expand: Read the API Reference ``` For more, see the following blog posts about Ray Serve: - [Serving ML Models in Production: Common Patterns](https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns) by Simon Mo, Edward Oakes, and Michael Galarnyk - [The Simplest Way to Serve your NLP Model in Production with Pure Python](https://medium.com/distributed-computing-with-ray/the-simplest-way-to-serve-your-nlp-model-in-production-with-pure-python-d42b6a97ad55) by Edward Oakes and Bill Chambers - [Machine Learning Serving is Broken](https://medium.com/distributed-computing-with-ray/machine-learning-serving-is-broken-f59aff2d607f) by Simon Mo - [How to Scale Up Your FastAPI Application Using Ray Serve](https://medium.com/distributed-computing-with-ray/how-to-scale-up-your-fastapi-application-using-ray-serve-c9a7b69e786) by Archit Kulkarni --- (serve-key-concepts)= # Key Concepts (serve-key-concepts-deployment)= ## Deployment Deployments are the central concept in Ray Serve. A deployment contains business logic or an ML model to handle incoming requests and can be scaled up to run across a Ray cluster. At runtime, a deployment consists of a number of *replicas*, which are individual copies of the class or function that are started in separate Ray Actors (processes). The number of replicas can be scaled up or down (or even autoscaled) to match the incoming request load. To define a deployment, use the {mod}`@serve.deployment ` decorator on a Python class (or function for simple use cases). Then, `bind` the deployment with optional arguments to the constructor to define an [application](serve-key-concepts-application). Finally, deploy the resulting application using `serve.run` (or the equivalent `serve run` CLI command, see [Development Workflow](serve-dev-workflow) for details). ```{literalinclude} ../serve/doc_code/key_concepts.py :start-after: __start_my_first_deployment__ :end-before: __end_my_first_deployment__ :language: python ``` (serve-key-concepts-application)= ## Application An application is the unit of upgrade in a Ray Serve cluster. An application consists of one or more deployments. One of these deployments is considered the [“ingress” deployment](serve-key-concepts-ingress-deployment), which handles all inbound traffic. Applications can be called via HTTP at the specified `route_prefix` or in Python using a `DeploymentHandle`. (serve-key-concepts-deployment-handle)= ## DeploymentHandle (composing deployments) Ray Serve enables flexible model composition and scaling by allowing multiple independent deployments to call into each other. When binding a deployment, you can include references to _other bound deployments_. Then, at runtime each of these arguments is converted to a {mod}`DeploymentHandle ` that can be used to query the deployment using a Python-native API. Below is a basic example where the `Ingress` deployment can call into two downstream models. For a more comprehensive guide, see the [model composition guide](serve-model-composition). ```{literalinclude} ../serve/doc_code/key_concepts.py :start-after: __start_deployment_handle__ :end-before: __end_deployment_handle__ :language: python ``` (serve-key-concepts-ingress-deployment)= ## Ingress deployment (HTTP handling) A Serve application can consist of multiple deployments that can be combined to perform model composition or complex business logic. However, one deployment is always the "top-level" one that is passed to `serve.run` to deploy the application. This deployment is called the "ingress deployment" because it serves as the entrypoint for all traffic to the application. Often, it then routes to other deployments or calls into them using the `DeploymentHandle` API, and composes the results before returning to the user. The ingress deployment defines the HTTP handling logic for the application. By default, the `__call__` method of the class is called and passed in a `Starlette` request object. The response will be serialized as JSON, but other `Starlette` response objects can also be returned directly. Here's an example: ```{literalinclude} ../serve/doc_code/key_concepts.py :start-after: __start_basic_ingress__ :end-before: __end_basic_ingress__ :language: python ``` After binding the deployment and running `serve.run()`, it is now exposed by the HTTP server and handles requests using the specified class. We can query the model using `requests` to verify that it's working. For more expressive HTTP handling, Serve also comes with a built-in integration with `FastAPI`. This allows you to use the full expressiveness of FastAPI to define more complex APIs: ```{literalinclude} ../serve/doc_code/key_concepts.py :start-after: __start_fastapi_ingress__ :end-before: __end_fastapi_ingress__ :language: python ``` ## What's next? Now that you have learned the key concepts, you can dive into these guides: - [Resource allocation](serve-resource-allocation) - [Autoscaling guide](serve-autoscaling) - [Configuring HTTP logic and integrating with FastAPI](http-guide) - [Development workflow for Serve applications](serve-dev-workflow) - [Composing deployments to perform model composition](serve-model-composition) --- (serve-llm-architecture-core)= # Core components This guide explains the technical implementation details of Ray Serve LLM's core components. You'll learn about the abstractions, protocols, and patterns that enable extensibility and modularity. ## Core abstractions Beyond `LLMServer` and `OpenAiIngress`, Ray Serve LLM defines several core abstractions that enable extensibility and modularity: ### LLMEngine protocol The `LLMEngine` abstract base class defines the contract for all inference engines. This abstraction allows Ray Serve LLM to support multiple engine implementations (vLLM, SGLang, TensorRT-LLM, etc.) with a consistent interface. The engine operates at the **OpenAI API level**, not at the raw prompt level. This means: - It accepts OpenAI-formatted requests (`ChatCompletionRequest`, `CompletionRequest`, etc.). - It returns OpenAI-formatted responses. - Engine-specific details (such as tokenization, sampling) are hidden behind this interface. #### Key methods ```python class LLMEngine(ABC): """Base protocol for all LLM engines.""" @abstractmethod async def chat( self, request: ChatCompletionRequest ) -> AsyncGenerator[Union[str, ChatCompletionResponse, ErrorResponse], None]: """Run a chat completion. Yields: - Streaming: yield "data: \\n\\n" for each chunk. - Non-streaming: yield single ChatCompletionResponse. - Error: yield ErrorResponse. - In all cases, it's still a generator to unify the upper-level logic. """ @abstractmethod async def completions( self, request: CompletionRequest ) -> AsyncGenerator[Union[str, CompletionResponse, ErrorResponse], None]: """Run a text completion.""" @abstractmethod async def embeddings( self, request: EmbeddingRequest ) -> AsyncGenerator[Union[EmbeddingResponse, ErrorResponse], None]: """Generate embeddings.""" @abstractmethod async def start(self): """Start the engine (async initialization).""" @abstractmethod async def check_health(self) -> bool: """Check if engine is healthy.""" @abstractmethod async def shutdown(self): """Gracefully shutdown the engine.""" ``` #### Engine implementations Ray Serve LLM provides: - **VLLMEngine**: Production-ready implementation using vLLM. - Supports continuous batching and paged attention. - Supports all kinds of parallelism. - KV cache transfer for prefill-decode disaggregation. - Automatic prefix caching (APC). - LoRA adapter support. Future implementations could include: - **TensorRT-LLM**: NVIDIA's optimized inference engine. - **SGLang**: Fast serving with RadixAttention. Ray Serve LLM deeply integrates with vLLM since it has end-to-end Ray support in the engine, which gives benefits in fine-grained placement of workers and other optimizations. The engine abstraction makes it straightforward to add new implementations without changing the core serving logic. ### LLMConfig `LLMConfig` is the central configuration object that specifies everything needed to deploy an LLM: ```python @dataclass class LLMConfig: """Configuration for LLM deployment.""" # Model loading model_loading_config: Union[dict, ModelLoadingConfig] # Hardware requirements accelerator_type: Optional[str] = None # For example, "A10G", "L4", "H100" # Placement group configuration placement_group_config: Optional[dict] = None # Engine-specific arguments engine_kwargs: Optional[dict] = None # Ray Serve deployment configuration deployment_config: Optional[dict] = None # LoRA adapter configuration lora_config: Optional[Union[dict, LoraConfig]] = None # Runtime environment (env vars, pip packages) runtime_env: Optional[dict] = None ``` #### Model loading configuration The `ModelLoadingConfig` specifies where and how to load the model. The following code shows the configuration structure: ```python @dataclass class ModelLoadingConfig: """Configuration for model loading.""" # Model identifier (used for API requests) model_id: str # Model source (HuggingFace or cloud storage) model_source: Union[str, dict] # Examples: # - "Qwen/Qwen2.5-7B-Instruct" (HuggingFace) # - {"bucket_uri": "s3://my-bucket/models/qwen-7b"} (S3) ``` #### LoRA configuration The following code shows the configuration structure for serving multiple LoRA adapters with a shared base model: ```python @dataclass class LoraConfig: """Configuration for LoRA multiplexing.""" # Path to LoRA weights (local or S3/GCS) dynamic_lora_loading_path: Optional[str] = None # Maximum number of adapters per replica max_num_adapters_per_replica: int = 1 ``` Ray Serve's multiplexing feature automatically routes requests to replicas that have the requested LoRA adapter loaded, using an LRU cache for adapter management. ### Deployment protocols Ray Serve LLM defines two key protocols that components must implement: #### DeploymentProtocol The base protocol for all deployments: ```python class DeploymentProtocol(Protocol): """Base protocol for Ray Serve LLM deployments.""" @classmethod def get_deployment_options(cls, *args, **kwargs) -> dict: """Return Ray Serve deployment options. Returns: dict: Options including: - placement_strategy: PlacementGroup configuration - num_replicas: Initial replica count - autoscaling_config: Autoscaling parameters - ray_actor_options: Ray actor options """ ``` This protocol ensures that all deployments can provide their own configuration for placement, scaling, and resources. #### LLMServerProtocol Extended protocol for LLM server deployments: ```python class LLMServerProtocol(DeploymentProtocol): """Protocol for LLM server deployments.""" @abstractmethod async def chat( self, request: ChatCompletionRequest, raw_request: Optional[Request] = None ) -> AsyncGenerator[Union[str, ChatCompletionResponse, ErrorResponse], None]: """Handle chat completion request.""" @abstractmethod async def completions( self, request: CompletionRequest, raw_request: Optional[Request] = None ) -> AsyncGenerator[Union[str, CompletionResponse, ErrorResponse], None]: """Handle text completion request.""" @abstractmethod async def embeddings( self, request: EmbeddingRequest, raw_request: Optional[Request] = None ) -> AsyncGenerator[Union[EmbeddingResponse, ErrorResponse], None]: """Handle embedding request.""" ``` This protocol ensures that all LLM server implementations (`LLMServer`, `DPServer`, `PDProxyServer`) provide consistent methods for handling requests. ## Builder pattern Ray Serve LLM uses the builder pattern to separate class definition from deployment decoration. This provides flexibility and testability. **Key principle**: Classes aren't decorated with `@serve.deployment`. Decoration happens in builder functions. ### Why use builders? Builders provide two key benefits: 1. **Flexibility**: Different deployment configurations for the same class. 2. **Production readiness**: You can use builders in YAML files and run `serve run config.yaml` with the target builder module. ### Builder example ```python def my_build_function( llm_config: LLMConfig, ) -> Deployment: # Get default options from the class serve_options = LLMServer.get_deployment_options(llm_config) # Merge with user-provided options serve_options.update(kwargs) # Decorate and bind return serve.deployment(deployment_cls).options( **serve_options ).bind(llm_config) ``` You can use the builder function in two ways: ::::{tab-set} :::{tab-item} Python :sync: python ```python # serve.py from ray import serve from ray.serve.llm import LLMConfig from my_module import my_build_function llm_config = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), accelerator_type="A10G", deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), ) app = my_build_function(llm_config) serve.run(app) ``` Run the deployment: ```bash python serve.py ``` ::: :::{tab-item} YAML :sync: yaml ```yaml # config.yaml applications: - args: llm_config: model_loading_config: model_id: qwen-0.5b model_source: Qwen/Qwen2.5-0.5B-Instruct accelerator_type: A10G deployment_config: autoscaling_config: min_replicas: 1 max_replicas: 2 import_path: my_module:my_build_function name: custom_llm_deployment route_prefix: / ``` Run the deployment: ```bash serve run config.yaml ``` ::: :::: ## Async constructor pattern `LLMServer` uses an async constructor to handle engine initialization. This pattern ensures the engine is fully started before the deployment begins serving requests. ```python class LLMServer(LLMServerProtocol): """LLM server deployment.""" async def __init__(self, llm_config: LLMConfig, **kwargs): """Async constructor - returns fully started instance. Ray Serve calls this constructor when creating replicas. By the time this returns, the engine is ready to serve. """ super().__init__() self._init_shared(llm_config, **kwargs) await self.start() # Start engine immediately def _init_shared(self, llm_config: LLMConfig, **kwargs): """Shared initialization logic.""" self._llm_config = llm_config self._engine_cls = self._get_engine_class() # ... other initialization async def start(self): """Start the underlying engine.""" self.engine = self._engine_cls(self._llm_config) await asyncio.wait_for( self._start_engine(), timeout=600 ) @classmethod def sync_init(cls, llm_config: LLMConfig, **kwargs) -> "LLMServer": """Sync constructor for testing. Returns unstarted instance. Caller must call await start(). """ instance = cls.__new__(cls) LLMServerProtocol.__init__(instance) instance._init_shared(llm_config, **kwargs) return instance # Not started yet! ``` ### Why use async constructors? Async constructors provide several benefits: 1. **Engine initialization is async**: Loading models and allocating GPU memory takes time. 2. **Failure detection**: If the engine fails to start, the replica fails immediately. 3. **Explicit control**: Clear distinction between when the server is ready versus initializing. 4. **Testing flexibility**: `sync_init` allows testing without engine startup. ## Component relationships The following diagram shows how core components relate to each other: ``` ┌─────────────────────────────────────────────────────────┐ │ RAY SERVE (Foundation) │ │ @serve.deployment | DeploymentHandle | Routing │ └────────────────────────┬────────────────────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Protocol │ │ Ingress │ │ Config │ │ │ │ │ │ │ │ • Deploy │ │ • OpenAI │ │ • LLM │ │ Proto │ │ API │ │ Config │ │ • Server │ │ • Model │ │ • Model │ │ Proto │ │ Routing│ │ Loading│ └─────┬────┘ └────┬─────┘ └────┬─────┘ │ │ │ └────────┬───────┴────────────────────┘ │ ▼ ┌─────────────┐ │ LLMServer │ │ │ │ Implements: │ │ • Protocol │ │ │ │ Uses: │ │ • Config │ │ • Engine │ └──────┬──────┘ │ ▼ ┌─────────────┐ │ LLMEngine │ │ (Protocol) │ │ │ │ Implemented │ │ by: │ │ • VLLMEngine│ │ • Future... │ └─────────────┘ ``` ## Extension points The core architecture provides several extension points: ### Custom engines Implement `LLMEngine` protocol to support new inference backends: ```python class MyCustomEngine(LLMEngine): """Custom engine implementation.""" async def chat(self, request): # Your implementation pass # ... implement other methods ``` ### Custom server implementations Extend `LLMServer` or implement `LLMServerProtocol` directly: ```python class CustomLLMServer(LLMServer): """Custom server with additional features.""" async def chat(self, request, raw_request=None): # Add custom preprocessing modified_request = self.preprocess(request) # Call parent implementation async for chunk in super().chat(modified_request, raw_request): yield chunk ``` ### Custom ingress Implement your own ingress for custom API formats: ```python from typing import List from ray import serve from ray.serve import DeploymentHandle # Define your FastAPI app or Ray Serve application. # For example: app = Application() @serve.ingress(app) class CustomIngress: """Custom ingress with non-OpenAI API.""" def __init__(self, server_handles: List[DeploymentHandle]): self.handles = server_handles @app.post("/custom/endpoint") async def custom_endpoint(self, request: "CustomRequest"): # CustomRequest is a user-defined request model. # Your custom logic pass ``` ### Custom builders Create domain-specific builders for common patterns: ```python def build_multimodal_deployment( model_config: dict, **kwargs ) -> Deployment: """Builder for multimodal models.""" llm_config = LLMConfig( model_loading_config={ "input_modality": InputModality.MULTIMODAL, **model_config }, engine_kwargs={ "task": "multimodal", } ) return build_llm_deployment(llm_config, **kwargs) ``` These extension points allow you to customize Ray Serve LLM for specific use cases without modifying core code. ## See also - {doc}`overview` - High-level architecture overview - {doc}`serving-patterns/index` - Detailed serving pattern documentation - {doc}`routing-policies` - Request routing architecture - {doc}`../user-guides/index` - Practical deployment guides --- # Architecture Technical documentation for Ray Serve LLM architecture, components, and patterns. ```{toctree} :maxdepth: 1 Architecture overview Core components Serving patterns Request routing ``` --- (serve-llm-architecture-overview)= # Architecture overview Ray Serve LLM is a framework that specializes Ray Serve primitives for distributed LLM serving workloads. This guide explains the core components, serving patterns, and routing policies that enable scalable and efficient LLM inference. ## What Ray Serve LLM provides Ray Serve LLM takes the performance of a single inference engine (such as vLLM) and extends it to support: - **Horizontal scaling**: Replicate inference across multiple GPUs on the same node or across nodes. - **Advanced distributed strategies**: Coordinate multiple engine instances for prefill-decode disaggregation, data parallel attention, and expert parallelism. - **Modular deployment**: Separate infrastructure logic from application logic for clean, maintainable deployments. Ray Serve LLM excels at highly distributed multi-node inference workloads where the unit of scale spans multiple nodes: - **Pipeline parallelism across nodes**: Serve large models that don't fit on a single node. - **Disaggregated prefill and decode**: Scale prefill and decode phases independently for better resource utilization. - **Cluster-wide parallelism**: Combine data parallel attention with expert parallelism for serving large-scale sparse MoE architectures such as Deepseek-v3, GPT OSS, etc. ## Ray Serve primitives Before diving into the architecture, you should understand these Ray Serve primitives: - **Deployment**: A class that defines the unit of scale. - **Replica**: An instance of a deployment which corresponds to a Ray actor. Multiple replicas can be distributed across a cluster. - **Deployment handle**: An object that allows one replica to call into replicas of other deployments. For more details, see the {ref}`Ray Serve core concepts `. ## Core components Ray Serve LLM provides two primary components that work together to serve LLM workloads: ### LLMServer `LLMServer` is a Ray Serve _deployment_ that manages a single inference engine instance. _Replicas_ of this _deployment_ can operate in three modes: - **Isolated**: Each _replica_ handles requests independently (horizontal scaling). - **Coordinated within deployment**: Multiple _replicas_ work together (data parallel attention). - **Coordinated across deployments**: Replicas coordinate with different deployments (prefill-decode disaggregation). The following example demonstrates the sketch of how to use `LLMServer` standalone: ```python from ray import serve from ray.serve.llm import LLMConfig from ray.serve.llm.deployment import LLMServer llm_config = LLMConfig(...) # Get deployment options (placement groups, etc.) serve_options = LLMServer.get_deployment_options(llm_config) # Decorate with serve options server_cls = serve.deployment(LLMServer).options( stream=True, **serve_options) # Bind the decorated class to its constructor parameters server_app = server_cls.bind(llm_config) # Run the application serve_handle = serve.run(server_app) # Use the deployment handle result = serve_handle.chat.remote(request=...).result() ``` #### Physical placement `LLMServer` controls physical placement of its constituent actors through placement groups. By default, it uses: - `{CPU: 1}` for the replica actor itself (no GPU resources). - `world_size` number of `{GPU: 1}` bundles for the GPU workers. The `world_size` is computed as `tensor_parallel_size × pipeline_parallel_size`. The vLLM engine allocates TP and PP ranks based on bundle proximity, prioritizing TP ranks on the same node. The PACK strategy tries to place all resources on a single node, but provisions different nodes when necessary. This works well for most deployments, though heterogeneous model deployments might occasionally run TP across nodes. ```{figure} ../images/placement.png --- width: 600px name: placement --- Physical placement strategy for GPU workers ``` #### Engine management When `LLMServer` starts, it: 1. Creates a vLLM engine client. 2. Spawns a background process that uses Ray's distributed executor backend. 3. Uses the parent actor's placement group to instantiate child GPU worker actors. 4. Executes the model's forward pass on these GPU workers. ```{figure} ../images/llmserver.png --- width: 600px name: llmserver --- Illustration of `LLMServer` managing vLLM engine instance. ``` ### OpenAiIngress `OpenAiIngress` provides an OpenAI-compatible FastAPI ingress that routes traffic to the appropriate model. It handles: - **Standard endpoint definitions**: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, etc. - **Request routing logic**: The execution of custom router logic (for example, prefix-aware or session-aware routing). - **Model multiplexing**: LoRA adapter management and routing. The following example shows a complete deployment with `OpenAiIngress`: ```python from ray import serve from ray.serve.llm import LLMConfig from ray.serve.llm.deployment import LLMServer from ray.serve.llm.ingress import OpenAiIngress, make_fastapi_ingress llm_config = LLMConfig(...) # Construct the LLMServer deployment serve_options = LLMServer.get_deployment_options(llm_config) server_cls = serve.deployment(LLMServer).options(**serve_options) llm_server = server_cls.bind(llm_config) # Get ingress default options ingress_options = OpenAiIngress.get_deployment_options([llm_config]) # Decorate with FastAPI app ingress_cls = make_fastapi_ingress(OpenAiIngress) # Make it a serve deployment with the right options ingress_cls = serve.deployment(ingress_cls, **ingress_options) # Bind with llm_server deployment handle ingress_app = ingress_cls.bind([llm_server]) # Run the application serve.run(ingress_app) ``` :::{note} You can create your own ingress deployments and connect them to existing LLMServer deployments. This is useful when you want to customize request tracing, authentication layers, etc. ::: #### Network topology and RPC patterns When the ingress makes an RPC call to `LLMServer` through the deployment handle, it can reach any replica across any node. However, the default request router prioritizes replicas on the same node to minimize cross-node RPC overhead, which is insignificant in LLM serving applications (only a few milliseconds impact on TTFT at high concurrency). The following figure illustrates the data flow: ```{figure} ../images/llmserver-ingress-rpc.png --- width: 600px name: llmserver-ingress-rpc --- Request routing from ingress to LLMServer replicas. Solid lines represent preferred local RPC calls; dashed lines represent potential cross-node RPC calls when local replicas are busy. ``` #### Scaling considerations **Ingress-to-LLMServer ratio**: The ingress event loop can become the bottleneck at high concurrency. In such situations, upscaling the number of ingress replicas can mitigate CPU contention. We recommend keeping at least a 2:1 ratio between the number of ingress replicas and LLMServer replicas. This architecture allows the system to dynamically scale the component that is the bottleneck. **Autoscaling coordination**: To maintain proper ratios during autoscaling, configure `target_ongoing_requests` proportionally: - Profile your vLLM configuration to find the maximum concurrent requests (for example, 64 requests). - Choose an ingress-to-LLMServer ratio (for example, 2:1). - Set LLMServer's `target_ongoing_requests` to say 75% of max capacity (for example, 48). - Set ingress's `target_ongoing_requests` to maintain the ratio (for example, 24). ## Architecture patterns Ray Serve LLM supports several deployment patterns for different scaling scenarios: ### Data parallel attention pattern Create multiple inference engine instances that process requests in parallel while coordinating across expert layers and sharding requests across attention layers. Useful for serving sparse MoE models for high-throughput workloads. **When to use**: High request volume, kv-cache limited, need to maximize throughput. See: {doc}`serving-patterns/data-parallel` ### Prefill-decode disaggregation Separate prefill and decode phases to optimize resource utilization and scale each phase independently. **When to use**: Prefill-heavy workloads where there's tension between prefill and decode, cost optimization with different GPU types. See: {doc}`serving-patterns/prefill-decode` ### Custom request routing Implement custom routing logic for specific optimization goals such as cache locality or session affinity. **When to use**: Workloads with repeated prompts, session-based interactions, or specific routing requirements. See: {doc}`routing-policies` ## Design principles Ray Serve LLM follows these key design principles: 1. **Engine-agnostic**: Support multiple inference engines (vLLM, SGLang, etc.) through the `LLMEngine` protocol. 2. **Composable patterns**: Combine serving patterns (data parallel attention, prefill-decode, custom routing) for complex deployments. 3. **Builder pattern**: Use builders to construct complex deployment graphs declaratively. 4. **Separation of concerns**: Keep infrastructure logic (placement, scaling) separate from application logic (routing, processing). 5. **Protocol-based extensibility**: Define clear protocols for engines, servers, and ingress to enable custom implementations. ## See also - {doc}`core` - Technical implementation details and extension points - {doc}`serving-patterns/index` - Detailed serving pattern documentation - {doc}`routing-policies` - Request routing architecture and patterns - {doc}`../user-guides/index` - Practical deployment guides --- # Request routing Ray Serve LLM provides customizable request routing to optimize request distribution across replicas for different workload patterns. Request routing operates at the **replica selection level**, distinct from ingress-level model routing. ## Routing versus ingress You need to distinguish between two levels of routing: **Ingress routing** (model-level): - Maps `model_id` to deployment - Example: `OpenAiIngress` gets `/v1/chat/completions` with `model="gptoss"` and maps it to the `gptoss` deployment. **Request routing** (replica-level): - Chooses which replica to send the request to - Example: The `gptoss` deployment handle inside the `OpenAiIngress` replica decides which replica of the deployment (1, 2, or 3) to send the request to. This document focuses on **request routing** (replica selection). ``` HTTP Request → Ingress (model routing) → Request Router (replica selection) → Server Replica ``` ## Request routing architecture Ray Serve LLM request routing operates at the deployment handle level: ``` ┌──────────────┐ │ Ingress │ │ (Replica 1) │ └──────┬───────┘ │ │ handle.remote(request) ↓ ┌──────────────────┐ │ Deployment Handle│ │ + Router │ ← Request routing happens here └──────┬───────────┘ │ │ Chooses replica based on policy ↓ ┌───┴────┬────────┬────────┐ │ │ │ │ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ │ LLM │ │ LLM │ │ LLM │ │ LLM │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ └─────┘ └─────┘ └─────┘ └─────┘ ``` ## Available routing policies Ray Serve LLM provides multiple request routing policies to optimize for different workload patterns: ### Default routing: Power of Two Choices The default router uses the Power of Two Choices algorithm to: 1. Randomly sample two replicas. 2. Route to the replica with fewer ongoing requests. This provides good load balancing with minimal coordination overhead. ### Prefix-aware routing The `PrefixCacheAffinityRouter` optimizes for workloads with shared prefixes by routing requests with similar prefixes to the same replicas. This improves KV cache hit rates in vLLM's Automatic Prefix Caching (APC). The routing strategy: 1. **Check load balance**: If replicas are balanced (queue difference < threshold), use prefix matching. 2. **High match rate (≥10%)**: Route to replicas with highest prefix match. 3. **Low match rate (<10%)**: Route to replicas with lowest cache utilization. 4. **Fallback**: Use Power of Two Choices when load is imbalanced. For more details, see {ref}`prefix-aware-routing-guide`. ## Design patterns for custom routing policies Customizing request routers is a feature in Ray Serve's native APIs that you can define per deployment. For each deployment, you can customize the routing logic that executes every time you call `.remote()` on the deployment handle from a caller. Because deployment handles are globally available objects across the cluster, you can call them from any actor or task in the Ray cluster. For more details on this API, see {ref}`custom-request-router-guide`. This allows you to run the same routing logic even if you have multiple handles. The default request router in Ray Serve is Power of Two Choices, which balances load equalization and prioritizes locality routing. However, you can customize this to use LLM-specific metrics. Ray Serve LLM includes prefix-aware routing in the framework. There are two common architectural patterns for customizing request routers. There are clear trade-offs between them, so choose the suitable one and balance simplicity with performance: ### Pattern 1: Centralized singleton metric store In this approach, you keep a centralized metric store (for example, a singleton actor) for tracking routing-related information. The request router logic physically runs on the process that owns the deployment handle, so there can be many such processes. Each one can query the singleton actor, creating a multi-tenant actor that provides a consistent view of the cluster state to the request routers. The single actor can provide atomic thread-safe operations such as `get()` for querying the global state and `set()` for updating the global state, which the router can use during `choose_replicas()` and `on_request_routed()`. ``` ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Ingress │────►│ Metric │◄────│ Ingress │ │ 1 │ │ Store │ │ 2 │ └────┬────┘ └─────────┘ └────┬────┘ │ │ └────────────────┬──────────────┘ │ ┌──────────┴──────────┐ │ │ ┌────▼────┐ ┌────▼────┐ │ LLM │ │ LLM │ │ Server │ │ Server │ └─────────┘ └─────────┘ ``` ```{figure} ../images/routing_centralized_store.png --- width: 600px name: centralized_metric_store_pattern --- Centralized metric store pattern for custom routing ``` **Pros:** - Simple implementation - no need to modify deployment logic for recording replica statistics. - Request metrics are immediately available. - Strong consistency guarantees. **Cons:** - A single actor can become a bottleneck in high-throughput applications where TTFT is impacted by the RPC call (~1000s of requests/s). - Requires an additional network hop for every routing decision. ### Pattern 2: Metrics broadcasted from Serve controller In this approach, the Serve controller polls each replica for local statistics and then broadcasts them to all request routers on their deployment handles. The request router can then use this globally broadcasted information to pick the right replica. After a request reaches the replica, the replica updates its local statistics so it can send them back to the Serve controller when the controller polls it next time. ``` ┌──────────────┐ │ Serve │ │ Controller │ └──────┬───────┘ │ (broadcast) ┌─────────┴─────────┐ │ │ ┌────▼────┐ ┌────▼────┐ │ Ingress │ │ Ingress │ │ +Cache │ │ +Cache │ └────┬────┘ └────┬────┘ │ │ └────────┬──────────┘ │ ┌──────┴──────┐ │ │ ┌────▼────┐ ┌────▼────┐ │ LLM │ │ LLM │ │ Server │ │ Server │ └─────────┘ └─────────┘ ``` ```{figure} ../images/routing_broadcast_metrics.png --- width: 600px name: broadcast_metrics_pattern --- Broadcast metrics pattern for custom routing ``` **Pros:** - Scalable to higher throughput. - No additional RPC overhead per routing decision. - Distributed routing decision making. **Cons:** - Time lag between the request router's view of statistics and the ground truth state of the replicas. - Eventual consistency - routers may base decisions on slightly stale data. - More complex implementation requiring coordination with the Serve controller. - **Use Pattern 1 (Centralized store)** when you need strong consistency, have moderate throughput requirements, or want simpler implementation. - **Use Pattern 2 (Broadcast metrics)** when you need very high throughput, can tolerate eventual consistency, or want to minimize per-request overhead. ## Custom routing policies You can implement custom routing policies by extending Ray Serve's [`RequestRouter`](../../api/doc/ray.serve.request_router.RequestRouter.rst) base class. For detailed examples and step-by-step guides on implementing custom routers, see {ref}`custom-request-router-guide`. Key methods to implement: - [`choose_replicas()`](../../api/doc/ray.serve.request_router.RequestRouter.choose_replicas.rst): Select which replicas should handle a request. - [`on_request_routed()`](../../api/doc/ray.serve.request_router.RequestRouter.on_request_routed.rst): Update the router state after a request is routed. - [`on_replica_actor_died()`](../../api/doc/ray.serve.request_router.RequestRouter.on_replica_actor_died.rst): Clean up the state when a replica dies. ### Utility mixins Ray Serve provides mixin classes that add common functionality to routers. See the {ref}`custom-request-router-guide` for examples: - [`LocalityMixin`](../../api/doc/ray.serve.request_router.LocalityMixin.rst): Prefers replicas on the same node to reduce network latency. - [`MultiplexMixin`](../../api/doc/ray.serve.request_router.MultiplexMixin.rst): Tracks which models are loaded on each replica for LoRA deployments. - [`FIFOMixin`](../../api/doc/ray.serve.request_router.FIFOMixin.rst): Ensures FIFO ordering of requests. ### Router lifecycle The typical lifecycle of request routers includes the following stages: 1. **Initialization**: Router created with list of replicas. 2. **Request routing**: `choose_replicas()` called for each request. 3. **Callback**: `on_request_routed()` called after successful routing. 4. **Replica failure**: `on_replica_actor_died()` called when replica dies. 5. **Cleanup**: Router cleaned up when deployment is deleted. #### Async operations Routers should use async operations for best performance. The following example demonstrates the recommended pattern: ```python # Recommended pattern: Async operation async def choose_replicas(self, ...): state = await self.state_actor.get.remote() return self._select(state) # Not recommended pattern: Blocking operation async def choose_replicas(self, ...): state = ray.get(self.state_actor.get.remote()) # Blocks! return self._select(state) ``` #### State management For routers with state, use appropriate synchronization. The following example shows the recommended pattern: ```python class StatefulRouter(RequestRouter): def __init__(self): self.lock = asyncio.Lock() # For async code self.state = {} async def choose_replicas(self, ...): async with self.lock: # Protect shared state # Update state self.state[...] = ... return [...] ``` ## See also - {ref}`prefix-aware-routing-guide` - user guide for deploying prefix-aware routing - {ref}`custom-request-router-guide` - Ray Serve guide for implementing custom routers - [`RequestRouter` API Reference](../../api/doc/ray.serve.request_router.RequestRouter.rst) - complete API documentation --- (serve-llm-architecture-data-parallel)= # Data parallel attention Data parallel attention (DP) is a serving pattern that creates multiple inference engine instances to process requests in parallel. This pattern is most useful when you combine it with expert parallelism for sparse MoE models. In this case, the experts are parallelized across multiple machines and attention (QKV) layers are replicated across GPUs, providing an opportunity to shard across requests. In this serving pattern, engine replicas aren't isolated. In fact, they need to run in sync with each other to serve a large number of requests concurrently. ## Architecture overview ```{figure} ../../images/dp.png --- width: 700px name: dp-architecture --- Data parallel attention architecture showing DPRankAssigner coordinating multiple LLMServer replicas. ``` In data parallel attention serving: - The system creates `dp_size` replicas of the LLM server. - Each replica runs an independent inference engine with the same model. - Requests are distributed across replicas through Ray Serve's routing. - All replicas work together as a cohesive unit. ### When to use DP Data parallel attention serving works best when: - **Large sparse MoE with MLA**: Allows reaching larger batch sizes by utilizing the sparsity of the experts more efficiently. MLA (Multi-head Latent Attention) reduces KV cache memory requirements. - **High throughput required**: You need to serve many concurrent requests. - **KV-cache limited**: Adding more KV cache capacity increases throughput, so that parallelization of experts could effectively increase the capacity of KV-cache for handling concurrent requests. ### When not to use DP Consider alternatives when: - **Low to medium throughput**: If you can't saturate the MoE layers, don't use DP. - **Non-MLA Attention with sufficient TP**: DP is most beneficial with MLA (Multi-head Latent Attention), where KV cache can't be sharded along the head dimension. For models with GQA (Grouped Query Attention), you can use TP to shard the KV cache up to the degree where `TP_size <= num_kv_heads`. Beyond that point, TP requires KV cache replication, which wastes memory—DP becomes a better choice to avoid duplication. For example, for Qwen-235b, using `DP=2, TP=4, EP=8` makes more sense than `DP=8, EP=8` because you can still shard the KV cache with TP=4 before needing to replicate it. Benchmark these configurations with your workload to determine the optimal setup. - **Non-MoE models**: The main reason for using DP at the cost of this complexity is to lift the effective batch size during decoding for saturating the experts. ## Components The following are the main components of DP deployments: ### DPServer `DPServer` extends `LLMServer` with data parallel attention coordination. The following pseudocode shows the structure: ```python from ray import serve class DPServer(LLMServer): """LLM server with data parallel attention coordination.""" async def __init__( self, llm_config: LLMConfig, rank_assigner_handle: DeploymentHandle, dp_size: int, **kwargs ): self.rank_assigner = rank_assigner_handle self.dp_size = dp_size # Get assigned rank from coordinator and pass it to engine. replica_id = serve.get_replica_context().replica_id llm_config.rank = await self.rank_assigner.assign_rank.remote(replica_id) # Call parent initialization await super().__init__(llm_config, **kwargs) ``` Key responsibilities: - Register with the rank assigner coordinator. - Obtain a unique rank (0 to `dp_size-1`). - Coordinate with other replicas for collective operations. - Handle replica failures and re-registration. ### DPRankAssigner `DPRankAssigner` is a singleton coordinator that manages rank assignment for data parallel attention replicas. The following pseudocode shows the structure: ```python class DPRankAssigner: """Coordinator for data parallel attention rank assignment.""" def __init__(self, dp_size: int): self.dp_size = dp_size self.assigned_ranks: Set[int] = set() self.rank_to_replica: Dict[int, str] = {} self.lock = asyncio.Lock() async def assign_rank(self, replica_id: str) -> int: """Assign a rank to a replica. Returns: int: Assigned rank (0 to dp_size-1) """ async with self.lock: # Find first available rank for rank in range(self.dp_size): if rank not in self.assigned_ranks: self.assigned_ranks.add(rank) self.rank_to_replica[rank] = replica_id return rank async def release_rank(self, rank: int): """Release a rank when replica dies.""" async with self.lock: self.assigned_ranks.discard(rank) self.rank_to_replica.pop(rank, None) ``` Key responsibilities: - Assign unique ranks to replicas. - Ensure exactly `dp_size` replicas are serving. ## Request flow ```{figure} ../../images/dp_flow.png --- width: 700px name: dp-flow --- Data parallel attention request flow from client to distributed replicas. ``` The following is the request flow through a data parallel attention deployment: 1. **Client request**: HTTP request arrives at ingress. 2. **Ingress routing**: Ingress uses deployment handle to call DPServer. 3. **Ray Serve routing**: Ray Serve's request router selects a replica. - Default: Power of Two Choices (load balancing). - Custom: Prefix-aware, session-aware, etc. 4. **Replica processing**: Selected DPServer replica processes request. 5. **Engine inference**: vLLM engine generates response. 6. **Streaming response**: Tokens stream back to client. The key difference from basic serving is that all the `dp_size` replicas are working in coordination with each other rather than in isolation. ## Scaling ### Scaling behavior Data parallel attention deployments require a fixed number of replicas equal to `dp_size`, as autoscaling isn't supported for this pattern. You must set `num_replicas` to `dp_size`, or if using `autoscaling_config`, both `min_replicas` and `max_replicas` must equal `dp_size`. ## Design considerations ### Coordination overhead The `DPRankAssigner` introduces minimal coordination overhead: - **Startup**: Each replica makes one RPC to get its rank. - **Runtime**: No coordination overhead during request processing. The singleton actor pattern ensures consistency during startup time. ### Placement strategy The PACK strategy places each replica's resources together: - Tensor parallel workers for one replica pack on the same node when possible. - Different replicas can be on different nodes. - This minimizes inter-node communication within each replica. ## Combining with other patterns ### DP + Prefill-decode disaggregation You can run data parallel attention on both prefill and decode phases: ``` ┌─────────────────────────────────────────────┐ │ OpenAiIngress │ └─────────────┬───────────────────────────────┘ │ ▼ ┌─────────────┐ │PDProxyServer│ └──┬───────┬──┘ │ │ ┌─────┘ └─────┐ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Prefill │ │ Decode │ │ DP-2 │ │ DP-4 │ │ │ │ │ │ Replica0 │ │ Replica0 │ │ Replica1 │ │ Replica1 │ └──────────┘ │ Replica2 │ │ Replica3 │ └──────────┘ ``` ## See also - {doc}`../overview` - High-level architecture overview - {doc}`../core` - Core components and protocols - {doc}`prefill-decode` - Prefill-decode disaggregation architecture - {doc}`../routing-policies` - Request routing architecture --- # Serving patterns Architecture documentation for distributed LLM serving patterns. ```{toctree} :maxdepth: 1 Data parallel attention Prefill-decode disaggregation ``` ## Overview Ray Serve LLM supports several serving patterns that can be combined for complex deployment scenarios: - **Data parallel attention**: Scale throughput by running multiple coordinated engine instances that shard requests across attention layers. - **Prefill-decode disaggregation**: Optimize resource utilization by separating prompt processing from token generation. These patterns are composable and can be mixed to meet specific requirements for throughput, latency, and cost optimization. --- (serve-llm-architecture-prefill-decode)= # Prefill-decode disaggregation Prefill-decode (PD) disaggregation is a serving pattern that separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This pattern was first pioneered in [DistServe](https://hao-ai-lab.github.io/blogs/distserve/) and optimizes resource utilization by scaling each phase independently based on its specific requirements. ## Architecture overview ```{figure} ../../images/pd_arch.png --- width: 700px name: pd-architecture --- Prefill-decode disaggregation architecture with PDProxyServer coordinating prefill and decode deployments. ``` In prefill-decode disaggregation: - **Prefill deployment**: Processes input prompts and generates initial KV cache. - **Decode deployment**: Uses transferred KV cache to generate output tokens. - **Independent scaling**: Each phase scales based on its own load. - **Resource optimization**: Different engine configurations for different phases. ## Why disaggregate? ### Resource characteristics Prefill and decode have different computational patterns: | Phase | Characteristics | Resource Needs | |-------|----------------|----------------| | Prefill | Processes the entire prompt at once | High compute, lower memory | | | Parallel token processing | Benefits from high FLOPS | | | Short duration per request | Can use fewer replicas when decode-limited | | Decode | Generates one token at a time | Lower compute, high memory | | | Auto-regressive generation | Benefits from large batch sizes | | | Long duration (many tokens) | Needs more replicas | ### Scaling benefits Disaggregation enables: - **Cost optimization**: The correct ratio of prefill to decode instances improves overall throughput per node. - **Dynamic traffic adjustment**: Scale prefill and decode independently depending on workloads (prefill-heavy versus decode-heavy) and traffic volume. - **Efficiency**: Prefill serves multiple requests while decode generates, allowing one prefill instance to feed multiple decode instances. ## Components ### PDProxyServer `PDProxyServer` orchestrates the disaggregated serving: ```python class PDProxyServer: """Proxy server for prefill-decode disaggregation.""" def __init__( self, prefill_handle: DeploymentHandle, decode_handle: DeploymentHandle, ): self.prefill_handle = prefill_handle self.decode_handle = decode_handle async def chat( self, request: ChatCompletionRequest, ) -> AsyncGenerator[str, None]: """Handle chat completion with PD flow. Flow: 1. Send request to prefill deployment 2. Prefill processes prompt, transfers KV to decode 3. Decode generates tokens, streams to client """ # Prefill phase prefill_result = await self.prefill_handle.chat.remote(request) # Extract KV cache metadata kv_metadata = prefill_result["kv_metadata"] # Decode phase with KV reference async for chunk in self.decode_handle.chat.remote( request, kv_metadata=kv_metadata ): yield chunk ``` Key responsibilities: - Route requests between prefill and decode. - Handle KV cache metadata transfer. - Stream responses from decode to client. - Manage errors in either phase per request. ### Prefill LLMServer Standard `LLMServer` configured for prefill: ```python prefill_config = LLMConfig( model_loading_config=dict( model_id="llama-3.1-8b", model_source="meta-llama/Llama-3.1-8B-Instruct" ), engine_kwargs=dict( # Prefill-specific configuration kv_transfer_config={ "kv_connector": "NixlConnector", "kv_role": "kv_both", }, ), ) ``` ### Decode LLMServer Standard `LLMServer` configured for decode: ```python decode_config = LLMConfig( model_loading_config=dict( model_id="llama-3.1-8b", model_source="meta-llama/Llama-3.1-8B-Instruct" ), engine_kwargs=dict( # Decode-specific configuration kv_transfer_config={ "kv_connector": "NixlConnector", "kv_role": "kv_both", }, ), ) ``` ### Request flow ```{figure} ../../images/pd.png --- width: 700px name: pd-flow --- Prefill-decode request flow showing KV cache transfer between phases. ``` Detailed request flow: 1. **Client request**: HTTP POST to `/v1/chat/completions`. 2. **Ingress**: Routes to `PDProxyServer`. 3. **Proxy → Prefill**: `PDProxyServer` calls prefill deployment. - Prefill server processes prompt. - Generates KV cache. - Transfers KV to storage backend. - Returns KV metadata (location, size, etc.). 4. **Proxy → Decode**: `PDProxyServer` calls decode deployment with KV metadata. - Decode server loads KV cache from storage. - Begins token generation. - Streams tokens back through proxy. 5. **Response streaming**: Client receives generated tokens. :::{note} The KV cache transfer is transparent to the client. From the client's perspective, it's a standard OpenAI API call. ::: ## Performance characteristics ### When to use PD disaggregation Prefill-decode disaggregation works best when: - **Long generations**: Decode phase dominates total end-to-end latency. - **Imbalanced phases**: Prefill and decode need different resources. - **Cost optimization**: Use different GPU types for each phase. - **High decode load**: Many requests are in decode phase simultaneously. - **Batch efficiency**: Prefill can batch multiple requests efficiently. ### When not to use PD Consider alternatives when: - **Short outputs**: Decode latency minimal, overhead not worth it. - **Network limitations**: KV transfer overhead too high. - **Small models**: Both phases fit comfortably on same resources. ## Design considerations ### KV cache transfer latency The latency of KV cache transfer between prefill and decode affects overall request latency and it's mostly determined by network bandwidth. NIXL has different backend plugins, but its performance on different network stacks isn't mature yet. You should inspect your deployment to verify NIXL uses the right network backend for your environment. ### Phase load balancing The system must balance load between prefill and decode phases. Mismatched scaling can lead to: - **Prefill bottleneck**: Requests queue at prefill, decode replicas idle. - **Decode bottleneck**: Prefill completes quickly, decode can't keep up. Monitor both phases and adjust replica counts and autoscaling policies accordingly. ## See also - {doc}`../overview` - High-level architecture overview - {doc}`../core` - Core components and protocols - {doc}`data-parallel` - Data parallel attention architecture - {doc}`../../user-guides/prefill-decode` - Practical deployment guide --- # Benchmarks Performance in LLM serving depends heavily on your specific workload characteristics and hardware stack. From a Ray Serve perspective, the focus is on orchestration overhead and the effectiveness of serving pattern implementations. The Ray team maintains the [ray-serve-llm-perf-examples](https://github.com/anyscale/ray-serve-llm-perf-examples) repository with benchmarking snapshots, tooling, and lessons learned. These benchmarks validate the correctness and effectiveness of different serving patterns. You can use these benchmarks to validate your production stack more systematically. ## Replica Startup Latency Replica startup times involving large models can be slow, leading to slow autoscaling and poor response to changing workloads. Experiments on replica startup can be found [here](https://github.com/anyscale/ray-serve-llm-perf-examples/tree/master/replica_initialization). The experiments illustrate the effects of the various techniques mentioned in [this guide](./user-guides/deployment-initialization.md), primarily targeting the latency cost of model loading and Torch Compile. As models grow larger, the effects of these optimizations become increasingly pronounced. As an example, we get nearly 3.88x reduction in latency on `Qwen/Qwen3-235B-A22B`. --- # Examples Production examples for deploying LLMs with Ray Serve. ## Tutorials Complete end-to-end tutorials for deploying different types of LLMs: - {doc}`Deploy a small-sized LLM <../tutorials/deployment-serve-llm/small-size-llm/README>` - {doc}`Deploy a medium-sized LLM <../tutorials/deployment-serve-llm/medium-size-llm/README>` - {doc}`Deploy a large-sized LLM <../tutorials/deployment-serve-llm/large-size-llm/README>` - {doc}`Deploy a vision LLM <../tutorials/deployment-serve-llm/vision-llm/README>` - {doc}`Deploy a reasoning LLM <../tutorials/deployment-serve-llm/reasoning-llm/README>` - {doc}`Deploy a hybrid reasoning LLM <../tutorials/deployment-serve-llm/hybrid-reasoning-llm/README>` - {doc}`Deploy gpt-oss <../tutorials/deployment-serve-llm/gpt-oss/README>` --- (serving-llms)= # Serving LLMs Ray Serve LLM provides a high-performance, scalable framework for deploying Large Language Models (LLMs) in production. It specializes Ray Serve primitives for distributed LLM serving workloads, offering enterprise-grade features with OpenAI API compatibility. ## Why Ray Serve LLM? Ray Serve LLM excels at highly distributed multi-node inference workloads: - **Advanced parallelism strategies**: Seamlessly combine pipeline parallelism, tensor parallelism, expert parallelism, and data parallel attention for models of any size. - **Prefill-decode disaggregation**: Separates and optimizes prefill and decode phases independently for better resource utilization and cost efficiency. - **Custom request routing**: Implements prefix-aware, session-aware, or custom routing logic to maximize cache hits and reduce latency. - **Multi-node deployments**: Serves massive models that span multiple nodes with automatic placement and coordination. - **Production-ready**: Has built-in autoscaling, monitoring, fault tolerance, and observability. ## Features - ⚡️ Automatic scaling and load balancing - 🌐 Unified multi-node multi-model deployment - 🔌 OpenAI-compatible API - 🔄 Multi-LoRA support with shared base models - 🚀 Engine-agnostic architecture (vLLM, SGLang, etc.) - 📊 Built-in metrics and Grafana dashboards - 🎯 Advanced serving patterns (PD disaggregation, data parallel attention) ## Requirements ```bash pip install ray[serve,llm] ``` ```{toctree} :hidden: Quickstart Examples User Guides Architecture Benchmarks Troubleshooting ``` ## Next steps - {doc}`Quickstart ` - Deploy your first LLM with Ray Serve - {doc}`Examples ` - Production-ready deployment tutorials - {doc}`User Guides ` - Practical guides for advanced features - {doc}`Architecture ` - Technical design and implementation details - {doc}`Troubleshooting ` - Common issues and solutions --- (quick-start)= # Quickstart examples ## Deployment through OpenAiIngress You can deploy LLM models using either the builder pattern or bind pattern. ::::{tab-set} :::{tab-item} Builder Pattern :sync: builder ```{literalinclude} ../../llm/doc_code/serve/qwen/qwen_example.py :language: python :start-after: __qwen_example_start__ :end-before: __qwen_example_end__ ``` ::: :::{tab-item} Bind Pattern :sync: bind ```python from ray import serve from ray.serve.llm import LLMConfig from ray.serve.llm.deployment import LLMServer from ray.serve.llm.ingress import OpenAiIngress, make_fastapi_ingress llm_config = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), # Pass the desired accelerator type (e.g. A10G, L4, etc.) accelerator_type="A10G", # You can customize the engine arguments (e.g. vLLM engine kwargs) engine_kwargs=dict( tensor_parallel_size=2, ), ) # Deploy the application server_options = LLMServer.get_deployment_options(llm_config) server_deployment = serve.deployment(LLMServer).options( **server_options).bind(llm_config) ingress_options = OpenAiIngress.get_deployment_options( llm_configs=[llm_config]) ingress_cls = make_fastapi_ingress(OpenAiIngress) ingress_deployment = serve.deployment(ingress_cls).options( **ingress_options).bind([server_deployment]) serve.run(ingress_deployment, blocking=True) ``` ::: :::: You can query the deployed models with either cURL or the OpenAI Python client: ::::{tab-set} :::{tab-item} cURL :sync: curl ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer fake-key" \ -d '{ "model": "qwen-0.5b", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ::: :::{tab-item} Python :sync: python ```python from openai import OpenAI # Initialize client client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") # Basic chat completion with streaming response = client.chat.completions.create( model="qwen-0.5b", messages=[{"role": "user", "content": "Hello!"}], stream=True ) for chunk in response: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="", flush=True) ``` ::: :::: For deploying multiple models, you can pass a list of {class}`LLMConfig ` objects to the {class}`OpenAiIngress ` deployment: ::::{tab-set} :::{tab-item} Builder Pattern :sync: builder ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config1 = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="A10G", ) llm_config2 = LLMConfig( model_loading_config=dict( model_id="qwen-1.5b", model_source="Qwen/Qwen2.5-1.5B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="A10G", ) app = build_openai_app({"llm_configs": [llm_config1, llm_config2]}) serve.run(app, blocking=True) ``` ::: :::{tab-item} Bind Pattern :sync: bind ```python from ray import serve from ray.serve.llm import LLMConfig from ray.serve.llm.deployment import LLMServer from ray.serve.llm.ingress import OpenAiIngress, make_fastapi_ingress llm_config1 = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="A10G", ) llm_config2 = LLMConfig( model_loading_config=dict( model_id="qwen-1.5b", model_source="Qwen/Qwen2.5-1.5B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="A10G", ) # deployment #1 server_options1 = LLMServer.get_deployment_options(llm_config1) server_deployment1 = serve.deployment(LLMServer).options( **server_options1).bind(llm_config1) # deployment #2 server_options2 = LLMServer.get_deployment_options(llm_config2) server_deployment2 = serve.deployment(LLMServer).options( **server_options2).bind(llm_config2) # ingress ingress_options = OpenAiIngress.get_deployment_options( llm_configs=[llm_config1, llm_config2]) ingress_cls = make_fastapi_ingress(OpenAiIngress) ingress_deployment = serve.deployment(ingress_cls).options( **ingress_options).bind([server_deployment1, server_deployment2]) # run serve.run(ingress_deployment, blocking=True) ``` ::: :::: ## Production deployment For production deployments, Ray Serve LLM provides utilities for config-driven deployments. You can specify your deployment configuration with YAML files: ::::{tab-set} :::{tab-item} Inline Config :sync: inline ```{literalinclude} ../../llm/doc_code/serve/qwen/llm_config_example.yaml :language: yaml ``` ::: :::{tab-item} Standalone Config :sync: standalone ```yaml # config.yaml applications: - args: llm_configs: - models/qwen-0.5b.yaml - models/qwen-1.5b.yaml import_path: ray.serve.llm:build_openai_app name: llm_app route_prefix: "/" ``` ```yaml # models/qwen-0.5b.yaml model_loading_config: model_id: qwen-0.5b model_source: Qwen/Qwen2.5-0.5B-Instruct accelerator_type: A10G deployment_config: autoscaling_config: min_replicas: 1 max_replicas: 2 ``` ```yaml # models/qwen-1.5b.yaml model_loading_config: model_id: qwen-1.5b model_source: Qwen/Qwen2.5-1.5B-Instruct accelerator_type: A10G deployment_config: autoscaling_config: min_replicas: 1 max_replicas: 2 ``` ::: :::: To deploy with either configuration file: ```bash serve run config.yaml ``` For monitoring and observability, see {doc}`Observability `. ## Advanced usage patterns For each usage pattern, Ray Serve LLM provides a server and client code snippet. ### Cross-node parallelism Ray Serve LLM supports cross-node tensor parallelism (TP) and pipeline parallelism (PP), allowing you to distribute model inference across multiple GPUs and nodes. See {doc}`Cross-node parallelism ` for a comprehensive guide on configuring and deploying models with cross-node parallelism. --- # Troubleshooting Common issues and frequently asked questions for Ray Serve LLM. ## Frequently asked questions ### How do I use gated Hugging Face models? You can use `runtime_env` to specify the env variables that are required to access the model. To get the deployment options, you can use the `get_deployment_options` method on the {class}`LLMServer ` class. Each deployment class has its own `get_deployment_options` method. ```python from ray import serve from ray.serve.llm import LLMConfig from ray.serve.llm.deployment import LLMServer from ray.serve.llm.ingress import OpenAiIngress from ray.serve.llm.builders import build_openai_app import os llm_config = LLMConfig( model_loading_config=dict( model_id="llama-3-8b-instruct", model_source="meta-llama/Meta-Llama-3-8B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), # Pass the desired accelerator type (e.g., A10G, L4, etc.) accelerator_type="A10G", runtime_env=dict( env_vars=dict( HF_TOKEN=os.environ["HF_TOKEN"] ) ), ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ### Why is downloading the model so slow? If you're using Hugging Face models, you can enable fast download by setting `HF_HUB_ENABLE_HF_TRANSFER` and installing `pip install hf_transfer`. ```python from ray import serve from ray.serve.llm import LLMConfig from ray.serve.llm.deployment import LLMServer from ray.serve.llm.ingress import OpenAiIngress from ray.serve.llm.builders import build_openai_app import os llm_config = LLMConfig( model_loading_config=dict( model_id="llama-3-8b-instruct", model_source="meta-llama/Meta-Llama-3-8B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), # Pass the desired accelerator type (e.g., A10G, L4, etc.) accelerator_type="A10G", runtime_env=dict( env_vars=dict( HF_TOKEN=os.environ["HF_TOKEN"], HF_HUB_ENABLE_HF_TRANSFER="1" ) ), ) # Deploy the application app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ## Get help If you encounter issues not covered in this guide: - [Ray GitHub Issues](https://github.com/ray-project/ray/issues) - Report bugs or request features - [Ray Slack](https://ray-distributed.slack.com) - Get help from the community - [Ray Discourse Forum](https://discuss.ray.io) - Ask questions and share knowledge ## See also - {doc}`Quickstart examples ` - {doc}`Examples ` --- (cross-node-parallelism)= # Cross-node parallelism Ray Serve LLM supports cross-node tensor parallelism (TP) and pipeline parallelism (PP), allowing you to distribute model inference across multiple GPUs and nodes. This capability enables you to: - Deploy models that don't fit on a single GPU or node. - Scale model serving across your cluster's available resources. - Leverage Ray's placement group strategies to control worker placement for performance or fault tolerance. ::::{note} By default, Ray Serve LLM uses the `PACK` placement strategy, which tries to place workers on as few nodes as possible. If workers can't fit on a single node, they automatically spill to other nodes. This enables cross-node deployments when single-node resources are insufficient. :::: ## Tensor parallelism Tensor parallelism splits model weights across multiple GPUs, with each GPU processing a portion of the model's tensors for each forward pass. This approach is useful for models that don't fit on a single GPU. The following example shows how to configure tensor parallelism across 2 GPUs: ::::{tab-set} :::{tab-item} Python :sync: python ```{literalinclude} ../../doc_code/cross_node_parallelism_example.py :language: python :start-after: __cross_node_tp_example_start__ :end-before: __cross_node_tp_example_end__ ``` ::: :::: ## Pipeline parallelism Pipeline parallelism splits the model's layers across multiple GPUs, with each GPU processing a subset of the model's layers. This approach is useful for very large models where tensor parallelism alone isn't sufficient. The following example shows how to configure pipeline parallelism across 2 GPUs: ::::{tab-set} :::{tab-item} Python :sync: python ```{literalinclude} ../../doc_code/cross_node_parallelism_example.py :language: python :start-after: __cross_node_pp_example_start__ :end-before: __cross_node_pp_example_end__ ``` ::: :::: ## Combined tensor and pipeline parallelism For extremely large models, you can combine both tensor and pipeline parallelism. The total number of GPUs is the product of `tensor_parallel_size` and `pipeline_parallel_size`. The following example shows how to configure a model with both TP and PP (4 GPUs total): ::::{tab-set} :::{tab-item} Python :sync: python ```{literalinclude} ../../doc_code/cross_node_parallelism_example.py :language: python :start-after: __cross_node_tp_pp_example_start__ :end-before: __cross_node_tp_pp_example_end__ ``` ::: :::: ## Custom placement groups You can customize how Ray places vLLM engine workers across nodes through the `placement_group_config` parameter. This parameter accepts a dictionary with `bundles` (a list of resource dictionaries) and `strategy` (placement strategy). Ray Serve LLM uses the `PACK` strategy by default, which tries to place workers on as few nodes as possible. If workers can't fit on a single node, they automatically spill to other nodes. For more details on all available placement strategies, see {ref}`Ray Core's placement strategies documentation `. ::::{note} Data parallel deployments automatically override the placement strategy to `STRICT_PACK` because each replica must be co-located for correct data parallel behavior. :::: While you can specify the degree of tensor and pipeline parallelism, the specific assignment of model ranks to GPUs is managed by the vLLM engine and can't be directly configured through the Ray Serve LLM API. Ray Serve automatically injects accelerator type labels into bundles and merges the first bundle with replica actor resources (CPU, GPU, memory). The following example shows how to use the `SPREAD` strategy to distribute workers across multiple nodes for fault tolerance: ::::{tab-set} :::{tab-item} Python :sync: python ```{literalinclude} ../../doc_code/cross_node_parallelism_example.py :language: python :start-after: __custom_placement_group_spread_example_start__ :end-before: __custom_placement_group_spread_example_end__ ``` ::: :::: --- (data-parallel-attention-guide)= # Data parallel attention Deploy LLMs with data parallel attention for increased throughput and better resource utilization, especially for sparse MoE (Mixture of Experts) models. Data parallel attention creates multiple coordinated inference engine replicas that process requests in parallel. This pattern is most effective when combined with expert parallelism for sparse MoE models, where attention (QKV) layers are replicated across replicas while MoE experts are sharded. This separation provides: - **Increased throughput**: Process more concurrent requests by distributing them across multiple replicas. - **Better resource utilization**: Especially beneficial for sparse MoE models where not all experts are active for each request. - **KV cache scalability**: Add more KV cache capacity across replicas to handle larger batch sizes. - **Expert saturation**: Achieve higher effective batch sizes during decoding to better saturate MoE layers. ## When to use data parallel attention Consider this pattern when: - **Sparse MoE models with MLA**: You're serving models with Multi-head Latent Attention (MLA) where KV cache can't be sharded along the head dimension. MLA reduces KV cache memory requirements, making data parallel replication more efficient. - **High throughput requirements**: You need to serve many concurrent requests and want to maximize throughput. - **KV-cache limited**: Adding more KV cache capacity increases throughput, and data parallel attention effectively increases KV cache capacity across replicas. **When not to use data parallel attention:** - **Low to medium throughput**: If you can't saturate the MoE layers, data parallel attention adds unnecessary complexity. - **Non-MoE models**: The main benefit is lifting effective batch size for saturating experts, which doesn't apply to dense models. - **Sufficient tensor parallelism**: For models with GQA (Grouped Query Attention), use tensor parallelism (TP) first to shard KV cache up to `TP_size <= num_kv_heads`. Beyond that, TP requires KV cache replication—at that point, data parallel attention becomes a better choice. ## Basic deployment The following example shows how to deploy with data parallel attention: ```{literalinclude} ../../../llm/doc_code/serve/multi_gpu/dp_basic_example.py :language: python :start-after: __dp_basic_example_start__ :end-before: __dp_basic_example_end__ ``` ## Production YAML configuration For production deployments, use a YAML configuration file: ```yaml applications: - name: dp_llm_app route_prefix: / import_path: ray.serve.llm:build_dp_openai_app args: llm_config: model_loading_config: model_id: Qwen/Qwen2.5-0.5B-Instruct engine_kwargs: data_parallel_size: 4 tensor_parallel_size: 2 experimental_configs: dp_size_per_node: 4 ``` Deploy with: ```bash serve deploy dp_config.yaml ``` :::{note} The `num_replicas` in `deployment_config` must equal `data_parallel_size` in `engine_kwargs`. Autoscaling is not supported for data parallel attention deployments since all replicas must be present and coordinated. ::: ## Configuration parameters ### Required parameters - `data_parallel_size`: Number of data parallel replicas to create. Must be a positive integer. - `dp_size_per_node`: Number of DP replicas per node. Must be set in `experimental_configs`. This controls how replicas are distributed across nodes. This is a temporary required config that we will remove in future versions. ### Deployment configuration - `num_replicas`: Must be set to `data_parallel_size`. Data parallel attention requires a fixed number of replicas. - `placement_group_strategy`: Automatically set to `"STRICT_PACK"` to ensure replicas are properly placed. ## Understanding replica coordination In data parallel attention, all replicas work together as a cohesive unit: 1. **Rank assignment**: Each replica receives a unique rank (0 to `dp_size-1`) from a coordinator. 2. **Request distribution**: Ray Serve's request router distributes requests across replicas using load balancing. 3. **Collective operations**: Replicas coordinate for collective operations (e.g., all-reduce) required by the model. 4. **Synchronization**: All replicas must be present and healthy for the deployment to function correctly. The coordination overhead is minimal: - **Startup**: Each replica makes one RPC call to get its rank. - **Runtime**: No coordination overhead during request processing. For more details, see {doc}`../architecture/serving-patterns/data-parallel`. ## Test your deployment Test with a chat completion request: ```bash curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer fake-key" \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [ {"role": "user", "content": "Explain data parallel attention"} ], "max_tokens": 100, "temperature": 0.7 }' ``` You can also test programmatically: ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="fake-key" ) response = client.chat.completions.create( model="Qwen/Qwen2.5-0.5B-Instruct", messages=[ {"role": "user", "content": "Explain data parallel attention"} ], max_tokens=100 ) print(response.choices[0].message.content) ``` ## Combining with other patterns ### Data parallel + Prefill-decode disaggregation You can combine data parallel attention with prefill-decode disaggregation to scale both phases independently while using DP within each phase. This pattern is useful when you need high throughput for both prefill and decode phases. The following example shows a complete, functional deployment: ```{literalinclude} ../../../llm/doc_code/serve/multi_gpu/dp_pd_example.py :language: python :start-after: __dp_pd_example_start__ :end-before: __dp_pd_example_end__ ``` This configuration creates: - **Prefill phase**: 2 data parallel replicas for processing input prompts - **Decode phase**: 2 data parallel replicas for generating tokens - **PDProxyServer**: Coordinates requests between prefill and decode phases - **OpenAI ingress**: Provides OpenAI-compatible API endpoints This allows you to: - Optimize prefill and decode phases independently based on workload characteristics - Use data parallel attention within each phase for increased throughput :::{note} This example uses 4 GPUs total (2 for prefill, 2 for decode). Adjust the `data_parallel_size` values based on your available GPU resources. ::: :::{note} For this example to work, you need to have NIXL installed. See the {doc}`prefill-decode` guide for prerequisites and installation instructions. ::: ## See also - {doc}`../architecture/serving-patterns/data-parallel` - Data parallel attention architecture details - {doc}`prefill-decode` - Prefill-decode disaggregation guide - {doc}`../architecture/serving-patterns/index` - Overview of serving patterns - {doc}`../quick-start` - Basic LLM deployment examples --- (deployment-initialization-guide)= # Deployment Initialization The initialization phase of a serve.llm deployment involves many steps, including preparation of model weights, engine (vLLM) initialization, and Ray serve replica autoscaling overheads. A detailed breakdown of the steps involved in using serve.llm with vLLM is provided below. ## Startup Breakdown - **Provisioning Nodes**: If a GPU node isn't available, a new instance must be provisioned. - **Image Download**: Downloading image to target instance incurs latency correlated with image size. - **Fixed Ray/Node Initialization**: Ray/vLLM incurs some fixed overhead when spawning new processes to handle a new replica, which involves importing large libraries (such as vLLM), preparing model and engine configurations, etc. - **Model Loading**: Retrieve model either from Hugging Face or cloud storage, including time spent downloading the model and moving it to GPU memory - **Torch Compile**: Torch compile is integral to vLLM's design and it is enabled by default. - **Memory Profiling**: vLLM runs some inference on the model to determine the amount of available memory it can dedicate to the KV cache - **CUDA Graph Capture**: vLLM captures the CUDA graphs for different input sizes ahead of time. More details are [here.](https://docs.vllm.ai/en/latest/design/cuda_graphs.html) - **Warmup**: Initialize KV cache, run model inference. This document will provide an overview of the numerous ways to customize your deployment initialization. ## Model Loading from Hugging Face By default, Ray Serve LLM loads models from Hugging Face Hub. Specify the model source with `model_source`: ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config = LLMConfig( model_loading_config=dict( model_id="llama-3-8b", model_source="meta-llama/Meta-Llama-3-8B-Instruct", ), accelerator_type="A10G", ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ### Load gated models Gated Hugging Face models require authentication. Pass your Hugging Face token through the `runtime_env`: ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app import os llm_config = LLMConfig( model_loading_config=dict( model_id="llama-3-8b-instruct", model_source="meta-llama/Meta-Llama-3-8B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="A10G", runtime_env=dict( env_vars={ "HF_TOKEN": os.environ["HF_TOKEN"] } ), ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` You can also set environment variables cluster-wide by passing them to `ray.init`: ```python import ray ray.init( runtime_env=dict( env_vars={ "HF_TOKEN": os.environ["HF_TOKEN"] } ), ) ``` ### Fast download from Hugging Face Enable fast downloads with Hugging Face's `hf_transfer` library: 1. Install the library: ```bash pip install hf_transfer ``` 2. Set the `HF_HUB_ENABLE_HF_TRANSFER` environment variable: ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config = LLMConfig( model_loading_config=dict( model_id="llama-3-8b", model_source="meta-llama/Meta-Llama-3-8B-Instruct", ), accelerator_type="A10G", runtime_env=dict( env_vars={ "HF_HUB_ENABLE_HF_TRANSFER": "1" } ), ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ## Model Loading from remote storage Load models from S3 or GCS buckets instead of Hugging Face. This is useful for: - Private models not hosted on Hugging Face - Faster loading from cloud storage in the same region - Custom model formats or fine-tuned models ### S3 bucket structure Your S3 bucket should contain the model files in a Hugging Face-compatible structure: ```bash $ aws s3 ls air-example-data/rayllm-ossci/meta-Llama-3.2-1B-Instruct/ 2025-03-25 11:37:48 1519 .gitattributes 2025-03-25 11:37:48 7712 LICENSE.txt 2025-03-25 11:37:48 41742 README.md 2025-03-25 11:37:48 6021 USE_POLICY.md 2025-03-25 11:37:48 877 config.json 2025-03-25 11:37:48 189 generation_config.json 2025-03-25 11:37:48 2471645608 model.safetensors 2025-03-25 11:37:53 296 special_tokens_map.json 2025-03-25 11:37:53 9085657 tokenizer.json 2025-03-25 11:37:53 54528 tokenizer_config.json ``` ### Configure S3 loading (YAML) Use the `bucket_uri` parameter in `model_loading_config`: ```yaml # config.yaml applications: - args: llm_configs: - accelerator_type: A10G engine_kwargs: max_model_len: 8192 model_loading_config: model_id: my_llama model_source: bucket_uri: s3://anonymous@air-example-data/rayllm-ossci/meta-Llama-3.2-1B-Instruct import_path: ray.serve.llm:build_openai_app name: llm_app route_prefix: "/" ``` Deploy with: ```bash serve deploy config.yaml ``` ### Configure S3 loading (Python API) You can also configure S3 loading with Python: ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config = LLMConfig( model_loading_config=dict( model_id="my_llama", model_source=dict( bucket_uri="s3://my-bucket/path/to/model" ) ), accelerator_type="A10G", engine_kwargs=dict( max_model_len=8192, ), ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ### Configure GCS bucket loading (YAML) For Google Cloud Storage, use the `gs://` protocol: ```yaml model_loading_config: model_id: my_model model_source: bucket_uri: gs://my-gcs-bucket/path/to/model ``` ### S3 credentials For private S3 buckets, configure AWS credentials: 1. **Option 1: Environment variables** ```python llm_config = LLMConfig( model_loading_config=dict( model_id="my_model", model_source=dict( bucket_uri="s3://my-private-bucket/model" ) ), runtime_env=dict( env_vars={ "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"], "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"], } ), ) ``` 2. **Option 2: IAM roles** (recommended for production) Use EC2 instance profiles or EKS service accounts with appropriate S3 read permissions. ### S3 and RunAI Streamer S3 can be combined with RunAI Streamer, an extension in vLLM that enables streaming the model weights directly from remote cloud storage into GPU memory, improving model load latency. More details can be found [here](https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html). ```python llm_config = LLMConfig( ... model_loading_config={ "model_id": "llama", "model_source": "s3://your-bucket/Meta-Llama-3-8B-Instruct", }, engine_kwargs={ "tensor_parallel_size": 1, "load_format": "runai_streamer", }, ... ) ``` ### Model Sharding Modern LLM model sizes often outgrow the memory capacity of a single GPU, requiring the use of tensor parallelism to split computation across multiple devices. In this paradigm, only a subset of weights are stored on each GPU, and model sharding ensures that each device only loads the relevant portion of the model. By sharding the model files in advance, we can reduce load times significantly, since GPUs avoid loading unneeded weights. vLLM provides a utility script for this purpose: [save_sharded_state.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/save_sharded_state.py). Once the sharded weights have been saved, upload them to S3 and use RunAI streamer with a new flag to load the sharded weights ```python llm_config = LLMConfig( ... engine_kwargs={ "tensor_parallel_size": 4, "load_format": "runai_streamer_sharded", }, ... ) ``` ## Additional Optimizations ### Torch Compile Cache Torch.compile incurs some latency during initialization. This can be mitigated by keeping a torch compile cache, which is automatically generated by vLLM. To retrieve the torch compile cache, run vLLM and look for a log like below: ``` (RayWorkerWrapper pid=126782) INFO 10-15 11:57:04 [backends.py:608] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/131ee5c6d9/rank_1_0/backbone for vLLM's torch.compile ``` In this example the cache folder is located at `/home/ray/.cache/vllm/torch_compile_cache/131ee5c6d9`. Upload this directory to your S3 bucket. The cache folder can now be retrieved at startup. We provide a custom utility to download the compile cache from cloud storage. Specify the `CloudDownloader` callback in `LLMConfig` and supply the relevant arguments. Make sure to set the `cache_dir` in compilation_config correctly. ```python llm_config = LLMConfig( ... callback_config={ "callback_class": "ray.llm._internal.common.callbacks.cloud_downloader.CloudDownloader", "callback_kwargs": {"paths": [("s3://samplebucket/llama-3-8b-cache", "/home/ray/.cache/vllm/torch_compile_cache/llama-3-8b-cache")]}, }, engine_kwargs={ "tensor_parallel_size": 1, "compilation_config": { "cache_dir": "/home/ray/.cache/vllm/torch_compile_cache/llama-3-8b-cache", } }, ... ) ``` Other options for retrieving the compile cache (distributed filesystem, block storage) can be used, as long as the path to the cache is set in `compilation_config`. ### Custom Initialization Behaviors We provide the ability to create custom node initialization behaviors with the API defined by [`CallbackBase`](https://github.com/ray-project/ray/blob/master/python/ray/llm/_internal/common/callbacks/base.py). Callback functions defined in the class are invoked at certain parts of the initialization process. An example is the above mentioned [`CloudDownloader`](https://github.com/ray-project/ray/blob/master/python/ray/llm/_internal/common/callbacks/cloud_downloader.py) which overrides the `on_before_download_model_files_distributed` function to distribute download tasks across nodes. To enable your custom callback, specify the classname inside `LLMConfig`. ```python from user_custom_classes import CustomCallback config = LLMConfig( ... callback_config={ "callback_class": CustomCallback, # or use string "user_custom_classes.CustomCallback" "callback_kwargs": {"kwargs_test_key": "kwargs_test_value"}, }, ... ) ``` > **Note:** Callbacks are a new feature. We may change the callback API and incorporate user feedback as we continue to develop this functionality. ## Best practices ### Model source selection - **Use Hugging Face** for publicly available models and quick prototyping - **Use remote storage** for private models, custom fine-tunes, or when co-located with compute - **Enable fast downloads** when downloading large models from Hugging Face ### Security - **Never commit tokens** to version control. Use environment variables or secrets management. - **Use IAM roles** instead of access keys for production deployments on AWS. - **Scope permissions** to read-only access for model loading. ### Performance - **Co-locate storage and compute** in the same cloud region to reduce latency and egress costs. - **Use fast download** (`HF_HUB_ENABLE_HF_TRANSFER`) for models larger than 10GB. - **Cache models** locally if you're repeatedly deploying the same model. - **See benchmarks** [here](../benchmarks.md) for detailed information about optimizations ## Troubleshooting ### Slow downloads from Hugging Face - Install `hf_transfer`: `pip install hf_transfer` - Set `HF_HUB_ENABLE_HF_TRANSFER=1` in `runtime_env` - Consider moving the model to S3/GCS in your cloud region and using RunAI streamer, and use sharding for large models ### S3/GCS access errors - Verify bucket URI format (for example, `s3://bucket/path` or `gs://bucket/path`) - Check AWS/GCP credentials and regions are configured correctly - Ensure your IAM role or service account has `s3:GetObject` or `storage.objects.get` permissions - Verify the bucket exists and is accessible from your deployment region ### Model files not found - Verify the model structure matches Hugging Face format (must include `config.json`, tokenizer files, and model weights) - Check that all required files are present in the bucket ## See also - {doc}`Quickstart <../quick-start>` - Basic LLM deployment examples --- (fractional-gpu-guide)= # Fractional GPU serving Serve multiple small models on the same GPU for cost-efficient deployments. :::{note} This feature hasn't been extensively tested in production. If you encounter any issues, report them on [GitHub](https://github.com/ray-project/ray/issues) with reproducible code. ::: Fractional GPU allocation allows you to run multiple model replicas on a single GPU by customizing placement groups. This approach maximizes GPU utilization and reduces costs when serving small models that don't require a full GPU's resources. ## When to use fractional GPUs Consider fractional GPU allocation when: - You're serving small models with low concurrency that don't require a full GPU for model weights and KV cache. - You have multiple models that fit this profile. ## Deploy with fractional GPU allocation The following example shows how to serve 8 replicas of a small model on 4 L4 GPUs (2 replicas per GPU): ```python from ray.serve.llm import LLMConfig, ModelLoadingConfig from ray.serve.llm import build_openai_app from ray import serve llm_config = LLMConfig( model_loading_config=ModelLoadingConfig( model_id="HuggingFaceTB/SmolVLM-256M-Instruct", ), engine_kwargs=dict( gpu_memory_utilization=0.4, use_tqdm_on_load=False, enforce_eager=True, max_model_len=2048, ), deployment_config=dict( autoscaling_config=dict( min_replicas=8, max_replicas=8, ) ), accelerator_type="L4", placement_group_config=dict(bundles=[dict(GPU=0.49)]), runtime_env=dict( env_vars={ "VLLM_DISABLE_COMPILE_CACHE": "1", }, ), ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ## Configuration parameters Use the following parameters to configure fractional GPU allocation. The placement group defines the GPU share, and Ray Serve infers the matching `VLLM_RAY_PER_WORKER_GPUS` value for you. The memory management and performance settings are vLLM-specific optimizations that you can adjust based on your model and workload requirements. ### Placement group configuration - `placement_group_config`: Specifies the GPU fraction each replica uses. Set `GPU` to the fraction (for example, `0.49` for approximately half a GPU). Use slightly less than the theoretical fraction to account for system overhead—this headroom prevents out-of-memory errors. - `VLLM_RAY_PER_WORKER_GPUS`: Ray Serve derives this from `placement_group_config` when GPU bundles are fractional. Setting it manually is allowed but not recommended. ### Memory management - `gpu_memory_utilization`: Controls how much GPU memory vLLM pre-allocates. vLLM allocates memory based on this setting regardless of Ray's GPU scheduling. In the example, `0.4` means vLLM targets 40% of GPU memory for the model, KV cache, and CUDAGraph memory. ### Performance settings - `enforce_eager`: Set to `True` to disable CUDA graphs and reduce memory overhead. - `max_model_len`: Limits the maximum sequence length, reducing memory requirements. - `use_tqdm_on_load`: Set to `False` to disable progress bars during model loading. ### Workarounds - `VLLM_DISABLE_COMPILE_CACHE`: Set to `1` to avoid a [resource contention issue](https://github.com/vllm-project/vllm/issues/24601) among workers during torch compile caching. ## Best practices ### Calculate GPU allocation - **Leave headroom**: Use slightly less than the theoretical fraction (for example, `0.49` instead of `0.5`) to account for system overhead. - **Match memory to workload**: Ensure `gpu_memory_utilization` × GPU memory × number of replicas per GPU doesn't exceed total GPU memory. - **Account for all memory**: Consider model weights, KV cache, CUDA graphs, and framework overhead. ### Optimize for your models - **Test memory requirements**: Profile your model's actual memory usage before setting `gpu_memory_utilization`. This information often gets printed as part of the vLLM initialization. - **Start conservative**: Begin with fewer replicas per GPU and increase gradually while monitoring memory usage. - **Monitor OOM errors**: Watch for out-of-memory errors that indicate you need to reduce replicas or lower `gpu_memory_utilization`. ### Production considerations - **Validate performance**: Test throughput and latency with your actual workload before production deployment. - **Consider autoscaling carefully**: Fractional GPU deployments work best with fixed replica counts rather than autoscaling. ## Troubleshooting ### Out of memory errors - Reduce `gpu_memory_utilization` (for example, from `0.4` to `0.3`) - Decrease the number of replicas per GPU - Lower `max_model_len` to reduce KV cache size - Enable `enforce_eager=True` if not already set to ensure CUDA graph memory requirements don't cause issues ### Replicas fail to start - Verify that your fractional allocation matches your replica count (for example, 2 replicas with `GPU=0.49` each) - Confirm that `placement_group_config` matches the share you expect Ray to reserve - If you override `VLLM_RAY_PER_WORKER_GPUS` (not recommended) ensure it matches the GPU share from the placement group - Ensure your model size is appropriate for fractional GPU allocation ### Resource contention issues - Ensure `VLLM_DISABLE_COMPILE_CACHE=1` is set to avoid torch compile caching conflicts - Check Ray logs for resource allocation errors - Verify placement group configuration is applied correctly ## See also - {doc}`Quickstart <../quick-start>` - Basic LLM deployment examples - [Ray placement groups](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) - Ray Core placement group documentation --- # User guides How-to guides for deploying and configuring Ray Serve LLM features. ```{toctree} :maxdepth: 1 Cross-node parallelism Data parallel attention Deployment Initialization Prefill/decode disaggregation KV cache offloading Prefix-aware routing Multi-LoRA deployment vLLM compatibility Fractional GPU serving Observability and monitoring ``` --- (kv-cache-offloading-guide)= # KV cache offloading Extend KV cache capacity by offloading to CPU memory or local disk for larger batch sizes and reduced GPU memory pressure. :::{note} Ray Serve doesn't provide KV cache offloading out of the box, but integrates seamlessly with vLLM solutions. This guide demonstrates one such integration: LMCache. ::: Benefits of KV cache offloading: - **Increased capacity**: Store more KV caches by using CPU RAM or local storage instead of relying solely on GPU memory - **Cache reuse across requests**: Save and reuse previously computed KV caches for repeated or similar prompts, reducing prefill computation - **Flexible storage backends**: Choose from multiple storage options including local CPU, disk, or distributed systems Consider KV cache offloading when your application has repeated prompts or multi-turn conversations where you can reuse cached prefills. If consecutive conversation queries aren't sent immediately, the GPU evicts these caches to make room for other concurrent requests, causing cache misses. Offloading KV caches to CPU memory or other storage backends, which has much larger capacity, preserves them for longer periods. ## Deploy with LMCache LMCache provides KV cache offloading with support for multiple storage backends. ### Prerequisites Install LMCache: ```bash uv pip install lmcache ``` ### Basic deployment The following example shows how to deploy with LMCache for local CPU offloading: ::::{tab-set} :::{tab-item} Python ```python from ray.serve.llm import LLMConfig, build_openai_app import ray.serve as serve llm_config = LLMConfig( model_loading_config={ "model_id": "qwen-0.5b", "model_source": "Qwen/Qwen2-0.5B-Instruct" }, engine_kwargs={ "tensor_parallel_size": 1, "kv_transfer_config": { "kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both", } }, runtime_env={ "env_vars": { "LMCACHE_LOCAL_CPU": "True", "LMCACHE_CHUNK_SIZE": "256", "LMCACHE_MAX_LOCAL_CPU_SIZE": "100", # 100GB } } ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app) ``` ::: :::{tab-item} YAML ```yaml applications: - name: llm-with-lmcache route_prefix: / import_path: ray.serve.llm:build_openai_app runtime_env: env_vars: LMCACHE_LOCAL_CPU: "True" LMCACHE_CHUNK_SIZE: "256" LMCACHE_MAX_LOCAL_CPU_SIZE: "100" args: llm_configs: - model_loading_config: model_id: qwen-0.5b model_source: Qwen/Qwen2-0.5B-Instruct engine_kwargs: tensor_parallel_size: 1 kv_transfer_config: kv_connector: LMCacheConnectorV1 kv_role: kv_both ``` Deploy with: ```bash serve run config.yaml ``` ::: :::: ## Compose multiple KV transfer backends with MultiConnector You can combine multiple KV transfer backends using `MultiConnector`. This is useful when you want both local offloading and cross-instance transfer in disaggregated deployments. ### When to use MultiConnector Use `MultiConnector` to combine multiple backends when you're using prefill/decode disaggregation and want both cross-instance transfer (NIXL) and local offloading. The following example shows how to combine NIXL (for cross-instance transfer) with LMCache (for local offloading) in a prefill/decode deployment: :::{note} The order of connectors matters. Since you want to prioritize local KV cache lookup through LMCache, it appears first in the list before the NIXL connector. ::: ::::{tab-set} :::{tab-item} Python ```python from ray.serve.llm import LLMConfig, build_pd_openai_app import ray.serve as serve # Shared KV transfer config combining NIXL and LMCache kv_config = { "kv_connector": "MultiConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "connectors": [ { "kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both", }, { "kv_connector": "NixlConnector", "kv_role": "kv_both", "backends": ["UCX"], } ] } } prefill_config = LLMConfig( model_loading_config={ "model_id": "qwen-0.5b", "model_source": "Qwen/Qwen2-0.5B-Instruct" }, engine_kwargs={ "tensor_parallel_size": 1, "kv_transfer_config": kv_config, }, runtime_env={ "env_vars": { "LMCACHE_LOCAL_CPU": "True", "LMCACHE_CHUNK_SIZE": "256", "UCX_TLS": "all", } } ) decode_config = LLMConfig( model_loading_config={ "model_id": "qwen-0.5b", "model_source": "Qwen/Qwen2-0.5B-Instruct" }, engine_kwargs={ "tensor_parallel_size": 1, "kv_transfer_config": kv_config, }, runtime_env={ "env_vars": { "LMCACHE_LOCAL_CPU": "True", "LMCACHE_CHUNK_SIZE": "256", "UCX_TLS": "all", } } ) pd_config = { "prefill_config": prefill_config, "decode_config": decode_config, } app = build_pd_openai_app(pd_config) serve.run(app) ``` ::: :::{tab-item} YAML ```yaml applications: - name: pd-multiconnector route_prefix: / import_path: ray.serve.llm:build_pd_openai_app runtime_env: env_vars: LMCACHE_LOCAL_CPU: "True" LMCACHE_CHUNK_SIZE: "256" UCX_TLS: "all" args: prefill_config: model_loading_config: model_id: qwen-0.5b model_source: Qwen/Qwen2-0.5B-Instruct engine_kwargs: tensor_parallel_size: 1 kv_transfer_config: kv_connector: MultiConnector kv_role: kv_both kv_connector_extra_config: connectors: - kv_connector: LMCacheConnectorV1 kv_role: kv_both - kv_connector: NixlConnector kv_role: kv_both backends: ["UCX"] decode_config: model_loading_config: model_id: qwen-0.5b model_source: Qwen/Qwen2-0.5B-Instruct engine_kwargs: tensor_parallel_size: 1 kv_transfer_config: kv_connector: MultiConnector kv_role: kv_both kv_connector_extra_config: connectors: - kv_connector: LMCacheConnectorV1 kv_role: kv_both - kv_connector: NixlConnector kv_role: kv_both backends: ["UCX"] ``` Deploy with: ```bash serve run config.yaml ``` ::: :::: ## Configuration parameters ### LMCache environment variables - `LMCACHE_LOCAL_CPU`: Set to `"True"` to enable local CPU offloading - `LMCACHE_CHUNK_SIZE`: Size of KV cache chunks, in terms of tokens (default: 256) - `LMCACHE_MAX_LOCAL_CPU_SIZE`: Maximum CPU storage size in GB - `LMCACHE_PD_BUFFER_DEVICE`: Buffer device for prefill/decode scenarios (default: "cpu") For the full list of LMCache configuration options, see the [LMCache configuration reference](https://docs.lmcache.ai/api_reference/configurations.html). ### MultiConnector configuration - `kv_connector`: Set to `"MultiConnector"` to compose multiple backends - `kv_connector_extra_config.connectors`: List of connector configurations to compose. Order matters—connectors earlier in the list take priority. - Each connector in the list uses the same configuration format as standalone connectors ## Performance considerations Extending KV cache beyond local GPU memory introduces overhead for managing and looking up caches across different memory hierarchies. This creates a tradeoff: you gain larger cache capacity but may experience increased latency. Consider these factors: **Overhead in cache-miss scenarios**: When there are no cache hits, offloading adds modest overhead (~10-15%) compared to pure GPU caching, based on our internal experiments. This overhead comes from the additional hashing, data movement, and management operations. **Benefits with cache hits**: When caches can be reused, offloading significantly reduces prefill computation. For example, in multi-turn conversations where users return after minutes of inactivity, LMCache retrieves the conversation history from CPU rather than recomputing it, significantly reducing time to first token for follow-up requests. **Network transfer costs**: When combining MultiConnector with cross-instance transfer (such as NIXL), ensure that the benefits of disaggregation outweigh the network transfer costs. ## See also - {doc}`Prefill/decode disaggregation ` - Deploy LLMs with separated prefill and decode phases - [LMCache documentation](https://docs.lmcache.ai/) - Comprehensive LMCache configuration and features --- # Multi-LoRA deployment Deploy multiple fine-tuned LoRA adapters efficiently with Ray Serve LLM. ## Understand multi-LoRA deployment Multi-LoRA lets your model switch between different fine-tuned adapters at runtime without reloading the base model. Use multi-LoRA when your application needs to support multiple domains, users, or tasks using a single shared model backend. Following are the main reasons you might want to add adapters to your workflow: - **Parameter efficiency**: LoRA adapters are small, typically less than 1% of the base model's size. This makes them cheap to store, quick to load, and easy to swap in and out during inference, which is especially useful when memory is tight. - **Runtime adaptation**: With multi-LoRA, you can switch between different adapters at inference time without reloading the base model. This allows for dynamic behavior depending on user, task, domain, or context, all from a single deployment. - **Simpler MLOps**: Multi-LoRA cuts down on infrastructure complexity and cost by centralizing inference around one model. ### How request routing works When a request for a given LoRA adapter arrives, Ray Serve: 1. Checks if any replica has already loaded that adapter 2. Finds a replica with the adapter but isn't overloaded and routes the request to it 3. If all replicas with the adapter are overloaded, routes the request to a less busy replica, which loads the adapter 4. If no replica has the adapter loaded, routes the request to a replica according to the default request router logic (for example Power of 2) and loads it there Ray Serve LLM then caches the adapter for subsequent requests. Ray Serve LLM controls the cache of LoRA adapters on each replica through a Least Recently Used (LRU) mechanism with a max size, which you control with the `max_num_adapters_per_replica` variable. ## Configure Ray Serve LLM with multi-LoRA To enable multi-LoRA on your deployment, update your Ray Serve LLM configuration with these additional settings. ### LoRA configuration Set `dynamic_lora_loading_path` to your AWS or GCS storage path: ```python lora_config=dict( dynamic_lora_loading_path="s3://my_dynamic_lora_path", max_num_adapters_per_replica=16, # Optional: limit adapters per replica ) ``` - `dynamic_lora_loading_path`: Path to the directory containing LoRA checkpoint subdirectories. - `max_num_adapters_per_replica`: Maximum number of LoRA adapters cached per replica. Must match `max_loras`. ### Engine arguments Forward these parameters to your vLLM engine: ```python engine_kwargs=dict( enable_lora=True, max_lora_rank=32, # Set to the highest LoRA rank you plan to use max_loras=16, # Must match max_num_adapters_per_replica ) ``` - `enable_lora`: Enable LoRA support in the vLLM engine. - `max_lora_rank`: Maximum LoRA rank supported. Set to the highest rank you plan to use. - `max_loras`: Maximum number of LoRAs per batch. Must match `max_num_adapters_per_replica`. ### Example The following example shows a complete multi-LoRA configuration: ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app # Configure the model with LoRA llm_config = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), lora_config=dict( # Assume this is where LoRA weights are stored on S3. # For example # s3://my_dynamic_lora_path/lora_model_1_ckpt # s3://my_dynamic_lora_path/lora_model_2_ckpt # are two of the LoRA checkpoints dynamic_lora_loading_path="s3://my_dynamic_lora_path", max_num_adapters_per_replica=16, # Need to set this to the same value as `max_loras`. ), engine_kwargs=dict( enable_lora=True, max_loras=16, # Need to set this to the same value as `max_num_adapters_per_replica`. ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="A10G", ) # Build and deploy the model app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ## Send requests to multi-LoRA adapters To query the base model, call your service as you normally would. To use a specific LoRA adapter at inference time, include the adapter name in your request using the following format: ``` : ``` where - `` is the `model_id` that you define in the Ray Serve LLM configuration - `` is the adapter's folder name in your cloud storage ### Example queries Query both the base model and different LoRA adapters: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") # Base model request (no adapter) response = client.chat.completions.create( model="qwen-0.5b", # No adapter messages=[{"role": "user", "content": "Hello!"}], ) # Adapter 1 response = client.chat.completions.create( model="qwen-0.5b:adapter_name_1", # Follow naming convention in your cloud storage messages=[{"role": "user", "content": "Hello!"}], stream=True, ) for chunk in response: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="", flush=True) # Adapter 2 response = client.chat.completions.create( model="qwen-0.5b:adapter_name_2", messages=[{"role": "user", "content": "Hello!"}], ) ``` ## See also - {doc}`Quickstart <../quick-start>` - [vLLM LoRA documentation](https://docs.vllm.ai/en/stable/models/lora.html) --- (observability-guide)= # Observability and monitoring Monitor your LLM deployments with built-in metrics, dashboards, and logging. Ray Serve LLM provides comprehensive observability with the following features: - **Service-level metrics**: Request and token behavior across deployed models. - **Engine metrics**: vLLM-specific performance metrics such as TTFT and TPOT. - **Grafana dashboards**: Pre-built dashboard for LLM-specific visualizations. - **Prometheus integration**: Export capability for all metrics for custom monitoring and alerting. ## Service-level metrics Ray enables LLM service-level logging by default, making these statistics available through Grafana and Prometheus. For more details on configuring Grafana and Prometheus, see {ref}`collect-metrics`. These higher-level metrics track request and token behavior across deployed models: - Average total tokens per request - Ratio of input tokens to generated tokens - Peak tokens per second - Request latency and throughput - Model-specific request counts ## Grafana dashboard Ray includes a Serve LLM-specific dashboard, which is automatically available in Grafana: ![](../images/serve_llm_dashboard.png) The dashboard includes visualizations for: - **Request metrics**: Throughput, latency, and error rates. - **Token metrics**: Input/output token counts and ratios. - **Performance metrics**: Time to first token (TTFT), time per output token (TPOT). - **Resource metrics**: GPU cache utilization, memory usage. ## Engine metrics All engine metrics, including vLLM, are available through the Ray metrics export endpoint and are queryable with Prometheus. See [vLLM metrics](https://docs.vllm.ai/en/stable/usage/metrics.html) for a complete list. The Serve LLM Grafana dashboard also visualizes these metrics. Key engine metrics include: - **Time to first token (TTFT)**: Latency before the first token is generated. - **Time per output token (TPOT)**: Average latency per generated token. - **GPU cache utilization**: KV cache memory usage. - **Batch size**: Current and average batch sizes. - **Throughput**: Requests per second and tokens per second. ### Configure engine metrics Engine metric logging is on by default as of Ray 2.51. To disable engine-level metric logging, set `log_engine_metrics: False` when configuring the LLM deployment: ::::{tab-set} :::{tab-item} Python :sync: builder ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), log_engine_metrics=False # Disable engine metrics ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ::: :::{tab-item} YAML :sync: bind ```yaml # config.yaml applications: - args: llm_configs: - model_loading_config: model_id: qwen-0.5b model_source: Qwen/Qwen2.5-0.5B-Instruct accelerator_type: A10G deployment_config: autoscaling_config: min_replicas: 1 max_replicas: 2 log_engine_metrics: false # Disable engine metrics import_path: ray.serve.llm:build_openai_app name: llm_app route_prefix: "/" ``` ::: :::: ## Usage data collection The Ray Team collects usage data to improve Ray Serve LLM. The team collects data about the following features and attributes: - Model architecture used for serving. - Whether JSON mode is used. - Whether LoRA is used and how many LoRA weights are loaded initially at deployment time. - Whether autoscaling is used and the min and max replicas setup. - Tensor parallel size used. - Initial replicas count. - GPU type used and number of GPUs used. To opt out from usage data collection, see {ref}`Ray usage stats ` for how to disable it. ## See also - {ref}`collect-metrics` - Ray metrics collection guide - [vLLM metrics documentation](https://docs.vllm.ai/en/stable/usage/metrics.html) - {doc}`Troubleshooting <../troubleshooting>` - Common issues and solutions --- (prefill-decode-guide)= # Prefill/decode disaggregation Deploy LLMs with separated prefill and decode phases for better resource utilization and cost optimization. :::{warning} This feature requires vLLM v1, which is the default engine. For legacy deployments using vLLM v0, upgrade to v1 first. ::: Prefill/decode disaggregation separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This separation provides: - **Independent optimization**: You can optimize prefill separately from decode with different configurations. - **Reduced interference**: Prefill operations can interfere with decode operations and vice versa, degrading performance during unpredictable traffic spikes. Disaggregation removes this contention. - **Independent scaling**: You can scale each phase independently based on demand. - **Cost optimization**: You can use different node types for different workloads, taking advantage of heterogeneous clusters. vLLM provides several KV transfer backends for disaggregated serving: 1. **NIXLConnector**: Network-based KV cache transfer using NVIDIA Inference Xfer Library (NIXL) with support for various backends such as UCX, libfabric, and EFA. Simple setup with minimal configuration. 2. **LMCacheConnectorV1**: Advanced caching solution with support for various storage backends, including integration with NIXL. ## When to use prefill/decode disaggregation Consider this pattern when: - You have variable workload patterns with different resource needs for prefill vs decode. - You want to optimize costs by using different hardware for different phases. - Your application has high throughput requirements that benefit from decoupling prefill and decode. ## Deploy with NIXLConnector NIXLConnector provides network-based KV cache transfer between prefill and decode servers with minimal configuration. ### Prerequisites If you use [ray-project/ray-llm](https://hub.docker.com/r/rayproject/ray-llm/tags) Docker images, NIXL is already installed. Otherwise, install it: ```bash uv pip install nixl ``` The NIXL wheel comes bundled with its supported backends (UCX, libfabric, EFA, etc.). These shared binaries may not be the latest version for your hardware and network stack. If you need the latest versions, install NIXL from source against the target backend library. See the [NIXL installation guide](https://github.com/ai-dynamo/nixl?tab=readme-ov-file#prerequisites-for-source-build) for details. ### Basic deployment The following example shows how to deploy with NIXLConnector: ```python from ray.serve.llm import LLMConfig, build_pd_openai_app import ray.serve as serve # Configure prefill instance prefill_config = LLMConfig( model_loading_config={ "model_id": "meta-llama/Llama-3.1-8B-Instruct" }, engine_kwargs={ "kv_transfer_config": { "kv_connector": "NixlConnector", "kv_role": "kv_both", } } ) # Configure decode instance decode_config = LLMConfig( model_loading_config={ "model_id": "meta-llama/Llama-3.1-8B-Instruct" }, engine_kwargs={ "kv_transfer_config": { "kv_connector": "NixlConnector", "kv_role": "kv_both", } } ) pd_config = dict( prefill_config=prefill_config, decode_config=decode_config, ) app = build_pd_openai_app(pd_config) serve.run(app) ``` ### Production YAML configuration For production deployments, use a YAML configuration file: ```{literalinclude} ../../doc_code/pd_dissagregation/nixl_example.yaml :language: yaml ``` Deploy with: ```bash serve deploy nixl_config.yaml ``` ### Configuration parameters - `kv_connector`: Set to `"NixlConnector"` to use NIXL. - `kv_role`: Set to `"kv_both"` for both prefill and decode instances. ## Deploy with LMCacheConnectorV1 LMCacheConnectorV1 provides advanced caching with support for multiple storage backends. ### Prerequisites Install LMCache: ```bash uv pip install lmcache ``` ### Scenario 1: LMCache with NIXL backend This configuration uses LMCache with a NIXL-based storage backend for network communication. The following is an example Ray Serve configuration for LMCache with NIXL: ```{literalinclude} ../../doc_code/pd_dissagregation/lmcache_nixl_example.yaml :language: yaml ``` Create the LMCache configuration for the prefill instance (`lmcache_prefiller.yaml`): ```{literalinclude} ../../doc_code/pd_dissagregation/lmcache/nixl/prefiller.yaml :language: yaml ``` Create the LMCache configuration for the decode instance (`lmcache_decoder.yaml`): ```{literalinclude} ../../doc_code/pd_dissagregation/lmcache/nixl/decoder.yaml :language: yaml ``` :::{note} The `LMCACHE_CONFIG_FILE` environment variable must point to an existing configuration file that's accessible within the Ray Serve container or worker environment. Ensure these configuration files are properly mounted or available in your deployment environment. ::: ### Scenario 2: LMCache with Mooncake store backend This configuration uses LMCache with Mooncake store, a high-performance distributed storage system. The following is an example Ray Serve configuration for LMCache with Mooncake: ```{literalinclude} ../../doc_code/pd_dissagregation/lmcache_mooncake_example.yaml :language: yaml ``` Create the LMCache configuration for Mooncake (`lmcache_mooncake.yaml`): ```{literalinclude} ../../doc_code/pd_dissagregation/lmcache/mooncake.yaml :language: yaml ``` :::{warning} For Mooncake deployments: - Ensure the etcd metadata server is running and accessible at the specified address. - Verify that you properly configured RDMA devices and storage servers and that they are accessible. - In containerized deployments, mount configuration files with appropriate read permissions (for example, `chmod 644`). - Ensure all referenced hostnames and IP addresses in configuration files are resolvable from the deployment environment. ::: ### Configuration parameters - `kv_connector`: Set to `"LMCacheConnectorV1"`. - `kv_role`: Set to `"kv_producer"` for prefill, `"kv_consumer"` for decode. - `kv_buffer_size`: Size of the KV cache buffer. - `LMCACHE_CONFIG_FILE`: Environment variable that specifies the configuration file path. ## Test your deployment Before deploying with LMCacheConnectorV1, start the required services: ```bash # Start etcd server if not already running docker run -d --name etcd-server \ -p 2379:2379 -p 2380:2380 \ quay.io/coreos/etcd:latest \ etcd --listen-client-urls http://0.0.0.0:2379 \ --advertise-client-urls http://localhost:2379 # For Mooncake backend, start the Mooncake master # See https://docs.lmcache.ai/kv_cache/mooncake.html for details mooncake_master --port 49999 ``` Test with a chat completion request: ```bash curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "user", "content": "Explain the benefits of prefill/decode disaggregation"} ], "max_tokens": 100, "temperature": 0.7 }' ``` ## Best practices - **Choose the right backend**: Use NIXLConnector for simpler deployments. Use LMCacheConnectorV1 when you need advanced caching or multiple storage backends. - **Monitor KV transfer overhead**: Ensure that the benefits of disaggregation outweigh the network transfer costs. Monitor latency and throughput. - **Scale independently**: Take advantage of independent scaling by monitoring resource utilization for each phase separately. - **Test with realistic workloads**: Validate performance improvements with your actual traffic patterns before production deployment. - **Ensure network connectivity**: For NIXLConnector, verify that prefill and decode instances can communicate over the network. - **Secure etcd access**: For LMCacheConnectorV1, ensure your etcd server is properly secured and accessible only to authorized services. ## Troubleshooting ### Prefill and decode instances can't communicate - Verify network connectivity between instances with sufficient bandwidth for KV transfer. - Check that your network supports the backend you're using (such as RDMA for high-performance deployments). - For NIXLConnector, ensure NIXL is properly installed on all nodes. - Verify firewall rules and security groups allow communication between prefill and decode instances. ### LMCache configuration not found - Verify the `LMCACHE_CONFIG_FILE` environment variable points to an existing file. - Ensure the configuration file is accessible from the Ray Serve worker environment. - Check that the file has appropriate read permissions. ## See also - [LMCache disaggregated serving documentation](https://docs.lmcache.ai/disaggregated_prefill/nixl/index.html) - [NIXLConnector usage guide](https://docs.vllm.ai/en/stable/features/nixl_connector_usage.html) - {doc}`Quickstart <../quick-start>` - Basic LLM deployment examples --- (prefix-aware-routing-guide)= # Prefix-aware routing Optimize LLM inference with cache locality using prefix-aware request routing. :::{warning} This API is in alpha and may change before becoming stable. ::: LLM inference can benefit significantly from cache locality optimization. When one replica processes multiple prompts that share a prefix, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as [Automatic Prefix Caching (APC)](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) in vLLM. The `PrefixCacheAffinityRouter` routes requests with similar prefixes to the same replicas, maximizing KV cache hit rates. ## When to use prefix-aware routing Use prefix-aware routing when: - Your workload has many requests with shared prefixes (for example, same system prompts or few-shot examples) - You're using vLLM with Automatic Prefix Caching enabled - Cache hit rate is more important than perfect load balance in balanced scenarios ## How it works The `PrefixCacheAffinityRouter` implements a multi-tier routing strategy that balances cache locality with load distribution: ### 1. Load balance check First, it evaluates whether the current load is balanced across replicas by comparing queue lengths. If the difference between the highest and lowest queue lengths is below the `imbalanced_threshold`, it proceeds with prefix cache-aware routing. ### 2. Prefix matching strategy When load is balanced, the router uses a prefix tree to find replicas that have previously processed similar input text: - **High match rate (≥10%)**: Routes to replicas with the highest prefix match rate for better cache hit rates - **Low match rate (<10%)**: Falls back to replicas with the lowest prefix cache utilization to increase utilization - **No prefix data**: Uses the default Power of Two Choices selection ### 3. Imbalanced load fallback When load is imbalanced (queue length difference exceeds threshold), the router prioritizes load balancing over cache locality and falls back to the standard Power of Two Choices algorithm. ### Prefix tree management The router maintains a distributed prefix tree actor that: - Tracks input text prefixes processed by each replica - Supports automatic eviction of old entries to manage memory usage - Persists across router instances using Ray's detached actor pattern ## Deploy with prefix-aware routing The following example shows how to deploy an LLM with prefix-aware routing: ```{literalinclude} ../../../llm/doc_code/serve/prefix_aware_router/prefix_aware_example.py :start-after: __prefix_aware_example_start__ :end-before: __prefix_aware_example_end__ :language: python ``` ## Configuration parameters The `PrefixCacheAffinityRouter` provides several configuration parameters to tune its behavior: ### Core routing parameters - **`imbalanced_threshold`** (default: 10): Queue length difference threshold for considering load balanced. Lower values prioritize load balancing over cache locality. - **`match_rate_threshold`** (default: 0.1): Minimum prefix match rate (0.0-1.0) required to use prefix cache-aware routing. Higher values require stronger prefix matches before routing for cache locality. ### Memory management parameters - **`do_eviction`** (default: False): Enable automatic eviction of old prefix tree entries to approximate the LLM engine's eviction policy. - **`eviction_threshold_chars`** (default: 400,000): Maximum number of characters in the prefix tree before the LLM engine triggers an eviction. - **`eviction_target_chars`** (default: 360,000): Target number of characters to reduce the prefix tree to during eviction. - **`eviction_interval_secs`** (default: 10): Interval in seconds between eviction checks when eviction is enabled. ## Best practices - **Enable vLLM APC**: Make sure to set `enable_prefix_caching=True` in your `engine_kwargs` for the router to have any effect - **Tune thresholds**: Adjust `imbalanced_threshold` and `match_rate_threshold` based on your workload characteristics - **Monitor cache hit rates**: Track vLLM's cache hit metrics to verify the router is improving performance - **Start conservative**: Begin with default settings and tune incrementally based on observed behavior ## See also - {doc}`Architecture: Request routing <../architecture/routing-policies>` - [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) --- (vllm-compatibility-guide)= # vLLM compatibility Ray Serve LLM provides an OpenAI-compatible API that aligns with vLLM's OpenAI-compatible server. Most of the `engine_kwargs` that work with `vllm serve` work with Ray Serve LLM, giving you access to vLLM's feature set through Ray Serve's distributed deployment capabilities. This compatibility means you can: - Use the same model configurations and engine arguments as vLLM - Leverage vLLM's latest features (multimodal, structured output, reasoning models) - Switch between `vllm serve` and Ray Serve LLM with no code changes and scale - Take advantage of Ray Serve's production features (autoscaling, multi-model serving, advanced routing) This guide shows how to use vLLM features such as embeddings, structured output, vision language models, and reasoning models with Ray Serve. ## Embeddings You can generate embeddings by setting the `task` parameter to `"embed"` in the engine arguments. Models supporting this use case are listed in the [vLLM text embedding models documentation](https://docs.vllm.ai/en/stable/models/supported_models.html#text-embedding-task-embed). ### Deploy an embedding model ::::{tab-set} :::{tab-item} Server :sync: server ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), engine_kwargs=dict( task="embed", ), ) app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ::: :::{tab-item} Python Client :sync: client ```python from openai import OpenAI # Initialize client client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") # Generate embeddings response = client.embeddings.create( model="qwen-0.5b", input=["A text to embed", "Another text to embed"], ) for data in response.data: print(data.embedding) # List of float of len 4096 ``` ::: :::{tab-item} cURL :sync: curl ```bash curl -X POST http://localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -H "Authorization: Bearer fake-key" \ -d '{ "model": "qwen-0.5b", "input": ["A text to embed", "Another text to embed"], "encoding_format": "float" }' ``` ::: :::: ## Transcriptions You can generate audio transcriptions using Speech-to-Text (STT) models trained specifically for Automatic Speech Recognition (ASR) tasks. Models supporting this use case are listed in the [vLLM transcription models documentation](https://docs.vllm.ai/en/stable/models/supported_models.html). ### Deploy a transcription model ::::{tab-set} :::{tab-item} Server :sync: server ```{literalinclude} ../../../llm/doc_code/serve/transcription/transcription_example.py :language: python :start-after: __transcription_example_start__ :end-before: __transcription_example_end__ ``` ::: :::{tab-item} Python Client :sync: client ```python from openai import OpenAI # Initialize client client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") # Open audio file with open("/path/to/audio.wav", "rb") as f: # Make a request to the transcription model response = client.audio.transcriptions.create( model="whisper-large", file=f, temperature=0.0, language="en", ) print(response.text) ``` ::: :::{tab-item} cURL :sync: curl ```bash curl http://localhost:8000/v1/audio/transcriptions \ -X POST \ -H "Authorization: Bearer fake-key" \ -F "file=@/path/to/audio.wav" \ -F "model=whisper-large" \ -F "temperature=0.0" \ -F "language=en" ``` ::: :::: ## Structured output You can request structured JSON output similar to OpenAI's API using JSON mode or JSON schema validation with Pydantic models. ### JSON mode ::::{tab-set} :::{tab-item} Server :sync: server ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config = LLMConfig( model_loading_config=dict( model_id="qwen-0.5b", model_source="Qwen/Qwen2.5-0.5B-Instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="A10G", ) # Build and deploy the model app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ::: :::{tab-item} Client (JSON Object) :sync: client ```python from openai import OpenAI # Initialize client client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") # Request structured JSON output response = client.chat.completions.create( model="qwen-0.5b", response_format={"type": "json_object"}, messages=[ { "role": "system", "content": "You are a helpful assistant that outputs JSON." }, { "role": "user", "content": "List three colors in JSON format" } ], stream=True, ) for chunk in response: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="", flush=True) # Example response: # { # "colors": [ # "red", # "blue", # "green" # ] # } ``` ::: :::: ### JSON schema with Pydantic You can specify the exact schema you want for the response using Pydantic models: ```python from openai import OpenAI from typing import List, Literal from pydantic import BaseModel # Initialize client client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") # Define a pydantic model of a preset of allowed colors class Color(BaseModel): colors: List[Literal["cyan", "magenta", "yellow"]] # Request structured JSON output response = client.chat.completions.create( model="qwen-0.5b", response_format={ "type": "json_schema", "json_schema": Color.model_json_schema() }, messages=[ { "role": "system", "content": "You are a helpful assistant that outputs JSON." }, { "role": "user", "content": "List three colors in JSON format" } ], stream=True, ) for chunk in response: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="", flush=True) # Example response: # { # "colors": [ # "cyan", # "magenta", # "yellow" # ] # } ``` ## Vision language models You can deploy multimodal models that process both text and images. Ray Serve LLM supports vision models through vLLM's multimodal capabilities. ### Deploy a vision model ::::{tab-set} :::{tab-item} Server :sync: server ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app # Configure a vision model llm_config = LLMConfig( model_loading_config=dict( model_id="pixtral-12b", model_source="mistral-community/pixtral-12b", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=2, ) ), accelerator_type="L40S", engine_kwargs=dict( tensor_parallel_size=1, max_model_len=8192, ), ) # Build and deploy the model app = build_openai_app({"llm_configs": [llm_config]}) serve.run(app, blocking=True) ``` ::: :::{tab-item} Client :sync: client ```python from openai import OpenAI # Initialize client client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") # Create and send a request with an image response = client.chat.completions.create( model="pixtral-12b", messages=[ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } } ] } ], stream=True, ) for chunk in response: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="", flush=True) ``` ::: :::: ### Supported models For a complete list of supported vision models, see the [vLLM multimodal models documentation](https://docs.vllm.ai/en/stable/models/supported_models.html#multimodal-language-models). ## Reasoning models Ray Serve LLM supports reasoning models such as DeepSeek-R1 and QwQ through vLLM. These models use extended thinking processes before generating final responses. For reasoning model support and configuration, see the [vLLM reasoning models documentation](https://docs.vllm.ai/en/stable/models/supported_models.html). ## See also - [vLLM supported models](https://docs.vllm.ai/en/stable/models/supported_models.html) - Complete list of supported models and features - [vLLM OpenAI compatibility](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html) - vLLM's OpenAI-compatible server documentation - {doc}`Quickstart <../quick-start>` - Basic LLM deployment examples --- (serve-model-multiplexing)= # Model Multiplexing This section helps you understand how to write multiplexed deployment by using the `serve.multiplexed` and `serve.get_multiplexed_model_id` APIs. ## Why model multiplexing? Model multiplexing is a technique used to efficiently serve multiple models with similar input types from a pool of replicas. Traffic is routed to the corresponding model based on the request header. To serve multiple models with a pool of replicas, model multiplexing optimizes cost and load balances the traffic. This is useful in cases where you might have many models with the same shape but different weights that are sparsely invoked. If any replica for the deployment has the model loaded, incoming traffic for that model (based on request header) will automatically be routed to that replica avoiding unnecessary load time. ## Writing a multiplexed deployment To write a multiplexed deployment, use the `serve.multiplexed` and `serve.get_multiplexed_model_id` APIs. Assuming you have multiple PyTorch models inside an AWS S3 bucket with the following structure: ``` s3://my_bucket/1/model.pt s3://my_bucket/2/model.pt s3://my_bucket/3/model.pt s3://my_bucket/4/model.pt ... ``` Define a multiplexed deployment: ```{literalinclude} doc_code/multiplexed.py :language: python :start-after: __serve_deployment_example_begin__ :end-before: __serve_deployment_example_end__ ``` :::{note} The `serve.multiplexed` API also has a `max_num_models_per_replica` parameter. Use it to configure how many models to load in a single replica. If the number of models is larger than `max_num_models_per_replica`, Serve uses the LRU policy to evict the least recently used model. ::: :::{tip} This code example uses the PyTorch Model object. You can also define your own model class and use it here. To release resources when the model is evicted, implement the `__del__` method. Ray Serve internally calls the `__del__` method to release resources when the model is evicted. ::: `serve.get_multiplexed_model_id` retrieves the model ID from the request header. This ID is then passed to the `get_model` function. If the model is not already cached in the replica, Serve loads it from the S3 bucket. Otherwise, the cached model is returned. :::{note} Internally, the Serve router uses the model ID in the request header to route traffic to a corresponding replica. If all replicas that have the model are over-subscribed, Ray Serve routes the request to a new replica, which then loads and caches the model from the S3 bucket. ::: To send a request to a specific model, include the `serve_multiplexed_model_id` field in the request header, and set the value to the model ID to which you want to send the request. ```{literalinclude} doc_code/multiplexed.py :language: python :start-after: __serve_request_send_example_begin__ :end-before: __serve_request_send_example_end__ ``` :::{note} `serve_multiplexed_model_id` is required in the request header, and the value should be the model ID you want to send the request to. If the `serve_multiplexed_model_id` is not found in the request header, Serve will treat it as a normal request and route it to a random replica. ::: After you run the above code, you should see the following lines in the deployment logs: ``` INFO 2023-05-24 01:19:03,853 default_Model default_Model#EjYmnQ CUpzhwUUNw / default replica.py:442 - Started executing request CUpzhwUUNw INFO 2023-05-24 01:19:03,854 default_Model default_Model#EjYmnQ CUpzhwUUNw / default multiplex.py:131 - Loading model '1'. INFO 2023-05-24 01:19:04,859 default_Model default_Model#EjYmnQ CUpzhwUUNw / default replica.py:542 - __CALL__ OK 1005.8ms ``` If you continue to load more models and exceed the `max_num_models_per_replica`, the least recently used model will be evicted and you will see the following lines in the deployment logs:: ``` INFO 2023-05-24 01:19:15,988 default_Model default_Model#rimNjA WzjTbJvbPN / default replica.py:442 - Started executing request WzjTbJvbPN INFO 2023-05-24 01:19:15,988 default_Model default_Model#rimNjA WzjTbJvbPN / default multiplex.py:145 - Unloading model '3'. INFO 2023-05-24 01:19:15,988 default_Model default_Model#rimNjA WzjTbJvbPN / default multiplex.py:131 - Loading model '4'. INFO 2023-05-24 01:19:16,993 default_Model default_Model#rimNjA WzjTbJvbPN / default replica.py:542 - __CALL__ OK 1005.7ms ``` You can also send a request to a specific model by using handle {mod}`options ` API. ```{literalinclude} doc_code/multiplexed.py :language: python :start-after: __serve_handle_send_example_begin__ :end-before: __serve_handle_send_example_end__ ``` When using model composition, you can send requests from an upstream deployment to a multiplexed deployment using the Serve DeploymentHandle. You need to set the `multiplexed_model_id` in the options. For example: ```{literalinclude} doc_code/multiplexed.py :language: python :start-after: __serve_model_composition_example_begin__ :end-before: __serve_model_composition_example_end__ ``` ## Using model multiplexing with batching You can combine model multiplexing with the `@serve.batch` decorator for efficient batched inference. When you use both features together, Ray Serve automatically splits batches by model ID to ensure each batch contains only requests for the same model. This prevents issues where a single batch would contain requests targeting different models. The following example shows how to combine multiplexing with batching: ```{literalinclude} doc_code/multiplexed.py :language: python :start-after: __serve_multiplexed_batching_example_begin__ :end-before: __serve_multiplexed_batching_example_end__ ``` :::{note} `serve.get_multiplexed_model_id()` works correctly inside functions decorated with `@serve.batch`. Ray Serve guarantees that all requests in a batch have the same `multiplexed_model_id`, so you can safely use this value to load and apply the appropriate model for the entire batch. ::: --- (serve-model-composition)= # Deploy Compositions of Models With this guide, you can: * Compose multiple {ref}`deployments ` containing ML models or business logic into a single {ref}`application ` * Independently scale and configure each of your ML models and business logic steps :::{note} The deprecated `RayServeHandle` and `RayServeSyncHandle` APIs have been fully removed as of Ray 2.10. ::: ## Compose deployments using DeploymentHandles When building an application, you can `.bind()` multiple deployments and pass them to each other's constructors. At runtime, inside the deployment code Ray Serve substitutes the bound deployments with {ref}`DeploymentHandles ` that you can use to call methods of other deployments. This capability lets you divide your application's steps, such as preprocessing, model inference, and post-processing, into independent deployments that you can independently scale and configure. Use {mod}`handle.remote ` to send requests to a deployment. These requests can contain ordinary Python args and kwargs, which DeploymentHandles can pass directly to the method. The method call returns a {mod}`DeploymentResponse ` that represents a future to the output. You can `await` the response to retrieve its result or pass it to another downstream {mod}`DeploymentHandle ` call. (serve-model-composition-deployment-handles)= ## Basic DeploymentHandle example This example has two deployments: ```{literalinclude} doc_code/model_composition/language_example.py :start-after: __hello_start__ :end-before: __hello_end__ :language: python :linenos: true ``` In line 42, the `LanguageClassifier` deployment takes in the `spanish_responder` and `french_responder` as constructor arguments. At runtime, Ray Serve converts these arguments into `DeploymentHandles`. `LanguageClassifier` can then call the `spanish_responder` and `french_responder`'s deployment methods using this handle. For example, the `LanguageClassifier`'s `__call__` method uses the HTTP request's values to decide whether to respond in Spanish or French. It then forwards the request's name to the `spanish_responder` or the `french_responder` on lines 19 and 21 using the `DeploymentHandle`s. The format of the calls is as follows: ```python response: DeploymentResponse = self.spanish_responder.say_hello.remote(name) ``` This call has a few parts: * `self.spanish_responder` is the `SpanishResponder` handle taken in through the constructor. * `say_hello` is the `SpanishResponder` method to invoke. * `remote` indicates that this is a `DeploymentHandle` call to another deployment. * `name` is the argument for `say_hello`. You can pass any number of arguments or keyword arguments here. This call returns a `DeploymentResponse` object, which is a reference to the result, rather than the result itself. This pattern allows the call to execute asynchronously. To get the actual result, `await` the response. `await` blocks until the asynchronous call executes and then returns the result. In this example, line 25 calls `await response` and returns the resulting string. (serve-model-composition-await-warning)= :::{warning} You can use the `response.result()` method to get the return value of remote `DeploymentHandle` calls. However, avoid calling `.result()` from inside a deployment because it blocks the deployment from executing any other code until the remote method call finishes. Using `await` lets the deployment process other requests while waiting for the remote method call to finish. You should use `await` instead of `.result()` inside deployments. ::: You can copy the preceding `hello.py` script and run it with `serve run`. Make sure to run the command from a directory containing `hello.py`, so it can locate the script: ```console $ serve run hello:language_classifier ``` You can use this client script to interact with the example: ```{literalinclude} doc_code/model_composition/language_example.py :start-after: __hello_client_start__ :end-before: __hello_client_end__ :language: python ``` While the `serve run` command is running, open a separate terminal window and run the script: ```console $ python hello_client.py Hola Dora ``` :::{note} Composition lets you break apart your application and independently scale each part. For instance, suppose this `LanguageClassifier` application's requests were 75% Spanish and 25% French. You could scale your `SpanishResponder` to have 3 replicas and your `FrenchResponder` to have 1 replica, so you can meet your workload's demand. This flexibility also applies to reserving resources like CPUs and GPUs, as well as any other configurations you can set for each deployment. With composition, you can avoid application-level bottlenecks when serving models and business logic steps that use different types and amounts of resources. ::: ## Chaining DeploymentHandle calls Ray Serve can directly pass the `DeploymentResponse` object that a `DeploymentHandle` returns, to another `DeploymentHandle` call to chain together multiple stages of a pipeline. You don't need to `await` the first response, Ray Serve manages the `await` behavior under the hood. When the first call finishes, Ray Serve passes the output of the first call, instead of the `DeploymentResponse` object, directly to the second call. For example, the code sample below defines three deployments in an application: - An `Adder` deployment that increments a value by its configured increment. - A `Multiplier` deployment that multiplies a value by its configured multiple. - An `Ingress` deployment that chains calls to the adder and multiplier together and returns the final response. Note how the response from the `Adder` handle passes directly to the `Multiplier` handle, but inside the multiplier, the input argument resolves to the output of the `Adder` call. ```{literalinclude} doc_code/model_composition/chaining_example.py :start-after: __chaining_example_start__ :end-before: __chaining_example_end__ :language: python ``` ## Streaming DeploymentHandle calls You can also use `DeploymentHandles` to make streaming method calls that return multiple outputs. To make a streaming call, the method must be a generator and you must set `handle.options(stream=True)`. Then, the handle call returns a {mod}`DeploymentResponseGenerator ` instead of a unary `DeploymentResponse`. You can use `DeploymentResponseGenerators` as a sync or async generator, like in an `async for` code block. Similar to `DeploymentResponse.result()`, avoid using a `DeploymentResponseGenerator` as a sync generator within a deployment, as that blocks other requests from executing concurrently on that replica. Note that you can't pass `DeploymentResponseGenerators` to other handle calls. Example: ```{literalinclude} doc_code/model_composition/streaming_example.py :start-after: __streaming_example_start__ :end-before: __streaming_example_end__ :language: python ``` ## Advanced: Pass a DeploymentResponse in a nested object [FULLY DEPRECATED] :::{warning} Passing a `DeploymentResponse` to downstream handle calls in nested objects is fully deprecated and no longer supported. Please manually use `DeploymentResponse._to_object_ref()` instead to pass the corresponding object reference in nested objects. Passing a `DeploymentResponse` object as a top-level argument or keyword argument is still supported. ::: ## Advanced: Convert a DeploymentResponse to a Ray ObjectRef Under the hood, each `DeploymentResponse` corresponds to a Ray `ObjectRef`, or an `ObjectRefGenerator` for streaming calls. To compose `DeploymentHandle` calls with Ray Actors or Tasks, you may want to resolve the response to its `ObjectRef`. For this purpose, you can use the {mod}`DeploymentResponse._to_object_ref ` and {mod}`DeploymentResponse._to_object_ref_sync ` developer APIs. Example: ```{literalinclude} doc_code/model_composition/response_to_object_ref_example.py :start-after: __response_to_object_ref_example_start__ :end-before: __response_to_object_ref_example_end__ :language: python ``` --- (serve-monitoring)= # Monitor Your Application This section helps you debug and monitor your Serve applications by: * viewing the Ray dashboard * viewing the `serve status` output * using Ray logging and Loki * inspecting built-in Ray Serve metrics * exporting metrics into Arize platform ## Ray Dashboard You can use the Ray dashboard to get a high-level overview of your Ray cluster and Ray Serve application's states. This includes details such as: * the number of deployment replicas currently running * logs for your Serve controller, deployment replicas, and proxies * the Ray nodes (i.e. machines) running in your Ray cluster. You can access the Ray dashboard at port 8265 at your cluster's URI. For example, if you're running Ray Serve locally, you can access the dashboard by going to `http://localhost:8265` in your browser. View important information about your application by accessing the [Serve page](dash-serve-view). ```{image} https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/serve.png :align: center ``` This example has a single-node cluster running a deployment named `Translator`. This deployment has 2 replicas. View details of these replicas by browsing the Serve page. On the details page of each replica. From there, you can view metadata about the replica and the logs of the replicas, including the `logging` and `print` statements generated by the replica process. Another useful view is the [Actors view](dash-actors-view). This example Serve application uses four [Ray actors](actor-guide): - 1 Serve controller - 1 HTTP proxy - 2 `Translator` deployment replicas You can see the details of these entities throughout the Serve page and in the actor's page. This page includes additional useful information like each actor's process ID (PID) and a link to each actor's logs. You can also see whether any particular actor is alive or dead to help you debug potential cluster failures. :::{tip} To learn more about the Serve controller actor, the HTTP proxy actor(s), the deployment replicas, and how they all work together, check out the [Serve Architecture](serve-architecture) documentation. ::: For a detailed overview of the Ray dashboard, see the [dashboard documentation](observability-getting-started). (serve-in-production-inspecting)= ## Inspect applications with the Serve CLI Two Serve CLI commands help you inspect a Serve application in production: `serve config` and `serve status`. If you have a remote cluster, `serve config` and `serve status` also has an `--address/-a` argument to access the cluster. See [VM deployment](serve-in-production-remote-cluster) for more information on this argument. `serve config` gets the latest config file that the Ray Cluster received. This config file represents the Serve application's goal state. The Ray Cluster constantly strives to reach and maintain this state by deploying deployments, recovering failed replicas, and performing other relevant actions. Using the `serve_config.yaml` example from [the production guide](production-config-yaml): ```console $ ray start --head $ serve deploy serve_config.yaml ... $ serve config name: default route_prefix: / import_path: text_ml:app runtime_env: pip: - torch - transformers deployments: - name: Translator num_replicas: 1 user_config: language: french - name: Summarizer num_replicas: 1 ``` `serve status` gets your Serve application's current status. This command reports the status of the `proxies` and the `applications` running on the Ray cluster. `proxies` lists each proxy's status. Each proxy is identified by the node ID of the node that it runs on. A proxy has three possible statuses: * `STARTING`: The proxy is starting up and is not yet ready to serve requests. * `HEALTHY`: The proxy is capable of serving requests. It is behaving normally. * `UNHEALTHY`: The proxy has failed its health-checks. It will be killed, and a new proxy will be started on that node. * `DRAINING`: The proxy is healthy but is closed to new requests. It may contain pending requests that are still being processed. * `DRAINED`: The proxy is closed to new requests. There are no pending requests. `applications` contains a list of applications, their overall statuses, and their deployments' statuses. Each entry in `applications` maps an application's name to four fields: * `status`: A Serve application has four possible overall statuses: * `"NOT_STARTED"`: No application has been deployed on this cluster. * `"DEPLOYING"`: The application is currently carrying out a `serve deploy` request. It is deploying new deployments or updating existing ones. * `"RUNNING"`: The application is at steady-state. It has finished executing any previous `serve deploy` requests, and is attempting to maintain the goal state set by the latest `serve deploy` request. * `"DEPLOY_FAILED"`: The latest `serve deploy` request has failed. * `message`: Provides context on the current status. * `deployment_timestamp`: A UNIX timestamp of when Serve received the last `serve deploy` request. The timestamp is calculated using the `ServeController`'s local clock. * `deployments`: A list of entries representing each deployment's status. Each entry maps a deployment's name to three fields: * `status`: A Serve deployment has six possible statuses: * `"UPDATING"`: The deployment is updating to meet the goal state set by a previous `deploy` request. * `"HEALTHY"`: The deployment is healthy and running at the target replica count. * `"UNHEALTHY"`: The deployment has updated and has become unhealthy afterwards. This condition may be due to replicas failing to upscale, replicas failing health checks, or a general system or machine error. * `"DEPLOY_FAILED"`: The deployment failed to start or update. This condition is likely due to an error in the deployment's constructor. * `"UPSCALING"`: The deployment (with autoscaling enabled) is upscaling the number of replicas. * `"DOWNSCALING"`: The deployment (with autoscaling enabled) is downscaling the number of replicas. * `replica_states`: A list of the replicas' states and the number of replicas in that state. Each replica has five possible states: * `STARTING`: The replica is starting and not yet ready to serve requests. * `UPDATING`: The replica is undergoing a `reconfigure` update. * `RECOVERING`: The replica is recovering its state. * `RUNNING`: The replica is running normally and able to serve requests. * `STOPPING`: The replica is being stopped. * `message`: Provides context on the current status. Use the `serve status` command to inspect your deployments after they are deployed and throughout their lifetime. Using the `serve_config.yaml` example from [an earlier section](production-config-yaml): ```console $ ray start --head $ serve deploy serve_config.yaml ... $ serve status proxies: cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec: HEALTHY applications: default: status: RUNNING message: '' last_deployed_time_s: 1694041157.2211847 deployments: Translator: status: HEALTHY replica_states: RUNNING: 1 message: '' Summarizer: status: HEALTHY replica_states: RUNNING: 1 message: '' ``` For Kubernetes deployments with KubeRay, tighter integrations of `serve status` with Kubernetes are available. See [Getting the status of Serve applications in Kubernetes](serve-getting-status-kubernetes). ## Get application details in Python Call the `serve.status()` API to get Serve application details in Python. `serve.status()` returns the same information as the `serve status` CLI command inside a `dataclass`. Use this method inside a deployment or a Ray driver script to obtain live information about the Serve applications on the Ray cluster. For example, this `monitoring_app` reports all the `RUNNING` Serve applications on the cluster: ```{literalinclude} doc_code/monitoring/monitor_deployment.py :start-after: __monitor_start__ :end-before: __monitor_end__ :language: python ``` (serve-logging)= ## Ray logging To understand system-level behavior and to surface application-level details during runtime, you can leverage Ray logging. Ray Serve uses Python's standard `logging` module with a logger named `"ray.serve"`. By default, logs are emitted from actors both to `stderr` and on disk on each node at `/tmp/ray/session_latest/logs/serve/`. This includes both system-level logs from the Serve controller and proxy as well as access logs and custom user logs produced from within deployment replicas. In development, logs are streamed to the driver Ray program (the Python script that calls `serve.run()` or the `serve run` CLI command), so it's convenient to keep the driver running while debugging. For example, let's run a basic Serve application and view the logs that it emits. First, let's create a simple deployment that logs a custom log message when it's queried: ```{literalinclude} doc_code/monitoring/monitoring.py :start-after: __start__ :end-before: __end__ :language: python ``` Run this deployment using the `serve run` CLI command: ```console $ serve run monitoring:say_hello 2023-04-10 15:57:32,100 INFO scripts.py:380 -- Deploying from import path: "monitoring:say_hello". [2023-04-10 15:57:33] INFO ray._private.worker::Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ServeController pid=63503) INFO 2023-04-10 15:57:35,822 controller 63503 deployment_state.py:1168 - Deploying new version of deployment SayHello. (ProxyActor pid=63513) INFO: Started server process [63513] (ServeController pid=63503) INFO 2023-04-10 15:57:35,882 controller 63503 deployment_state.py:1386 - Adding 1 replica to deployment SayHello. 2023-04-10 15:57:36,840 SUCC scripts.py:398 -- Deployed Serve app successfully. ``` `serve run` prints a few log messages immediately. Note that a few of these messages start with identifiers such as ``` (ServeController pid=63881) ``` These messages are logs from Ray Serve [actors](actor-guide). They describe which actor (Serve controller, proxy, or deployment replica) created the log and what its process ID is (which is useful when distinguishing between different deployment replicas or proxies). The rest of these log messages are the actual log statements generated by the actor. While `serve run` is running, we can query the deployment in a separate terminal window: ``` curl -X GET http://localhost:8000/ ``` This causes the HTTP proxy and deployment replica to print log statements to the terminal running `serve run`: ```console (ServeReplica:SayHello pid=63520) INFO 2023-04-10 15:59:45,403 SayHello SayHello#kTBlTj HzIYOzaEgN / monitoring.py:16 - Hello world! (ServeReplica:SayHello pid=63520) INFO 2023-04-10 15:59:45,403 SayHello SayHello#kTBlTj HzIYOzaEgN / replica.py:527 - __CALL__ OK 0.5ms ``` :::{note} Log messages include the logging level, timestamp, deployment name, replica tag, request ID, route, file name, and line number. ::: Find a copy of these logs at `/tmp/ray/session_latest/logs/serve/`. You can parse these stored logs with a logging stack such as ELK or [Loki](serve-logging-loki) to be able to search by deployment or replica. Serve supports [Log Rotation](log-rotation) of these logs through setting the environment variables `RAY_ROTATION_MAX_BYTES` and `RAY_ROTATION_BACKUP_COUNT`. To silence the replica-level logs or otherwise configure logging, configure the `"ray.serve"` logger **inside the deployment constructor**: ```python import logging logger = logging.getLogger("ray.serve") @serve.deployment class Silenced: def __init__(self): logger.setLevel(logging.ERROR) ``` This controls which logs are written to STDOUT or files on disk. In addition to the standard Python logger, Serve supports custom logging. Custom logging lets you control what messages are written to STDOUT/STDERR, files on disk, or both. For a detailed overview of logging in Ray, see [Ray Logging](configure-logging). ### Configure Serve logging From ray 2.9, the logging_config API configures logging for Ray Serve. You can configure logging for Ray Serve. Pass a dictionary or object of [LoggingConfig](../serve/api/doc/ray.serve.schema.LoggingConfig.rst) to the `logging_config` argument of `serve.run` or `@serve.deployment`. #### Configure logging format You can configure the JSON logging format by passing `encoding=JSON` to `logging_config` argument in `serve.run` or `@serve.deployment` ::::{tab-set} :::{tab-item} serve.run ```{literalinclude} doc_code/monitoring/logging_config.py :start-after: __serve_run_json_start__ :end-before: __serve_run_json_end__ :language: python ``` ::: :::{tab-item} @serve.deployment ```{literalinclude} doc_code/monitoring/logging_config.py :start-after: __deployment_json_start__ :end-before: __deployment_json_end__ :language: python ``` ::: :::: In the replica `Model` log file, you should see the following: ``` # cat `ls /tmp/ray/session_latest/logs/serve/replica_default_Model_*` {"levelname": "INFO", "asctime": "2024-02-27 10:36:08,908", "deployment": "default_Model", "replica": "rdofcrh4", "message": "replica.py:855 - Started initializing replica."} {"levelname": "INFO", "asctime": "2024-02-27 10:36:08,908", "deployment": "default_Model", "replica": "rdofcrh4", "message": "replica.py:877 - Finished initializing replica."} {"levelname": "INFO", "asctime": "2024-02-27 10:36:10,127", "deployment": "default_Model", "replica": "rdofcrh4", "request_id": "f4f4b3c0-1cca-4424-9002-c887d7858525", "route": "/", "application": "default", "message": "replica.py:1068 - Started executing request to method '__call__'."} {"levelname": "INFO", "asctime": "2024-02-27 10:36:10,127", "deployment": "default_Model", "replica": "rdofcrh4", "request_id": "f4f4b3c0-1cca-4424-9002-c887d7858525", "route": "/", "application": "default", "message": "replica.py:373 - __CALL__ OK 0.6ms"} ``` #### Disable access log :::{note} Access log is Ray Serve traffic log, it is printed to proxy log files and replica log files per request. Sometimes it is useful for debugging, but it can also be noisy. ::: You can also disable the access log by passing `disable_access_log=True` to `logging_config` argument of `@serve.deployment`. For example: ```{literalinclude} doc_code/monitoring/logging_config.py :start-after: __enable_access_log_start__ :end-before: __enable_access_log_end__ :language: python ``` The `Model` replica log file doesn't include the Serve traffic log, you should only see the application log in the log file. ``` # cat `ls /tmp/ray/session_latest/logs/serve/replica_default_Model_*` INFO 2024-02-27 15:43:12,983 default_Model 4guj63jr replica.py:855 - Started initializing replica. INFO 2024-02-27 15:43:12,984 default_Model 4guj63jr replica.py:877 - Finished initializing replica. INFO 2024-02-27 15:43:13,492 default_Model 4guj63jr 2246c4bb-73dc-4524-bf37-c7746a6b3bba / :5 - hello world ``` #### Configure logging in different deployments and applications You can also configure logging at the application level by passing `logging_config` to `serve.run`. For example: ```{literalinclude} doc_code/monitoring/logging_config.py :start-after: __application_and_deployment_start__ :end-before: __application_and_deployment_end__ :language: python ``` In the Router log file, you should see the following: ``` # cat `ls /tmp/ray/session_latest/logs/serve/replica_default_Router_*` INFO 2024-02-27 16:05:10,738 default_Router cwnihe65 replica.py:855 - Started initializing replica. INFO 2024-02-27 16:05:10,739 default_Router cwnihe65 replica.py:877 - Finished initializing replica. INFO 2024-02-27 16:05:11,233 default_Router cwnihe65 4db9445d-fc9e-490b-8bad-0a5e6bf30899 / replica.py:1068 - Started executing request to method '__call__'. DEBUG 2024-02-27 16:05:11,234 default_Router cwnihe65 4db9445d-fc9e-490b-8bad-0a5e6bf30899 / :7 - This debug message is from the router. INFO 2024-02-27 16:05:11,238 default_Router cwnihe65 4db9445d-fc9e-490b-8bad-0a5e6bf30899 / router.py:308 - Using router . DEBUG 2024-02-27 16:05:11,240 default_Router cwnihe65 long_poll.py:157 - LongPollClient received updates for keys: [(LongPollNamespace.DEPLOYMENT_CONFIG, DeploymentID(name='Model', app='default')), (LongPollNamespace.RUNNING_REPLICAS, DeploymentID(name='Model', app='default'))]. INFO 2024-02-27 16:05:11,241 default_Router cwnihe65 pow_2_scheduler.py:255 - Got updated replicas for deployment 'Model' in application 'default': {'default#Model#256v3hq4'}. DEBUG 2024-02-27 16:05:11,241 default_Router cwnihe65 long_poll.py:157 - LongPollClient received updates for keys: [(LongPollNamespace.DEPLOYMENT_CONFIG, DeploymentID(name='Model', app='default')), (LongPollNamespace.RUNNING_REPLICAS, DeploymentID(name='Model', app='default'))]. INFO 2024-02-27 16:05:11,245 default_Router cwnihe65 4db9445d-fc9e-490b-8bad-0a5e6bf30899 / replica.py:373 - __CALL__ OK 12.2ms ``` In the Model log file, you should see the following: ``` # cat `ls /tmp/ray/session_latest/logs/serve/replica_default_Model_*` INFO 2024-02-27 16:05:10,735 default_Model 256v3hq4 replica.py:855 - Started initializing replica. INFO 2024-02-27 16:05:10,735 default_Model 256v3hq4 replica.py:877 - Finished initializing replica. INFO 2024-02-27 16:05:11,244 default_Model 256v3hq4 4db9445d-fc9e-490b-8bad-0a5e6bf30899 / replica.py:1068 - Started executing request to method '__call__'. INFO 2024-02-27 16:05:11,244 default_Model 256v3hq4 4db9445d-fc9e-490b-8bad-0a5e6bf30899 / replica.py:373 - __CALL__ OK 0.6ms ``` When you set `logging_config` at the application level, Ray Serve applies to all deployments in the application. When you set `logging_config` at the deployment level at the same time, the deployment level configuration will overrides the application level configuration. #### Configure logging for serve components You can also update logging configuration similar above to the Serve controller and proxies by passing `logging_config` to `serve.start`. ```{literalinclude} doc_code/monitoring/logging_config.py :start-after: __configure_serve_component_start__ :end-before: __configure_serve_component_end__ :language: python ``` #### Run custom initialization code in the controller For advanced use cases, you can run custom initialization code when the Serve Controller starts by setting the `RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` environment variable. This variable should point to a callback function that runs during controller initialization. The function doesn't need to return anything. For example, to add a custom log handler: ```python # mymodule/callbacks.py import logging def setup_custom_logging(): logger = logging.getLogger("ray.serve") handler = logging.StreamHandler() handler.setFormatter(logging.Formatter("[CUSTOM] %(message)s")) logger.addHandler(handler) ``` Then set the environment variable before starting Ray: ```bash export RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH="mymodule.callbacks:setup_custom_logging" ``` ### Set Request ID You can set a custom request ID for each HTTP request by including `X-Request-ID` in the request header and retrieve request ID from response. For example ```{literalinclude} doc_code/monitoring/request_id.py :language: python ``` The custom request ID `123-234` can be seen in the access logs that are printed to the HTTP Proxy log files and deployment log files. HTTP proxy log file: ``` INFO 2023-07-20 13:47:54,221 http_proxy 127.0.0.1 123-234 / default http_proxy.py:538 - GET 200 8.9ms ``` Deployment log file: ``` (ServeReplica:default_Model pid=84006) INFO 2023-07-20 13:47:54,218 default_Model default_Model#yptKoo 123-234 / default replica.py:691 - __CALL__ OK 0.2ms ``` :::{note} The request ID is used to associate logs across the system. Avoid sending duplicate request IDs, which may lead to confusion when debugging. ::: (serve-logging-loki)= ### Filtering logs with Loki You can explore and filter your logs using [Loki](https://grafana.com/oss/loki/). Setup and configuration are straightforward on Kubernetes, but as a tutorial, let's set up Loki manually. For this walkthrough, you need both Loki and Promtail, which are both supported by [Grafana Labs](https://grafana.com). Follow the installation instructions at Grafana's website to get executables for [Loki](https://grafana.com/docs/loki/latest/installation/) and [Promtail](https://grafana.com/docs/loki/latest/clients/promtail/). For convenience, save the Loki and Promtail executables in the same directory, and then navigate to this directory in your terminal. Now let's get your logs into Loki using Promtail. Save the following file as `promtail-local-config.yaml`: ```yaml server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://localhost:3100/loki/api/v1/push scrape_configs: - job_name: ray static_configs: - labels: job: ray __path__: /tmp/ray/session_latest/logs/serve/*.* ``` The relevant part for Ray Serve is the `static_configs` field, where we have indicated the location of our log files with `__path__`. The expression `*.*` will match all files, but it won't match directories since they cause an error with Promtail. We'll run Loki locally. Grab the default config file for Loki with the following command in your terminal: ```shell wget https://raw.githubusercontent.com/grafana/loki/v2.1.0/cmd/loki/loki-local-config.yaml ``` Now start Loki: ```shell ./loki-darwin-amd64 -config.file=loki-local-config.yaml ``` Here you may need to replace `./loki-darwin-amd64` with the path to your Loki executable file, which may have a different name depending on your operating system. Start Promtail and pass in the path to the config file we saved earlier: ```shell ./promtail-darwin-amd64 -config.file=promtail-local-config.yaml ``` Once again, you may need to replace `./promtail-darwin-amd64` with your Promtail executable. Run the following Python script to deploy a basic Serve deployment with a Serve deployment logger and to make some requests: ```{literalinclude} doc_code/monitoring/deployment_logger.py :start-after: __start__ :end-before: __end__ :language: python ``` Now [install and run Grafana](https://grafana.com/docs/grafana/latest/installation/) and navigate to `http://localhost:3000`, where you can log in with default credentials: * Username: admin * Password: admin On the welcome page, click "Add your first data source" and click "Loki" to add Loki as a data source. Now click "Explore" in the left-side panel. You are ready to run some queries! To filter all these Ray logs for the ones relevant to our deployment, use the following [LogQL](https://grafana.com/docs/loki/latest/logql/) query: ```shell {job="ray"} |= "Counter" ``` You should see something similar to the following: ```{image} https://raw.githubusercontent.com/ray-project/Images/master/docs/serve/loki-serve.png :align: center ``` You can use Loki to filter your Ray Serve logs and gather insights quicker. (serve-production-monitoring-metrics)= ## Built-in Ray Serve metrics Ray Serve exposes important system metrics like the number of successful and failed requests through the [Ray metrics monitoring infrastructure](dash-metrics-view). By default, metrics are exposed in Prometheus format on each node. :::{note} Different metrics are collected when deployments are called via Python `DeploymentHandle` versus HTTP/gRPC. See the markers below each table: - **[H]** - Available when using HTTP/gRPC proxy calls - **[D]** - Available when using Python `DeploymentHandle` calls - **[†]** - Internal metrics for advanced debugging; may change in future releases ::: :::{warning} **Histogram bucket configuration** Histogram metrics use predefined bucket boundaries to aggregate latency measurements. The default buckets are: `[1, 2, 5, 10, 20, 50, 100, 200, 300, 400, 500, 1000, 2000, 5000, 10000, 60000, 120000, 300000, 600000]` (in milliseconds). You can customize these buckets using environment variables: - **`RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS`**: Controls bucket boundaries for request latency histograms: - `ray_serve_http_request_latency_ms` - `ray_serve_grpc_request_latency_ms` - `ray_serve_deployment_processing_latency_ms` - `ray_serve_health_check_latency_ms` - `ray_serve_replica_reconfigure_latency_ms` - **`RAY_SERVE_MODEL_LOAD_LATENCY_BUCKETS_MS`**: Controls bucket boundaries for model multiplexing latency histograms: - `ray_serve_multiplexed_model_load_latency_ms` - `ray_serve_multiplexed_model_unload_latency_ms` - **`RAY_SERVE_BATCH_UTILIZATION_BUCKETS_PERCENT`**: Controls bucket boundaries for batch utilization histogram. Default: `[5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 99, 100]` (percentage). - `ray_serve_batch_utilization_percent` - **`RAY_SERVE_BATCH_SIZE_BUCKETS`**: Controls bucket boundaries for batch size histogram. Default: `[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]`. - `ray_serve_actual_batch_size` - **`RAY_SERVE_REPLICA_STARTUP_SHUTDOWN_LATENCY_BUCKETS_MS`**: Controls bucket boundaries for replica lifecycle latency histograms: - `ray_serve_replica_startup_latency_ms` - `ray_serve_replica_initialization_latency_ms` - `ray_serve_replica_shutdown_duration_ms` - `ray_serve_proxy_shutdown_duration_ms` Note: `ray_serve_batch_wait_time_ms` and `ray_serve_batch_execution_time_ms` use the same buckets as `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS`. Set these as comma-separated values, for example: `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS="10,50,100,500,1000,5000"` or `RAY_SERVE_BATCH_SIZE_BUCKETS="1,4,8,16,32,64"`. **Histogram accuracy considerations** Prometheus histograms aggregate data into predefined buckets, which can affect the accuracy of percentile calculations (e.g., p50, p95, p99) displayed on dashboards: - **Values outside bucket range**: If your latencies exceed the largest bucket boundary (default: 600,000ms / 10 minutes), they all fall into the `+Inf` bucket and percentile estimates become inaccurate. - **Sparse bucket coverage**: If your actual latencies cluster between two widely-spaced buckets, the calculated percentiles are interpolated and may not reflect true values. - **Bucket boundaries are fixed at startup**: Changes to bucket environment variables (such as `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS`, `RAY_SERVE_BATCH_SIZE_BUCKETS`, etc.) require restarting Serve actors to take effect. For accurate percentile calculations, configure bucket boundaries that closely match your expected latency distribution. For example, if most requests complete in 10-100ms, use finer-grained buckets in that range. ::: ### Request lifecycle and metrics The following diagram shows where metrics are captured along the request path: ``` REQUEST FLOW ┌─────────────────────────────────────────────────────────────────────────────┐ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ HTTP/gRPC PROXY │ │ │ │ │ │ │ │ ○ ray_serve_num_ongoing_http_requests (while processing) │ │ │ │ ○ ray_serve_num_http_requests_total (on completion) │ │ │ │ ○ ray_serve_http_request_latency_ms (on completion) │ │ │ │ ○ ray_serve_num_http_error_requests_total (on error response) │ │ │ └──────────────────────────────┬──────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ DEPLOYMENT HANDLE │ │ │ │ │ │ │ │ ○ ray_serve_handle_request_counter_total (on completion) │ │ │ └──────────────────────────────┬──────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ ROUTER │ │ │ │ │ │ │ │ ○ ray_serve_num_router_requests_total (on request routed) │ │ │ │ ○ ray_serve_deployment_queued_queries (while in queue) │ │ │ │ ○ ray_serve_num_ongoing_requests_at_replicas (assigned to replica) │ │ │ └──────────────────────────────┬──────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ REPLICA │ │ │ │ │ │ │ │ ○ ray_serve_replica_processing_queries (while processing) │ │ │ │ ○ ray_serve_deployment_processing_latency_ms (on completion) │ │ │ │ ○ ray_serve_deployment_request_counter_total (on completion) │ │ │ │ ○ ray_serve_deployment_error_counter_total (on exception) │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ Legend: ───────────────────────────────────────────────────────────────────────────── • Counters (_total): Incremented once per event • Gauges (ongoing/queued): Show current count, increase on start, decrease on end • Histograms (latency_ms): Record duration when request completes ``` ### HTTP/gRPC proxy metrics These metrics track request throughput and latency at the proxy level (request entry point). | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_num_ongoing_http_requests` **[H]** | Gauge | `node_id`, `node_ip_address` | Current number of HTTP requests being processed by the proxy. | | `ray_serve_num_ongoing_grpc_requests` **[H]** | Gauge | `node_id`, `node_ip_address` | Current number of gRPC requests being processed by the proxy. | | `ray_serve_num_http_requests_total` **[H]** | Counter | `route`, `method`, `application`, `status_code` | Total number of HTTP requests processed by the proxy. | | `ray_serve_num_grpc_requests_total` **[H]** | Counter | `route`, `method`, `application`, `status_code` | Total number of gRPC requests processed by the proxy. | | `ray_serve_http_request_latency_ms` **[H]** | Histogram | `method`, `route`, `application`, `status_code` | Histogram of end-to-end HTTP request latency in milliseconds (measured from proxy receipt to response sent). | | `ray_serve_grpc_request_latency_ms` **[H]** | Histogram | `method`, `route`, `application`, `status_code` | Histogram of end-to-end gRPC request latency in milliseconds (measured from proxy receipt to response sent). | | `ray_serve_num_http_error_requests_total` **[H]** | Counter | `route`, `error_code`, `method`, `application` | Total number of HTTP requests that returned non-2xx/3xx status codes. | | `ray_serve_num_grpc_error_requests_total` **[H]** | Counter | `route`, `error_code`, `method`, `application` | Total number of gRPC requests that returned non-OK status codes. | | `ray_serve_num_deployment_http_error_requests_total` **[H]** | Counter | `deployment`, `error_code`, `method`, `route`, `application` | Total number of HTTP errors per deployment. Useful for identifying which deployment caused errors. | | `ray_serve_num_deployment_grpc_error_requests_total` **[H]** | Counter | `deployment`, `error_code`, `method`, `route`, `application` | Total number of gRPC errors per deployment. Useful for identifying which deployment caused errors. | ### Request routing metrics These metrics track request routing and queueing behavior. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_handle_request_counter_total` **[D]** | Counter | `handle`, `deployment`, `route`, `application` | Total number of requests processed by this `DeploymentHandle`. | | `ray_serve_num_router_requests_total` **[H]** | Counter | `deployment`, `route`, `application`, `handle`, `actor_id` | Total number of requests routed to a deployment. | | `ray_serve_deployment_queued_queries` **[H]** | Gauge | `deployment`, `application`, `handle`, `actor_id` | Current number of requests waiting to be assigned to a replica. High values indicate backpressure. | | `ray_serve_num_ongoing_requests_at_replicas` **[H]** | Gauge | `deployment`, `application`, `handle`, `actor_id` | Current number of requests assigned and sent to replicas but not yet completed. | | `ray_serve_request_router_fulfillment_time_ms` **[H][D]** | Histogram | `deployment`, `actor_id`, `application`, `handle_source` | Histogram of time in milliseconds that a request spent waiting in the router queue before being assigned to a replica. This includes the time to resolve the pending request's arguments. | | `ray_serve_request_router_queue_len` **[H][D]** | Gauge | `deployment`, `replica_id`, `actor_id`, `application`, `handle_source` | Current number of requests running on a replica as tracked by the router's queue length cache. | | `ray_serve_num_scheduling_tasks` **[H][†]** | Gauge | `deployment`, `actor_id` | Current number of request scheduling tasks in the router. | | `ray_serve_num_scheduling_tasks_in_backoff` **[H][†]** | Gauge | `deployment`, `actor_id` | Current number of scheduling tasks in exponential backoff (waiting before retry). | ### Request processing metrics These metrics track request throughput, errors, and latency at the replica level. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_replica_processing_queries` **[D]** | Gauge | `deployment`, `replica`, `application` | Current number of requests being processed by the replica. | | `ray_serve_deployment_request_counter_total` **[D]** | Counter | `deployment`, `replica`, `route`, `application` | Total number of requests processed by the replica. | | `ray_serve_deployment_processing_latency_ms` **[D]** | Histogram | `deployment`, `replica`, `route`, `application` | Histogram of request processing time in milliseconds (excludes queue wait time). | | `ray_serve_deployment_error_counter_total` **[D]** | Counter | `deployment`, `replica`, `route`, `application` | Total number of exceptions raised while processing requests. | ### Batching metrics These metrics track request batching behavior for deployments using `@serve.batch`. Use them to tune batching parameters and debug latency issues. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_batch_wait_time_ms` | Histogram | `deployment`, `replica`, `application`, `function_name` | Time requests waited for the batch to fill in milliseconds. High values indicate batch timeout may be too long. | | `ray_serve_batch_execution_time_ms` | Histogram | `deployment`, `replica`, `application`, `function_name` | Time to execute the batch function in milliseconds. | | `ray_serve_batch_queue_length` | Gauge | `deployment`, `replica`, `application`, `function_name` | Current number of requests waiting in the batch queue. High values indicate a batching bottleneck. | | `ray_serve_batch_utilization_percent` | Histogram | `deployment`, `replica`, `application`, `function_name` | Batch utilization as percentage (`computed_batch_size / max_batch_size * 100`). Low utilization suggests `batch_wait_timeout_s` is too aggressive or traffic is too low. | | `ray_serve_actual_batch_size` | Histogram | `deployment`, `replica`, `application`, `function_name` | The computed size of each batch. When `batch_size_fn` is configured, this reports the custom computed size (such as total tokens). Otherwise, it reports the number of requests. | | `ray_serve_batches_processed_total` | Counter | `deployment`, `replica`, `application`, `function_name` | Total number of batches executed. Compare with request counter to measure batching efficiency. | ### Proxy health metrics These metrics track proxy health and lifecycle. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_proxy_status` | Gauge | `node_id`, `node_ip_address` | Current status of the proxy as a numeric value: `1` = STARTING, `2` = HEALTHY, `3` = UNHEALTHY, `4` = DRAINING, `5` = DRAINED. | | `ray_serve_proxy_shutdown_duration_ms` | Histogram | `node_id`, `node_ip_address` | Time taken for a proxy to shut down in milliseconds. | ### Replica lifecycle metrics These metrics track replica health, restarts, and lifecycle timing. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_deployment_replica_healthy` | Gauge | `deployment`, `replica`, `application` | Health status of the replica: `1` = healthy, `0` = unhealthy. | | `ray_serve_deployment_replica_starts_total` | Counter | `deployment`, `replica`, `application` | Total number of times the replica has started (including restarts due to failure). | | `ray_serve_replica_startup_latency_ms` | Histogram | `deployment`, `replica`, `application` | Total time from replica creation to ready state in milliseconds. Includes node provisioning (if needed on VM or Kubernetes), runtime environment bootstrap (pip install, Docker image pull, etc.), Ray actor scheduling, and actor constructor execution. Useful for debugging slow cold starts. | | `ray_serve_replica_initialization_latency_ms` | Histogram | `deployment`, `replica`, `application` | Time for the actor constructor to run in milliseconds. This is a subset of `ray_serve_replica_startup_latency_ms`. | | `ray_serve_replica_reconfigure_latency_ms` | Histogram | `deployment`, `replica`, `application` | Time in milliseconds for a replica to complete reconfiguration. Includes both reconfigure time and one control-loop iteration, so very low values may be unreliable. | | `ray_serve_health_check_latency_ms` | Histogram | `deployment`, `replica`, `application` | Duration of health check calls in milliseconds. Useful for identifying slow health checks blocking scaling. | | `ray_serve_health_check_failures_total` | Counter | `deployment`, `replica`, `application` | Total number of failed health checks. Provides early warning before replica is marked unhealthy. | | `ray_serve_replica_shutdown_duration_ms` | Histogram | `deployment`, `replica`, `application` | Time from shutdown signal to replica fully stopped in milliseconds. Useful for debugging slow draining during scale-down or rolling updates. | ### Autoscaling metrics These metrics provide visibility into autoscaling behavior and help debug scaling issues. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_autoscaling_target_replicas` | Gauge | `deployment`, `application` | Target number of replicas the autoscaler is trying to reach. Compare with actual replicas to identify scaling lag. | | `ray_serve_autoscaling_desired_replicas` | Gauge | `deployment`, `application` | Raw autoscaling decision (number of replicas) from the policy *before* applying `min_replicas`/`max_replicas` bounds. | | `ray_serve_autoscaling_total_requests` | Gauge | `deployment`, `application` | Total number of requests (queued + in-flight) as seen by the autoscaler. This is the input to the scaling decision. | | `ray_serve_autoscaling_policy_execution_time_ms` | Gauge | `deployment`, `application`, `policy_scope` | Time taken to execute the autoscaling policy in milliseconds. `policy_scope` is `deployment` or `application`. | | `ray_serve_autoscaling_replica_metrics_delay_ms` | Gauge | `deployment`, `application`, `replica` | Time taken for replica metrics to reach the controller in milliseconds. High values may indicate controller overload. | | `ray_serve_autoscaling_handle_metrics_delay_ms` | Gauge | `deployment`, `application`, `handle` | Time taken for handle metrics to reach the controller in milliseconds. High values may indicate controller overload. | | `ray_serve_record_autoscaling_stats_failed_total` | Counter | `application`, `deployment`, `replica`, `exception_name` | Total number of failed attempts to collect autoscaling metrics on replica from user defined function. Non-zero values indicate error in user code. | | `ray_serve_user_autoscaling_stats_latency_ms` | Histogram | `application`, `deployment`, `replica` | Histogram of time taken to execute the user-defined autoscaling stats function in milliseconds. | ### Model multiplexing metrics These metrics track model loading and caching behavior for deployments using `@serve.multiplexed`. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_multiplexed_model_load_latency_ms` | Histogram | `deployment`, `replica`, `application` | Histogram of time taken to load a model in milliseconds. | | `ray_serve_multiplexed_model_unload_latency_ms` | Histogram | `deployment`, `replica`, `application` | Histogram of time taken to unload a model in milliseconds. | | `ray_serve_num_multiplexed_models` | Gauge | `deployment`, `replica`, `application` | Current number of models loaded on the replica. | | `ray_serve_multiplexed_models_load_counter_total` | Counter | `deployment`, `replica`, `application` | Total number of model load operations. | | `ray_serve_multiplexed_models_unload_counter_total` | Counter | `deployment`, `replica`, `application` | Total number of model unload operations (evictions). | | `ray_serve_registered_multiplexed_model_id` | Gauge | `deployment`, `replica`, `application`, `model_id` | Indicates which model IDs are currently loaded. Value is `1` when the model is loaded. | | `ray_serve_multiplexed_get_model_requests_counter_total` | Counter | `deployment`, `replica`, `application` | Total number of `get_model()` calls. Compare with load counter to calculate cache hit rate. | ### Controller metrics These metrics track the Serve controller's performance. Useful for debugging control plane issues. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_controller_control_loop_duration_s` | Gauge | — | Duration of the last control loop iteration in seconds. | | `ray_serve_controller_num_control_loops` | Gauge | `actor_id` | Total number of control loop iterations. Increases monotonically. | | `ray_serve_routing_stats_delay_ms` | Histogram | `deployment`, `replica`, `application` | Time taken for routing stats to propagate from replica to controller in milliseconds. | | `ray_serve_routing_stats_error_total` | Counter | `deployment`, `replica`, `application`, `error_type` | Total number of errors when getting routing stats from replicas. `error_type` is `exception` (replica raised an error) or `timeout` (replica didn't respond in time). | | `ray_serve_long_poll_host_transmission_counter_total` **[†]** | Counter | `namespace_or_state` | Total number of long poll updates transmitted to clients. | | `ray_serve_deployment_status` | Gauge | `deployment`, `application` | Numeric status of deployment: `0` = UNKNOWN, `1` = DEPLOY_FAILED, `2` = UNHEALTHY, `3` = UPDATING, `4` = UPSCALING, `5` = DOWNSCALING, `6` = HEALTHY. Use for state timeline visualization and lifecycle debugging. | | `ray_serve_application_status` | Gauge | `application` | Numeric status of application: `0` = UNKNOWN, `1` = DEPLOY_FAILED, `2` = UNHEALTHY, `3` = NOT_STARTED, `4` = DELETING, `5` = DEPLOYING, `6` = RUNNING. Use for state timeline visualization and lifecycle debugging. | | `ray_serve_long_poll_latency_ms` **[†]** | Histogram | `namespace` | Time for updates to propagate from controller to clients in milliseconds. `namespace` is the long poll namespace such as `ROUTE_TABLE`, `DEPLOYMENT_CONFIG`, or `DEPLOYMENT_TARGETS`. Debug slow config propagation; impacts autoscaling response time. | | `ray_serve_long_poll_pending_clients` **[†]** | Gauge | `namespace` | Number of clients waiting for updates. `namespace` is the long poll namespace such as `ROUTE_TABLE`, `DEPLOYMENT_CONFIG`, or `DEPLOYMENT_TARGETS`. Identify backpressure in notification system. | ### Event loop monitoring metrics These metrics track the health of asyncio event loops in Serve components. High scheduling latency indicates the event loop is blocked, which can cause request latency issues. Use these metrics to detect blocking code in handlers or system bottlenecks. | Metric | Type | Tags | Description | |--------|------|------|-------------| | `ray_serve_event_loop_scheduling_latency_ms` **[†]** | Histogram | `component`, `loop_type`, `actor_id`, `deployment`*, `application`* | Event loop scheduling delay in milliseconds. Measures how long the loop was blocked beyond the expected sleep interval. Values close to zero indicate a healthy loop; high values indicate either blocking code or a large number of tasks queued on the event loop. | | `ray_serve_event_loop_monitoring_iterations_total` **[†]** | Counter | `component`, `loop_type`, `actor_id`, `deployment`*, `application`* | Number of event loop monitoring iterations. Acts as a heartbeat; a stalled counter indicates the loop is completely blocked. | | `ray_serve_event_loop_tasks` **[†]** | Gauge | `component`, `loop_type`, `actor_id`, `deployment`*, `application`* | Number of pending asyncio tasks on the event loop. High values may indicate task accumulation. | *\* `deployment` and `application` tags are only present for replica `main` and `user_code` loops, not for proxy or router loops.* **Tag values:** - `component`: The Serve component type. - `proxy`: HTTP/gRPC proxy actor - `replica`: Deployment replica actor - `unknown`: When using `DeploymentHandle.remote()` - `loop_type`: The type of event loop being monitored. - `main`: Main event loop for the actor (always present) - `user_code`: Separate event loop for user handler code (replicas only, when `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD=1`, which is the default) - `router`: Separate event loop for request routing (replicas only, when `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=1`, which is the default) - `actor_id`: The Ray actor ID of the proxy or replica - `deployment`: The deployment name (replicas only, for `main` and `user_code` loops) - `application`: The application name (replicas only, for `main` and `user_code` loops) **Interpreting scheduling latency:** - **< 10ms**: Healthy event loop - **10-50ms**: Acceptable under load - **50-100ms**: Concerning; investigate for blocking code - **100-500ms**: Problematic; likely blocking I/O or CPU-bound code in async handlers - **> 500ms**: Severe blocking; definitely impacting request latency To see this in action, first run the following command to start Ray and set up the metrics export port: ```bash ray start --head --metrics-export-port=8080 ``` Then run the following script: ```{literalinclude} doc_code/monitoring/metrics_snippet.py :start-after: __start__ :end-before: __end__ :language: python ``` The requests loop until canceled with `Control-C`. While this script is running, go to `localhost:8080` in your web browser. In the output there, you can search for `serve_` to locate the metrics above. The metrics are updated once every ten seconds by default, so you need to refresh the page to see new values. The metrics report interval rate can be modified using the following configuration option (note that this is not a stable public API and is subject to change without warning): ```console ray start --head --system-config='{"metrics_report_interval_ms": 1000}' ``` For example, after running the script for some time and refreshing `localhost:8080` you should find metrics similar to the following: ``` ray_serve_deployment_processing_latency_ms_count{..., replica="sleeper#jtzqhX"} 48.0 ray_serve_deployment_processing_latency_ms_sum{..., replica="sleeper#jtzqhX"} 48160.6719493866 ``` which indicates that the average processing latency is just over one second, as expected. You can even define a [custom metric](application-level-metrics) for your deployment and tag it with deployment or replica metadata. Here's an example: ```{literalinclude} doc_code/monitoring/custom_metric_snippet.py :start-after: __start__ :end-before: __end__ ``` The emitted logs include: ``` # HELP ray_my_counter_total The number of odd-numbered requests to this deployment. # TYPE ray_my_counter_total counter ray_my_counter_total{..., deployment="MyDeployment",model="123",replica="MyDeployment#rUVqKh"} 5.0 ``` See the [Ray Metrics documentation](collect-metrics) for more details, including instructions for scraping these metrics using Prometheus. ## Profiling memory Ray provides two useful metrics to track memory usage: `ray_component_rss_mb` (resident set size) and `ray_component_mem_shared_bytes` (shared memory). Approximate a Serve actor's memory usage by subtracting its shared memory from its resident set size (i.e. `ray_component_rss_mb` - `ray_component_mem_shared_bytes`). If you notice a memory leak on a Serve actor, use `memray` to debug (`pip install memray`). Set the env var `RAY_SERVE_ENABLE_MEMORY_PROFILING=1`, and run your Serve application. All the Serve actors will run a `memray` tracker that logs their memory usage to `bin` files in the `/tmp/ray/session_latest/logs/serve/` directory. Run the `memray flamegraph [bin file]` command to generate a flamegraph of the memory usage. See the [memray docs](https://bloomberg.github.io/memray/overview.html) for more info. ## Exporting metrics into Arize Besides using Prometheus to check out Ray metrics, Ray Serve also has the flexibility to export the metrics into other observability platforms. [Arize](https://docs.arize.com/arize/) is a machine learning observability platform which can help you to monitor real-time model performance, root cause model failures/performance degradation using explainability & slice analysis and surface drift, data quality, data consistency issues etc. To integrate with Arize, add Arize client code directly into your Serve deployment code. ([Example code](https://docs.arize.com/arize/integrations/integrations/anyscale-ray-serve)) --- (serve-multi-application)= # Deploy Multiple Applications Serve supports deploying multiple independent Serve applications. This user guide walks through how to generate a multi-application config file and deploy it using the Serve CLI, and monitor your applications using the CLI and the Ray Serve dashboard. ## Context ### Background With the introduction of multi-application Serve, we walk you through the new concept of applications and when you should choose to deploy a single application versus multiple applications per cluster. An application consists of one or more deployments. The deployments in an application are tied into a directed acyclic graph through [model composition](serve-model-composition). An application can be called via HTTP at the specified route prefix, and the ingress deployment handles all such inbound traffic. Due to the dependence between deployments in an application, one application is a unit of upgrade. ### When to use multiple applications You can solve many use cases by using either model composition or multi-application. However, both have their own individual benefits and can be used together. Suppose you have multiple models and/or business logic that all need to be executed for a single request. If they are living in one repository, then you most likely upgrade them as a unit, so we recommend having all those deployments in one application. On the other hand, if these models or business logic have logical groups, for example, groups of models that communicate with each other but live in different repositories, we recommend separating the models into applications. Another common use-case for multiple applications is separate groups of models that may not communicate with each other, but you want to co-host them to increase hardware utilization. Because one application is a unit of upgrade, having multiple applications allows you to deploy many independent models (or groups of models) each behind different endpoints. You can then easily add or delete applications from the cluster as well as upgrade applications independently of each other. ## Getting started Define a Serve application: ```{literalinclude} doc_code/image_classifier_example.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` Copy this code to a file named `image_classifier.py`. Define a second Serve application: ```{literalinclude} doc_code/translator_example.py :language: python :start-after: __serve_example_begin__ :end-before: __serve_example_end__ ``` Copy this code to a file named `text_translator.py`. Generate a multi-application config file that contains both of these two applications and save it to `config.yaml`. ``` serve build image_classifier:app text_translator:app -o config.yaml ``` This generates the following config: ```yaml proxy_location: EveryNode http_options: host: 0.0.0.0 port: 8000 grpc_options: port: 9000 grpc_servicer_functions: [] logging_config: encoding: JSON log_level: INFO logs_dir: null enable_access_log: true applications: - name: app1 route_prefix: /classify import_path: image_classifier:app runtime_env: {} deployments: - name: downloader - name: ImageClassifier - name: app2 route_prefix: /translate import_path: text_translator:app runtime_env: {} deployments: - name: Translator ``` :::{note} The names for each application are auto-generated as `app1`, `app2`, etc. To give custom names to the applications, modify the config file before moving on to the next step. ::: ### Deploy the applications To deploy the applications, be sure to start a Ray cluster first. ```console $ ray start --head $ serve deploy config.yaml > Sent deploy request successfully! ``` Query the applications at their respective endpoints, `/classify` and `/translate`. ```{literalinclude} doc_code/image_classifier_example.py :language: python :start-after: __request_begin__ :end-before: __request_end__ ``` ```{literalinclude} doc_code/translator_example.py :language: python :start-after: __request_begin__ :end-before: __request_end__ ``` #### Development workflow with `serve run` You can also use the CLI command `serve run` to run and test your application easily, either locally or on a remote cluster. ```console $ serve run config.yaml > 2023-04-04 11:00:05,901 INFO scripts.py:327 -- Deploying from config file: "config.yaml". > 2023-04-04 11:00:07,505 INFO worker.py:1613 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 > 2023-04-04 11:00:09,012 SUCC scripts.py:393 -- Submitted deploy config successfully. ``` The `serve run` command blocks the terminal, which allows logs from Serve to stream to the console. This helps you test and debug your applications easily. If you want to change your code, you can hit Ctrl-C to interrupt the command and shutdown Serve and all its applications, then rerun `serve run`. :::{note} `serve run` only supports running multi-application config files. If you want to run applications by directly passing in an import path, `serve run` can only run one application import path at a time. ::: ### Check status Check the status of the applications by running `serve status`. ```console $ serve status proxies: 2e02a03ad64b3f3810b0dd6c3265c8a00ac36c13b2b0937cbf1ef153: HEALTHY applications: app1: status: RUNNING message: '' last_deployed_time_s: 1693267064.0735464 deployments: downloader: status: HEALTHY replica_states: RUNNING: 1 message: '' ImageClassifier: status: HEALTHY replica_states: RUNNING: 1 message: '' app2: status: RUNNING message: '' last_deployed_time_s: 1693267064.0735464 deployments: Translator: status: HEALTHY replica_states: RUNNING: 1 message: '' ``` ### Send requests between applications You can also make calls between applications without going through HTTP by using the Serve API `serve.get_app_handle` to get a handle to any live Serve application on the cluster. This handle can be used to directly execute a request on an application. Take the classifier and translator app above as an example. You can modify the `__call__` method of the `ImageClassifier` to check for another parameter in the HTTP request, and send requests to the translator application. ```{literalinclude} doc_code/image_classifier_example.py :language: python :start-after: __serve_example_modified_begin__ :end-before: __serve_example_modified_end__ ``` Then, send requests to the classifier application with the `should_translate` flag set to True: ```{literalinclude} doc_code/image_classifier_example.py :language: python :start-after: __second_request_begin__ :end-before: __second_request_end__ ``` ### Inspect deeper For more visibility into the applications running on the cluster, go to the Ray Serve dashboard at [`http://localhost:8265/#/serve`](http://localhost:8265/#/serve). You can see all applications that are deployed on the Ray cluster: ![applications](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/multi-app/applications-dashboard.png) The list of deployments under each application: ![deployments](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/multi-app/deployments-dashboard.png) As well as the list of replicas for each deployment: ![replicas](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/multi-app/replica-dashboard.png) For more details on the Ray Serve dashboard, see the [Serve dashboard documentation](dash-serve-view). ## Add, delete, and update applications You can add, remove or update entries under the `applications` field to add, remove or update applications in the cluster. This doesn't affect other applications on the cluster. To update an application, modify the config options in the corresponding entry under the `applications` field. :::{note} The in-place update behavior for an application when you resubmit a config is the same as the single-application behavior. For how an application reacts to different config changes, see [Updating a Serve Application](serve-inplace-updates). ::: (serve-config-migration)= ### Migrating from a single-application config Migrating the single-application config `ServeApplicationSchema` to the multi-application config format `ServeDeploySchema` is straightforward. Each entry under the `applications` field matches the old, single-application config format. To convert a single-application config to the multi-application config format: * Copy the entire old config to an entry under the `applications` field. * Remove `host` and `port` from the entry and move them under the `http_options` field. * Name the application. * If you haven't already, set the application-level `route_prefix` to the route prefix of the ingress deployment in the application. In a multi-application config, you should set route prefixes at the application level instead of for the ingress deployment in each application. * When needed, add more applications. For more details on the multi-application config format, see the documentation for [`ServeDeploySchema`](serve-rest-api-config-schema). :::{note} You must remove `host` and `port` from the application entry. In a multi-application config, specifying cluster-level options within an individual application isn't applicable, and is not supported. ::: --- (serve-best-practices)= # Best practices in production This section helps you: * Understand best practices when operating Serve in production * Learn more about managing Serve with the Serve CLI * Configure your HTTP requests when querying Serve ## CLI best practices This section summarizes the best practices for deploying to production using the Serve CLI: * Use `serve run` to manually test and improve your Serve application locally. * Use `serve build` to create a Serve config file for your Serve application. * For development, put your Serve application's code in a remote repository and manually configure the `working_dir` or `py_modules` fields in your Serve config file's `runtime_env` to point to that repository. * For production, put your Serve application's code in a custom Docker image instead of a `runtime_env`. See [this tutorial](serve-custom-docker-images) to learn how to create custom Docker images and deploy them on KubeRay. * Use `serve status` to track your Serve application's health and deployment progress. See [the monitoring guide](serve-in-production-inspecting) for more info. * Use `serve config` to check the latest config that your Serve application received. This is its goal state. See [the monitoring guide](serve-in-production-inspecting) for more info. * Make lightweight configuration updates (e.g., `num_replicas` or `user_config` changes) by modifying your Serve config file and redeploying it with `serve deploy`. (serve-best-practices-http-requests)= ## Client-side HTTP requests Most examples in these docs use straightforward `get` or `post` requests using Python's `requests` library, such as: ```{literalinclude} ../doc_code/requests_best_practices.py :start-after: __prototype_code_start__ :end-before: __prototype_code_end__ :language: python ``` This pattern is useful for prototyping, but it isn't sufficient for production. In production, HTTP requests should use: * Retries: Requests may occasionally fail due to transient issues (e.g., slow network, node failure, power outage, spike in traffic, etc.). Retry failed requests a handful of times to account for these issues. * Exponential backoff: To avoid bombarding the Serve application with retries during a transient error, apply an exponential backoff on failure. Each retry should wait exponentially longer than the previous one before running. For example, the first retry may wait 0.1s after a failure, and subsequent retries wait 0.4s (4 x 0.1), 1.6s, 6.4s, 25.6s, etc. after the failure. * Timeouts: Add a timeout to each retry to prevent requests from hanging. The timeout should be longer than the application's latency to give your application enough time to process requests. Additionally, set an [end-to-end timeout](serve-performance-e2e-timeout) in the Serve application, so slow requests don't bottleneck replicas. ```{literalinclude} ../doc_code/requests_best_practices.py :start-after: __production_code_start__ :end-before: __production_code_end__ :language: python ``` ## Load shedding When a request is sent to a cluster, it's first received by the Serve proxy, which then forwards it to a replica for handling using a {mod}`DeploymentHandle `. Replicas can handle up to a configurable number of requests at a time. Configure the number using the `max_ongoing_requests` option. If all replicas are busy and cannot accept more requests, the request is queued in the {mod}`DeploymentHandle ` until one becomes available. Under heavy load, {mod}`DeploymentHandle ` queues can grow and cause high tail latency and excessive load on the system. To avoid instability, it's often preferable to intentionally reject some requests to avoid these queues growing indefinitely. This technique is called "load shedding," and it allows the system to gracefully handle excessive load without spiking tail latencies or overloading components to the point of failure. You can configure load shedding for your Serve deployments using the `max_queued_requests` parameter to the {mod}`@serve.deployment ` decorator. This controls the maximum number of requests that each {mod}`DeploymentHandle `, including the Serve proxy, will queue. Once the limit is reached, enqueueing any new requests immediately raises a {mod}`BackPressureError `. HTTP requests will return a `503` status code (service unavailable). The following example defines a deployment that emulates slow request handling and has `max_ongoing_requests` and `max_queued_requests` configured. ```{literalinclude} ../doc_code/load_shedding.py :start-after: __example_deployment_start__ :end-before: __example_deployment_end__ :language: python ``` To test the behavior, send HTTP requests in parallel to emulate multiple clients. Serve accepts `max_ongoing_requests` and `max_queued_requests` requests, and rejects further requests with a `503`, or service unavailable, status. ```{literalinclude} ../doc_code/load_shedding.py :start-after: __client_test_start__ :end-before: __client_test_end__ :language: python ``` ```bash 2024-02-28 11:12:22,287 INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ProxyActor pid=21011) INFO 2024-02-28 11:12:24,088 proxy 127.0.0.1 proxy.py:1140 - Proxy actor 15b7c620e64c8c69fb45559001000000 starting on node ebc04d744a722577f3a049da12c9f83d9ba6a4d100e888e5fcfa19d9. (ProxyActor pid=21011) INFO 2024-02-28 11:12:24,089 proxy 127.0.0.1 proxy.py:1357 - Starting HTTP server on node: ebc04d744a722577f3a049da12c9f83d9ba6a4d100e888e5fcfa19d9 listening on port 8000 (ProxyActor pid=21011) INFO: Started server process [21011] (ServeController pid=21008) INFO 2024-02-28 11:12:24,199 controller 21008 deployment_state.py:1614 - Deploying new version of deployment SlowDeployment in application 'default'. Setting initial target number of replicas to 1. (ServeController pid=21008) INFO 2024-02-28 11:12:24,300 controller 21008 deployment_state.py:1924 - Adding 1 replica to deployment SlowDeployment in application 'default'. (ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,141 proxy 127.0.0.1 544437ef-f53a-4991-bb37-0cda0b05cb6a / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2). (ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,142 proxy 127.0.0.1 44dcebdc-5c07-4a92-b948-7843443d19cc / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2). (ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,143 proxy 127.0.0.1 83b444ae-e9d6-4ac6-84b7-f127c48f6ba7 / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2). (ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,144 proxy 127.0.0.1 f92b47c2-6bff-4a0d-8e5b-126d948748ea / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2). (ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,145 proxy 127.0.0.1 cde44bcc-f3e7-4652-b487-f3f2077752aa / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2). (ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:28,168 default_SlowDeployment 8ey9y40a e3b77013-7dc8-437b-bd52-b4839d215212 / replica.py:373 - __CALL__ OK 2007.7ms (ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:30,175 default_SlowDeployment 8ey9y40a 601e7b0d-1cd3-426d-9318-43c2c4a57a53 / replica.py:373 - __CALL__ OK 4013.5ms (ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:32,183 default_SlowDeployment 8ey9y40a 0655fa12-0b44-4196-8fc5-23d31ae6fcb9 / replica.py:373 - __CALL__ OK 3987.9ms (ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:34,188 default_SlowDeployment 8ey9y40a c49dee09-8de1-4e7a-8c2f-8ce3f6d8ef34 / replica.py:373 - __CALL__ OK 3960.8ms Request finished with status code 200. Request finished with status code 200. Request finished with status code 200. Request finished with status code 200. ``` --- (serve-in-production-config-file)= # Serve Config Files This section should help you: - Understand the Serve config file format. - Learn how to deploy and update your applications in production using the Serve config. - Learn how to generate a config file for a list of Serve applications. The Serve config is the recommended way to deploy and update your applications in production. It allows you to fully configure everything related to Serve, including system-level components like the proxy and application-level options like individual deployment parameters (recall how to [configure Serve deployments](serve-configure-deployment)). One major benefit is you can dynamically update individual deployment parameters by modifying the Serve config, without needing to redeploy or restart your application. :::{tip} If you are deploying Serve on a VM, you can use the Serve config with the [serve deploy](serve-in-production-deploying) CLI command. If you are deploying Serve on Kubernetes, you can embed the Serve config in a [RayService](serve-in-production-kubernetes) custom resource in Kubernetes to ::: The Serve config is a YAML file with the following format: ```yaml proxy_location: ... http_options: host: ... port: ... request_timeout_s: ... keep_alive_timeout_s: ... grpc_options: port: ... grpc_servicer_functions: ... request_timeout_s: ... logging_config: log_level: ... logs_dir: ... encoding: ... enable_access_log: ... applications: - name: ... route_prefix: ... import_path: ... runtime_env: ... external_scaler_enabled: ... deployments: - name: ... num_replicas: ... ... - name: ... ``` The file contains `proxy_location`, `http_options`, `grpc_options`, `logging_config` and `applications`. (proxy-config)= ## Proxy config The `proxy_location` field configures where to run proxies to handle traffic to the cluster. You can set `proxy_location` to the following values: - EveryNode (default): Run a proxy on every node in the cluster that has at least one replica actor. - HeadOnly: Only run a single proxy on the head node. - Disabled: Don't run proxies at all. Set this value if you are only making calls to your applications using deployment handles. (http-config)= ## HTTP config The `http_options` are as follows. Note that the HTTP config is global to your Ray cluster, and you can't update it during runtime. - **`host`**: The host IP address for Serve's HTTP proxies. This is optional and can be omitted. By default, the `host` is set to `0.0.0.0` to expose your deployments publicly. If you're using Kubernetes, you must set `host` to `0.0.0.0` to expose your deployments outside the cluster. - **`port`**: The port for Serve's HTTP proxies. This parameter is optional and can be omitted. By default, the port is set to `8000`. - **`request_timeout_s`**: Allows you to set the end-to-end timeout for a request before terminating and retrying at another replica. By default, there is no request timeout. - **`keep_alive_timeout_s`**: Allows you to set the keep alive timeout for the HTTP proxy. For more details, see [here](serve-http-guide-keep-alive-timeout) (grpc-config)= ## gRPC config The `grpc_options` are as follows. Note that the gRPC config is global to your Ray cluster, and you can't update it during runtime. - **`port`**: The port that the gRPC proxies listen on. These are optional settings and can be omitted. By default, the port is set to `9000`. - **`grpc_servicer_functions`**: List of import paths for gRPC `add_servicer_to_server` functions to add to Serve's gRPC proxy. The servicer functions need to be importable from the context of where Serve is running. This defaults to an empty list, which means the gRPC server isn't started. - **`request_timeout_s`**: Allows you to set the end-to-end timeout for a request before terminating and retrying at another replica. By default, there is no request timeout. (logging-config)= ## Logging config The `logging_config` is global config, you can configure controller & proxy & replica logs. Note that you can also set application and deployment level logging config, which will take precedence over the global config. See logging config API [here](../../serve/api/doc/ray.serve.schema.LoggingConfig.rst) for more details. (application-config)= ## Application config You configure one or more deployments as part of your Serve application. See [deployment config](serve-configure-deployment). These are the fields per `application`: - **`name`**: The names for each application that are auto-generated by `serve build`. The name of each application must be unique. - **`route_prefix`**: An application can be called via HTTP at the specified route prefix. It defaults to `/`. The route prefix for each application must be unique. - **`import_path`**: The path to your top-level Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`. - **`runtime_env`**: Defines the environment that the application runs in. Use this parameter to package application dependencies such as `pip` packages (see {ref}`Runtime Environments ` for supported fields). The `import_path` must be available _within_ the `runtime_env` if it's specified. The Serve config's `runtime_env` can only use [remote URIs](remote-uris) in its `working_dir` and `py_modules`; it can't use local zip files or directories. [More details on runtime env](serve-runtime-env). - **`external_scaler_enabled`**: Enables the external scaling API, which lets you scale deployments from outside the Ray cluster using a REST API. When enabled, you can't use built-in autoscaling (`autoscaling_config`) for any deployment in this application. Defaults to `False`. See [External Scaling API](serve-external-scale-api) for details. - **`deployments (optional)`**: A list of deployment options that allows you to override the `@serve.deployment` settings specified in the deployment graph code. Each entry in this list must include the deployment `name`, which must match one in the code. If this section is omitted, Serve launches all deployments in the graph with the parameters specified in the code. See how to [configure serve deployment options](serve-configure-deployment). - **`args`**: Arguments that are passed to the [application builder](serve-app-builder-guide). ## Example config Below is a config for the [`Text ML Model` example](serve-in-production-example) that follows the format explained above: ```yaml proxy_location: EveryNode http_options: host: 0.0.0.0 port: 8000 applications: - name: default route_prefix: / import_path: text_ml:app runtime_env: pip: - torch - transformers deployments: - name: Translator num_replicas: 1 user_config: language: french - name: Summarizer num_replicas: 1 ``` The file uses the same `text_ml:app` import path that was used with `serve run`, and has two entries in the `deployments` list for the translation and summarization deployments. Both entries contain a `name` setting and some other configuration options such as `num_replicas`. :::{tip} Each individual entry in the `deployments` list is optional. In the example config file above, you could omit the `Summarizer`, including its `name` and `num_replicas`, and the file would still be valid. When you deploy the file, the `Summarizer` deployment is still deployed, using the configurations set in the `@serve.deployment` decorator from the application's code. ::: ## Auto-generate the Serve config using `serve build` You can use a utility to auto-generate this config file from the code. The `serve build` command takes an import path to your application, and it generates a config file containing all the deployments and their parameters in the application code. Tweak these parameters to manage your deployments in production. ```console $ ls text_ml.py $ serve build text_ml:app -o serve_config.yaml $ ls text_ml.py serve_config.yaml ``` (production-config-yaml)= The `serve_config.yaml` file contains: ```yaml proxy_location: EveryNode http_options: host: 0.0.0.0 port: 8000 grpc_options: port: 9000 grpc_servicer_functions: [] logging_config: encoding: TEXT log_level: INFO logs_dir: null enable_access_log: true applications: - name: default route_prefix: / import_path: text_ml:app runtime_env: {} deployments: - name: Translator num_replicas: 1 user_config: language: french - name: Summarizer ``` Note that the `runtime_env` field will always be empty when using `serve build` and must be set manually. In this case, if `torch` and `transformers` are not installed globally, you should include these two pip packages in the `runtime_env`. Additionally, `serve build` includes the default HTTP and gPRC options in its autogenerated files. You can modify these parameters. (serve-user-config)= ## Dynamically change parameters without restarting replicas (`user_config`) You can use the `user_config` field to supply a structured configuration for your deployment. You can pass arbitrary JSON serializable objects to the YAML configuration. Serve then applies it to all running and future deployment replicas. The application of user configuration *doesn't* restart the replica. This deployment continuity means that you can use this field to dynamically: - adjust model weights and versions without restarting the cluster. - adjust traffic splitting percentage for your model composition graph. - configure any feature flag, A/B tests, and hyper-parameters for your deployments. To enable the `user_config` feature, implement a `reconfigure` method that takes a JSON-serializable object (e.g., a Dictionary, List, or String) as its only argument: ```python @serve.deployment class Model: def reconfigure(self, config: Dict[str, Any]): self.threshold = config["threshold"] ``` If you set the `user_config` when you create the deployment (that is, in the decorator or the Serve config file), Ray Serve calls this `reconfigure` method right after the deployment's `__init__` method, and passes the `user_config` in as an argument. You can also trigger the `reconfigure` method by updating your Serve config file with a new `user_config` and reapplying it to the Ray cluster. See [In-place Updates](serve-inplace-updates) for more information. The corresponding YAML snippet is: ```yaml ... deployments: - name: Model user_config: threshold: 1.5 ``` --- (serve-custom-docker-images)= # Custom Docker Images This section helps you: * Extend the official Ray Docker images with your own dependencies * Package your Serve application in a custom Docker image instead of a `runtime_env` * Use custom Docker images with KubeRay To follow this tutorial, make sure to install [Docker Desktop](https://docs.docker.com/engine/install/) and create a [Dockerhub](https://hub.docker.com/) account where you can host custom Docker images. ## Working example Create a Python file called `fake.py` and save the following Serve application to it: ```{literalinclude} ../doc_code/fake_email_creator.py :start-after: __fake_start__ :end-before: __fake_end__ :language: python ``` This app creates and returns a fake email address. It relies on the [Faker package](https://github.com/joke2k/faker) to create the fake email address. Install the `Faker` package locally to run it: ```console % pip install Faker==18.13.0 ... % serve run fake:app ... # In another terminal window: % curl localhost:8000 john24@example.org ``` This tutorial explains how to package and serve this code inside a custom Docker image. ## Extending the Ray Docker image The [rayproject](https://hub.docker.com/u/rayproject) organization maintains Docker images with dependencies needed to run Ray. In fact, the [rayproject/ray](https://hub.docker.com/r/rayproject/ray) repo hosts Docker images for this doc. For instance, [this RayService config](https://github.com/ray-project/kuberay/blob/release-1.1.0/ray-operator/config/samples/ray-service.sample.yaml) uses the [rayproject/ray:2.9.0](https://hub.docker.com/layers/rayproject/ray/2.9.0/images/sha256-e64546fb5c3233bb0f33608e186e285c52cdd7440cae1af18f7fcde1c04e49f2?context=explore) image hosted by `rayproject/ray`. You can extend these images and add your own dependencies to them by using them as a base layer in a Dockerfile. For instance, the working example application uses Ray 2.9.0 and Faker 18.13.0. You can create a Dockerfile that extends the `rayproject/ray:2.9.0` by adding the Faker package: ```dockerfile # File name: Dockerfile FROM rayproject/ray:2.9.0 RUN pip install Faker==18.13.0 ``` In general, the `rayproject/ray` images contain only the dependencies needed to import Ray and the Ray libraries. You can extend images from either of these repos to build your custom images. Then, you can build this image and push it to your Dockerhub account, so it can be pulled in the future: ```console % docker build . -t your_dockerhub_username/custom_image_name:latest ... % docker image push your_dockerhub_username/custom_image_name:latest ... ``` Make sure to replace `your_dockerhub_username` with your DockerHub user name and the `custom_image_name` with the name you want for your image. `latest` is this image's version. If you don't specify a version when you pull the image, then Docker automatically pulls the `latest` version of the package. You can also replace `latest` with a specific version if you prefer. ## Adding your Serve application to the Docker image During development, it's useful to package your Serve application into a zip file and pull it into your Ray cluster using `runtime_envs`. During production, it's more stable to put the Serve application in the Docker image instead of the `runtime_env` since new nodes won't need to dynamically pull and install the Serve application code before running it. Use the [WORKDIR](https://docs.docker.com/engine/reference/builder/#workdir) and [COPY](https://docs.docker.com/engine/reference/builder/#copy) commands inside the Dockerfile to install the example Serve application code in your image: ```dockerfile # File name: Dockerfile FROM rayproject/ray:2.9.0 RUN pip install Faker==18.13.0 # Set the working dir for the container to /serve_app WORKDIR /serve_app # Copies the local `fake.py` file into the WORKDIR COPY fake.py /serve_app/fake.py ``` KubeRay starts Ray with the `ray start` command inside the `WORKDIR` directory. All the Ray Serve actors are then able to import any dependencies in the directory. By `COPY`ing the Serve file into the `WORKDIR`, the Serve deployments have access to the Serve code without needing a `runtime_env.` For your applications, you can also add any other dependencies needed for your Serve app to the `WORKDIR` directory. Build and push this image to Dockerhub. Use the same version as before to overwrite the image stored at that version. ## Using custom Docker images in KubeRay Run these custom Docker images in KubeRay by adding them to the RayService config. Make the following changes: 1. Set the `rayVersion` in the `rayClusterConfig` to the Ray version used in your custom Docker image. 2. Set the `ray-head` container's `image` to the custom image's name on Dockerhub. 3. Set the `ray-worker` container's `image` to the custom image's name on Dockerhub. 4. Update the `serveConfigV2` field to remove any `runtime_env` dependencies that are in the container. A pre-built version of this image is available at [shrekrisanyscale/serve-fake-email-example](https://hub.docker.com/r/shrekrisanyscale/serve-fake-email-example). Try it out by running this RayService config: ```{literalinclude} ../doc_code/fake_email_creator.yaml :start-after: __fake_config_start__ :end-before: __fake_config_end__ :language: yaml ``` --- (serve-e2e-ft)= # Add End-to-End Fault Tolerance This section helps you: * Provide additional fault tolerance for your Serve application * Understand Serve's recovery procedures * Simulate system errors in your Serve application :::{admonition} Relevant Guides :class: seealso This section discusses concepts from: * Serve's [architecture guide](serve-architecture) * Serve's [Kubernetes production guide](serve-in-production-kubernetes) ::: (serve-e2e-ft-guide)= ## Guide: end-to-end fault tolerance for your Serve app Serve provides some [fault tolerance](serve-ft-detail) features out of the box. Two options to get end-to-end fault tolerance are the following: * tune these features and run Serve on top of [KubeRay] * use the [Anyscale platform](https://docs.anyscale.com/platform/services/head-node-ft?utm_source=ray_docs&utm_medium=docs&utm_campaign=tolerance), a managed Ray platform ### Replica health-checking By default, the Serve controller periodically health-checks each Serve deployment replica and restarts it on failure. You can define custom application-level health-checks and adjust their frequency and timeout. To define a custom health-check, add a `check_health` method to your deployment class. This method should take no arguments and return no result, and it should raise an exception if Ray Serve considers the replica unhealthy. If the health-check fails, the Serve controller logs the exception, kills the unhealthy replica(s), and restarts them. You can also use the deployment options to customize how frequently Serve runs the health-check and the timeout after which Serve marks a replica unhealthy. ```{literalinclude} ../doc_code/fault_tolerance/replica_health_check.py :start-after: __health_check_start__ :end-before: __health_check_end__ :language: python ``` In this example, `check_health` raises an error if the connection to an external database is lost. The Serve controller periodically calls this method on each replica of the deployment. If the method raises an exception for a replica, Serve marks that replica as unhealthy and restarts it. Health checks are configured and performed on a per-replica basis. :::{note} You shouldn't call ``check_health`` directly through a deployment handle (e.g., ``await deployment_handle.check_health.remote()``). This would invoke the health check on a single, arbitrary replica. The ``check_health`` method is designed as an interface for the Serve controller, not for direct user calls. ::: :::{note} In a composable deployment graph, each deployment is responsible for its own health, independent of the other deployments it's bound to. For example, in an application defined by ``app = ParentDeployment.bind(ChildDeployment.bind())``, ``ParentDeployment`` doesn't restart if ``ChildDeployment`` replicas fail their health checks. When the ``ChildDeployment`` replicas recover, the handle in ``ParentDeployment`` updates automatically to route requests to the healthy replicas. ::: ### Worker node recovery :::{admonition} KubeRay Required :class: caution, dropdown You **must** deploy your Serve application with [KubeRay] to use this feature. See Serve's [Kubernetes production guide](serve-in-production-kubernetes) to learn how you can deploy your app with KubeRay. ::: By default, Serve can recover from certain failures, such as unhealthy actors. When [Serve runs on Kubernetes](serve-in-production-kubernetes) with [KubeRay], it can also recover from some cluster-level failures, such as dead workers or head nodes. When a worker node fails, the actors running on it also fail. Serve detects that the actors have failed, and it attempts to respawn the actors on the remaining, healthy nodes. Meanwhile, KubeRay detects that the node itself has failed, so it attempts to restart the worker pod on another running node, and it also brings up a new healthy node to replace it. Once the node comes up, if the pod is still pending, it can be restarted on that node. Similarly, Serve can also respawn any pending actors on that node as well. The deployment replicas running on healthy nodes can continue serving traffic throughout the recovery period. (serve-e2e-ft-guide-gcs)= ### Head node recovery: Ray GCS fault tolerance :::{admonition} KubeRay Required :class: caution, dropdown You **must** deploy your Serve application with [KubeRay] to use this feature. See Serve's [Kubernetes production guide](serve-in-production-kubernetes) to learn how you can deploy your app with KubeRay. ::: In this section, you'll learn how to add fault tolerance to Ray's Global Control Store (GCS), which allows your Serve application to serve traffic even when the head node crashes. By default, the Ray head node is a single point of failure: if it crashes, the entire Ray cluster crashes and you must restart it. When running on Kubernetes, the `RayService` controller health-checks the Ray cluster and restarts it if this occurs, but this introduces some downtime. Starting with Ray 2.0+, KubeRay supports [Global Control Store (GCS) fault tolerance](kuberay-gcs-ft), preventing the Ray cluster from crashing if the head node goes down. While the head node is recovering, Serve applications can still handle traffic with worker nodes but you can't update or recover from other failures like Actors or Worker nodes crashing. Once the GCS recovers, the cluster returns to normal behavior. You can enable GCS fault tolerance on KubeRay by adding an external Redis server and modifying your `RayService` Kubernetes object with the following steps: #### Step 1: Add external Redis server GCS fault tolerance requires an external Redis database. You can choose to host your own Redis database, or you can use one through a third-party vendor. Use a highly available Redis database for resiliency. **For development purposes**, you can also host a small Redis database on the same Kubernetes cluster as your Ray cluster. For example, you can add a 1-node Redis cluster by prepending these three Redis objects to your Kubernetes YAML: (one-node-redis-example)= ```YAML kind: ConfigMap apiVersion: v1 metadata: name: redis-config labels: app: redis data: redis.conf: |- port 6379 bind 0.0.0.0 protected-mode no requirepass 5241590000000000 --- apiVersion: v1 kind: Service metadata: name: redis labels: app: redis spec: type: ClusterIP ports: - name: redis port: 6379 selector: app: redis --- apiVersion: apps/v1 kind: Deployment metadata: name: redis labels: app: redis spec: replicas: 1 selector: matchLabels: app: redis template: metadata: labels: app: redis spec: containers: - name: redis image: redis:5.0.8 command: - "sh" - "-c" - "redis-server /usr/local/etc/redis/redis.conf" ports: - containerPort: 6379 volumeMounts: - name: config mountPath: /usr/local/etc/redis/redis.conf subPath: redis.conf volumes: - name: config configMap: name: redis-config --- ``` **This configuration is NOT production-ready**, but it's useful for development and testing. When you move to production, it's highly recommended that you replace this 1-node Redis cluster with a highly available Redis cluster. #### Step 2: Add Redis info to RayService After adding the Redis objects, you also need to modify the `RayService` configuration. First, you need to update your `RayService` metadata's annotations: ::::{tab-set} :::{tab-item} Vanilla Config ```yaml ... apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample spec: ... ``` ::: :::{tab-item} Fault Tolerant Config :selected: ```yaml ... apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: "true" ray.io/external-storage-namespace: "my-raycluster-storage-namespace" spec: ... ``` ::: :::: The annotations are: * `ray.io/ft-enabled` REQUIRED: Enables GCS fault tolerance when true * `ray.io/external-storage-namespace` OPTIONAL: Sets the [external storage namespace] Next, you need to add the `RAY_REDIS_ADDRESS` environment variable to the `headGroupSpec`: ::::{tab-set} :::{tab-item} Vanilla Config ```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: ... spec: ... rayClusterConfig: headGroupSpec: ... template: ... spec: ... env: ... ``` ::: :::{tab-item} Fault Tolerant Config :selected: ```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: ... spec: ... rayClusterConfig: headGroupSpec: ... template: ... spec: ... env: ... - name: RAY_REDIS_ADDRESS value: redis:6379 ``` ::: :::: `RAY_REDIS_ADDRESS`'s value should be your Redis database's `redis://` address. It should contain your Redis database's host and port. An [example Redis address](https://www.iana.org/assignments/uri-schemes/prov/rediss) is `redis://user:secret@localhost:6379/0?foo=bar&qux=baz`. In the example above, the Redis deployment name (`redis`) is the host within the Kubernetes cluster, and the Redis port is `6379`. The example is compatible with the previous section's [example config](one-node-redis-example). After you apply the Redis objects along with your updated `RayService`, your Ray cluster can recover from head node crashes without restarting all the workers! :::{seealso} Check out the KubeRay guide on [GCS fault tolerance](kuberay-gcs-ft) to learn more about how Serve leverages the external Redis cluster to provide head node fault tolerance. ::: ### Spreading replicas across nodes One way to improve the availability of your Serve application is to spread deployment replicas across multiple nodes so that you still have enough running replicas to serve traffic even after a certain number of node failures. By default, Serve soft spreads all deployment replicas but it has a few limitations: * The spread is soft and best-effort with no guarantee that the it's perfectly even. * Serve tries to spread replicas among the existing nodes if possible instead of launching new nodes. For example, if you have a big enough single node cluster, Serve schedules all replicas on that single node assuming it has enough resources. However, that node becomes the single point of failure. You can change the spread behavior of your deployment with the `max_replicas_per_node` [deployment option](../../serve/api/doc/ray.serve.deployment_decorator.rst), which hard limits the number of replicas of a given deployment that can run on a single node. If you set it to 1 then you're effectively strict spreading the deployment replicas. If you don't set it then there's no hard spread constraint and Serve uses the default soft spread mentioned in the preceding paragraph. `max_replicas_per_node` option is per deployment and only affects the spread of replicas within a deployment. There's no spread between replicas of different deployments. The following code example shows how to set `max_replicas_per_node` deployment option: ```{testcode} import ray from ray import serve @serve.deployment(max_replicas_per_node=1) class Deployment1: def __call__(self, request): return "hello" @serve.deployment(max_replicas_per_node=2) class Deployment2: def __call__(self, request): return "world" ``` This example has two Serve deployments with different `max_replicas_per_node`: `Deployment1` can have at most one replica on each node and `Deployment2` can have at most two replicas on each node. If you schedule two replicas of `Deployment1` and two replicas of `Deployment2`, Serve runs a cluster with at least two nodes, each running one replica of `Deployment1`. The two replicas of `Deployment2` may run on either a single node or across two nodes because either satisfies the `max_replicas_per_node` constraint. (serve-e2e-ft-behavior)= ## Serve's recovery procedures This section explains how Serve recovers from system failures. It uses the following Serve application and config as a working example. ::::{tab-set} :::{tab-item} Python Code ```{literalinclude} ../doc_code/fault_tolerance/sleepy_pid.py :start-after: __start__ :end-before: __end__ :language: python ``` ::: :::{tab-item} Kubernetes Config ```{literalinclude} ../doc_code/fault_tolerance/k8s_config.yaml :language: yaml ``` ::: :::: Follow the [KubeRay quickstart guide](kuberay-quickstart) to: * Install `kubectl` and `Helm` * Prepare a Kubernetes cluster * Deploy a KubeRay operator Then, [deploy the Serve application](serve-deploy-app-on-kuberay) above: ```console $ kubectl apply -f config.yaml ``` ### Worker node failure You can simulate a worker node failure in the working example. First, take a look at the nodes and pods running in your Kubernetes cluster: ```console $ kubectl get nodes NAME STATUS ROLES AGE VERSION gke-serve-demo-default-pool-ed597cce-nvm2 Ready 3d22h v1.22.12-gke.1200 gke-serve-demo-default-pool-ed597cce-m888 Ready 3d22h v1.22.12-gke.1200 gke-serve-demo-default-pool-ed597cce-pu2q Ready 3d22h v1.22.12-gke.1200 $ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ervice-sample-raycluster-thwmr-worker-small-group-bdv6q 1/1 Running 0 3m3s 10.68.2.62 gke-serve-demo-default-pool-ed597cce-nvm2 ervice-sample-raycluster-thwmr-worker-small-group-pztzk 1/1 Running 0 3m3s 10.68.2.61 gke-serve-demo-default-pool-ed597cce-m888 rayservice-sample-raycluster-thwmr-head-28mdh 1/1 Running 1 (2m55s ago) 3m3s 10.68.0.45 gke-serve-demo-default-pool-ed597cce-pu2q redis-75c8b8b65d-4qgfz 1/1 Running 0 3m3s 10.68.2.60 gke-serve-demo-default-pool-ed597cce-nvm2 ``` Open a separate terminal window and port-forward to one of the worker nodes: ```console $ kubectl port-forward ervice-sample-raycluster-thwmr-worker-small-group-bdv6q 8000 Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 ``` While the `port-forward` is running, you can query the application in another terminal window: ```console $ curl localhost:8000 418 ``` The output is the process ID of the deployment replica that handled the request. The application launches 6 deployment replicas, so if you run the query multiple times, you should see different process IDs: ```console $ curl localhost:8000 418 $ curl localhost:8000 256 $ curl localhost:8000 385 ``` Now you can simulate worker failures. You have two options: kill a worker pod or kill a worker node. Let's start with the worker pod. Make sure to kill the pod that you're **not** port-forwarding to, so you can continue querying the living worker while the other one relaunches. ```console $ kubectl delete pod ervice-sample-raycluster-thwmr-worker-small-group-pztzk pod "ervice-sample-raycluster-thwmr-worker-small-group-pztzk" deleted $ curl localhost:8000 6318 ``` While the pod crashes and recovers, the live pod can continue serving traffic! :::{tip} Killing a node and waiting for it to recover usually takes longer than killing a pod and waiting for it to recover. For this type of debugging, it's quicker to simulate failures by killing at the pod level rather than at the node level. ::: You can similarly kill a worker node and see that the other nodes can continue serving traffic: ```console $ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ervice-sample-raycluster-thwmr-worker-small-group-bdv6q 1/1 Running 0 65m 10.68.2.62 gke-serve-demo-default-pool-ed597cce-nvm2 ervice-sample-raycluster-thwmr-worker-small-group-mznwq 1/1 Running 0 5m46s 10.68.1.3 gke-serve-demo-default-pool-ed597cce-m888 rayservice-sample-raycluster-thwmr-head-28mdh 1/1 Running 1 (65m ago) 65m 10.68.0.45 gke-serve-demo-default-pool-ed597cce-pu2q redis-75c8b8b65d-4qgfz 1/1 Running 0 65m 10.68.2.60 gke-serve-demo-default-pool-ed597cce-nvm2 $ kubectl delete node gke-serve-demo-default-pool-ed597cce-m888 node "gke-serve-demo-default-pool-ed597cce-m888" deleted $ curl localhost:8000 385 ``` ### Head node failure You can simulate a head node failure by either killing the head pod or the head node. First, take a look at the running pods in your cluster: ```console $ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ervice-sample-raycluster-thwmr-worker-small-group-6f2pk 1/1 Running 0 6m59s 10.68.2.64 gke-serve-demo-default-pool-ed597cce-nvm2 ervice-sample-raycluster-thwmr-worker-small-group-bdv6q 1/1 Running 0 79m 10.68.2.62 gke-serve-demo-default-pool-ed597cce-nvm2 rayservice-sample-raycluster-thwmr-head-28mdh 1/1 Running 1 (79m ago) 79m 10.68.0.45 gke-serve-demo-default-pool-ed597cce-pu2q redis-75c8b8b65d-4qgfz 1/1 Running 0 79m 10.68.2.60 gke-serve-demo-default-pool-ed597cce-nvm2 ``` Port-forward to one of your worker pods. Make sure this pod is on a separate node from the head node, so you can kill the head node without crashing the worker: ```console $ kubectl port-forward ervice-sample-raycluster-thwmr-worker-small-group-bdv6q Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 ``` In a separate terminal, you can make requests to the Serve application: ```console $ curl localhost:8000 418 ``` You can kill the head pod to simulate killing the Ray head node: ```console $ kubectl delete pod rayservice-sample-raycluster-thwmr-head-28mdh pod "rayservice-sample-raycluster-thwmr-head-28mdh" deleted $ curl localhost:8000 ``` If you have configured [GCS fault tolerance](serve-e2e-ft-guide-gcs) on your cluster, your worker pod can continue serving traffic without restarting when the head pod crashes and recovers. Without GCS fault tolerance, KubeRay restarts all worker pods when the head pod crashes, so you'll need to wait for the workers to restart and the deployments to reinitialize before you can port-forward and send more requests. ### Serve controller failure You can simulate a Serve controller failure by manually killing the Serve actor. If you're running KubeRay, `exec` into one of your pods: ```console $ kubectl get pods NAME READY STATUS RESTARTS AGE ervice-sample-raycluster-mx5x6-worker-small-group-hfhnw 1/1 Running 0 118m ervice-sample-raycluster-mx5x6-worker-small-group-nwcpb 1/1 Running 0 118m rayservice-sample-raycluster-mx5x6-head-bqjhw 1/1 Running 0 118m redis-75c8b8b65d-4qgfz 1/1 Running 0 3h36m $ kubectl exec -it rayservice-sample-raycluster-mx5x6-head-bqjhw -- bash ray@rayservice-sample-raycluster-mx5x6-head-bqjhw:~$ ``` You can use the [Ray State API](state-api-cli-ref) to inspect your Serve app: ```console $ ray summary actors ======== Actors Summary: 2022-10-04 21:06:33.678706 ======== Stats: ------------------------------------ total_actors: 10 Table (group by class): ------------------------------------ CLASS_NAME STATE_COUNTS 0 ProxyActor ALIVE: 3 1 ServeReplica:SleepyPid ALIVE: 6 2 ServeController ALIVE: 1 $ ray list actors --filter "class_name=ServeController" ======== List: 2022-10-04 21:09:14.915881 ======== Stats: ------------------------------ Total: 1 Table: ------------------------------ ACTOR_ID CLASS_NAME STATE NAME PID 0 70a718c973c2ce9471d318f701000000 ServeController ALIVE SERVE_CONTROLLER_ACTOR 48570 ``` You can then kill the Serve controller via the Python interpreter. Note that you'll need to use the `NAME` from the `ray list actor` output to get a handle to the Serve controller. ```console $ python >>> import ray >>> controller_handle = ray.get_actor("SERVE_CONTROLLER_ACTOR", namespace="serve") >>> ray.kill(controller_handle, no_restart=True) >>> exit() ``` You can use the Ray State API to check the controller's status: ```console $ ray list actors --filter "class_name=ServeController" ======== List: 2022-10-04 21:36:37.157754 ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ ACTOR_ID CLASS_NAME STATE NAME PID 0 3281133ee86534e3b707190b01000000 ServeController ALIVE SERVE_CONTROLLER_ACTOR 49914 1 70a718c973c2ce9471d318f701000000 ServeController DEAD SERVE_CONTROLLER_ACTOR 48570 ``` You should still be able to query your deployments while the controller is recovering: ``` # If you're running KubeRay, you # can do this from inside the pod: $ python >>> import requests >>> requests.get("http://localhost:8000").json() 347 ``` :::{note} While the controller is dead, replica health-checking and deployment autoscaling will not work. They'll continue working once the controller recovers. ::: ### Deployment replica failure You can simulate replica failures by manually killing deployment replicas. If you're running KubeRay, make sure to `exec` into a Ray pod before running these commands. ```console $ ray summary actors ======== Actors Summary: 2022-10-04 21:40:36.454488 ======== Stats: ------------------------------------ total_actors: 11 Table (group by class): ------------------------------------ CLASS_NAME STATE_COUNTS 0 ProxyActor ALIVE: 3 1 ServeController ALIVE: 1 2 ServeReplica:SleepyPid ALIVE: 6 $ ray list actors --filter "class_name=ServeReplica:SleepyPid" ======== List: 2022-10-04 21:41:32.151864 ======== Stats: ------------------------------ Total: 6 Table: ------------------------------ ACTOR_ID CLASS_NAME STATE NAME PID 0 39e08b172e66a5d22b2b4cf401000000 ServeReplica:SleepyPid ALIVE SERVE_REPLICA::SleepyPid#RlRptP 203 1 55d59bcb791a1f9353cd34e301000000 ServeReplica:SleepyPid ALIVE SERVE_REPLICA::SleepyPid#BnoOtj 348 2 8c34e675edf7b6695461d13501000000 ServeReplica:SleepyPid ALIVE SERVE_REPLICA::SleepyPid#SakmRM 283 3 a95405318047c5528b7483e701000000 ServeReplica:SleepyPid ALIVE SERVE_REPLICA::SleepyPid#rUigUh 347 4 c531188fede3ebfc868b73a001000000 ServeReplica:SleepyPid ALIVE SERVE_REPLICA::SleepyPid#gbpoFe 383 5 de8dfa16839443f940fe725f01000000 ServeReplica:SleepyPid ALIVE SERVE_REPLICA::SleepyPid#PHvdJW 176 ``` You can use the `NAME` from the `ray list actor` output to get a handle to one of the replicas: ```console $ python >>> import ray >>> replica_handle = ray.get_actor("SERVE_REPLICA::SleepyPid#RlRptP", namespace="serve") >>> ray.kill(replica_handle, no_restart=True) >>> exit() ``` While the replica is restarted, the other replicas can continue processing requests. Eventually the replica restarts and continues serving requests: ```console $ python >>> import requests >>> requests.get("http://localhost:8000").json() 383 ``` ### Proxy failure You can simulate Proxy failures by manually killing `ProxyActor` actors. If you're running KubeRay, make sure to `exec` into a Ray pod before running these commands. ```console $ ray summary actors ======== Actors Summary: 2022-10-04 21:51:55.903800 ======== Stats: ------------------------------------ total_actors: 12 Table (group by class): ------------------------------------ CLASS_NAME STATE_COUNTS 0 ProxyActor ALIVE: 3 1 ServeController ALIVE: 1 2 ServeReplica:SleepyPid ALIVE: 6 $ ray list actors --filter "class_name=ProxyActor" ======== List: 2022-10-04 21:52:39.853758 ======== Stats: ------------------------------ Total: 3 Table: ------------------------------ ACTOR_ID CLASS_NAME STATE NAME PID 0 283fc11beebb6149deb608eb01000000 ProxyActor ALIVE SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-91f9a685e662313a0075efcb7fd894249a5bdae7ee88837bea7985a0 101 1 2b010ce28baeff5cb6cb161e01000000 ProxyActor ALIVE SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cc262f3dba544a49ea617d5611789b5613f8fe8c86018ef23c0131eb 133 2 7abce9dd241b089c1172e9ca01000000 ProxyActor ALIVE SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-7589773fc62e08c2679847aee9416805bbbf260bee25331fa3389c4f 267 ``` You can use the `NAME` from the `ray list actor` output to get a handle to one of the replicas: ```console $ python >>> import ray >>> proxy_handle = ray.get_actor("SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-91f9a685e662313a0075efcb7fd894249a5bdae7ee88837bea7985a0", namespace="serve") >>> ray.kill(proxy_handle, no_restart=False) >>> exit() ``` While the proxy is restarted, the other proxies can continue accepting requests. Eventually the proxy restarts and continues accepting requests. You can use the `ray list actor` command to see when the proxy restarts: ```console $ ray list actors --filter "class_name=ProxyActor" ======== List: 2022-10-04 21:58:41.193966 ======== Stats: ------------------------------ Total: 3 Table: ------------------------------ ACTOR_ID CLASS_NAME STATE NAME PID 0 283fc11beebb6149deb608eb01000000 ProxyActor ALIVE SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-91f9a685e662313a0075efcb7fd894249a5bdae7ee88837bea7985a0 57317 1 2b010ce28baeff5cb6cb161e01000000 ProxyActor ALIVE SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cc262f3dba544a49ea617d5611789b5613f8fe8c86018ef23c0131eb 133 2 7abce9dd241b089c1172e9ca01000000 ProxyActor ALIVE SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-7589773fc62e08c2679847aee9416805bbbf260bee25331fa3389c4f 267 ``` Note that the PID for the first ProxyActor has changed, indicating that it restarted. [KubeRay]: kuberay-index [external storage namespace]: kuberay-external-storage-namespace --- (serve-handling-dependencies)= # Handle Dependencies (serve-runtime-env)= ## Add a runtime environment The import path (e.g., `text_ml:app`) must be importable by Serve at runtime. When running locally, this path might be in your current working directory. However, when running on a cluster you also need to make sure the path is importable. Build the code into the cluster's container image (see [Cluster Configuration](kuberay-config) for more details) or use a `runtime_env` with a [remote URI](remote-uris) that hosts the code in remote storage. For an example, see the [Text ML Models application on GitHub](https://github.com/ray-project/serve_config_examples/blob/master/text_ml.py). You can use this config file to deploy the text summarization and translation application to your own Ray cluster even if you don't have the code locally: ```yaml import_path: text_ml:app runtime_env: working_dir: "https://github.com/ray-project/serve_config_examples/archive/HEAD.zip" pip: - torch - transformers ``` :::{note} You can also package a deployment graph into a standalone Python package that you can import using a [PYTHONPATH](https://docs.python.org/3.10/using/cmdline.html#envvar-PYTHONPATH) to provide location independence on your local machine. However, the best practice is to use a `runtime_env`, to ensure consistency across all machines in your cluster. ::: ## Dependencies per deployment Ray Serve also supports serving deployments with different (and possibly conflicting) Python dependencies. For example, you can simultaneously serve one deployment that uses legacy Tensorflow 1 and another that uses Tensorflow 2. This is supported on Mac OS and Linux using Ray's {ref}`runtime-environments` feature. As with all other Ray actor options, pass the runtime environment in via `ray_actor_options` in your deployment. Be sure to first run `pip install "ray[default]"` to ensure the Runtime Environments feature is installed. Example: ```{literalinclude} ../doc_code/varying_deps.py :language: python ``` :::{tip} Avoid dynamically installing packages that install from source: these can be slow and use up all resources while installing, leading to problems with the Ray cluster. Consider precompiling such packages in a private repository or Docker image. ::: The dependencies required in the deployment may be different than the dependencies installed in the driver program (the one running Serve API calls). In this case, you should use a delayed import within the class to avoid importing unavailable packages in the driver. This applies even when not using runtime environments. Example: ```{literalinclude} ../doc_code/delayed_import.py :language: python ``` --- (serve-in-production)= # Production Guide ```{toctree} :hidden: config kubernetes docker fault-tolerance handling-dependencies best-practices ``` The recommended way to run Ray Serve in production is on Kubernetes using the [KubeRay](kuberay-quickstart) [RayService](kuberay-rayservice-quickstart) custom resource. The RayService custom resource automatically handles important production requirements such as health checking, status reporting, failure recovery, and upgrades. If you're not running on Kubernetes, you can also run Ray Serve on a Ray cluster directly using the Serve CLI. This section will walk you through a quickstart of how to generate a Serve config file and deploy it using the Serve CLI. For more details, you can check out the other pages in the production guide: - Understand the [Serve config file format](serve-in-production-config-file). - Understand how to [deploy on Kubernetes using KubeRay](serve-in-production-kubernetes). - Understand how to [monitor running Serve applications](serve-monitoring). For deploying on VMs instead of Kubernetes, see [Deploy on VM](serve-in-production-deploying). (serve-in-production-example)= ## Working example: Text summarization and translation application Throughout the production guide, we will use the following Serve application as a working example. The application takes in a string of text in English, then summarizes and translates it into French (default), German, or Romanian. ```{literalinclude} ../doc_code/production_guide/text_ml.py :language: python :start-after: __example_start__ :end-before: __example_end__ ``` Save this code locally in `text_ml.py`. In development, we would likely use the `serve run` command to iteratively run, develop, and repeat (see the [Development Workflow](serve-dev-workflow) for more information). When we're ready to go to production, we will generate a structured [config file](serve-in-production-config-file) that acts as the single source of truth for the application. This config file can be generated using `serve build`: ``` $ serve build text_ml:app -o serve_config.yaml ``` The generated version of this file contains an `import_path`, `runtime_env`, and configuration options for each deployment in the application. The application needs the `torch` and `transformers` packages, so modify the `runtime_env` field of the generated config to include these two pip packages. Save this config locally in `serve_config.yaml`. ```yaml proxy_location: EveryNode http_options: host: 0.0.0.0 port: 8000 applications: - name: default route_prefix: / import_path: text_ml:app runtime_env: pip: - torch - transformers deployments: - name: Translator num_replicas: 1 user_config: language: french - name: Summarizer num_replicas: 1 ``` You can use `serve deploy` to deploy the application to a local Ray cluster and `serve status` to get the status at runtime: ```console # Start a local Ray cluster. ray start --head # Deploy the Text ML application to the local Ray cluster. serve deploy serve_config.yaml 2022-08-16 12:51:22,043 SUCC scripts.py:180 -- Sent deploy request successfully! * Use `serve status` to check deployments' statuses. * Use `serve config` to see the running app's config. $ serve status proxies: cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec: HEALTHY applications: default: status: RUNNING message: '' last_deployed_time_s: 1694041157.2211847 deployments: Translator: status: HEALTHY replica_states: RUNNING: 1 message: '' Summarizer: status: HEALTHY replica_states: RUNNING: 1 message: '' ``` Test the application using Python `requests`: ```{literalinclude} ../doc_code/production_guide/text_ml.py :language: python :start-after: __start_client__ :end-before: __end_client__ ``` To update the application, modify the config file and use `serve deploy` again. ## Next Steps For a deeper dive into how to deploy, update, and monitor Serve applications, see the following pages: - Learn the details of the [Serve config file format](serve-in-production-config-file). - Learn how to [deploy on Kubernetes using KubeRay](serve-in-production-kubernetes). - Learn how to [build custom Docker images](serve-custom-docker-images) to use with KubeRay. - Learn how to [monitor running Serve applications](serve-monitoring). [KubeRay]: kuberay-index [RayService]: kuberay-rayservice-quickstart --- (serve-in-production-kubernetes)= # Deploy on Kubernetes This section should help you: - understand how to install and use the [KubeRay] operator. - understand how to deploy a Ray Serve application using a [RayService]. - understand how to monitor and update your application. Deploying Ray Serve on Kubernetes provides the scalable compute of Ray Serve and operational benefits of Kubernetes. This combination also allows you to integrate with existing applications that may be running on Kubernetes. When running on Kubernetes, use the [RayService] controller from [KubeRay]. > NOTE: [Anyscale](https://www.anyscale.com/get-started) is a managed Ray solution that provides high-availability, high-performance autoscaling, multi-cloud clusters, spot instance support, and more out of the box. A [RayService] CR encapsulates a multi-node Ray Cluster and a Serve application that runs on top of it into a single Kubernetes manifest. Deploying, upgrading, and getting the status of the application can be done using standard `kubectl` commands. This section walks through how to deploy, monitor, and upgrade the [Text ML example](serve-in-production-example) on Kubernetes. (serve-installing-kuberay-operator)= ## Installing the KubeRay operator Follow the [KubeRay quickstart guide](kuberay-quickstart) to: * Install `kubectl` and `Helm` * Prepare a Kubernetes cluster * Deploy a KubeRay operator ## Setting up a RayService custom resource (CR) Once the KubeRay controller is running, manage your Ray Serve application by creating and updating a `RayService` CR ([example](https://github.com/ray-project/kuberay/blob/5b1a5a11f5df76db2d66ed332ff0802dc3bbff76/ray-operator/config/samples/ray-service.text-ml.yaml)). Under the `spec` section in the `RayService` CR, set the following fields: **`serveConfigV2`**: Represents the configuration that Ray Serve uses to deploy the application. Using `serve build` to print the Serve configuration and copy-paste it directly into your [Kubernetes config](serve-in-production-kubernetes) and `RayService` CR. **`rayClusterConfig`**: Populate this field with the contents of the `spec` field from the `RayCluster` CR YAML file. Refer to [KubeRay configuration](kuberay-config) for more details. :::{tip} To enhance the reliability of your application, particularly when dealing with large dependencies that may require a significant amount of time to download, consider including the dependencies in your image's Dockerfile, so the dependencies are available as soon as the pods start. ::: (serve-deploy-app-on-kuberay)= ## Deploying a Serve application When the `RayService` is created, the `KubeRay` controller first creates a Ray cluster using the provided configuration. Then, once the cluster is running, it deploys the Serve application to the cluster using the [REST API](serve-in-production-deploying). The controller also creates a Kubernetes Service that can be used to route traffic to the Serve application. To see an example, deploy the [Text ML example](serve-in-production-example). The Serve config for the example is embedded into [this sample `RayService` CR](https://github.com/ray-project/kuberay/blob/5b1a5a11f5df76db2d66ed332ff0802dc3bbff76/ray-operator/config/samples/ray-service.text-ml.yaml). Save this CR locally to a file named `ray-service.text-ml.yaml`: :::{note} - The example `RayService` uses very low `numCpus` values for demonstration purposes. In production, provide more resources to the Serve application. Learn more about how to configure KubeRay clusters [here](kuberay-config). - If you have dependencies that must be installed during deployment, you can add them to the `runtime_env` in the Deployment code. Learn more [here](serve-handling-dependencies) ::: ```console $ curl -o ray-service.text-ml.yaml https://raw.githubusercontent.com/ray-project/kuberay/2ba0dd7bea387ac9df3681666bab3d622e89846c/ray-operator/config/samples/ray-service.text-ml.yaml ``` To deploy the example, we simply `kubectl apply` the CR. This creates the underlying Ray cluster, consisting of a head and worker node pod (see [Ray Clusters Key Concepts](../../cluster/key-concepts.rst) for more details on Ray clusters), as well as the service that can be used to query our application: ```console $ kubectl apply -f ray-service.text-ml.yaml $ kubectl get rayservices NAME SERVICE STATUS NUM SERVE ENDPOINTS rayservice-sample Running 1 $ kubectl get pods NAME READY STATUS RESTARTS AGE rayservice-sample-raycluster-7wlx2-head-hr8mg 1/1 Running 0 XXs rayservice-sample-raycluster-7wlx2-small-group-worker-tb8nn 1/1 Running 0 XXs $ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE rayservice-sample-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP XXs rayservice-sample-raycluster-7wlx2-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP XXs rayservice-sample-serve-svc ClusterIP 192.168.145.219 8000/TCP XXs ``` Note that the `rayservice-sample-serve-svc` above is the one that can be used to send queries to the Serve application -- this will be used in the next section. ## Querying the application Once the `RayService` is running, we can query it over HTTP using the service created by the KubeRay controller. This service can be queried directly from inside the cluster, but to access it from your laptop you'll need to configure a [Kubernetes ingress](kuberay-networking) or use port forwarding as below: ```console $ kubectl port-forward service/rayservice-sample-serve-svc 8000 $ curl -X POST -H "Content-Type: application/json" localhost:8000/summarize_translate -d '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief"' c'était le meilleur des temps, c'était le pire des temps . ``` (serve-getting-status-kubernetes)= ## Getting the status of the application As the `RayService` is running, the `KubeRay` controller continually monitors it and writes relevant status updates to the CR. You can view the status of the application using `kubectl describe`. This includes the status of the cluster, events such as health check failures or restarts, and the application-level statuses reported by [`serve status`](serve-in-production-inspecting). ```console $ kubectl get rayservices NAME AGE rayservice-sample 7s $ kubectl describe rayservice rayservice-sample ... Status: Active Service Status: Ray Cluster Status: Available Worker Replicas: 1 Desired CPU: 2500m Desired GPU: 0 Desired Memory: 4Gi Desired TPU: 0 Desired Worker Replicas: 1 Endpoints: Client: 10001 Dashboard: 8265 Metrics: 8080 Redis: 6379 Serve: 8000 Head: Pod IP: 10.48.99.153 Pod Name: rayservice-sample-raycluster-7wlx2-head-dqv7t Service IP: 10.48.99.153 Service Name: rayservice-sample-raycluster-7wlx2-head-svc Last Update Time: 2025-04-28T06:32:13Z Max Worker Replicas: 5 Min Worker Replicas: 1 Observed Generation: 1 Observed Generation: 1 Pending Service Status: Application Statuses: text_ml_app: Health Last Update Time: 2025-04-28T06:39:02Z Serve Deployment Statuses: Summarizer: Health Last Update Time: 2025-04-28T06:39:02Z Status: HEALTHY Translator: Health Last Update Time: 2025-04-28T06:39:02Z Status: HEALTHY Status: RUNNING Ray Cluster Name: rayservice-sample-raycluster-7wlx2 Ray Cluster Status: Desired CPU: 0 Desired GPU: 0 Desired Memory: 0 Desired TPU: 0 Head: Service Status: Running Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Running 2m15s (x29791 over 16h) rayservice-controller The Serve applicaton is now running and healthy. ``` ## Updating the application To update the `RayService`, modify the manifest and apply it use `kubectl apply`. There are two types of updates that can occur: - *Application-level updates*: when only the Serve config options are changed, the update is applied _in-place_ on the same Ray cluster. This enables [lightweight updates](serve-in-production-lightweight-update) such as scaling a deployment up or down or modifying autoscaling parameters. - *Cluster-level updates*: when the `RayCluster` config options are changed, such as updating the container image for the cluster, it may result in a cluster-level update. In this case, a new cluster is started, and the application is deployed to it. Once the new cluster is ready, the Kubernetes service is updated to point to the new cluster and the previous cluster is terminated. There should not be any downtime for the application, but note that this requires the Kubernetes cluster to be large enough to schedule both Ray clusters. ### Example: Serve config update In the Text ML example above, change the language of the Translator in the Serve config to German: ```yaml - name: Translator num_replicas: 1 user_config: language: german ``` Now to update the application we apply the modified manifest: ```console $ kubectl apply -f ray-service.text-ml.yaml $ kubectl describe rayservice rayservice-sample ... Serve Deployment Statuses: text_ml_app_Translator: Health Last Update Time: 2023-09-07T18:21:36Z Last Update Time: 2023-09-07T18:21:36Z Status: UPDATING ... ``` Query the application to see a different translation in German: ```console $ curl -X POST -H "Content-Type: application/json" localhost:8000/summarize_translate -d '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief"' Es war die beste Zeit, es war die schlimmste Zeit . ``` ### Updating the RayCluster config The process of updating the RayCluster config is the same as updating the Serve config. For example, we can update the number of worker nodes to 2 in the manifest: ```console workerGroupSpecs: # the number of pods in the worker group. - replicas: 2 ``` ```console $ kubectl apply -f ray-service.text-ml.yaml $ kubectl describe rayservice rayservice-sample ... pendingServiceStatus: appStatus: {} dashboardStatus: healthLastUpdateTime: "2022-07-18T21:54:53Z" lastUpdateTime: "2022-07-18T21:54:54Z" rayClusterName: rayservice-sample-raycluster-bshfr rayClusterStatus: {} ... ``` In the status, you can see that the `RayService` is preparing a pending cluster. After the pending cluster is healthy, it becomes the active cluster and the previous cluster is terminated. ## Autoscaling You can configure autoscaling for your Serve application by setting the autoscaling field in the Serve config. Learn more about the configuration options in the [Serve Autoscaling Guide](serve-autoscaling). To enable autoscaling in a KubeRay Cluster, you need to set `enableInTreeAutoscaling` to True. Additionally, there are other options available to configure the autoscaling behavior. For further details, please refer to the documentation [here](serve-autoscaling). :::{note} In most use cases, it is recommended to enable Kubernetes autoscaling to fully utilize the resources in your cluster. If you are using GKE, you can utilize the AutoPilot Kubernetes cluster. For instructions, see [Create an Autopilot Cluster](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-an-autopilot-cluster). For EKS, you can enable Kubernetes cluster autoscaling by utilizing the Cluster Autoscaler. For detailed information, see [Cluster Autoscaler on AWS](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md). To understand the relationship between Kubernetes autoscaling and Ray autoscaling, see [Ray Autoscaler with Kubernetes Cluster Autoscaler](kuberay-autoscaler-with-ray-autoscaler). ::: ## Load balancer Set up ingress to expose your Serve application with a load balancer. See [this configuration](https://github.com/ray-project/kuberay/blob/v1.0.0/ray-operator/config/samples/ray-service-alb-ingress.yaml) :::{note} - Ray Serve runs HTTP proxy on every node, allowing you to use `/-/routes` as the endpoint for node health checks. - Ray Serve uses port 8000 as the default HTTP proxy traffic port. You can change the port by setting `http_options` in the Serve config. Learn more details [here](serve-multi-application). ::: ## Monitoring Monitor your Serve application using the Ray Dashboard. - Learn more about how to configure and manage Dashboard [here](observability-configure-manage-dashboard). - Learn about the Ray Serve Dashboard [here](serve-monitoring). - Learn how to set up [Prometheus](prometheus-setup) and [Grafana](grafana) for Dashboard. - Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](persist-kuberay-custom-resource-logs) on Kubernetes. :::{note} - To troubleshoot application deployment failures in Serve, you can check the KubeRay operator logs by running `kubectl logs -f ` (e.g., `kubectl logs -f kuberay-operator-7447d85d58-lv7pf`). The KubeRay operator logs contain information about the Serve application deployment event and Serve application health checks. - You can also check the controller log and deployment log, which are located under `/tmp/ray/session_latest/logs/serve/` in both the head node pod and worker node pod. These logs contain information about specific deployment failure reasons and autoscaling events. ::: ## Next Steps See [Add End-to-End Fault Tolerance](serve-e2e-ft) to learn more about Serve's failure conditions and how to guard against them. [KubeRay]: kuberay-quickstart [RayService]: kuberay-rayservice-quickstart --- (serve-resource-allocation)= # Resource Allocation This guide helps you configure Ray Serve to: - Scale your deployments horizontally by specifying a number of replicas - Scale up and down automatically to react to changing traffic - Allocate hardware resources (CPUs, GPUs, other accelerators, etc) for each deployment (serve-cpus-gpus)= ## Resource management (CPUs, GPUs, accelerators) You may want to specify a deployment's resource requirements to reserve cluster resources like GPUs or other accelerators. To assign hardware resources per replica, you can pass resource requirements to `ray_actor_options`. By default, each replica reserves one CPU. To learn about options to pass in, take a look at the [Resources with Actors guide](actor-resource-guide). For example, to create a deployment where each replica uses a single GPU, you can do the following: ```python @serve.deployment(ray_actor_options={"num_gpus": 1}) def func(*args): return do_something_with_my_gpu() ``` Or if you want to create a deployment where each replica uses another type of accelerator such as an HPU, follow the example below: ```python @serve.deployment(ray_actor_options={"resources": {"HPU": 1}}) def func(*args): return do_something_with_my_hpu() ``` (serve-fractional-resources-guide)= ### Fractional CPUs and fractional GPUs To do this, the resources specified in `ray_actor_options` can be *fractional*. For example, if you have two models and each doesn't fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each. ```python @serve.deployment(ray_actor_options={"num_gpus": 0.5}) def func_1(*args): return do_something_with_my_gpu() @serve.deployment(ray_actor_options={"num_gpus": 0.5}) def func_2(*args): return do_something_with_my_gpu() ``` In this example, each replica of each deployment will be allocated 0.5 GPUs. The same can be done to multiplex over CPUs, using `"num_cpus"`. ### Custom resources, accelerator types, and more You can also specify {ref}`custom resources ` in `ray_actor_options`, for example to ensure that a deployment is scheduled on a specific node. For example, if you have a deployment that requires 2 units of the `"custom_resource"` resource, you can specify it like this: ```python @serve.deployment(ray_actor_options={"resources": {"custom_resource": 2}}) def func(*args): return do_something_with_my_custom_resource() ``` You can also specify {ref}`accelerator types ` via the `accelerator_type` parameter in `ray_actor_options`. Below is the full list of supported options in `ray_actor_options`; please see the relevant Ray Core documentation for more details about each option: - `accelerator_type` - `memory` - `num_cpus` - `num_gpus` - `object_store_memory` - `resources` - `runtime_env` (serve-omp-num-threads)= ## Configuring parallelism with OMP_NUM_THREADS Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference. The number of CPUs they use is controlled by the `OMP_NUM_THREADS` environment variable. Ray sets `OMP_NUM_THREADS=` by default. To [avoid contention](omp-num-thread-note), Ray sets `OMP_NUM_THREADS=1` if `num_cpus` is not specified on the tasks/actors, to reduce contention between actors/tasks which run in a single thread. If you *do* want to enable this parallelism in your Serve deployment, just set `num_cpus` (recommended) to the desired value, or manually set the `OMP_NUM_THREADS` environment variable when starting Ray or in your function/class definition. ```bash OMP_NUM_THREADS=12 ray start --head OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS ``` ```{literalinclude} doc_code/managing_deployments.py :start-after: __configure_parallism_start__ :end-before: __configure_parallism_end__ :language: python ``` :::{note} Some other libraries may not respect `OMP_NUM_THREADS` and have their own way to configure parallelism. For example, if you're using OpenCV, you'll need to manually set the number of threads using `cv2.setNumThreads(num_threads)` (set to 0 to disable multi-threading). You can check the configuration using `cv2.getNumThreads()` and `cv2.getNumberOfCPUs()`. ::: --- --- orphan: true --- # Serve an Inference with Stable Diffusion Model on AWS NeuronCores Using FastAPI This example uses a precompiled Stable Diffusion XL model and deploys on an AWS Inferentia2 (Inf2) instance using Ray Serve and FastAPI. :::{note} Before starting this example: * Set up [PyTorch Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx) * Install AWS NeuronCore drivers and tools, and torch-neuronx based on the instance-type ::: ```bash pip install "optimum-neuron==0.0.13" "diffusers==0.21.4" pip install "ray[serve]" requests transformers ``` This example uses the [Stable Diffusion-XL](https://huggingface.co/aws-neuron/stable-diffusion-xl-base-1-0-1024x1024) model and [FastAPI](https://fastapi.tiangolo.com/). This model is compiled with AWS Neuron and is ready to run inference. However, you can choose a different Stable Diffusion model and compile it to be compatible for running inference on AWS Inferentia2 instances. The model in this example is ready for deployment. Save the following code to a file named aws_neuron_core_inference_serve_stable_diffusion.py. Use `serve run aws_neuron_core_inference_serve_stable_diffusion:entrypoint` to start the Serve application. ```{literalinclude} ../doc_code/aws_neuron_core_inference_serve_stable_diffusion.py :language: python :start-after: __neuron_serve_code_start__ :end-before: __neuron_serve_code_end__ ``` You should see the following log messages when a deployment using RayServe is successful: ```text 2024-02-07 17:53:28,299 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ProxyActor pid=25282) INFO 2024-02-07 17:53:31,751 proxy 172.31.10.188 proxy.py:1128 - Proxy actor fd464602af1e456162edf6f901000000 starting on node 5a8e0c24b22976f1f7672cc54f13ace25af3664a51429d8e332c0679. (ProxyActor pid=25282) INFO 2024-02-07 17:53:31,755 proxy 172.31.10.188 proxy.py:1333 - Starting HTTP server on node: 5a8e0c24b22976f1f7672cc54f13ace25af3664a51429d8e332c0679 listening on port 8000 (ProxyActor pid=25282) INFO: Started server process [25282] (ServeController pid=25233) INFO 2024-02-07 17:53:31,921 controller 25233 deployment_state.py:1545 - Deploying new version of deployment StableDiffusionV2 in application 'default'. Setting initial target number of replicas to 1. (ServeController pid=25233) INFO 2024-02-07 17:53:31,922 controller 25233 deployment_state.py:1545 - Deploying new version of deployment APIIngress in application 'default'. Setting initial target number of replicas to 1. (ServeController pid=25233) INFO 2024-02-07 17:53:32,024 controller 25233 deployment_state.py:1829 - Adding 1 replica to deployment StableDiffusionV2 in application 'default'. (ServeController pid=25233) INFO 2024-02-07 17:53:32,029 controller 25233 deployment_state.py:1829 - Adding 1 replica to deployment APIIngress in application 'default'. Fetching 20 files: 100%|██████████| 20/20 [00:00<00:00, 195538.65it/s] (ServeController pid=25233) WARNING 2024-02-07 17:54:02,114 controller 25233 deployment_state.py:2171 - Deployment 'StableDiffusionV2' in application 'default' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method. (ServeController pid=25233) WARNING 2024-02-07 17:54:32,170 controller 25233 deployment_state.py:2171 - Deployment 'StableDiffusionV2' in application 'default' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method. (ServeController pid=25233) WARNING 2024-02-07 17:55:02,344 controller 25233 deployment_state.py:2171 - Deployment 'StableDiffusionV2' in application 'default' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method. (ServeController pid=25233) WARNING 2024-02-07 17:55:32,418 controller 25233 deployment_state.py:2171 - Deployment 'StableDiffusionV2' in application 'default' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method. 2024-02-07 17:55:46,263 SUCC scripts.py:483 -- Deployed Serve app successfully. ``` Use the following code to send requests: ```python import requests prompt = "a zebra is dancing in the grass, river, sunlit" input = "%20".join(prompt.split(" ")) resp = requests.get(f"http://127.0.0.1:8000/imagine?prompt={input}") print("Write the response to `output.png`.") with open("output.png", "wb") as f: f.write(resp.content) ``` You should see the following log messages when a request is sent to the endpoint: ```text (ServeReplica:default:StableDiffusionV2 pid=25320) Prompt: a zebra is dancing in the grass, river, sunlit 0%| | 0/50 [00:00. (ServeController pid=147087) INFO 2025-03-03 06:07:11,381 controller 147087 -- Adding 1 replica to Deployment(name='LlamaModel', app='default'). (ServeReplica:default:LlamaModel pid=147085) [WARNING|utils.py:212] 2025-03-03 06:07:15,251 >> optimum-habana v1.15.0 has been validated for SynapseAI v1.19.0 but habana-frameworks v1.20.0.543 was found, this could lead to undefined behavior! (ServeReplica:default:LlamaModel pid=147085) /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations (ServeReplica:default:LlamaModel pid=147085) warnings.warn( (ServeReplica:default:LlamaModel pid=147085) /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py:796: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. (ServeReplica:default:LlamaModel pid=147085) warnings.warn( (ServeReplica:default:LlamaModel pid=147085) /usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py:991: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. (ServeReplica:default:LlamaModel pid=147085) warnings.warn( (ServeReplica:default:LlamaModel pid=147085) /usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py:471: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. (ServeReplica:default:LlamaModel pid=147085) warnings.warn( Loading checkpoint shards: 0%| | 0/2 [00:00. (DeepSpeedInferenceWorker pid=179962) [WARNING|utils.py:212] 2025-03-03 06:22:14,611 >> optimum-habana v1.15.0 has been validated for SynapseAI v1.19.0 but habana-frameworks v1.20.0.543 was found, this could lead to undefined behavior! (DeepSpeedInferenceWorker pid=179963) /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations (DeepSpeedInferenceWorker pid=179963) warnings.warn( (DeepSpeedInferenceWorker pid=179964) [WARNING|utils.py:212] 2025-03-03 06:22:14,613 >> optimum-habana v1.15.0 has been validated for SynapseAI v1.19.0 but habana-frameworks v1.20.0.543 was found, this could lead to undefined behavior! [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (DeepSpeedInferenceWorker pid=179962) [2025-03-03 06:22:23,502] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to hpu (auto detect) Loading 2 checkpoint shards: 0%| | 0/2 [00:00 io.ray ray-serve ${ray.version} provided ``` > NOTE: After installing Ray with Python, the local environment includes the Java jar of Ray Serve. The `provided` scope ensures that you can compile the Java code using Ray Serve without version conflicts when you deploy on the cluster. ## Example model This example use case is a production workflow for a financial application. The application needs to compute the best strategy to interact with different banks for a single task. ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/Strategy.java :end-before: docs-strategy-end :language: java :start-after: docs-strategy-start ``` This example uses the `Strategy` class to calculate the indicators of a number of banks. * The `calc` method is the entry of the calculation. The input parameters are the time interval of calculation and the map of the banks and their indicators. The `calc` method contains a two-tier `for` loop, traversing each indicator list of each bank, and calling the `calcBankIndicators` method to calculate the indicators of the specified bank. - There is another layer of `for` loop in the `calcBankIndicators` method, which traverses each indicator, and then calls the `calcIndicator` method to calculate the specific indicator of the bank. - The `calcIndicator` method is a specific calculation logic based on the bank, the specified time interval and the indicator. This code uses the `Strategy` class: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/StrategyCalc.java :end-before: docs-strategy-calc-end :language: java :start-after: docs-strategy-calc-start ``` When the scale of banks and indicators expands, the three-tier `for` loop slows down the calculation. Even if you use the thread pool to calculate each indicator in parallel, you may encounter a single machine performance bottleneck. Moreover, you can't use this `Strategy` object as a resident service. ## Converting to a Ray Serve Deployment Through Ray Serve, you can deploy the core computing logic of `Strategy` as a scalable distributed computing service. First, extract the indicator calculation of each institution into a separate `StrategyOnRayServe` class: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/StrategyOnRayServe.java :end-before: docs-strategy-end :language: java :start-after: docs-strategy-start ``` Next, start the Ray Serve runtime and deploy `StrategyOnRayServe` as a deployment. ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/StrategyCalcOnRayServe.java :end-before: docs-deploy-end :language: java :start-after: docs-deploy-start ``` The `Deployment.create` makes a Deployment object named `strategy`. After executing `Deployment.deploy`, the Ray Serve instance deploys this `strategy` deployment with four replicas, and you can access it for distributed parallel computing. ## Testing the Ray Serve Deployment You can test the `strategy` deployment using RayServeHandle inside Ray: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/StrategyCalcOnRayServe.java :end-before: docs-calc-end :language: java :start-after: docs-calc-start ``` This code executes the calculation of each bank's indicator serially, and sends it to Ray for execution. You can make the calculation concurrent, which not only improves the calculation efficiency, but also solves the bottleneck of single machine. ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/StrategyCalcOnRayServe.java :end-before: docs-parallel-calc-end :language: java :start-after: docs-parallel-calc-start ``` You can use `StrategyCalcOnRayServe` like the example in the `main` method: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/StrategyCalcOnRayServe.java :end-before: docs-main-end :language: java :start-after: docs-main-start ``` ## Calling Ray Serve Deployment with HTTP Another way to test or call a deployment is through the HTTP request. However, two limitations exist for the Java deployments: - Only the `call` method of the user class can process the HTTP requests. - The `call` method can only have one input parameter, and the type of the input parameter and the returned value can only be `String`. If you want to call the `strategy` deployment with HTTP, then you can rewrite the class like this code: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/HttpStrategyOnRayServe.java :end-before: docs-strategy-end :language: java :start-after: docs-strategy-start ``` After deploying this deployment, you can access it with the `curl` command: ```shell curl -d '{"time":1641038674, "bank":"test_bank", "indicator":"test_indicator"}' http://127.0.0.1:8000/strategy ``` You can also access it using HTTP Client in Java code: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/HttpStrategyCalcOnRayServe.java :end-before: docs-http-end :language: java :start-after: docs-http-start ``` The example of strategy calculation using HTTP to access deployment is as follows: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/HttpStrategyCalcOnRayServe.java :end-before: docs-calc-end :language: java :start-after: docs-calc-start ``` You can also rewrite this code to support concurrency: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/HttpStrategyCalcOnRayServe.java :end-before: docs-parallel-calc-end :language: java :start-after: docs-parallel-calc-start ``` Finally, the complete usage of `HttpStrategyCalcOnRayServe` is like this code: ```{literalinclude} ../../../../java/serve/src/test/java/io/ray/serve/docdemo/HttpStrategyCalcOnRayServe.java :end-before: docs-main-end :language: java :start-after: docs-main-start ``` --- --- orphan: true --- (serve-object-detection-tutorial)= # Building a Real-time Object Detection Service with Ray Serve ## Overview This tutorial demonstrates how to deploy a production-ready object detection service using Ray Serve. You will learn how to serve a YOLOv5 object detection model efficiently with automatic GPU resource management and scaling capabilities. ## Installation Install the required dependencies: ```bash pip install "ray[serve]" requests torch pillow numpy opencv-python-headless pandas "gitpython>=3.1.30" ``` ## Implementation This example uses the [ultralytics/yolov5](https://github.com/ultralytics/yolov5) model for object detection and [FastAPI](https://fastapi.tiangolo.com/) for creating the web API. ### Code Structure Save the following code to a file named `object_detection.py`: ```{literalinclude} ../doc_code/object_detection.py :language: python :start-after: __example_code_start__ :end-before: __example_code_end__ ``` The code consists of two main deployments: 1. **APIIngress**: A FastAPI-based frontend that handles HTTP requests 2. **ObjectDetection**: The backend deployment that loads the YOLOv5 model and performs inference on GPU :::{note} **Understanding Autoscaling** The configuration in this example sets `min_replicas` to 0, which means: - The deployment starts with no `ObjectDetection` replicas - Ray Serve creates replicas only when requests arrive - After a period of inactivity, Ray Serve scales down the replicas back to 0 - This "scale-to-zero" capability helps conserve GPU resources when the service isn't being actively used ::: ## Deployment Deploy the service with: ```bash serve run object_detection:entrypoint ``` When successfully deployed, you should see log messages similar to: ```text (ServeReplica:ObjectDection pid=4747) warnings.warn( (ServeReplica:ObjectDection pid=4747) Downloading: "https://github.com/ultralytics/yolov5/zipball/master" to /home/ray/.cache/torch/hub/master.zip (ServeReplica:ObjectDection pid=4747) YOLOv5 🚀 2023-3-8 Python-3.9.16 torch-1.13.0+cu116 CUDA:0 (Tesla T4, 15110MiB) (ServeReplica:ObjectDection pid=4747) (ServeReplica:ObjectDection pid=4747) Fusing layers... (ServeReplica:ObjectDection pid=4747) YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients (ServeReplica:ObjectDection pid=4747) Adding AutoShape... 2023-03-08 21:10:21,685 SUCC :93 -- Deployed Serve app successfully. ``` ## Troubleshooting :::{tip} **Common OpenCV Error** You might encounter this error when running the example: ``` ImportError: libGL.so.1: cannot open shared object file: No such file or directory ``` This typically happens when running `opencv-python` in headless environments like containers. The solution is to use the headless version: ```bash pip uninstall opencv-python; pip install opencv-python-headless ``` ::: ## Testing the Service Once the service is running, you can test it with the following Python code: ```python import requests # Sample image URL for testing image_url = "https://ultralytics.com/images/zidane.jpg" # Send request to the object detection service resp = requests.get(f"http://127.0.0.1:8000/detect?image_url={image_url}") # Save the annotated image with detected objects with open("output.jpeg", 'wb') as f: f.write(resp.content) ``` ## Example Output The service processes the image and returns it with bounding boxes around detected objects: ![Example of object detection output](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/object_detection_output.jpeg) --- --- orphan: true --- (serve-deepseek-tutorial)= # Serve DeepSeek This example shows how to deploy DeepSeek R1 or V3 with Ray Serve LLM. ## Installation To run this example, install the following: ```bash pip install "ray[llm]==2.46.0" ``` Note: Deploying DeepSeek-R1 requires at least 720GB of free disk space per worker node to store model weights. ## Deployment ### Quick Deployment For quick deployment and testing, save the following code to a file named `deepseek.py`, and run `python3 deepseek.py`. ```python from ray import serve from ray.serve.llm import LLMConfig, build_openai_app llm_config = LLMConfig( model_loading_config={ "model_id": "deepseek", "model_source": "deepseek-ai/DeepSeek-R1", }, deployment_config={ "autoscaling_config": { "min_replicas": 1, "max_replicas": 1, } }, # Change to the accelerator type of the node accelerator_type="H100", runtime_env={"env_vars": {"VLLM_USE_V1": "1"}}, # Customize engine arguments as needed (e.g. vLLM engine kwargs) engine_kwargs={ "tensor_parallel_size": 8, "pipeline_parallel_size": 2, "gpu_memory_utilization": 0.92, "dtype": "auto", "max_num_seqs": 40, "max_model_len": 16384, "enable_chunked_prefill": True, "enable_prefix_caching": True, }, ) # Deploy the application llm_app = build_openai_app({"llm_configs": [llm_config]}) serve.run(llm_app) ``` ### Production Deployment For production deployments, save the following to a YAML file named `deepseek.yaml` and run `serve run deepseek.yaml`. ```yaml applications: - args: llm_configs: - model_loading_config: model_id: "deepseek" model_source: "deepseek-ai/DeepSeek-R1" accelerator_type: "H100" deployment_config: autoscaling_config: min_replicas: 1 max_replicas: 1 runtime_env: env_vars: VLLM_USE_V1: "1" engine_kwargs: tensor_parallel_size: 8 pipeline_parallel_size: 2 gpu_memory_utilization: 0.92 dtype: "auto" max_num_seqs: 40 max_model_len: 16384 enable_chunked_prefill: true enable_prefix_caching: true import_path: ray.serve.llm:build_openai_app name: llm_app route_prefix: "/" ``` ## Configuration You may need to adjust configurations in the above code based on your setup, specifically: * `accelerator_type`: for NVIDIA GPUs, DeepSeek requires Hopper GPUs or later ones. Therefore, you can specify `H200`, `H100`, `H20` etc. based on your hardware. * `tensor_parallel_size` and `pipeline_parallel_size`: DeepSeek requires a single node of 8xH200, or two nodes of 8xH100. The typical setup of using H100 is setting `tensor_parallel_size` to `8` and `pipeline_parallel_size` to `2` as in the code example. When using H200, you can set `tensor_parallel_size` to `8` and leave out the `pipeline_parallel_size` parameter (it is `1` by default). * `model_source`: although you could specify a HuggingFace model ID like `deepseek-ai/DeepSeek-R1` in the code example, it is recommended to pre-download the model because it is huge. You can download it to the local file system (e.g., `/path/to/downloaded/model`) or to a remote object store (e.g., `s3://my-bucket/path/to/downloaded/model`), and specify it as `model_source`. It is recommended to download it to a remote object store, using {ref}`Ray model caching utilities `. Note that if you have two nodes and would like to download to local file system, you need to download the model to the same path on both nodes. ## Testing the Service You can query the deployed model using the following request and get the corresponding response. ::::{tab-set} :::{tab-item} Request ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer fake-key" \ -d '{ "model": "deepseek", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ::: :::{tab-item} Response ```bash {"id":"deepseek-68b5d5c5-fd34-42fc-be26-0a36f8457ffe","object":"chat.completion","created":1743646776,"model":"deepseek","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Hello! How can I assist you today? 😊","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":6,"total_tokens":18,"completion_tokens":12,"prompt_tokens_details":null},"prompt_logprobs":null} ``` ::: :::: Another example request and response: ::::{tab-set} :::{tab-item} Request ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer fake-key" \ -d '{ "model": "deepseek", "messages": [{"role": "user", "content": "The future of AI is"}] }' ``` ::: :::{tab-item} Response ```bash {"id":"deepseek-b81ff9be-3ffc-4811-80ff-225006eff27c","object":"chat.completion","created":1743646860,"model":"deepseek","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The future of AI is multifaceted and holds immense potential across various domains. Here are some key aspects that are likely to shape its trajectory:\n\n1. **Advanced Automation**: AI will continue to automate routine and complex tasks across industries, increasing efficiency and productivity. This includes everything from manufacturing and logistics to healthcare and finance.\n\n2. **Enhanced Decision-Making**: AI systems will provide deeper insights and predictive analytics, aiding in better decision-making processes for businesses, governments, and individuals.\n\n3. **Personalization**: AI will drive more personalized experiences in areas such as shopping, education, and entertainment, tailoring services and products to individual preferences and behaviors.\n\n4. **Healthcare Revolution**: AI will play a significant role in diagnosing diseases, personalizing treatment plans, and even predicting health issues before they become critical, potentially transforming the healthcare industry.\n\n5. **Ethical and Responsible AI**: As AI becomes more integrated into society, there will be a growing focus on developing ethical guidelines and frameworks to ensure AI is used responsibly and transparently, addressing issues like bias, privacy, and security.\n\n6. **Human-AI Collaboration**: The future will see more seamless collaboration between humans and AI, with AI augmenting human capabilities rather than replacing them. This includes areas like creative industries, where AI can assist in generating ideas and content.\n\n7. **AI in Education**: AI will personalize learning experiences, adapt to individual learning styles, and provide real-time feedback, making education more accessible and effective.\n\n8. **Robotics and Autonomous Systems**: Advances in AI will lead to more sophisticated robots and autonomous systems, impacting industries like transportation (e.g., self-driving cars), agriculture, and home automation.\n\n9. **AI and Sustainability**: AI will play a crucial role in addressing environmental challenges by optimizing resource use, improving energy efficiency, and aiding in climate modeling and conservation efforts.\n\n10. **Regulation and Governance**: As AI technologies advance, there will be increased efforts to establish international standards and regulations to govern their development and use, ensuring they benefit society as a whole.\n\n11. **Quantum Computing and AI**: The integration of quantum computing with AI could revolutionize data processing capabilities, enabling the solving of complex problems that are currently intractable.\n\n12. **AI in Creative Fields**: AI will continue to make strides in creative domains such as music, art, and literature, collaborating with human creators to push the boundaries of innovation and expression.\n\nOverall, the future of AI is both promising and challenging, requiring careful consideration of its societal impact and the ethical implications of its widespread adoption.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":518,"completion_tokens":509,"prompt_tokens_details":null},"prompt_logprobs":null} ``` ::: :::: ## Deploying with KubeRay Create a KubeRay cluster using the {ref}`Ray Serve LLM KubeRay guide ` with sufficient GPU resources for DeepSeek R1. For example, two 8xH100 nodes. Deploy DeepSeek-R1 as a RayService with the following configuration file: ```bash kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.deepseek.yaml ``` ## Troubleshooting ### Multi-Node GPU Issues Since DeepSeek typically requires multi-node GPU deployment, you may encounter issues specific to multi-node GPU serving. Common problems include: * **NCCL initialization failures**: Especially on H100 instances due to outdated `aws-ofi-plugin` versions * **Pipeline parallelism hangs**: When `pipeline_parallel_size > 1`, the model serving may hang due to resource conflicts For comprehensive troubleshooting of multi-node GPU serving issues, refer to {ref}`Troubleshooting multi-node GPU serving on KubeRay `. --- --- orphan: true --- (serve-ml-models-tutorial)= # Serve ML Models (Tensorflow, PyTorch, Scikit-Learn, others) This guide shows how to train models from various machine learning frameworks and deploy them to Ray Serve. See the [Key Concepts](serve-key-concepts) to learn more general information about Ray Serve. :::::{tab-set} ::::{tab-item} Keras and TensorFlow This example trains and deploys a simple TensorFlow neural net. In particular, it shows: - How to train a TensorFlow model and load the model from your file system in your Ray Serve deployment. - How to parse the JSON request and make a prediction. Ray Serve is framework-agnostic--you can use any version of TensorFlow. This tutorial uses TensorFlow 2 and Keras. You also need `requests` to send HTTP requests to your model deployment. If you haven't already, install TensorFlow 2 and requests by running: ```console $ pip install "tensorflow>=2.0" requests "ray[serve]" ``` Open a new Python file called `tutorial_tensorflow.py`. First, import Ray Serve and some other helpers. ```{literalinclude} ../doc_code/tutorial_tensorflow.py :start-after: __doc_import_begin__ :end-before: __doc_import_end__ ``` Next, train a simple MNIST model using Keras. ```{literalinclude} ../doc_code/tutorial_tensorflow.py :start-after: __doc_train_model_begin__ :end-before: __doc_train_model_end__ ``` Next, define a `TFMnistModel` class that accepts HTTP requests and runs the MNIST model that you trained. The `@serve.deployment` decorator makes it a deployment object that you can deploy onto Ray Serve. Note that Ray Serve exposes the deployment over an HTTP route. By default, when the deployment receives a request over HTTP, Ray Serve invokes the `__call__` method. ```{literalinclude} ../doc_code/tutorial_tensorflow.py :start-after: __doc_define_servable_begin__ :end-before: __doc_define_servable_end__ ``` :::{note} When you deploy and instantiate the `TFMnistModel` class, Ray Serve loads the TensorFlow model from your file system so that it can be ready to run inference on the model and serve requests later. ::: Now that you've defined the Serve deployment, prepare it so that you can deploy it. ```{literalinclude} ../doc_code/tutorial_tensorflow.py :start-after: __doc_deploy_begin__ :end-before: __doc_deploy_end__ ``` :::{note} `TFMnistModel.bind(TRAINED_MODEL_PATH)` binds the argument `TRAINED_MODEL_PATH` to the deployment and returns a `DeploymentNode` object, a wrapping of the `TFMnistModel` deployment object, that you can then use to connect with other `DeploymentNodes` to form a more complex [deployment graph](serve-model-composition). ::: Finally, deploy the model to Ray Serve through the terminal. ```console $ serve run tutorial_tensorflow:mnist_model ``` Next, query the model. While Serve is running, open a separate terminal window, and run the following in an interactive Python shell or a separate Python script: ```python import requests import numpy as np resp = requests.get( "http://localhost:8000/", json={"array": np.random.randn(28 * 28).tolist()} ) print(resp.json()) ``` You should get an output like the following, although the exact prediction may vary: ```bash { "prediction": [[-1.504277229309082, ..., -6.793371200561523]], "file": "/tmp/mnist_model.h5" } ``` :::: ::::{tab-item} PyTorch This example loads and deploys a PyTorch ResNet model. In particular, it shows: - How to load the model from PyTorch's pre-trained Model Zoo. - How to parse the JSON request, transform the payload and make a prediction. This tutorial requires PyTorch and Torchvision. Ray Serve is framework agnostic and works with any version of PyTorch. You also need `requests` to send HTTP requests to your model deployment. If you haven't already, install them by running: ```console $ pip install torch torchvision requests "ray[serve]" ``` Open a new Python file called `tutorial_pytorch.py`. First, import Ray Serve and some other helpers. ```{literalinclude} ../doc_code/tutorial_pytorch.py :start-after: __doc_import_begin__ :end-before: __doc_import_end__ ``` Define a class `ImageModel` that parses the input data, transforms the images, and runs the ResNet18 model loaded from `torchvision`. The `@serve.deployment` decorator makes it a deployment object that you can deploy onto Ray Serve. Note that Ray Serve exposes the deployment over an HTTP route. By default, when the deployment receives a request over HTTP, Ray Serve invokes the `__call__` method. ```{literalinclude} ../doc_code/tutorial_pytorch.py :start-after: __doc_define_servable_begin__ :end-before: __doc_define_servable_end__ ``` :::{note} When you deploy and instantiate an `ImageModel` class, Ray Serve loads the ResNet18 model from `torchvision` so that it can be ready to run inference on the model and serve requests later. ::: Now that you've defined the Serve deployment, prepare it so that you can deploy it. ```{literalinclude} ../doc_code/tutorial_pytorch.py :start-after: __doc_deploy_begin__ :end-before: __doc_deploy_end__ ``` :::{note} `ImageModel.bind()` returns a `DeploymentNode` object, a wrapping of the `ImageModel` deployment object, that you can then use to connect with other `DeploymentNodes` to form a more complex [deployment graph](serve-model-composition). ::: Finally, deploy the model to Ray Serve through the terminal. ```console $ serve run tutorial_pytorch:image_model ``` Next, query the model. While Serve is running, open a separate terminal window, and run the following in an interactive Python shell or a separate Python script: ```python import requests ray_logo_bytes = requests.get( "https://raw.githubusercontent.com/ray-project/" "ray/master/doc/source/images/ray_header_logo.png" ).content resp = requests.post("http://localhost:8000/", data=ray_logo_bytes) print(resp.json()) ``` You should get an output like the following, although the exact number may vary: ```bash {'class_index': 919} ``` :::: ::::{tab-item} Scikit-learn This example trains and deploys a simple scikit-learn classifier. In particular, it shows: - How to load the scikit-learn model from file system in your Ray Serve definition. - How to parse the JSON request and make a prediction. Ray Serve is framework-agnostic. You can use any version of sklearn. You also need `requests` to send HTTP requests to your model deployment. If you haven't already, install scikit-learn and requests by running: ```console $ pip install scikit-learn requests "ray[serve]" ``` Open a new Python file called `tutorial_sklearn.py`. Import Ray Serve and some other helpers. ```{literalinclude} ../doc_code/tutorial_sklearn.py :start-after: __doc_import_begin__ :end-before: __doc_import_end__ ``` **Train a Classifier** Next, train a classifier with the [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). First, instantiate a `GradientBoostingClassifier` loaded from scikit-learn. ```{literalinclude} ../doc_code/tutorial_sklearn.py :start-after: __doc_instantiate_model_begin__ :end-before: __doc_instantiate_model_end__ ``` Next, load the Iris dataset and split the data into training and validation sets. ```{literalinclude} ../doc_code/tutorial_sklearn.py :start-after: __doc_data_begin__ :end-before: __doc_data_end__ ``` Then, train the model and save it to a file. ```{literalinclude} ../doc_code/tutorial_sklearn.py :start-after: __doc_train_model_begin__ :end-before: __doc_train_model_end__ ``` **Deploy with Ray Serve** Finally, you're ready to deploy the classifier using Ray Serve. Define a `BoostingModel` class that runs inference on the `GradientBoosingClassifier` model you trained and returns the resulting label. It's decorated with `@serve.deployment` to make it a deployment object so you can deploy it onto Ray Serve. Note that Ray Serve exposes the deployment over an HTTP route. By default, when the deployment receives a request over HTTP, Ray Serve invokes the `__call__` method. ```{literalinclude} ../doc_code/tutorial_sklearn.py :start-after: __doc_define_servable_begin__ :end-before: __doc_define_servable_end__ ``` :::{note} When you deploy and instantiate a `BoostingModel` class, Ray Serve loads the classifier model that you trained from the file system so that it can be ready to run inference on the model and serve requests later. ::: After you've defined the Serve deployment, prepare it so that you can deploy it. ```{literalinclude} ../doc_code/tutorial_sklearn.py :start-after: __doc_deploy_begin__ :end-before: __doc_deploy_end__ ``` :::{note} `BoostingModel.bind(MODEL_PATH, LABEL_PATH)` binds the arguments `MODEL_PATH` and `LABEL_PATH` to the deployment and returns a `DeploymentNode` object, a wrapping of the `BoostingModel` deployment object, that you can then use to connect with other `DeploymentNodes` to form a more complex [deployment graph](serve-model-composition). ::: Finally, deploy the model to Ray Serve through the terminal. ```console $ serve run tutorial_sklearn:boosting_model ``` Next, query the model. While Serve is running, open a separate terminal window, and run the following in an interactive Python shell or a separate Python script: ```python import requests sample_request_input = { "sepal length": 1.2, "sepal width": 1.0, "petal length": 1.1, "petal width": 0.9, } response = requests.get("http://localhost:8000/", json=sample_request_input) print(response.text) ``` You should get an output like the following, although the exact prediction may vary: ```python {"result": "versicolor"} ``` :::: ::::: --- --- orphan: true --- (serve-stable-diffusion-tutorial)= # Serve a Stable Diffusion Model Run on Anyscale

This example runs a Stable Diffusion application with Ray Serve. To run this example, install the following: ```bash pip install "ray[serve]" requests torch diffusers==0.35.2 transformers ``` This example uses the [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model and [FastAPI](https://fastapi.tiangolo.com/) to build the example. Save the following code to a file named stable_diffusion.py. The Serve code is as follows: ```{literalinclude} ../doc_code/stable_diffusion.py :language: python :start-after: __example_code_start__ :end-before: __example_code_end__ ``` Use `serve run stable_diffusion:entrypoint` to start the Serve application. :::{note} The autoscaling config sets `min_replicas` to 0, which means the deployment starts with no `ObjectDetection` replicas. These replicas spawn only when a request arrives. When no requests arrive after a certain period of time, Serve downscales `ObjectDetection` back to 0 replica to save GPU resources. ::: You should see these messages in the output: ```text (ServeController pid=362, ip=10.0.44.233) INFO 2023-03-08 16:44:57,579 controller 362 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-7396d5a9efdb59ee01b7befba448433f6c6fc734cfa5421d415da1b3' on node '7396d5a9efdb59ee01b7befba448433f6c6fc734cfa5421d415da1b3' listening on '127.0.0.1:8000' (ServeController pid=362, ip=10.0.44.233) INFO 2023-03-08 16:44:57,588 controller 362 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-a30ea53938547e0bf88ce8672e578f0067be26a7e26d23465c46300b' on node 'a30ea53938547e0bf88ce8672e578f0067be26a7e26d23465c46300b' listening on '127.0.0.1:8000' (ProxyActor pid=439, ip=10.0.44.233) INFO: Started server process [439] (ProxyActor pid=5779) INFO: Started server process [5779] (ServeController pid=362, ip=10.0.44.233) INFO 2023-03-08 16:44:59,362 controller 362 deployment_state.py:1333 - Adding 1 replica to deployment 'APIIngress'. 2023-03-08 16:45:01,316 SUCC :93 -- Deployed Serve app successfully. ``` Use the following code to send requests: ```python import requests prompt = "a cute cat is dancing on the grass." input = "%20".join(prompt.split(" ")) resp = requests.get(f"http://127.0.0.1:8000/imagine?prompt={input}") with open("output.png", 'wb') as f: f.write(resp.content) ``` The app saves the `output.png` file locally. The following is an example of an output image. ![image](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/stable_diffusion_output.png) --- --- orphan: true --- (serve-streaming-tutorial)= # Serve a Chatbot with Request and Response Streaming This example deploys a chatbot that streams output back to the user. It shows: * How to stream outputs from a Serve application * How to use WebSockets in a Serve application * How to combine batching requests with streaming outputs This tutorial should help you with following use cases: * You want to serve a large language model and stream results back token-by-token. * You want to serve a chatbot that accepts a stream of inputs from the user. This tutorial serves the [DialoGPT](https://huggingface.co/microsoft/DialoGPT-small) language model. Install the Hugging Face library to access it: ``` pip install "ray[serve]" transformers torch ``` ## Create a streaming deployment Open a new Python file called `textbot.py`. First, add the imports and the [Serve logger](serve-logging). ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __textbot_setup_start__ :end-before: __textbot_setup_end__ ``` Create a [FastAPI deployment](serve-fastapi-http), and initialize the model and the tokenizer in the constructor: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __textbot_constructor_start__ :end-before: __textbot_constructor_end__ ``` Note that the constructor also caches an `asyncio` loop. This behavior is useful when you need to run a model and concurrently stream its tokens back to the user. Add the following logic to handle requests sent to the `Textbot`: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __textbot_logic_start__ :end-before: __textbot_logic_end__ ``` `Textbot` uses three methods to handle requests: * `handle_request`: the entrypoint for HTTP requests. FastAPI automatically unpacks the `prompt` query parameter and passes it into `handle_request`. This method then creates a `TextIteratorStreamer`. Hugging Face provides this streamer as a convenient interface to access tokens generated by a language model. `handle_request` then kicks off the model in a background thread using `self.loop.run_in_executor`. This behavior lets the model generate tokens while `handle_request` concurrently calls `self.consume_streamer` to stream the tokens back to the user. `self.consume_streamer` is a generator that yields tokens one by one from the streamer. Lastly, `handle_request` passes the `self.consume_streamer` generator into a Starlette `StreamingResponse` and returns the response. Serve unpacks the Starlette `StreamingResponse` and yields the contents of the generator back to the user one by one. * `generate_text`: the method that runs the model. This method runs in a background thread kicked off by `handle_request`. It pushes generated tokens into the streamer constructed by `handle_request`. * `consume_streamer`: a generator method that consumes the streamer constructed by `handle_request`. This method keeps yielding tokens from the streamer until the model in `generate_text` closes the streamer. This method avoids blocking the event loop by calling `asyncio.sleep` with a brief timeout whenever the streamer is empty and waiting for a new token. Bind the `Textbot` to a language model. For this tutorial, use the `"microsoft/DialoGPT-small"` model: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __textbot_bind_start__ :end-before: __textbot_bind_end__ ``` Run the model with `serve run textbot:app`, and query it from another terminal window with this script: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __stream_client_start__ :end-before: __stream_client_end__ ``` You should see the output printed token by token. ## Stream inputs and outputs using WebSockets WebSockets let you stream input into the application and stream output back to the client. Use WebSockets to create a chatbot that stores a conversation with a user. Create a Python file called `chatbot.py`. First add the imports: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __chatbot_setup_start__ :end-before: __chatbot_setup_end__ ``` Create a FastAPI deployment, and initialize the model and the tokenizer in the constructor: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __chatbot_constructor_start__ :end-before: __chatbot_constructor_end__ ``` Add the following logic to handle requests sent to the `Chatbot`: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __chatbot_logic_start__ :end-before: __chatbot_logic_end__ ``` The `generate_text` and `consume_streamer` methods are the same as they were for the `Textbot`. The `handle_request` method has been updated to handle WebSocket requests. The `handle_request` method is decorated with a `fastapi_app.websocket` decorator, which lets it accept WebSocket requests. First it `awaits` to accept the client's WebSocket request. Then, until the client disconnects, it does the following: * gets the prompt from the client with `ws.receive_text` * starts a new `TextIteratorStreamer` to access generated tokens * runs the model in a background thread on the conversation so far * streams the model's output back using `ws.send_text` * stores the prompt and the response in the `conversation` string Each time `handle_request` gets a new prompt from a client, it runs the whole conversation–with the new prompt appended–through the model. When the model finishes generating tokens, `handle_request` sends the `"<>"` string to inform the client that the model has generated all tokens. `handle_request` continues to run until the client explicitly disconnects. This disconnect raises a `WebSocketDisconnect` exception, which ends the call. Read more about WebSockets in the [FastAPI documentation](https://fastapi.tiangolo.com/advanced/websockets/). Bind the `Chatbot` to a language model. For this tutorial, use the `"microsoft/DialoGPT-small"` model: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __chatbot_bind_start__ :end-before: __chatbot_bind_end__ ``` Run the model with `serve run chatbot:app`. Query it using the `websockets` package, using `pip install websockets`: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __ws_client_start__ :end-before: __ws_client_end__ ``` You should see the outputs printed token by token. ## Batch requests and stream the output for each Improve model utilization and request latency by batching requests together when running the model. Create a Python file called `batchbot.py`. First add the imports: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __batchbot_setup_start__ :end-before: __batchbot_setup_end__ ``` :::{warning} Hugging Face's support for `Streamers` is still under development and may change in the future. `RawQueue` is compatible with the `Streamers` interface in Hugging Face 4.30.2. However, the `Streamers` interface may change, making the `RawQueue` incompatible with Hugging Face models in the future. ::: Similar to `Textbot` and `Chatbot`, the `Batchbot` needs a streamer to stream outputs from batched requests, but Hugging Face `Streamers` don't support batched requests. Add this custom `RawStreamer` to process batches of tokens: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __raw_streamer_start__ :end-before: __raw_streamer_end__ ``` Create a FastAPI deployment, and initialize the model and the tokenizer in the constructor: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __batchbot_constructor_start__ :end-before: __batchbot_constructor_end__ ``` Unlike `Textbot` and `Chatbot`, the `Batchbot` constructor also sets a `pad_token`. You need to set this token to batch prompts with different lengths. Add the following logic to handle requests sent to the `Batchbot`: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __batchbot_logic_start__ :end-before: __batchbot_logic_end__ ``` `Batchbot` uses four methods to handle requests: * `handle_request`: the entrypoint method. This method simply takes in the request's prompt and calls the `run_model` method on it. `run_model` is a generator method that also handles batching the requests. `handle_request` passes `run_model` into a Starlette `StreamingResponse` and returns the response, so the bot can stream generated tokens back to the client. * `run_model`: a generator method that performs batching. Since `run_model` is decorated with `@serve.batch`, it automatically takes in a batch of prompts. See the [batching guide](serve-batch-tutorial) for more info. `run_model` creates a `RawStreamer` to access the generated tokens. It calls `generate_text` in a background thread, and passes in the `prompts` and the `streamer`, similar to the `Textbot`. Then it iterates through the `consume_streamer` generator, repeatedly yielding a batch of tokens generated by the model. * `generate_text`: the method that runs the model. It's mostly the same as `generate_text` in `Textbot`, with two differences. First, it takes in and processes a batch of prompts instead of a single prompt. Second, it sets `padding=True`, so prompts with different lengths can be batched together. * `consume_streamer`: a generator method that consumes the streamer constructed by `handle_request`. It's mostly the same as `consume_streamer` in `Textbot`, with one difference. It uses the `tokenizer` to decode the generated tokens. Usually, the Hugging Face streamer handles the decoding. Because this implementation uses the custom `RawStreamer`, `consume_streamer` must handle the decoding. :::{tip} Some inputs within a batch may generate fewer outputs than others. When a particular input has nothing left to yield, pass a `StopIteration` object into the output iterable to terminate that input's request. See [Streaming batched requests](serve-streaming-batched-requests-guide) for more details. ::: Bind the `Batchbot` to a language model. For this tutorial, use the `"microsoft/DialoGPT-small"` model: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __batchbot_bind_start__ :end-before: __batchbot_bind_end__ ``` Run the model with `serve run batchbot:app`. Query it from two other terminal windows with this script: ```{literalinclude} ../doc_code/streaming_tutorial.py :language: python :start-after: __stream_client_start__ :end-before: __stream_client_end__ ``` You should see the output printed token by token in both windows. --- --- orphan: true --- (serve-text-classification-tutorial)= # Serve a Text Classification Model This example uses a DistilBERT model to build an IMDB review classification application with Ray Serve. To run this example, install the following: ```bash pip install "ray[serve]" requests torch transformers ``` This example uses the [distilbert-base-uncased](https://huggingface.co/docs/transformers/tasks/sequence_classification) model and [FastAPI](https://fastapi.tiangolo.com/). Save the following code to a file named distilbert_app.py: Use the following Serve code: ```{literalinclude} ../doc_code/distilbert.py :language: python :start-after: __example_code_start__ :end-before: __example_code_end__ ``` Use `serve run distilbert_app:entrypoint` to start the Serve application. :::{note} The autoscaling config sets `min_replicas` to 0, which means the deployment starts with no `ObjectDetection` replicas. These replicas spawn only when a request arrives. When no requests arrive after a certain period of time, Serve downscales `ObjectDetection` back to 0 replica to save GPU resources. ::: You should see the following messages in the logs: ```text (ServeController pid=362, ip=10.0.44.233) INFO 2023-03-08 16:44:57,579 controller 362 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-7396d5a9efdb59ee01b7befba448433f6c6fc734cfa5421d415da1b3' on node '7396d5a9efdb59ee01b7befba448433f6c6fc734cfa5421d415da1b3' listening on '127.0.0.1:8000' (ServeController pid=362, ip=10.0.44.233) INFO 2023-03-08 16:44:57,588 controller 362 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-a30ea53938547e0bf88ce8672e578f0067be26a7e26d23465c46300b' on node 'a30ea53938547e0bf88ce8672e578f0067be26a7e26d23465c46300b' listening on '127.0.0.1:8000' (ProxyActor pid=439, ip=10.0.44.233) INFO: Started server process [439] (ProxyActor pid=5779) INFO: Started server process [5779] (ServeController pid=362, ip=10.0.44.233) INFO 2023-03-08 16:44:59,362 controller 362 deployment_state.py:1333 - Adding 1 replica to deployment 'APIIngress'. 2023-03-08 16:45:01,316 SUCC :93 -- Deployed Serve app successfully. ``` Use the following code to send requests: ```python import requests prompt = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three." input = "%20".join(prompt.split(" ")) resp = requests.get(f"http://127.0.0.1:8000/classify?sentence={prompt}") print(resp.status_code, resp.json()) ``` The output of the client code is the response status code, the label, which is positive in this example, and the label's score. ```text 200 [{'label': 'LABEL_1', 'score': 0.9994940757751465}] ``` --- --- orphan: true --- # Serving models with Triton Server in Ray Serve This guide shows how to build an application with stable diffusion model using [NVIDIA Triton Server](https://github.com/triton-inference-server/server) in Ray Serve. ## Preparation ### Installation It is recommended to use the `nvcr.io/nvidia/tritonserver:23.12-py3` image which already has the Triton Server python API library installed, and install the ray serve lib by `pip install "ray[serve]"` inside the image. ### Build and export a model For this application, the encoder is exported to ONNX format and the stable diffusion model is exported to be TensorRT engine format which is being compatible with Triton Server. Here is the example to export models to be in ONNX format. ```python import torch from pathlib import Path from diffusers import StableDiffusionPipeline # Load a specific model version that's known to work well with ONNX conversion model_id = "runwayml/stable-diffusion-v1-5" # This is often the most compatible model_path = Path("model_repository/stable_diffusion/1") device = "cuda" if torch.cuda.is_available() else "cpu" pipe = StableDiffusionPipeline.from_pretrained(model_id)\ .to(device) vae = pipe.vae unet = pipe.unet text_encoder = pipe.text_encoder hidden_size = text_encoder.config.hidden_size vae.forward = vae.decode torch.onnx.export( vae, (torch.randn(1, 4, 64, 64), False), "vae.onnx", input_names=["latent_sample", "return_dict"], output_names=["sample"], dynamic_axes={ "latent_sample": {0: "batch", 1: "channels", 2: "height", 3: "width"}, }, do_constant_folding=True, opset_version=14, ) dummy_text_input = torch.ones((1, 77), dtype=torch.int64, device=device) torch.onnx.export( text_encoder, dummy_text_input, "encoder.onnx", input_names=["input_ids"], output_names=["last_hidden_state", "pooler_output"], dynamic_axes={ "input_ids": {0: "batch", 1: "sequence"}, }, opset_version=14, do_constant_folding=True, ) ``` From the script, the outputs are `vae.onnx` and `encoder.onnx`. After the ONNX model exported, convert the ONNX model to the TensorRT engine serialized file. ([Details](https://github.com/NVIDIA/TensorRT/blob/release/9.2/samples/trtexec/README.md?plain=1#L22) about trtexec cli) ```bash trtexec --onnx=vae.onnx --saveEngine=vae.plan --minShapes=latent_sample:1x4x64x64 --optShapes=latent_sample:4x4x64x64 --maxShapes=latent_sample:8x4x64x64 --fp16 ``` ### Prepare the model repository Triton Server requires a [model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md) to store the models, which is a local directory or remote blob store (e.g. AWS S3) containing the model configuration and the model files. In our example, we will use a local directory as the model repository to save all the model files. ```bash model_repo/ ├── stable_diffusion │   ├── 1 │   │   └── model.py │   └── config.pbtxt ├── text_encoder │   ├── 1 │   │   └── model.onnx │   └── config.pbtxt └── vae ├── 1 │   └── model.plan └── config.pbtxt ``` The model repository contains three models: `stable_diffusion`, `text_encoder` and `vae`. Each model has a `config.pbtxt` file and a model file. The `config.pbtxt` file contains the model configuration, which is used to describe the model type and input/output formats.(you can learn more about model config file [here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md)). To get config files for our example, you can download them from [here](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_6-building_complex_pipelines/model_repository). We use `1` as the version of each model. The model files are saved in the version directory. ## Start the Triton Server inside a Ray Serve application In each serve replica, there is a single Triton Server instance running. The API takes the model repository path as the parameter, and the Triton Serve instance is started during the replica initialization. The models can be loaded during the inference requests, and the loaded models are cached in the Triton Server instance. Here is the inference code example for serving a model with Triton Server.([source](https://github.com/triton-inference-server/tutorials/blob/main/Triton_Inference_Server_Python_API/examples/rayserve/tritonserver_deployment.py)) ```python import numpy import requests import tritonserver from fastapi import FastAPI from PIL import Image from ray import serve app = FastAPI() @serve.deployment(ray_actor_options={"num_gpus": 1}) @serve.ingress(app) class TritonDeployment: def __init__(self): self._triton_server = tritonserver # NOTE: Each worker node needs to have access to this directory. # If you are using distributed multi-node setup, prefer to use # remote storage like S3 to save the model repository and use it. # # If triton server is not able to access this location, # the triton server will complain `failed to stat /workspace/models`. model_repository = ["/workspace/models"] self._triton_server = tritonserver.Server( model_repository=model_repository, model_control_mode=tritonserver.ModelControlMode.EXPLICIT, log_info=False, ) self._triton_server.start(wait_until_ready=True) @app.get("/generate") def generate(self, prompt: str, filename: str = "generated_image.jpg") -> None: if not self._triton_server.model("stable_diffusion").ready(): try: self._triton_server.load("text_encoder") self._triton_server.load("vae") self._stable_diffusion = self._triton_server.load("stable_diffusion") if not self._stable_diffusion.ready(): raise Exception("Model not ready") except Exception as error: print(f"Error can't load stable diffusion model, {error}") return for response in self._stable_diffusion.infer(inputs={"prompt": [[prompt]]}): generated_image = ( numpy.from_dlpack(response.outputs["generated_image"]) .squeeze() .astype(numpy.uint8) ) image_ = Image.fromarray(generated_image) image_.save(filename) if __name__ == "__main__": # Deploy the deployment. serve.run(TritonDeployment.bind()) # Query the deployment. requests.get( "http://localhost:8000/generate", params={"prompt": "dogs in new york, realistic, 4k, photograph"}, ) ``` Save the above code to a file named e.g. `triton_serve.py`, then run `python triton_serve.py` to start the server and send classify requests. After you run the above code, you should see the image generated `generated_image.jpg`. Check it out! ![image](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/triton_server_stable_diffusion.jpg) :::{note} You can also use remote model repository, such as AWS S3, to store the model files. To use remote model repository, you need to set the `model_repository` variable to the remote model repository path. For example `model_repository = s3:///`. ::: If you find any bugs or have any suggestions, please let us know by [filing an issue](https://github.com/ray-project/ray/issues) on GitHub. --- # Getting Data in and out of Tune Often, you will find yourself needing to pass data into Tune [Trainables](tune_60_seconds_trainables) (datasets, models, other large parameters) and get data out of them (metrics, checkpoints, other artifacts). In this guide, we'll explore different ways of doing that and see in what circumstances they should be used. ```{contents} :local: :backlinks: none ``` Let's start by defining a simple Trainable function. We'll be expanding this function with different functionality as we go. ```python import random import time import pandas as pd def training_function(config): # For now, we have nothing here. data = None model = {"hyperparameter_a": None, "hyperparameter_b": None} epochs = 0 # Simulate training & evaluation - we obtain back a "metric" and a "trained_model". for epoch in range(epochs): # Simulate doing something expensive. time.sleep(1) metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** ( -1 ) + model["hyperparameter_b"] * 0.1 * data["A"].sum() trained_model = {"state": model, "epoch": epoch} ``` Our `training_function` function requires a pandas DataFrame, a model with some hyperparameters and the number of epochs to train the model for as inputs. The hyperparameters of the model impact the metric returned, and in each epoch (iteration of training), the `trained_model` state is changed. We will run hyperparameter optimization using the [Tuner API](tune-run-ref). ```python from ray.tune import Tuner from ray import tune tuner = Tuner(training_function, tune_config=tune.TuneConfig(num_samples=4)) ``` ## Getting data into Tune First order of business is to provide the inputs for the Trainable. We can broadly separate them into two categories - variables and constants. Variables are the parameters we want to tune. They will be different for every [Trial](tune_60_seconds_trials). For example, those may be the learning rate and batch size for a neural network, number of trees and the maximum depth for a random forest, or the data partition if you are using Tune as an execution engine for batch training. Constants are the parameters that are the same for every Trial. Those can be the number of epochs, model hyperparameters we want to set but not tune, the dataset and so on. Often, the constants will be quite large (e.g. the dataset or the model). ```{warning} Objects from the outer scope of the `training_function` will also be automatically serialized and sent to Trial Actors, which may lead to unintended behavior. Examples include global locks not working (as each Actor operates on a copy) or general errors related to serialization. Best practice is to not refer to any objects from outer scope in the `training_function`. ``` ### Passing data into a Tune run through search spaces ```{note} TL;DR - use the `param_space` argument to specify small, serializable constants and variables. ``` The first way of passing inputs into Trainables is the [*search space*](tune-key-concepts-search-spaces) (it may also be called *parameter space* or *config*). In the Trainable itself, it maps to the `config` dict passed in as an argument to the function. You define the search space using the `param_space` argument of the `Tuner`. The search space is a dict and may be composed of [*distributions*](), which will sample a different value for each Trial, or of constant values. The search space may be composed of nested dictionaries, and those in turn can have distributions as well. ```{warning} Each value in the search space will be saved directly in the Trial metadata. This means that every value in the search space **must** be serializable and take up a small amount of memory. ``` For example, passing in a large pandas DataFrame or an unserializable model object as a value in the search space will lead to unwanted behavior. At best it will cause large slowdowns and disk space usage as Trial metadata saved to disk will also contain this data. At worst, an exception will be raised, as the data cannot be sent over to the Trial workers. For more details, see {ref}`tune-bottlenecks`. Instead, use strings or other identifiers as your values, and initialize/load the objects inside your Trainable directly depending on those. ```{note} [Datasets](data_quickstart) can be used as values in the search space directly. ``` In our example, we want to tune the two model hyperparameters. We also want to set the number of epochs, so that we can easily tweak it later. For the hyperparameters, we will use the `tune.uniform` distribution. We will also modify the `training_function` to obtain those values from the `config` dictionary. ```python def training_function(config): # For now, we have nothing here. data = None model = { "hyperparameter_a": config["hyperparameter_a"], "hyperparameter_b": config["hyperparameter_b"], } epochs = config["epochs"] # Simulate training & evaluation - we obtain back a "metric" and a "trained_model". for epoch in range(epochs): # Simulate doing something expensive. time.sleep(1) metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** ( -1 ) + model["hyperparameter_b"] * 0.1 * data["A"].sum() trained_model = {"state": model, "epoch": epoch} tuner = Tuner( training_function, param_space={ "hyperparameter_a": tune.uniform(0, 20), "hyperparameter_b": tune.uniform(-100, 100), "epochs": 10, }, ) ``` ### Using `tune.with_parameters` access data in Tune runs ```{note} TL;DR - use the `tune.with_parameters` util function to specify large constant parameters. ``` If we have large objects that are constant across Trials, we can use the {func}`tune.with_parameters ` utility to pass them into the Trainable directly. The objects will be stored in the [Ray object store](serialization-guide) so that each Trial worker may access them to obtain a local copy to use in its process. ```{tip} Objects put into the Ray object store must be serializable. ``` Note that the serialization (once) and deserialization (for each Trial) of large objects may incur a performance overhead. In our example, we will pass the `data` DataFrame using `tune.with_parameters`. In order to do that, we need to modify our function signature to include `data` as an argument. ```python def training_function(config, data): model = { "hyperparameter_a": config["hyperparameter_a"], "hyperparameter_b": config["hyperparameter_b"], } epochs = config["epochs"] # Simulate training & evaluation - we obtain back a "metric" and a "trained_model". for epoch in range(epochs): # Simulate doing something expensive. time.sleep(1) metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** ( -1 ) + model["hyperparameter_b"] * 0.1 * data["A"].sum() trained_model = {"state": model, "epoch": epoch} tuner = Tuner( training_function, param_space={ "hyperparameter_a": tune.uniform(0, 20), "hyperparameter_b": tune.uniform(-100, 100), "epochs": 10, }, ) ``` Next step is to wrap the `training_function` using `tune.with_parameters` before passing it into the `Tuner`. Every keyword argument of the `tune.with_parameters` call will be mapped to the keyword arguments in the Trainable signature. ```python data = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) tuner = Tuner( tune.with_parameters(training_function, data=data), param_space={ "hyperparameter_a": tune.uniform(0, 20), "hyperparameter_b": tune.uniform(-100, 100), "epochs": 10, }, tune_config=tune.TuneConfig(num_samples=4), ) ``` ### Loading data in a Tune Trainable You can also load data directly in Trainable from e.g. cloud storage, shared file storage such as NFS, or from the local disk of the Trainable worker. ```{warning} When loading from disk, ensure that all nodes in your cluster have access to the file you are trying to load. ``` A common use-case is to load the dataset from S3 or any other cloud storage with pandas, arrow or any other framework. The working directory of the Trainable worker will be automatically changed to the corresponding Trial directory. For more details, see {ref}`tune-working-dir`. Our tuning run can now be run, though we will not yet obtain any meaningful outputs back. ```python results = tuner.fit() ``` ## Getting data out of Ray Tune We can now run our tuning run using the `training_function` Trainable. The next step is to report *metrics* to Tune that can be used to guide the optimization. We will also want to *checkpoint* our trained models so that we can resume the training after an interruption, and to use them for prediction later. The `ray.tune.report` API is used to get data out of the Trainable workers. It can be called multiple times in the Trainable function. Each call corresponds to one iteration (epoch, step, tree) of training. ### Reporting metrics with Tune *Metrics* are values passed through the `metrics` argument in a `tune.report` call. Metrics can be used by Tune [Search Algorithms](search-alg-ref) and [Schedulers](schedulers-ref) to direct the search. After the tuning run is complete, you can [analyze the results](tune-analysis-guide), which include the reported metrics. ```{note} Similarly to search space values, each value reported as a metric will be saved directly in the Trial metadata. This means that every value reported as a metric **must** be serializable and take up a small amount of memory. ``` ```{note} Tune will automatically include some metrics, such as the training iteration, timestamp and more. See [here](tune-autofilled-metrics) for the entire list. ``` In our example, we want to maximize the `metric`. We will report it each epoch to Tune, and set the `metric` and `mode` arguments in `tune.TuneConfig` to let Tune know that it should use it as the optimization objective. ```python from ray import tune def training_function(config, data): model = { "hyperparameter_a": config["hyperparameter_a"], "hyperparameter_b": config["hyperparameter_b"], } epochs = config["epochs"] # Simulate training & evaluation - we obtain back a "metric" and a "trained_model". for epoch in range(epochs): # Simulate doing something expensive. time.sleep(1) metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** ( -1 ) + model["hyperparameter_b"] * 0.1 * data["A"].sum() trained_model = {"state": model, "epoch": epoch} tune.report(metrics={"metric": metric}) tuner = Tuner( tune.with_parameters(training_function, data=data), param_space={ "hyperparameter_a": tune.uniform(0, 20), "hyperparameter_b": tune.uniform(-100, 100), "epochs": 10, }, tune_config=tune.TuneConfig(num_samples=4, metric="metric", mode="max"), ) ``` ### Logging metrics with Tune callbacks Every metric logged using `tune.report` can be accessed during the tuning run through Tune [Callbacks](tune-logging). Ray Tune provides [several built-in integrations](loggers-docstring) with popular frameworks, such as MLFlow, Weights & Biases, CometML and more. You can also use the [Callback API](tune-callbacks-docs) to create your own callbacks. Callbacks are passed in the `callback` argument of the `Tuner`'s `RunConfig`. In our example, we'll use the MLFlow callback to track the progress of our tuning run and the changing value of the `metric` (requires `mlflow` to be installed). ```python import ray.tune from ray.tune.logger.mlflow import MLflowLoggerCallback def training_function(config, data): model = { "hyperparameter_a": config["hyperparameter_a"], "hyperparameter_b": config["hyperparameter_b"], } epochs = config["epochs"] # Simulate training & evaluation - we obtain back a "metric" and a "trained_model". for epoch in range(epochs): # Simulate doing something expensive. time.sleep(1) metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** ( -1 ) + model["hyperparameter_b"] * 0.1 * data["A"].sum() trained_model = {"state": model, "epoch": epoch} tune.report(metrics={"metric": metric}) tuner = tune.Tuner( tune.with_parameters(training_function, data=data), param_space={ "hyperparameter_a": tune.uniform(0, 20), "hyperparameter_b": tune.uniform(-100, 100), "epochs": 10, }, tune_config=tune.TuneConfig(num_samples=4, metric="metric", mode="max"), run_config=tune.RunConfig( callbacks=[MLflowLoggerCallback(experiment_name="example")] ), ) ``` ### Getting data out of Tune using checkpoints & other artifacts Aside from metrics, you may want to save the state of your trained model and any other artifacts to allow resumption from training failure and further inspection and usage. Those cannot be saved as metrics, as they are often far too large and may not be easily serializable. Finally, they should be persisted on disk or cloud storage to allow access after the Tune run is interrupted or terminated. Ray Train provides a {class}`Checkpoint ` API for that purpose. `Checkpoint` objects can be created from various sources (dictionaries, directories, cloud storage). In Ray Tune, `Checkpoints` are created by the user in their Trainable functions and reported using the optional `checkpoint` argument of `tune.report`. `Checkpoints` can contain arbitrary data and can be freely passed around the Ray cluster. After a tuning run is over, `Checkpoints` can be [obtained from the results](tune-analysis-guide). Ray Tune can be configured to [automatically sync checkpoints to cloud storage](tune-storage-options), keep only a certain number of checkpoints to save space (with {class}`ray.tune.CheckpointConfig`) and more. ```{note} The experiment state itself is checkpointed separately. See {ref}`tune-persisted-experiment-data` for more details. ``` In our example, we want to be able to resume the training from the latest checkpoint, and to save the `trained_model` in a checkpoint every iteration. To accomplish this, we will use the `session` and `Checkpoint` APIs. ```python import os import pickle import tempfile from ray import tune def training_function(config, data): model = { "hyperparameter_a": config["hyperparameter_a"], "hyperparameter_b": config["hyperparameter_b"], } epochs = config["epochs"] # Load the checkpoint, if there is any. checkpoint = tune.get_checkpoint() start_epoch = 0 if checkpoint: with checkpoint.as_directory() as checkpoint_dir: with open(os.path.join(checkpoint_dir, "model.pkl"), "rb") as f: checkpoint_dict = pickle.load(f) start_epoch = checkpoint_dict["epoch"] + 1 model = checkpoint_dict["state"] # Simulate training & evaluation - we obtain back a "metric" and a "trained_model". for epoch in range(start_epoch, epochs): # Simulate doing something expensive. time.sleep(1) metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** ( -1 ) + model["hyperparameter_b"] * 0.1 * data["A"].sum() checkpoint_dict = {"state": model, "epoch": epoch} # Create the checkpoint. with tempfile.TemporaryDirectory() as temp_checkpoint_dir: with open(os.path.join(temp_checkpoint_dir, "model.pkl"), "wb") as f: pickle.dump(checkpoint_dict, f) tune.report( {"metric": metric}, checkpoint=tune.Checkpoint.from_directory(temp_checkpoint_dir), ) tuner = tune.Tuner( tune.with_parameters(training_function, data=data), param_space={ "hyperparameter_a": tune.uniform(0, 20), "hyperparameter_b": tune.uniform(-100, 100), "epochs": 10, }, tune_config=tune.TuneConfig(num_samples=4, metric="metric", mode="max"), run_config=tune.RunConfig( callbacks=[MLflowLoggerCallback(experiment_name="example")] ), ) ``` With all of those changes implemented, we can now run our tuning and obtain meaningful metrics and artifacts. ```python results = tuner.fit() results.get_dataframe() ``` 2022-11-30 17:40:28,839 INFO tune.py:762 -- Total run time: 15.79 seconds (15.65 seconds for the tuning loop).
metric time_this_iter_s should_checkpoint done timesteps_total episodes_total training_iteration trial_id experiment_id date ... hostname node_ip time_since_restore timesteps_since_restore iterations_since_restore warmup_time config/epochs config/hyperparameter_a config/hyperparameter_b logdir
0 -58.399962 1.015951 True False NaN NaN 10 0b239_00000 acf38c19d59c4cf2ad7955807657b6ea 2022-11-30_17-40-26 ... ip-172-31-43-110 172.31.43.110 10.282120 0 10 0.003541 10 18.065981 -98.298928 /home/ubuntu/ray_results/training_function_202...
1 -24.461518 1.030420 True False NaN NaN 10 0b239_00001 5ca9e03d7cca46a7852cd501bc3f7b38 2022-11-30_17-40-28 ... ip-172-31-43-110 172.31.43.110 10.362581 0 10 0.004031 10 1.544918 -47.741455 /home/ubuntu/ray_results/training_function_202...
2 18.510299 1.034228 True False NaN NaN 10 0b239_00002 aa38dd786c714486a8d69fa5b372df48 2022-11-30_17-40-28 ... ip-172-31-43-110 172.31.43.110 10.333781 0 10 0.005286 10 8.129285 28.846415 /home/ubuntu/ray_results/training_function_202...
3 -16.138780 1.020072 True False NaN NaN 10 0b239_00003 5b401e15ab614332b631d552603a8d77 2022-11-30_17-40-28 ... ip-172-31-43-110 172.31.43.110 10.242707 0 10 0.003809 10 17.982020 -27.867871 /home/ubuntu/ray_results/training_function_202...

4 rows × 23 columns

Checkpoints, metrics, and the log directory for each trial can be accessed through the `ResultGrid` output of a Tune experiment. For more information on how to interact with the returned `ResultGrid`, see {doc}`/tune/examples/tune_analyze_results`. ### How do I access Tune results after I am finished? After you have finished running the Python session, you can still access the results and checkpoints. By default, Tune will save the experiment results to the `~/ray_results` local directory. You can configure Tune to persist results in the cloud as well. See {ref}`tune-storage-options` for more information on how to configure storage options for persisting experiment results. You can restore the Tune experiment by calling {meth}`Tuner.restore(path_or_cloud_uri, trainable) `, where `path_or_cloud_uri` points to a location either on the filesystem or cloud where the experiment was saved to. After the `Tuner` has been restored, you can access the results and checkpoints by calling `Tuner.get_results()` to receive the `ResultGrid` object, and then proceeding as outlined in the previous section.