# Executorch > The Neutron partitioner API allows for configuration of the model delegation to Neutron. Passing an``NeutronPartitioner``instance with no additional parameters will run as much of the model as possi --- =============== Partitioner API =============== The Neutron partitioner API allows for configuration of the model delegation to Neutron. Passing an ``NeutronPartitioner`` instance with no additional parameters will run as much of the model as possible on the Neutron backend. This is the most common use-case. It has the following arguments: * `compile_spec` - list of key-value pairs defining compilation: * `custom_delegation_options` - custom options for specifying node delegation. -------------------- Compile Spec Options -------------------- To generate the Compile Spec for Neutron backend, you can use the `generate_neutron_compile_spec` function or directly the `NeutronCompileSpecBuilder().neutron_compile_spec()` Following fields can be set: * `config` - NXP platform defining the Neutron NPU configuration, e.g. "imxrt700". * `neutron_converter_flavor` - Flavor of the neutron-converter module to use. Neutron-converter module named neutron_converter_SDK_25_06' has flavor 'SDK_25_06'. You shall set the flavour to the MCUXpresso SDK version you will use. * `extra_flags` - Extra flags for the Neutron compiler. * `operators_not_to_delegate` - List of operators that will not be delegated. ------------------------- Custom Delegation Options ------------------------- By default the Neutron backend is defensive, what means it does not delegate operators which cannot be decided statically during partitioning. But as the model author you typically have insight into the model and so you can allow opportunistic delegation for some cases. For list of options, see `CustomDelegationOptions `_ ================ Operator Support ================ Operators are the building blocks of the ML model. See `IRs `_ for more information on the PyTorch operator set. This section lists the Edge operators supported by the Neutron backend. For detailed constraints of the operators see the conditions in the ``is_supported_*`` functions in the `Node converters `_ .. csv-table:: Operator Support :file: op-support.csv :header-rows: 1 :widths: 20 15 30 30 :align: center --- ================ Operator Support ================ This page lists the PyTorch operators currently supported by the Samsung Exynos backend. .. csv-table:: Operator Support :file: samsung-op-support-table.csv :header-rows: 1 :widths: 25 15 55 :align: center --- ========================== {BACKEND_NAME} Partitioner API ========================== Document the partitioner API for the backend, including configuration options and compile specs. - ``option1``: Description of the option and values. - ``option2``: Description of the second option. - ``option3``: Description of the third option. {ADDITIONAL_PARTITIONER_DETAILS} ================ Operator Support ================ This page lists the operators supported by the {BACKEND_NAME} backend. Operators are the building blocks of the ML model. See `IRs `_ for more information on the PyTorch operator set. {OPERATOR_SUPPORT_NOTES} .. csv-table:: Operator Support :file: op-support.csv :header-rows: 1 :widths: 20 15 30 30 :align: center --- ================ Operator Support ================ This page lists the operators currently supported by the Vulkan backend. The source of truth for this information is `op_registry.py `_, which is used by the Vulkan Partitioner to determine which operators should be lowered to the Vulkan backend and additionally describes the capabilities of each operator implementation. If an operator used in your model is not in this list, feel free to create a feature request on Github and we will do our best to add an implementation for the operator. The namespace of an operator describes where it originates from: * **aten** - operators in this namespace correspond 1:1 to operators in PyTorch's `ATen library `_. They all support fp16 and fp32 dtypes at a minimum. * **dim_order_op** - these operators are inserted when lowering to ExecuTorch in order to manage optimal tensor memory layouts. They are typically removed, since the Vulkan backend manages optimal tensor representations internally. * **llama** - custom ops targeted for LLM inference. These are typically inserted by model source transformations applied to a `nn.Module` and are not invoked directly by a PyTorch model. * **operator** - these operators work with symbolic integers, which are also supported by the Vulkan backend. * **quantized_decomposed** / **torchao** - these ops are introduced by quantization workflows (either torchao's `quantize_` API or the PT2E quantization flow). They typically represent quantizing/dequantizing a tensor, or choosing the quantization parameters for a tensor. In practice, most instances of these operators will be fused into a custom op in the **et_vk** namespace. * **et_vk** - these are custom operators implemented only in the Vulkan backend. They typically represent quantized variants of **aten** operators, or fusions of common operator patterns. They are inserted by operator fusion graph passes when lowering to the Vulkan backend. All operators support dynamic input shapes unless otherwise noted (i.e. "no resize support"). The expectation is that over time, all operators will be able to support dynamic shapes. .. csv-table:: Vulkan Backend Operator Support :file: vulkan-op-support-table.csv :header-rows: 1 :widths: 25 25 75 :align: left --- =============== Partitioner API =============== The XNNPACK partitioner API allows for configuration of the model delegation to XNNPACK. Passing an ``XnnpackPartitioner`` instance with no additional parameters will run as much of the model as possible on the XNNPACK backend. This is the most common use-case. For advanced use cases, the partitioner exposes the following options via the `constructor `_: - ``configs``: Control which operators are delegated to XNNPACK. By default, all available operators all delegated. See `../config/__init__.py `_ for an exhaustive list of available operator configs. - ``config_precisions``: Filter operators by data type. By default, delegate all precisions. One or more of ``ConfigPrecisionType.FP32``, ``ConfigPrecisionType.STATIC_QUANT``, or ``ConfigPrecisionType.DYNAMIC_QUANT``. See `ConfigPrecisionType `_. - ``per_op_mode``: If true, emit individual delegate calls for every operator. This is an advanced option intended to reduce memory overhead in some contexts at the cost of a small amount of runtime overhead. Defaults to false. - ``verbose``: If true, print additional information during lowering. ================ Operator Support ================ This section lists the operators supported by the XNNPACK backend. Operators are the building blocks of the ML model. See `IRs `_ for more information on the PyTorch operator set. All operators support dynamic input shapes unless otherwise noted. .. csv-table:: Operator Support :file: op-support.csv :header-rows: 1 :widths: 20 15 30 30 :align: center --- Prerequisite | ETRecord - ExecuTorch Record =========================================== Overview -------- ``ETRecord`` is intended to be the debug artifact that is generated by users ahead of time (when they export their model to run on ExecuTorch). To draw a rough equivalent to conventional software development, ``ETRecord`` can be considered as the binary built with debug symbols that is used for debugging in GNU Debugger (gdb). It is expected that the user will supply this to the ExecuTorch Developer Tools in order for them to debug and visualize their model. ``ETRecord`` contains numerous components such as: * Edge dialect graph with debug handles * Delegate debug handle maps The ``ETRecord`` object itself is intended to be opaque to users and they should not access any components inside it directly. It should be provided to the `Inspector API `__ to link back performance and debug data sourced from the runtime back to the Python source code. Generating an ``ETRecord`` -------------------------- There are multiple ways to generate an ``ETRecord`` for debugging purposes: Method 1: Using the ``generate_etrecord`` Parameter (Recommended) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The recommended approach is to enable ``ETRecord`` generation by passing ``generate_etrecord=True`` to your export API calls. This can be used with: * ``executorch.export()`` - High-level export API * ``to_edge()`` - Edge dialect conversion * ``to_edge_transform_and_lower()`` - Edge conversion with transformations and lowering After export completes, retrieve the ``ETRecord`` using the ``get_etrecord()`` method, and save it using the ``save()`` method: **Example with** ``executorch.export()``: .. code-block:: python import executorch from executorch.export import ExportRecipe # Export with ETRecord generation enabled session = executorch.export( model=model, example_inputs=[example_inputs], export_recipe=recipe, generate_etrecord=True # Enable ETRecord generation ) # Get and save the ETRecord etrecord = session.get_etrecord() etrecord.save("model_debug.etrecord") **Example with** ``to_edge()``: .. code-block:: python from executorch.exir.program import to_edge from torch.export import export # Export model first exported_program = export(model, example_inputs) # Convert to edge with ETRecord generation edge_manager = to_edge( exported_program, generate_etrecord=True # Enable ETRecord generation ) # Apply transformations edge_manager = edge_manager.to_backend() et_manager = edge_manager.to_executorch() # Get and save ETRecord etrecord = et_manager.get_etrecord() etrecord.save("edge_debug.etrecord") **Example with** ``to_edge_transform_and_lower()``: .. code-block:: python from executorch.exir.program import to_edge_transform_and_lower from torch.export import export # Export model first exported_program = export(model, example_inputs) # Transform and lower with ETRecord generation edge_manager = to_edge_transform_and_lower( exported_program, partitioner=[MyPartitioner()], generate_etrecord=True # Enable ETRecord generation ) et_manager = edge_manager.to_executorch() # Get and save ETRecord etrecord = et_manager.get_etrecord() etrecord.save("debug.etrecord") Method 2: Using the ``generate_etrecord()`` Function ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can also use the standalone ``generate_etrecord()`` function to generate an ``ETRecord``. This method requires you to provide the Edge Dialect program (returned by ``to_edge()``), the ExecuTorch program (returned by ``to_executorch()``), and optional models. .. warning:: When using the standalone function, users should do a deepcopy of the output of ``to_edge()`` and pass in the deepcopy to the ``generate_etrecord`` API. This is needed because the subsequent call, ``to_executorch()``, does an in-place mutation and will lose debug data in the process. **Example:** .. code-block:: python import copy from executorch.devtools import generate_etrecord from torch.export import export # Export and convert to edge aten_dialect = export(model, example_inputs, strict=True) edge_program = to_edge(aten_dialect) # Create copy for ETRecord (needed because to_executorch modifies in-place) edge_program_copy = copy.deepcopy(edge_program) # Convert to ExecutorchProgramManager executorch_program = edge_program_copy.to_executorch() # Generate ETRecord separately generate_etrecord( "debug.etrecord", edge_program, executorch_program, ) .. currentmodule:: executorch.devtools.etrecord._etrecord .. autofunction:: generate_etrecord Using an ``ETRecord`` --------------------- Pass the ``ETRecord`` as an optional argument into the `Inspector API `__ to access this data and do post-run analysis. --- Runtime API Reference ================================ The ExecuTorch C++ API provides an on-device execution framework for exported PyTorch models. For a tutorial style introduction to the runtime API, check out the `using executorch with cpp tutorial `__ and its `simplified `__ version. For detailed information on how APIs evolve and the deprecation process, please refer to the `ExecuTorch API Life Cycle and Deprecation Policy `__. Model Loading and Execution --------------------------- .. doxygenclass:: executorch::runtime::Program :members: .. doxygenclass:: executorch::runtime::Method :members: .. doxygenclass:: executorch::runtime::MethodMeta :members: .. doxygenclass:: executorch::runtime::DataLoader :members: .. doxygenclass:: executorch::runtime::MemoryAllocator :members: .. doxygenclass:: executorch::runtime::HierarchicalAllocator :members: .. doxygenclass:: executorch::runtime::MemoryManager :members: Values ------ .. doxygenstruct:: executorch::runtime::EValue :members: .. doxygenclass:: executorch::runtime::etensor::Tensor :members: --- Export API Reference ---------------------------------- For detailed information on how APIs evolve and the deprecation process, please refer to the `ExecuTorch API Life Cycle and Deprecation Policy `__. .. automodule:: executorch.exir .. autofunction:: to_edge .. automodule:: executorch.exir .. autofunction:: to_edge_transform_and_lower .. autoclass:: EdgeProgramManager :members: methods, config_methods, exported_program, transform, to_backend, to_executorch .. autoclass:: ExecutorchProgramManager :members: methods, config_methods, exported_program, buffer, debug_handle_map, dump_executorch_program .. automodule:: executorch.exir.backend.backend_api .. autofunction:: to_backend .. autoclass:: LoweredBackendModule :members: backend_id, processed_bytes, compile_specs, original_module, buffer, program --- Setting Up ExecuTorch ===================== This page is re-organized into the following pages: * `Getting Started with ExecuTorch `_ * `Building from Source `_ It will redirect in 3 seconds .. raw:: html --- Inspector APIs ============== Overview -------- The Inspector APIs provide a convenient interface for analyzing the contents of `ETRecord `__ and `ETDump `__, helping developers get insights about model architecture and performance statistics. It’s built on top of the `EventBlock Class <#eventblock-class>`__ data structure, which organizes a group of `Event <#event-class>`__\ s for easy access to details of profiling events. There are multiple ways in which users can interact with the Inspector APIs: * By using `public methods <#inspector-methods>`__ provided by the ``Inspector`` class. * By accessing the `public attributes <#inspector-attributes>`__ of the ``Inspector``, ``EventBlock``, and ``Event`` classes. * By using a `CLI <#cli>`__ tool for basic functionalities. Please refer to the `e2e use case doc `__ get an understanding of how to use these in a real world example. Inspector Methods ----------------- Constructor ~~~~~~~~~~~ .. autofunction:: executorch.devtools.Inspector.__init__ **Example Usage:** .. code:: python from executorch.devtools import Inspector inspector = Inspector(etdump_path="/path/to/etdump.etdp", etrecord="/path/to/etrecord.bin") to_dataframe ~~~~~~~~~~~~~~~~ .. autofunction:: executorch.devtools.Inspector.to_dataframe print_data_tabular ~~~~~~~~~~~~~~~~~~ .. autofunction:: executorch.devtools.Inspector.print_data_tabular .. _example-usage-1: **Example Usage:** .. code:: python inspector.print_data_tabular() .. image:: _static/img/print_data_tabular.png Note that the unit of delegate profiling events is "cycles". We're working on providing a way to set different units in the future. find_total_for_module ~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: executorch.devtools.Inspector.find_total_for_module .. _example-usage-2: **Example Usage:** .. code:: python print(inspector.find_total_for_module("L__self___conv_layer")) :: 0.002 get_exported_program ~~~~~~~~~~~~~~~~~~~~ .. autofunction:: executorch.devtools.Inspector.get_exported_program .. _example-usage-3: **Example Usage:** .. code:: python print(inspector.get_exported_program()) :: ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, arg0_1: f32[4, 3, 64, 64]): # No stacktrace found for following nodes _param_constant0 = self._param_constant0 _param_constant1 = self._param_constant1 ### ... Omit part of the program for documentation readability ... ### Graph signature: ExportGraphSignature(parameters=[], buffers=[], user_inputs=['arg0_1'], user_outputs=['aten_tan_default'], inputs_to_parameters={}, inputs_to_buffers={}, buffers_to_mutate={}, backward_signature=None, assertion_dep_token=None) Range constraints: {} Equality constraints: [] calculate_numeric_gap ~~~~~~~~~~~~~~~~~~ .. autofunction:: executorch.devtools.Inspector.calculate_numeric_gap .. _example-usage-4: **Example Usage:** .. code:: python print(inspector.calculate_numeric_gap("L1")) .. image:: _static/img/calculate_numeric_gap.png Inspector Attributes -------------------- ``EventBlock`` Class ~~~~~~~~~~~~~~~~~~~~ Access ``EventBlock`` instances through the ``event_blocks`` attribute of an ``Inspector`` instance, for example: .. code:: python inspector.event_blocks .. autoclass:: executorch.devtools.inspector.EventBlock ``Event`` Class ~~~~~~~~~~~~~~~ Access ``Event`` instances through the ``events`` attribute of an ``EventBlock`` instance. .. autoclass:: executorch.devtools.inspector.Event **Example Usage:** .. code:: python for event_block in inspector.event_blocks: for event in event_block.events: if event.name == "Method::execute": print(event.perf_data.raw) :: [175.748, 78.678, 70.429, 122.006, 97.495, 67.603, 70.2, 90.139, 66.344, 64.575, 134.135, 93.85, 74.593, 83.929, 75.859, 73.909, 66.461, 72.102, 84.142, 77.774, 70.038, 80.246, 59.134, 68.496, 67.496, 100.491, 81.162, 74.53, 70.709, 77.112, 59.775, 79.674, 67.54, 79.52, 66.753, 70.425, 71.703, 81.373, 72.306, 72.404, 94.497, 77.588, 79.835, 68.597, 71.237, 88.528, 71.884, 74.047, 81.513, 76.116] CLI --- Execute the following command in your terminal to display the data table. This command produces the identical table output as calling the `print_data_tabular <#print-data-tabular>`__ mentioned earlier: .. code:: bash python3 -m devtools.inspector.inspector_cli --etdump_path --etrecord_path Note that the `etrecord_path` argument is optional. We plan to extend the capabilities of the CLI in the future. --- Runtime Python API Reference ---------------------------------- The Python ``executorch.runtime`` module wraps the C++ ExecuTorch runtime. It can load and execute serialized ``.pte`` program files: see the `Export to ExecuTorch Tutorial `__ for how to convert a PyTorch ``nn.Module`` to an ExecuTorch ``.pte`` program file. Execution accepts and returns ``torch.Tensor`` values, making it a quick way to validate the correctness of the program. For detailed information on how APIs evolve and the deprecation process, please refer to the `ExecuTorch API Life Cycle and Deprecation Policy `__. .. automodule:: executorch.runtime .. autoclass:: Runtime :members: get, load_program .. autoclass:: OperatorRegistry :members: operator_names .. autoclass:: Program :members: method_names, load_method .. autoclass:: Method :members: execute, metadata --- (advanced-topics-section)= # Advanced Deep dive into ExecuTorch's advanced features for optimization, customization, and integration. This section covers advanced concepts for developers who need to customize ExecuTorch for specific use cases, optimize performance, or integrate with custom hardware backends. ## Quantization & Optimization Techniques for model compression and performance optimization. **→ {doc}`quantization-optimization` — Quantization strategies and performance optimization** Key topics: - Quantization strategies and techniques - Performance profiling and optimization ## Model Export Learn the core ExecuTorch workflow, exporting PyTorch models to the `.pte` format for edge deployment. **→ {doc}`using-executorch-export`** - Model Export & Lowering Key topics: - Export and Lowering Workflow - Hardware Backend Selection & Optimization - Dynamic Shapes & Advanced Model Features ## Kernel Library Deep dive into ExecuTorch's kernel implementation and customization. **→ {doc}`kernel-library-advanced` — Kernel library deep dive and customization** Key topics: - Kernel library architecture - Custom kernel implementation - Selective build and optimization ## Backend & Delegates **→ {doc}`backend-delegate-advanced` — Backend delegate integration** Key topics: - Learn how to integrate Backend Delegate into ExecuTorch and more - XNNPACK Delegate Internals - Debugging Delegation ## Runtime & Integration Advanced runtime features and backend integration. **→ {doc}`runtime-integration-advanced` — Runtime customization and backend integration** Key topics: - Backend delegate implementation - Platform abstraction layer - Custom runtime integration ## Compiler & IR Advanced compiler features and intermediate representation details. **→ {doc}`compiler-ir-advanced` — Compiler passes and IR specification** Key topics: - Custom compiler passes - Memory planning strategies - Backend dialect and EXIR - Ops set definition ## File Formats ExecuTorch file format specifications and internals. **→ {doc}`file-formats-advanced` — PTE and PTD file format specifications** Key topics: - PTE file format internals - PTD file format specification - Custom file format handling ## Next Steps After exploring advanced topics: - **{doc}`tools-sdk-section`** - Developer tools for debugging and profiling - **{doc}`api-section`** - Complete API reference documentation ```{toctree} :hidden: :maxdepth: 2 :caption: Advanced Topics quantization-optimization using-executorch-export kernel-library-advanced backend-delegate-advanced runtime-integration-advanced compiler-ir-advanced file-formats-advanced --- (android-backends)= # Backends Available hardware acceleration backends for Android deployment. ## CPU Acceleration - {doc}`android-xnnpack` — XNNPACK CPU acceleration ## GPU Acceleration - {doc}`android-vulkan` — Vulkan GPU acceleration ## NPU/Accelerator Backends - {doc}`android-qualcomm` — Qualcomm AI Engine (NPU) - {doc}`android-mediatek` — MediaTek NPU acceleration - {doc}`android-arm-vgf` — ARM VGF Backend - {doc}`backends/samsung/samsung-overview` — Samsung Exynos NPU ```{toctree} :hidden: android-xnnpack android-vulkan android-qualcomm android-mediatek android-arm-vgf backends/samsung/samsung-overview --- # Examples & Demos - [Working with LLMs - Android Examples](https://github.com/meta-pytorch/executorch-examples/blob/main/llm/android/LlamaDemo/README.md) - ExecuTorch Llama Android Demo App - [Demo Apps](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app) - DeepLab v3 model for image segmentation - {doc}`tutorial-arm-vgf` — Export a simple PyTorch model for the ExecuTorch VGF backend ```{toctree} :hidden: tutorial-arm-vgf --- (android-section)= # Android Deploy ExecuTorch on Android devices with hardware acceleration support. ## Quick Start & Integration - {doc}`using-executorch-android` — Complete Android integration guide ## Backends - {doc}`android-backends` — Available Android backends and acceleration options ## Examples & Demos - {doc}`android-examples` — Explore Android Examples & Demos ```{toctree} :hidden: using-executorch-android android-backends android-examples --- # API Life Cycle and Deprecation Policy ## API Life Cycle ![name](_static/img/api_life_cycle.png) Each API of ExecuTorch falls into one of the following life cycle states: _Experimental_ - APIs in this stage are under active development and may change or be removed at any time. That said, the expectation is that we will eventually promote it to _Stable_, unless sufficient negative signals have been collected from the community or better alternatives have been found. - _Experimental_ APIs will be clearly marked (see the “How to Mark API State” section below). - _Experimental_ APIs may be changed or removed without notice, and developers should expect no stability guarantees. _Stable_ - APIs are considered to be _Stable_ if they are not marked as _Experimental_ or _Deprecated._ - APIs in this stage have been thoroughly tested and are considered ready for production use. - The recommended best practice is to not deprecate stable APIs. When writing an API, write it in such a way that it doesn’t need to be deprecated in the future. - _Stable_ APIs can be changed, but not in a breaking way. If breaking changes have to be made, _Stable_ APIs will always transition to _Deprecated_ before being broken/removed from the library. _Deprecated_ - APIs in this stage are no longer recommended for use and will be removed in a future version of ExecuTorch. - _Deprecated_ APIs will be clearly marked (see the “How to Mark API State” section below). - _Deprecated_ APIs will remain functional for at least the _deprecation period_ (see the “Deprecation Period” section below) to allow developers time to migrate to alternative APIs. _Deleted_ - APIs whose removal are made permanent. Cleaned up from both code and documentation. ## Deprecation Policy Follow these steps to deprecate and remove an API: 1. Discuss the change and collect initial feedback. 2. Clearly mark the API deprecated in code and documentation (See “How to Mark API State” below). 3. Listen to user feedback after the first release that deprecates the API. Users who weren't involved in the original discussion may have good arguments for not deprecating or removing the API. 4. Once the deprecation period has passed, the API may be removed (See “Deprecation Period” below). Be sure to also remove references from the documentation. We also use deprecation as a way to make breaking changes to an existing interface: for example, if adding a non-optional parameter to a method. To do this without breaking existing users: 1. In a single commit: - Create a new API that meets the new requirements. - Deprecate the old API and recommend that users move to the new API. 2. Migrate use cases from the old API to the new API. 3. Delete the old API after the deprecation period. ## How to Mark API State When possible, the ExecuTorch code uses language-standard ways to annotate API lifecycle state in the code. This makes it easier for IDEs and other tools to communicate state to developers.
Language Code Documentation
Python Use the executorch.exir._warnings.deprecated decorator.

Use the executorch.exir._warnings.experimental decorator.

Use .. warning:: in the docstrings of deprecated and experimental APIs. See example usage.
C++ Use the ET_DEPRECATED annotation macro. See example usage.

Use the ET_EXPERIMENTAL annotation macro.

Start Doxygen comments with DEPRECATED: See example usage.

Start Doxygen comments with EXPERIMENTAL:.

Java Use java.lang.Deprecated.

Use androidx.annotation.RequiresOptIn.

/**
* @deprecated Use {@link #newMethod()} instead.
*/

/**
* Warning: This API is experimental.
*/
Objective-C

__attribute__((deprecated("Use newMethod instead")));

__attribute__((deprecated("This API is experimental and may change without notice.")));


/**
* @deprecated Use `newMethod` instead.
*/


/**
* @experimental This API is experimental.
*/

Swift

@available(*, deprecated, message: "Use newMethod instead")

@available(*, message: "This API is experimental")

/// - Warning: Deprecated. Use `newMethod()` instead.

/// - Warning: This API is experimental.

The annotations would trigger static and/or runtime warning that contains at least these information: 1. Clearly point to the non-deprecated alternative to migrate to, or be clear if there is no alternative; 2. Specify the earliest version in which the API may actually be removed (See “Deprecation Period” below). ## Deprecation Period Here we recommend waiting for at least 2 minor releases before the removal. For example, if a function is marked as _deprecated_ in release 1.3.x, then it can be _deleted_ in 1.5.x or later. --- (api-section)= # API In this section, find complete API documentation for ExecuTorch's export, runtime, and extension interfaces. Includes comprehensive references for Python, C++, and Java APIs across all supported platforms. - {doc}`export-to-executorch-api-reference` — Export to ExecuTorch API Reference - {doc}`executorch-runtime-api-reference` — ExecuTorch Runtime API Reference - {doc}`runtime-python-api-reference` — Runtime Python API Reference - {doc}`api-life-cycle` — API Life Cycle - [Android doc →](https://pytorch.org/executorch/main/javadoc/) — Android API Documentation - {doc}`extension-module` — Extension Module - {doc}`extension-tensor` — Extension Tensor - {doc}`running-a-model-cpp-tutorial` — Detailed C++ Runtime APIs Tutorial ```{toctree} :hidden: :maxdepth: 1 :caption: API Reference export-to-executorch-api-reference executorch-runtime-api-reference runtime-python-api-reference api-life-cycle extension-module extension-tensor running-a-model-cpp-tutorial --- # Cadence Xtensa Backend (Legacy / Outdated) ```{warning} **⚠️ THIS DOCUMENTATION IS OUTDATED AND NO LONGER MAINTAINED** **For current Cadence backend documentation and support:** - Please refer to the up-to-date documentation in [backends-cadence.md](../backends-cadence.md) ``` --- # Cadence Xtensa Backend In this tutorial we will walk you through the process of getting setup to build ExecuTorch for an Xtensa HiFi4 DSP and running a simple model on it. [Cadence](https://www.cadence.com/en_US/home.html) is both a hardware and software vendor, providing solutions for many computational workloads, including to run on power-limited embedded devices. The [Xtensa HiFi4 DSP](https://www.cadence.com/en_US/home/tools/ip/tensilica-ip/hifi-dsps/hifi-4.html) is a Digital Signal Processor (DSP) that is optimized for running audio based neural networks such as wake word detection, Automatic Speech Recognition (ASR), etc. In addition to the chip, the HiFi4 Neural Network Library ([nnlib](https://github.com/foss-xtensa/nnlib-hifi4)) offers an optimized set of library functions commonly used in NN processing that we utilize in this example to demonstrate how common operations can be accelerated. On top of being able to run on the Xtensa HiFi4 DSP, another goal of this tutorial is to demonstrate how portable ExecuTorch is and its ability to run on a low-power embedded device such as the Xtensa HiFi4 DSP. This workflow does not require any delegates, it uses custom operators and compiler passes to enhance the model and make it more suitable to running on Xtensa HiFi4 DSPs. A custom [quantizer](https://pytorch.org/tutorials/prototype/quantization_in_pytorch_2_0_export_tutorial.html) is used to represent activations and weights as `uint8` instead of `float`, and call appropriate operators. Finally, custom kernels optimized with Xtensa intrinsics provide runtime acceleration. ::::{grid} 2 :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites * In this tutorial you will learn how to export a quantized model with a linear operation targeted for the Xtensa HiFi4 DSP. * You will also learn how to compile and deploy the ExecuTorch runtime with the kernels required for running the quantized model generated in the previous step on the Xtensa HiFi4 DSP. ::: :::{grid-item-card} Tutorials we recommend you complete before this: :class-card: card-prerequisites * [Introduction to ExecuTorch](intro-how-it-works.md) * [Getting Started](getting-started.md) * [Building ExecuTorch with CMake](using-executorch-building-from-source.md) ::: :::: ```{note} The linux part of this tutorial has been designed and tested on Ubuntu 22.04 LTS, and requires glibc 2.34. Workarounds are available for other distributions, but will not be covered in this tutorial. ``` ## Prerequisites (Hardware and Software) In order to be able to succesfully build and run ExecuTorch on a Xtensa HiFi4 DSP you'll need the following hardware and software components. ### Hardware - [i.MX RT600 Evaluation Kit](https://www.nxp.com/design/development-boards/i-mx-evaluation-and-development-boards/i-mx-rt600-evaluation-kit:MIMXRT685-EVK) ### Software - x86-64 Linux system (For compiling the DSP binaries) - [MCUXpresso IDE](https://www.nxp.com/design/software/development-software/mcuxpresso-software-and-tools-/mcuxpresso-integrated-development-environment-ide:MCUXpresso-IDE) - This IDE is supported on multiple platforms including MacOS. You can use it on any of the supported platforms as you'll only be using this to flash the board with the DSP images that you'll be building later on in this tutorial. - [J-Link](https://www.segger.com/downloads/jlink/) - Needed to flash the board with the firmware images. You can install this on the same platform that you installed the MCUXpresso IDE on. - Note: depending on the version of the NXP board, another probe than JLink might be installed. In any case, flashing is done using the MCUXpresso IDE in a similar way. - [MCUXpresso SDK](https://mcuxpresso.nxp.com/en/select?device=EVK-MIMXRT685) - Download this SDK to your Linux machine, extract it and take a note of the path where you store it. You'll need this later. - [Xtensa compiler](https://tensilicatools.com/platform/i-mx-rt600/) - Download this to your Linux machine. This is needed to build ExecuTorch for the HiFi4 DSP. - For cases with optimized kernels, the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4). ## Setting up Developer Environment Step 1. In order to be able to successfully install all the software components specified above users will need to go through the NXP tutorial linked below. Although the tutorial itself walks through a Windows setup, most of the steps translate over to a Linux installation too. [NXP tutorial on setting up the board and dev environment](https://www.nxp.com/document/guide/getting-started-with-i-mx-rt600-evaluation-kit:GS-MIMXRT685-EVK?section=plug-it-in) ```{note} Before proceeding forward to the next section users should be able to succesfullly flash the **dsp_mu_polling_cm33** sample application from the tutorial above and notice output on the UART console indicating that the Cortex-M33 and HiFi4 DSP are talking to each other. ``` Step 2. Make sure you have completed the ExecuTorch setup tutorials linked to at the top of this page. ## Working Tree Description The working tree is: ``` executorch ├── backends │ └── cadence │ ├── aot │ ├── ops_registration │ ├── tests │ ├── utils │ ├── hifi │ │ ├── kernels │ │ ├── operators │ │ └── third-party │ │ └── hifi4-nnlib │ └── [other cadence DSP families] │ ├── kernels │ ├── operators │ └── third-party │ └── [any required lib] └── examples └── cadence ├── models └── operators ``` ***AoT (Ahead-of-Time) Components***: The AoT folder contains all of the python scripts and functions needed to export the model to an ExecuTorch `.pte` file. In our case, [export_example.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/export_example.py) is an API that takes a model (nn.Module) and representative inputs and runs it through the quantizer (from [quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py)). Then a few compiler passes, also defined in [quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py), will replace operators with custom ones that are supported and optimized on the chip. Any operator needed to compute things should be defined in [ops_registrations.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/ops_registrations.py) and have corresponding implemetations in the other folders. ***Operators***: The operators folder contains two kinds of operators: existing operators from the [ExecuTorch portable library](https://github.com/pytorch/executorch/tree/main/kernels/portable/cpu) and new operators that define custom computations. The former is simply dispatching the operator to the relevant ExecuTorch implementation, while the latter acts as an interface, setting up everything needed for the custom kernels to compute the outputs. ***Kernels***: The kernels folder contains the optimized kernels that will run on the HiFi4 chip. They use Xtensa intrinsics to deliver high performance at low-power. ## Build In this step, you will generate the ExecuTorch program from different models. You'll then use this Program (the `.pte` file) during the runtime build step to bake this Program into the DSP image. ***Simple Model***: The first, simple model is meant to test that all components of this tutorial are working properly, and simply does an add operation. The generated file is called `add.pte`. ```bash cd executorch python3 -m examples.portable.scripts.export --model_name="add" ``` ***Quantized Operators***: The other, more complex model are custom operators, including: - a quantized [linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) operation. The model is defined [here](https://github.com/pytorch/executorch/blob/main/examples/cadence/operators/test_quantized_linear_op.py#L30). Linear is the backbone of most Automatic Speech Recognition (ASR) models. - a quantized [conv1d](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) operation. The model is defined [here](https://github.com/pytorch/executorch/blob/main/examples/cadence/operators/test_quantized_conv1d_op.py#L40). Convolutions are important in wake word and many denoising models. In both cases the generated file is called `CadenceDemoModel.pte`. ```bash cd executorch python3 -m examples.cadence.operators.quantized__op ``` ***Small Model: RNNT predictor***: The torchaudio [RNNT-emformer](https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html) model is an Automatic Speech Recognition (ASR) model, comprised of three different submodels: an encoder, a predictor and a joiner. The [predictor](https://github.com/pytorch/executorch/blob/main/examples/cadence/models/rnnt_predictor.py) is a sequence of basic ops (embedding, ReLU, linear, layer norm) and can be exported using: ```bash cd executorch python3 -m examples.cadence.models.rnnt_predictor ``` The generated file is called `CadenceDemoModel.pte`. ### Runtime **Building the DSP firmware image** In this step, you'll be building the DSP firmware image that consists of the sample ExecuTorch runner along with the Program generated from the previous step. This image when loaded onto the DSP will run through the model that this Program consists of. ***Step 1***. Configure the environment variables needed to point to the Xtensa toolchain that you have installed in the previous step. The three environment variables that need to be set include: ```bash # Directory in which the Xtensa toolchain was installed export XTENSA_TOOLCHAIN=/home/user_name/cadence/XtDevTools/install/tools # The version of the toolchain that was installed. This is essentially the name of the directory # that is present in the XTENSA_TOOLCHAIN directory from above. export TOOLCHAIN_VER=RI-2021.8-linux # The Xtensa core that you're targeting. export XTENSA_CORE=nxp_rt600_RI2021_8_newlib ``` ***Step 2***. Clone the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4), which contains optimized kernels and primitives for HiFi4 DSPs, with `git clone git@github.com:foss-xtensa/nnlib-hifi4.git`. ***Step 3***. Run the CMake build. In order to run the CMake build, you need the path to the following: - The Program generated in the previous step - Path to the NXP SDK root. This should have been installed already in the [Setting up Developer Environment](#setting-up-developer-environment) section. This is the directory that contains the folders such as boards, components, devices, and other. ```bash cd executorch ./install_executorch.sh --clean mkdir cmake-out # prebuild and install executorch library cmake -DCMAKE_TOOLCHAIN_FILE=/backends/cadence/cadence.cmake \ -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Debug \ -DPYTHON_EXECUTABLE=python3 \ -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \ -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=OFF \ -DEXECUTORCH_BUILD_PTHREADPOOL=OFF \ -DEXECUTORCH_BUILD_CPUINFO=OFF \ -Bcmake-out . cmake --build cmake-out -j --target install --config Debug # build cadence runner cmake -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_TOOLCHAIN_FILE=/examples/backends/cadence.cmake \ -DCMAKE_PREFIX_PATH=/cmake-out \ -DMODEL_PATH= \ -DNXP_SDK_ROOT_DIR= \ -DNN_LIB_BASE_DIR= \ -Bcmake-out/examples/cadence \ examples/cadence cmake --build cmake-out/examples/cadence -j8 -t cadence_executorch_example ``` After having succesfully run the above step you should see two binary files in their CMake output directory. ```bash > ls cmake-xt/*.bin cmake-xt/dsp_data_release.bin cmake-xt/dsp_text_release.bin ``` ## Deploying and Running on Device ***Step 1***. You now take the DSP binary images generated from the previous step and copy them over into your NXP workspace created in the [Setting up Developer Environment](#setting-up-developer-environment) section. Copy the DSP images into the `dsp_binary` section highlighted in the image below. ![MCUXpresso IDE](../_static/img/dsp_binary.png) ```{note} As long as binaries have been built using the Xtensa toolchain on Linux, flashing the board and running on the chip can be done only with the MCUXpresso IDE, which is available on all platforms (Linux, MacOS, Windows). ``` ***Step 2***. Clean your work space ***Step 3***. Click **Debug your Project** which will flash the board with your binaries. On the UART console connected to your board (at a default baud rate of 115200), you should see an output similar to this: ```bash > screen /dev/tty.usbmodem0007288234991 115200 Executed model Model executed successfully. First 20 elements of output 0 0.165528 0.331055 ... ``` ## Conclusion and Future Work In this tutorial, you have learned how to export a quantized operation, build the ExecuTorch runtime and run this model on the Xtensa HiFi4 DSP chip. The (quantized linear) model in this tutorial is a typical operation appearing in ASR models, and can be extended to a complete ASR model by creating the model as a new test and adding the needed operators/kernels to [operators](https://github.com/pytorch/executorch/blob/main/backends/cadence/hifi/operators) and [kernels](https://github.com/pytorch/executorch/blob/main/backends/cadence/hifi/kernels). Other models can be created following the same structure, always assuming that operators and kernels are available. --- (backend-delegate-advanced)= # Backend & Delegates ## Integration - {doc}`backend-delegates-integration` — Learn how to integrate a backend delegate into ExecuTorch ## Dependency Management - {doc}`backend-delegates-dependencies` — Manage third-party dependencies for backend delegates ## Overview - {doc}`compiler-delegate-and-partitioner` — Understanding backends, delegates, and the partitioner system ## Debugging - {doc}`debug-backend-delegate` — Tools and techniques for debugging delegation issues ```{toctree} :hidden: :maxdepth: 1 backend-delegates-integration backend-delegates-dependencies compiler-delegate-and-partitioner debug-backend-delegate --- # Third-Party Dependency Management for Backend Delegates Disclaimer: We are planning to restructure the repository around delegates. With that some of these guidelines will change in the future. A delegate may depend on external, third-party libraries to efficiently implement ahead-of-time (AOT) `partition()` or `preprocess()` functions, and/or to implement runtime functions such as `init()` or `execute()`, or to run tests in a specific manner. This guide aims to classify different types of third-party dependencies a delegate might depend on, and provide a high level guidance on how to include them. ## Ahead-of-Time Dependencies This includes dependencies used by the delegate's `partitioner()` and `preprocess()` functions to generate preprocessed result which will be used at later at runtime. Depending on how the `preprocess()` function is implemented this can be either Python or C++ dependency. This guide will talk about only Python AOT dependencies. **Guidelines:** * If ExecuTorch already includes a dependency you require, prefer to use that if possible. * If the dependency is only needed by the files inside the `executorch/backends//` directory, it should be introduced in a way such that it is used only by the code under that directory. * The dependency should not be installed by default when installing the ExecuTorch Python package. More details in the section [below](#python-dependencies). ## Runtime Dependencies This category covers C++ dependencies used by the delegate runtime code. It can be as simple as a third-party math library to implement some delegate operator, or can be a whole framework handling the lowered subgraph for the delegate. **Guidelines:** At a high level, "only pay for what you use" should be the desired approach for these third-party dependencies. * Similar to the AOT dependencies, the use of this should also be restricted to only the delegate runtime source files. * If a delegate has a dependency which is already part of `executorch/third-party` then try to use that if possible. This helps with reducing the binary size when the delegate is enabled. * The rest of the ExecuTorch code, outside of the delegate, should not depend on this. And it should build and run correctly without this dependency when the delegate is disabled at build time. More details in the section [below](#runtime-dependencies). ## Testing-Only Dependencies Some libraries or tools are only used for executing the delegate tests. These can either be a Python dependency or a C++ dependency depending on the type of the test. **Guidelines:** * For a Python test dependency, it should not be installed by default when installing the ExecuTorch Python package. * For a C++ test dependency, it should not be part of the ExecuTorch runtime even when the delegate is built/enabled. ## Other Considerations ### Versioning Explicit and specific is preferred. For example a PyPI version (or range) or a git tag/release. ### Documenting Dependencies At a minimum, some documentation under `executorch/backends//` should be provided when introducing a new dependency which includes, * Rationale for introducing a new third-party dependency * How to upgrade the dependency * Any special considerations for the new dependency *** After listing the high level guidelines, let's now talk about specific logistics to actually include a dependency for your delegate, ## Python Dependencies Python packaging is complicated and continuously evolving. For delegate dependencies, we recommend that a delegate specifies its third-party dependencies under `executorch/backends//requirements.txt` to be supplied to pip at installation time. The goal is to decouple them from the core ExecuTorch dependencies. Version conflicts should be avoided by trying to use the dependency already included by ExecuTorch or by some other backend if possible. Otherwise try some other [recommended](https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts) ways to mitigate version conflicts. #### Local Python Packages If it is a git repository, it should be added as a git submodule. ## C++ Dependencies The recommended approach is to include a git submodule for a given C++ dependency in the `executorch/backends//third-party` directory. ### CMake Support At a minimum CMake support is required. --- # Integrating a Backend Delegate into ExecuTorch Disclaimer: We are planning to restructure the repository around delegates. With that some of these guidelines will change in the future. This is a high level guideline when integrating a backend delegate with ExecuTorch. ## Directory Structure Delegate files should be under this directory: `executorch/backends//`. The delegate name should be unique. ## Python Source Files Delegate Python files such as those implementing `preprocess()` or `partition()` functions for ExecuTorch AOT flow, excluding any external third-party dependencies and their files, should be installed and available with the top level ExecuTorch package. For third-party dependencies, please refer to [this](backend-delegates-dependencies.md). ## C++ Source Files At a minimum, a delegate must provide CMake support for building its C++ sources. For the CMake setup: - The delegate directory should be included by the top-level `CMakeLists.txt` file using the `add_subdirectory` command. - It should be built conditionally using an ExecuTorch build flag like `EXECUTORCH_BUILD_`. (See `EXECUTORCH_BUILD_XNNPACK` for an example.) For third-party dependencies, please refer to [this](backend-delegates-dependencies.md). ## Tests Tests should be added under `executorch/backends//test`. Tests can be either python or C++ tests. For adding more complex end-to-end (e2e) tests, please reach out to us. Common test types: * Simple python unit tests that test AOT logic such as `partitioner()` or AOT export flow (generating a `.pte` file from an `nn.Module`) * Runtime C++ tests, using gtest, that test delegate `init()` or `execute()` runtime logic. ## Documentation A delegate must include: - `executorch/backends//README.md` – covering the basics of the delegate, its directory structure, features, and any known issues. - `executorch/backends//setup.md` – documenting any additional setup steps beyond the ones listed above. --- # Backend Development ```{toctree} :maxdepth: 1 backend-delegates-integration backend-delegates-dependencies compiler-delegate-and-partitioner debug-backend-delegate ``` --- # Arm Ethos-U Backend The Arm® Ethos™-U backend targets Edge/IoT-type AI use-cases by enabling optimal execution of quantized models on [Arm® Ethos™-U55 NPU](https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55), [Arm® Ethos™-U65 NPU](https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u65), and [Arm® Ethos™-U85 NPU](https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u85), leveraging [TOSA](https://www.mlplatform.org/tosa/) and the [ethos-u-vela](https://pypi.org/project/ethos-u-vela/) graph compiler. This document is a technical reference for using the Ethos-U backend, for a top level view with code examples please refer to the [Arm Ethos-U Backend Tutorial](https://docs.pytorch.org/executorch/stable/tutorial-arm-ethos-u.html). ## Features - Wide operator support for delegating large parts of models to highly optimized and low power Ethos-U NPUs. - A quantizer that optimizes quantization for the NPU target. - Example runtime integration for easy hardware bringup. ## Target Requirements The target system must include an Ethos-U NPU. ## Development Requirements ```{tip} All requirements can be downloaded using `examples/arm/setup.sh --i-agree-to-the-contained-eula` and added to the path using set(CMAKE_INSTALL_PREFIX "${CMAKE_BINARY_DIR}") `source examples/arm/arm-scratch/setup_path.sh`. Note that this means accepting the End-User License Agreements (EULA:s) required for using the downloaded software. ``` For the AOT flow, compilation of a model to `.pte` format using the Ethos-U backend, the requirements are: - [TOSA Serialization Library](https://www.mlplatform.org/tosa/software.html) for serializing the Exir IR graph into TOSA IR. - [Ethos-U Vela graph compiler](https://pypi.org/project/ethos-u-vela/) for compiling TOSA flatbuffers into an Ethos-U command stream. And for building and running the example application available in `examples/arm/executor_runner/`: - [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation. - [Arm® Corstone™ SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) for testing on a Arm® Cortex®-M55+Ethos-U55 reference design. - [Arm® Corstone™ SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for testing on a Arm® Cortex®-M85+Ethos-U85 reference design. Fixed Virtual Platforms (FVPs) are freely available emulators provided by Arm for easy embedded development without the need for a physical development board. ## Using the Arm Ethos-U Backend The main configuration point for the lowering is the `EthosUCompileSpec` consumed by the partitioner and quantizer. The full user-facing API is documented below. ```python class EthosUCompileSpec(target: str, system_config: str | None = None, memory_mode: str | None = None, extra_flags: list[str] | None = None, config_ini: str | None = 'Arm/vela.ini') ``` Compile spec for Ethos-U NPU. Args: - **target**: Ethos-U accelerator configuration, e.g. ethos-u55-128. - **system_config**: System configuration to select from the Vela configuration file. - **memory_mode**: Memory mode to select from the Vela configuration file. - **extra_flags**: Extra flags for the Vela compiler. - **config_ini**: Vela configuration file(s) in Python ConfigParser .ini file format. ```python def EthosUCompileSpec.dump_debug_info(self, debug_mode: executorch.backends.arm.common.arm_compile_spec.ArmCompileSpec.DebugMode | None): ``` Dump debugging information into the intermediates path. Args: - **debug_mode**: The debug mode to use for dumping debug information. ```python def EthosUCompileSpec.dump_intermediate_artifacts_to(self, output_path: str | None): ``` Sets a path for dumping intermediate results during such as tosa and pte. Args: - **output_path**: Path to dump intermediate results to. ```python def EthosUCompileSpec.get_intermediate_path(self) -> str | None: ``` Gets the path used for dumping intermediate results such as tosa and pte. Returns: Path where intermediate results are saved. ```python def EthosUCompileSpec.get_output_format() -> str: ``` Returns a constant string that is the output format of the class. ### Partitioner API See [Partitioner API](arm-ethos-u-partitioner.md) for more information of the Partitioner API. ## Quantization Since the Ethos-U backend is integer-only, all operators intended be executed on the NPU needs to be quantized. The Ethos-U quantizer supports [Post Training Quantization (PT2E)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) and [Quantization-Aware Training (QAT)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_qat.html) quantization. For more information on quantization, see [Quantization](arm-ethos-u-quantization.md) ## Runtime Integration An example runtime application is available in [examples/arm/executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/), and the steps requried for building and deploying it on a FVP it is explained in the previously mentioned [Arm Ethos-U Backend Tutorial](https://docs.pytorch.org/executorch/stable/tutorial-arm-ethos-u.html). The example application is recommended to use for testing basic functionality of your lowered models, as well as a starting point for developing runtime integrations for your own targets. For an in-depth explanation of the architecture of the executor_runner and the steps required for doing such an integration, please refer to [Ethos-U porting guide](https://github.com/pytorch/executorch/blob/main/examples/arm/ethos-u-porting-guide.md). ### Ethos-U memory modes The Ethos-U NPU provides two distinct memory interfaces: - One interface for **low-latency, high-bandwidth memory**. - On all Ethos-U NPUs(Ethos-U55, Ethos-U65, Ethos-U85), the low-latency memory is usually the SRAM of the SoC. - One interface for **higher-latency, lower-bandwidth memory**, typically external (off-chip) memory. - On a low-power microcontroller, the external memory is usually Flash. - On systems with Arm® Cortex™-A and a rich operating system, the external memory is typically DRAM. When running an inference, the Ethos-U compiler and Ethos-U driver make use of three logical memory regions: - Ethos-U scratch buffer - a contiguous block of memory used by the NPU to store the intermediate tensors produced and consumed during inference. - Neural Network - a contiguous block of memory holding constant data such as weights, biases, quantization parameters required to run an inference. - Ethos-U fast scratch buffer - a contiguous block of memory, assumed to reside in on-chip memory in order to hide the higher latency/lower bandwidth of external memory. Only applicable for Ethos-U65 and Ethos-U85 on systems with Cortex-A and the external memory is assumed to be DRAM. The placement of the scratch buffer and the Neural Network determine the memory mode to be used in the `EthosUCompileSpec` and when building the executor_runner. Three different memory modes are supported: | Memory Mode | Ethos-U Scratch Buffer Placement | Neural Network Placement | When to Use | Trade-off | |--------------------|----------------------------------|----------------------------|------------ |---------------------------------------------------------------------------| | **SRAM-Only** | On-chip SRAM | On-chip SRAM | When the ML model, the Ethos-U scratch buffer and the wider software stack fit within the SRAM of the SoC | Limited by SRAM size; often not feasible for larger NNs | | **Shared-SRAM** | On-chip SRAM | External memory (Flash/DRAM) | Most common mode on Cortex-M and Ethos-U systems; balances good performance and SRAM usage | Requires enough SRAM to hold the largest intermediate tensor | | **Dedicated-SRAM** | External memory | External memory (Flash/DRAM) | Most common mode for Cortex-A and Ethos-U systems. For very large models where the peak intermediates cannot fit in SRAM | Need high-bandwidth external memory to deliver good performance | Here is an in-depth explanation of the different modes: #### 1. Sram-Only Memory Mode - Ethos-U scratch buffer resides in the SRAM. - Neural Network resides in the SRAM. - Ethos-U fast scratch buffer is not used. - Characteristics: - Provides the best performance since all the memory traffic passes via the low-latency/high-bandwidth memory. - The performance uplift is especially noticeable on memory-bound workloads on the external interface. - Available on Ethos-U55, Ethos-U65 and Ethos-U85. - Limitations: - Embedded SoCs often have limited SRAM and NNs are becoming larger. This memory mode may be unsuitable for a system running a big model relative to the amount of SRAM available on the SoC. Below, you can see a visual representation of the placement of the two logical memory regions for the Sram Only configuration. ![](backend-arm-ethos-u-sram_only.png) #### 2. Shared-Sram Memory Mode - Ethos-U scratch buffer resides in the SRAM. - Neural Network resides in the External memory. - Ethos-U fast scratch buffer is not used. - Characteristics: - Intermediate tensors are stored in the SRAM, leveraging its low-latency and high-bandwidth. - The Ethos-U compiler can prefetch weights from the external memory to the SRAM ahead of time so that when the NPU needs the data, it will already be avaialbe in the on-chip memory. - In this mode, the external interface is Read-Only, the on-chip memory interface is Read/Write - Shared-Sram offers great balance between performance and low SRAM usage. - Available on Ethos-U55, Ethos-U65 and Ethos-U85. - Limitations: - You need to have enough space in the SRAM to hold the peak intermediate tensor. Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration. ![](backend-arm-ethos-u-shared_sram.png) #### 3. Dedicated-Sram Memory Mode - Ethos-U scratch buffer resides in the External memory. - Neural Network resides in the External memory. - Ethos-U fast scratch buffer resides in the on-chip memory. - Characteristics: - Used when the peak intermediate tensor is too big to fit into the on-chip memory. - Enables silicon acceleration of large models. - The NPU stores the results from the intermediate computations in the external memory. - The dedicated SRAM acts as a software managed cache, improving performance by pre-fetching frequently accessed tensors to the on-chip memory. - Available on Ethos-U65 and Ethos-U85. - Limitations: - The SRAM space must be dedicated exculisely to the Ethos-U(the host processor should not access it). - Not available on Ethos-U55. Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration. ![](backend-arm-ethos-u-dedicated_sram.png) The memory modes are defined within the [vela.ini file](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/ethosu/config_files/Arm/vela.ini?ref_type=heads). When you install ExecuTorch for the Ethos-U backend, you automatically install the compiler containing the vela.ini file so you can directly create a compile specification with these memory modes. ## Reference **→{doc}`/backends/arm-ethos-u/arm-ethos-u-partitioner` — Partitioner options.** **→{doc}`/backends/arm-ethos-u/arm-ethos-u-quantization` — Supported quantization schemes.** **→{doc}`/backends/arm-ethos-u/arm-ethos-u-troubleshooting` — Troubleshooting and common issues.** **→{doc}`/backends/arm-ethos-u/tutorials/arm-ethos-u-tutorials` — Tutorials.** ```{toctree} :maxdepth: 2 :hidden: :caption: Arm Ethos-U Backend arm-ethos-u-partitioner arm-ethos-u-quantization arm-ethos-u-troubleshooting tutorials/arm-ethos-u-tutorials ``` --- # Partitioner API The `EthosUPartitioner` controls what parts of a model is delegated to the Arm Ethos-U backend. Below is a reference of the various functions the partitioner provides: ```python class EthosUPartitioner(compile_spec: executorch.backends.arm.ethosu.compile_spec.EthosUCompileSpec, additional_checks: Optional[Sequence[torch.fx.passes.operator_support.OperatorSupportBase]] = None) -> None ``` Partitions subgraphs supported by the Arm Ethos-U backend. Args: - **compile_spec**: List of CompileSpec objects for Ethos-U backend. - **additional_checks**: Optional sequence of additional operator support checks. ```python def EthosUPartitioner.ops_to_not_decompose(self, ep: torch.export.exported_program.ExportedProgram) -> Tuple[List[torch._ops.OpOverload], Optional[Callable[[torch.fx.node.Node], bool]]]: ``` Return operators and a filter that should not be decomposed. Provide a base set of ops to preserve as-is and a predicate that keeps certain activations whole when surrounded by quantize/dequantize ops in a quantized graph. This helps downstream TOSA lowering and delegation. Args: - **ep (ExportedProgram)**: Program used to infer target-specific policy. Returns: - **Tuple[List[torch._ops.OpOverload], Optional[Callable[[torch.fx.Node], bool]]]**: A list of op overloads to keep intact, and an optional filter function that returns True when an op should not be decomposed. ```python def EthosUPartitioner.partition(self, exported_program: torch.export.exported_program.ExportedProgram) -> executorch.exir.backend.partitioner.PartitionResult: ``` Partition the program and tag TOSA-compatible subgraphs. Run the FX capability-based partitioner to propose subgraphs, then refine tags by removing boundary-only quantize/dequantize nodes and by rejecting partitions that would lower to no-ops. Emit a detailed report of rejected nodes and their reasons. Args: - **exported_program (ExportedProgram)**: Program to analyze and partition. Returns: - **PartitionResult**: The input program with nodes tagged for delegation and a mapping of partition tags to delegation specs. --- # Quantization The Arm Ethos-U delegate only supports the execution of quantized models. To quantize a model so that is supported by this delegate, the `EthosUQuantizer` should be used. Currently, the symmetric `int8` config defined by `executorch.backends.arm.quantizer.arm_quantizer.get_symmetric_quantization_config` is the main config available to use with the Ethos-U quantizer. ### Supported Quantization Schemes The Arm Ethos-U delegate supports the following quantization schemes: - 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow). - Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization). This is under development. - Partial quantization is *not* supported on the Ethos-U backend. The entire model must be quantized. ### Quantization API ```python class EthosUQuantizer(compile_spec: 'EthosUCompileSpec') -> 'None' ``` Quantizer supported by the Arm Ethos-U backend. Args: - **compile_spec (EthosUCompileSpec)**: Backend compile specification for Ethos-U targets. ```python def EthosUQuantizer.quantize_with_submodules(self, model: 'GraphModule', calibration_samples: 'list[tuple]', is_qat: 'bool' = False): ``` Quantizes a GraphModule in a way such that conditional submodules are handled properly. Args: - **model (GraphModule)**: The model to quantize. - **calibration_samples (list[tuple])**: A list of inputs to used to calibrate the model during quantization. To properly calibrate a model with submodules, at least one sample per code path is needed. - **is_qat (bool)**: Whether to do quantization aware training or not. Returns: - **GraphModule**: The quantized model. ```python def EthosUQuantizer.set_global(self, quantization_config: 'QuantizationConfig') -> 'TOSAQuantizer': ``` Set quantization_config for submodules not matched by other filters. Args: - **quantization_config (QuantizationConfig)**: Configuration to apply to modules that are not captured by name or type filters. ```python def EthosUQuantizer.set_io(self, quantization_config: 'QuantizationConfig') -> 'TOSAQuantizer': ``` Set quantization_config for input and output nodes. Args: - **quantization_config (QuantizationConfig)**: Configuration describing activation quantization for model inputs and outputs. ```python def EthosUQuantizer.set_module_name(self, module_name: 'str', quantization_config: 'Optional[QuantizationConfig]') -> 'TOSAQuantizer': ``` Set quantization_config for submodules with a given module name. For example, calling set_module_name("blocks.sub") quantizes supported patterns for that submodule with the provided quantization_config. Args: - **module_name (str)**: Fully qualified module name to configure. - **quantization_config (QuantizationConfig)**: Configuration to apply to the named submodule. ```python def EthosUQuantizer.set_module_type(self, module_type: 'Callable', quantization_config: 'QuantizationConfig') -> 'TOSAQuantizer': ``` Set quantization_config for submodules with a given module type. For example, calling set_module_type(Sub) quantizes supported patterns in each Sub instance with the provided quantization_config. Args: - **module_type (Callable)**: Type whose submodules should use the provided quantization configuration. - **quantization_config (QuantizationConfig)**: Configuration to apply to submodules of the given type. ```python def EthosUQuantizer.transform_for_annotation(self, model: 'GraphModule') -> 'GraphModule': ``` Transform the graph to prepare it for quantization annotation. Currently transforms scalar values to tensor attributes. Args: - **model (GraphModule)**: Model whose graph will be transformed. Returns: - **GraphModule**: Transformed model prepared for annotation. --- # Arm Ethos-U Troubleshooting This page describes common issues that you may encounter when using the Arm Ethos-U backend and how to debug and resolve them. ## Understanding memory footprint using the Ethos-U compiler As part of the `to_edge_transform_and_lower` step, you will see a memory footprint information presented as: ``` Total SRAM used 2467.27 KiB Total Off-chip Flash used 12.20 KiB ``` The `Total SRAM used` indicates the peak SRAM utilization needed by the NPU in order to perform an inference. In the snippet above, the Ethos-U compiler requires 2467.27 KiB of SRAM in order to schedule the inference. Therefore, from an application standpoint, you need to ensure you have at least 2467.27 KiB of SRAM on the SoC to run this model. The Ethos-U compiler provides a scheduling algorithm allowing to lower the peak SRAM usage within reasonable limits, you need to add the `--optimise Size` or `--arena-cache-size` CLI options for to the compile spec. You can read more about the options of the Ethos-U compiler in the documentation [here](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/OPTIONS.md#optimise). If the peak SRAM usage remains too high in Shared Sram memory mode, you would need to us the Dedicated Sram mode in order to store the Neural Network and the Ethos-U scratch buffer in the external memory. The main advantage of the Dedicated_Sram memory mode is that you can run large models and still benefit from the low-latency/high-bandwidth of the SRAM, used as a cache. It is important to highlight that when you specify a memory mode in the compile spec, in the runtime, the user is expected to place the scratch buffer and NN in the correct memory location. In other words, when you specify for ex. Shared Sram memory mode, the runtime application logic should place the ethos-U scratch buffer in the on-chip memory and the NN in the external memory for optimal performance. You can see how this coupling between the memory mode and runtime application is done in the [Ethos-U porting guide](https://github.com/pytorch/executorch/blob/main/examples/arm/ethos-u-porting-guide.md) ## Using Bundled.io and ETdump The arm_executor_runner supports [bundled-io](https://docs.pytorch.org/executorch/0.4/bundled-io.html) and [ETdump](https://docs.pytorch.org/executorch/stable/etdump.html) debugging tools. To enable bundled-io, set `EXECUTORCH_BUILD_DEVTOOLS` when building Executorch and `DET_BUNDLE_IO` when building the executor_runner. To enable ETdump, set `EXECUTORCH_BUILD_ARM_ETDUMP` when building Executorch and `DEXECUTORCH_ENABLE_EVENT_TRACER` when building the executor_runner. ## Issues with memory formats Tensors of rank 4 and higher have two differing [memory format](https://pytorch.org/blog/tensor-memory-format-matters/) standards used. PyTorch defaults to contiguous/ channels first/ NCHW memory formats, compared to TOSA which only supports channels last/NHWC memory format. To support this, the backend inserts a transpose in the beginning if the incoming memory format is contiguous, and correspondingly a transpose in the end if the outgoing memory format is contiguous. Note that this means that you may avoid transposing the data unneccessarily if the runtime integration and full network is converted to use channels last. A word of caution must be given here however - changing memory format has been noted to have side effects such as unsupported ops being inserted into the graph, and it is currently not widely tested, so the feature must so far be viewed as experimental. --- # Arm Ethos-U Backend Tutorials **→{doc}`ethos-u-getting-started`** ```{toctree} :maxdepth: 2 :hidden: :caption: Tutorials ethos-u-getting-started --- # Getting Started Tutorial ::::{grid} 2 :::{grid-item-card} Tutorials we recommend you complete before this: :class-card: card-prerequisites * [Introduction to ExecuTorch](intro-how-it-works.md) * [Getting Started](getting-started.md) * [Building ExecuTorch with CMake](using-executorch-building-from-source.md) ::: :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites In this tutorial you will learn how to export a simple PyTorch model for the ExecuTorch Ethos-U backend. ::: :::: ```{tip} If you are already familiar with this delegate, you may want to jump directly to the examples: * [Examples in the ExecuTorch repository](https://github.com/pytorch/executorch/tree/main/examples/arm) * [A commandline compiler for example models](https://github.com/pytorch/executorch/blob/main/examples/arm/aot_arm_compiler.py) ``` This tutorial serves as an introduction to using ExecuTorch to deploy PyTorch models on Arm® Ethos™-U targets. It is based on `ethos_u_minimal_example.ipynb`, provided in Arm’s examples folder. ## Prerequisites ### Hardware To successfully complete this tutorial, you will need a Linux machine with aarch64 or x86_64 processor architecture, or a macOS™ machine with Apple® Silicon. To enable development without a specific development board, we will be using a [Fixed Virtual Platform (FVP)](https://www.arm.com/products/development-tools/simulation/fixed-virtual-platforms), simulating [Arm® Corstone™-300](https://developer.arm.com/Processors/Corstone-300)(cs300) and [Arm® Corstone™-320](https://developer.arm.com/Processors/Corstone-320)(cs320)systems. Think of it as virtual hardware. ### Software First, you will need to install ExecuTorch. Please follow the recommended tutorials to set up a working ExecuTorch development environment. In addition to this, you need to install a number of SDK dependencies for generating Ethos-U command streams. Scripts to automate this are available in the main [ExecuTorch repository](https://github.com/pytorch/executorch/tree/main/examples/arm/). To install Ethos-U dependencies, run ```bash ./examples/arm/setup.sh --i-agree-to-the-contained-eula ``` This will install: - [TOSA Serialization Library](https://www.mlplatform.org/tosa/software.html) for serializing the Exir IR graph into TOSA IR. - [Ethos-U Vela graph compiler](https://pypi.org/project/ethos-u-vela/) for compiling TOSA flatbuffers into a Ethos-U command stream. - [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation. - [Corstone SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) for testing on Ethos-U55 reference design. - [Corstone SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for testing on Ethos-U85 reference design. ## Set Up the Developer Environment The setup.sh script generates a setup_path.sh script that you need to source whenever you restart your shell. Run: ```{bash} source examples/arm/arm-scratch/setup_path.sh ``` As a simple check that your environment is set up correctly, run `which FVP_Corstone_SSE-320` and make sure that the executable is located where you expect, in the `examples/arm` tree. ## Build ### Ahead-of-Time (AOT) components The ExecuTorch Ahead-of-Time (AOT) pipeline takes a PyTorch Model (a `torch.nn.Module`) and produces a `.pte` binary file, which is then consumed by the ExecuTorch Runtime. This [document](getting-started-architecture.md) goes in much more depth about the ExecuTorch software stack for both AoT as well as Runtime. The example below shows how to quantize a model consisting of a single addition, and export it it through the AOT flow using the EthosU backend. For more details, see `examples/arm/ethos_u_minimal_example.ipynb`. ```python import torch class Add(torch.nn.Module): def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: return x + y example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1)) model = Add() model = model.eval() exported_program = torch.export.export(model, example_inputs) graph_module = exported_program.graph_module from executorch.backends.arm.ethosu import EthosUCompileSpec from executorch.backends.arm.quantizer import ( EthosUQuantizer, get_symmetric_quantization_config, ) from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e # Create a compilation spec describing the target for configuring the quantizer # Some args are used by the Arm Vela graph compiler later in the example. Refer to Arm Vela documentation for an # explanation of its flags: https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/OPTIONS.md compile_spec = EthosUCompileSpec( target="ethos-u55-128", system_config="Ethos_U55_High_End_Embedded", memory_mode="Shared_Sram", extra_flags=["--output-format=raw", "--debug-force-regor"] ) # Create and configure quantizer to use a symmetric quantization config globally on all nodes quantizer = EthosUQuantizer(compile_spec) operator_config = get_symmetric_quantization_config() quantizer.set_global(operator_config) # Post training quantization quantized_graph_module = prepare_pt2e(graph_module, quantizer) quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input quantized_graph_module = convert_pt2e(quantized_graph_module) # Create a new exported program using the quantized_graph_module quantized_exported_program = torch.export.export(quantized_graph_module, example_inputs) from executorch.backends.arm.ethosu import EthosUPartitioner from executorch.exir import ( EdgeCompileConfig, ExecutorchBackendConfig, to_edge_transform_and_lower, ) from executorch.extension.export_util.utils import save_pte_program # Create partitioner from compile spec partitioner = EthosUPartitioner(compile_spec) # Lower the exported program to the Ethos-U backend edge_program_manager = to_edge_transform_and_lower( quantized_exported_program, partitioner=[partitioner], compile_config=EdgeCompileConfig( _check_ir_validity=False, ), ) # Convert edge program to executorch executorch_program_manager = edge_program_manager.to_executorch( config=ExecutorchBackendConfig(extract_delegate_segments=False) ) # Save pte file save_pte_program(executorch_program_manager, "ethos_u_minimal_example.pte") ``` ```{tip} For a quick start, you can use the script `examples/arm/aot_arm_compiler.py` to produce the pte. To produce a pte file equivalent to the one above, run `python -m examples.arm.aot_arm_compiler --model_name=add --delegate --quantize --output=ethos_u_minimal_example.pte` ``` ### Runtime: After the AOT compilation flow is done, the runtime can be cross compiled and linked to the produced `.pte`-file using the Arm cross-compilation toolchain. This is done in two steps: First, build and install the ExecuTorch libraries and EthosUDelegate: ``` # In ExecuTorch top-level, with sourced setup_path.sh cmake -DCMAKE_BUILD_TYPE=Release --preset arm-baremetal -B cmake-out-arm . cmake --build cmake-out-arm --target install -j$(nproc) ``` Second, build and link the `arm_executor_runner` and generate kernel bindings for any non delegated ops. This is the actual program that will run on target. ``` # In ExecuTorch top-level, with sourced setup_path.sh cmake -DCMAKE_TOOLCHAIN_FILE=`pwd`/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \ -DCMAKE_BUILD_TYPE=Release \ -DET_PTE_FILE_PATH=ethos_u_minimal_example.pte \ -DTARGET_CPU=cortex-m55 \ -DETHOSU_TARGET_NPU_CONFIG=ethos-u55-128 \ -DMEMORY_MODE=Shared_Sram \ -DSYSTEM_CONFIG=Ethos_U55_High_End_Embedded \ -Bethos_u_minimal_example \ examples/arm/executor_runner cmake --build ethos_u_minimal_example -j$(nproc) -- arm_executor_runner ``` ```{tip} For a quick start, you can use the script `backends/arm/scripts/build_executor_runner.sh` to build the runner. To build a runner equivalent to the one above, run `./backends/arm/scripts/build_executor_runner.sh --pte=ethos_u_minimal_example.pte` ```` The block diagram below shows, at the high level, how the various build artifacts are generated and are linked together to generate the final bare-metal executable. ![](arm-delegate-runtime-build.svg) ## Running on Corstone FVP Platforms Finally, use the `backends/arm/scripts/run_fvp.sh` utility script to run the .elf-file on simulated Arm hardware. ``` backends/arm/scripts/run_fvp.sh --elf=$(find ethos_u_minimal_example -name arm_executor_runner) --target=ethos-u55-128 ``` The example application is by default built with an input of ones, so the expected result of the quantized addition should be close to 2. ## Takeaways In this tutorial you have learned how to use ExecuTorch to export a PyTorch model to an executable that can run on an embedded target, and then run that executable on simulated hardware. To learn more, check out these learning paths: https://learn.arm.com/learning-paths/embedded-and-microcontrollers/rpi-llama3/ https://learn.arm.com/learning-paths/embedded-and-microcontrollers/visualizing-ethos-u-performance/ ## FAQs If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new). ``` Arm is a registered trademark of Arm Limited (or its subsidiaries or affiliates). ``` --- # Arm VGF Backend The Arm® VGF backend is the ExecuTorch solution for lowering PyTorch models to VGF compatible hardware. It leverages the TOSA operator set and the [ML SDK for Vulkan®](https://github.com/arm/ai-ml-sdk-for-vulkan?tab=readme-ov-file) to produce a .PTE file. The VGF backend also supports execution from a .PTE file and provides functionality to extract the corresponding VGF file for integration into various applications. ## Features - Wide operator support for delegating large parts of models to the VGF target. - A quantizer that optimizes quantization for the VGF target. ## Target Requirements The target system must include ML SDK for Vulkan and a Vulkan driver with Vulkan API >= 1.3. ## Development Requirements ```{tip} All requirements can be downloaded using `examples/arm/setup.sh --enable-mlsdk-deps --disable-ethos-u-deps` and added to the path using `source examples/arm/arm-scratch/setup_path.sh` ``` For the AOT flow, compilation of a model to `.pte` format using the VGF backend, the requirements are: - [TOSA Serialization Library](https://www.mlplatform.org/tosa/software.html) for serializing the Exir IR graph into TOSA IR. - [ML SDK Model Converter](https://github.com/arm/ai-ml-sdk-model-converter) for converting TOSA flatbuffers to VGF files. And for building and running your application using the generic executor_runner: - [Vulkan API](https://www.vulkan.org) should be set up locally for GPU execution support. - [ML Emulation Layer for Vulkan](https://github.com/arm/ai-ml-emulation-layer-for-vulkan) for testing on Vulkan API. ## Using the Arm VGF Backend The [VGF Minimal Example](https://github.com/pytorch/executorch/blob/main/examples/arm/vgf_minimal_example.ipynb) demonstrates how to lower a module using the VGF backend. The main configuration point for the lowering is the `VgfCompileSpec` consumed by the partitioner and quantizer. The full user-facing API is documented below. ```python class VgfCompileSpec(tosa_spec: executorch.backends.arm.tosa.specification.TosaSpecification | str | None = None, compiler_flags: list[str] | None = None) ``` Compile spec for VGF compatible targets. Args: - **tosa_spec**: TOSA specification that should be targeted. - **compiler_flags**: Extra compiler flags for converter_backend. ```python def VgfCompileSpec.dump_debug_info(self, debug_mode: executorch.backends.arm.common.arm_compile_spec.ArmCompileSpec.DebugMode | None): ``` Dump debugging information into the intermediates path. Args: - **debug_mode**: The debug mode to use for dumping debug information. ```python def VgfCompileSpec.dump_intermediate_artifacts_to(self, output_path: str | None): ``` Sets a path for dumping intermediate results during such as tosa and pte. Args: - **output_path**: Path to dump intermediate results to. ```python def VgfCompileSpec.get_intermediate_path(self) -> str | None: ``` Gets the path used for dumping intermediate results such as tosa and pte. Returns: Path where intermediate results are saved. ```python def VgfCompileSpec.get_output_format() -> str: ``` Returns a constant string that is the output format of the class. ### Partitioner API See [Partitioner API](arm-vgf-partitioner.md) for more information of the Partitioner API. ## Quantization The VGF quantizer supports [Post Training Quantization (PT2E)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) and [Quantization-Aware Training (QAT)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_qat.html). Partial quantization is supported, allowing users to quantize only specific parts of the model while leaving others in floating-point. For more information on quantization, see [Quantization](arm-vgf-quantization.md). ## Runtime Integration The VGF backend can use the default ExecuTorch runner. The steps required for building and running it are explained in the [VGF Backend Tutorial](tutorials/vgf-getting-started.md). The example application is recommended to use for testing basic functionality of your lowered models, as well as a starting point for developing runtime integrations for your own targets. ## Reference **→{doc}`/backends/arm-vgf/arm-vgf-partitioner` — Partitioner options.** **→{doc}`/backends/arm-vgf/arm-vgf-quantization` — Supported quantization schemes.** **→{doc}`/backends/arm-vgf/arm-vgf-troubleshooting` — Debug common issues.** **→{doc}`/backends/arm-vgf/tutorials/arm-vgf-tutorials` — Tutorials.** ```{toctree} :maxdepth: 2 :hidden: :caption: Arm VGF Backend arm-vgf-partitioner arm-vgf-quantization arm-vgf-troubleshooting tutorials/arm-vgf-tutorials ``` --- # Partitioner API The `VgfPartitioner` controls what parts of a model is delegated to the Arm VGF backend. Below is a reference of the various functions the partitioner provides: ```python class VgfPartitioner(compile_spec: executorch.backends.arm.vgf.compile_spec.VgfCompileSpec, additional_checks: Optional[Sequence[torch.fx.passes.operator_support.OperatorSupportBase]] = None) -> None ``` Partitions subgraphs supported by the Arm Vgf backend. Args: - **compile_spec**: The Vgf compilation specification. - **additional_checks**: Optional sequence of additional operator support checks. ```python def VgfPartitioner.ops_to_not_decompose(self, ep: torch.export.exported_program.ExportedProgram) -> Tuple[List[torch._ops.OpOverload], Optional[Callable[[torch.fx.node.Node], bool]]]: ``` Return operators and a filter that should not be decomposed. Provide a base set of ops to preserve as-is and a predicate that keeps certain activations whole when surrounded by quantize/dequantize ops in a quantized graph. This helps downstream TOSA lowering and delegation. Args: - **ep (ExportedProgram)**: Program used to infer target-specific policy. Returns: - **Tuple[List[torch._ops.OpOverload], Optional[Callable[[torch.fx.Node], bool]]]**: A list of op overloads to keep intact, and an optional filter function that returns True when an op should not be decomposed. ```python def VgfPartitioner.partition(self, exported_program: torch.export.exported_program.ExportedProgram) -> executorch.exir.backend.partitioner.PartitionResult: ``` Partition the program and tag TOSA-compatible subgraphs. Run the FX capability-based partitioner to propose subgraphs, then refine tags by removing boundary-only quantize/dequantize nodes and by rejecting partitions that would lower to no-ops. Emit a detailed report of rejected nodes and their reasons. Args: - **exported_program (ExportedProgram)**: Program to analyze and partition. Returns: - **PartitionResult**: The input program with nodes tagged for delegation and a mapping of partition tags to delegation specs. --- # Quantization The Arm VGF delegate can be used to execute quantized models. To quantize a model so that is supported by this delegate, the `VgfQuantizer` should be used. Currently the symmetric `int8` config defined by `executorch.backends.arm.quantizer.arm_quantizer.get_symmetric_quantization_config` is the main config available to use with the VGF quantizer. ### Supported Quantization Schemes The quantization schemes supported by the VGF Backend are: - 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow). - Supports both static and dynamic activations - Supports per-channel and per-tensor schemes Weight-only quantization is not currently supported on the VGF backend. ### Partial Quantization The VGF backend supports partial quantization, where only parts of the model are quantized while others remain in floating-point. This can be useful for models where certain layers are not well-suited for quantization or when a balance between performance and accuracy is desired. For every node (op) in the graph, the quantizer looks at the *quantization configuration* set for that specific node. If the configuration is set to `None`, the node is left in floating-point; if it is provided (not `None`), the node is quantized according to that configuration. With the [Quantization API](#quantization-api), users can specify the quantization configurations for specific layers or submodules of the model. The `set_global` method is first used to set a default quantization configuration (could be `None` as explained above) for all nodes in the model. Then, configurations for specific layers or submodules can override the global setting using the `set_module_name` or `set_module_type` methods. ### Quantization API ```python class VgfQuantizer(compile_spec: 'VgfCompileSpec') -> 'None' ``` Quantizer supported by the Arm Vgf backend. Args: - **compile_spec (VgfCompileSpec)**: Backend compile specification for Vgf targets. ```python def VgfQuantizer.quantize_with_submodules(self, model: 'GraphModule', calibration_samples: 'list[tuple]', is_qat: 'bool' = False): ``` Quantizes a GraphModule in a way such that conditional submodules are handled properly. Args: - **model (GraphModule)**: The model to quantize. - **calibration_samples (list[tuple])**: A list of inputs to used to calibrate the model during quantization. To properly calibrate a model with submodules, at least one sample per code path is needed. - **is_qat (bool)**: Whether to do quantization aware training or not. Returns: - **GraphModule**: The quantized model. ```python def VgfQuantizer.set_global(self, quantization_config: 'QuantizationConfig') -> 'TOSAQuantizer': ``` Set quantization_config for submodules not matched by other filters. Args: - **quantization_config (QuantizationConfig)**: Configuration to apply to modules that are not captured by name or type filters. ```python def VgfQuantizer.set_io(self, quantization_config: 'QuantizationConfig') -> 'TOSAQuantizer': ``` Set quantization_config for input and output nodes. Args: - **quantization_config (QuantizationConfig)**: Configuration describing activation quantization for model inputs and outputs. ```python def VgfQuantizer.set_module_name(self, module_name: 'str', quantization_config: 'Optional[QuantizationConfig]') -> 'TOSAQuantizer': ``` Set quantization_config for submodules with a given module name. For example, calling set_module_name("blocks.sub") quantizes supported patterns for that submodule with the provided quantization_config. Args: - **module_name (str)**: Fully qualified module name to configure. - **quantization_config (QuantizationConfig)**: Configuration to apply to the named submodule. ```python def VgfQuantizer.set_module_type(self, module_type: 'Callable', quantization_config: 'QuantizationConfig') -> 'TOSAQuantizer': ``` Set quantization_config for submodules with a given module type. For example, calling set_module_type(Sub) quantizes supported patterns in each Sub instance with the provided quantization_config. Args: - **module_type (Callable)**: Type whose submodules should use the provided quantization configuration. - **quantization_config (QuantizationConfig)**: Configuration to apply to submodules of the given type. ```python def VgfQuantizer.transform_for_annotation(self, model: 'GraphModule') -> 'GraphModule': ``` Transform the graph to prepare it for quantization annotation. Currently transforms scalar values to tensor attributes. Args: - **model (GraphModule)**: Model whose graph will be transformed. Returns: - **GraphModule**: Transformed model prepared for annotation. --- # Arm VGF Troubleshooting This page describes common issues that you may encounter when using the Arm VGF backend and how to debug and resolve them. ## How do you visualize VGF files The [VGF Adapter for Model Explorer](https://github.com/arm/vgf-adapter-model-explorer) enables visualization of VGF files and can be useful for debugging. --- # Arm VGF Backend Tutorials **→{doc}`vgf-getting-started`** ```{toctree} :maxdepth: 2 :hidden: :caption: Tutorials vgf-getting-started --- # Getting Started Tutorial ::::{grid} 2 :::{grid-item-card} Tutorials we recommend you complete before this: :class-card: card-prerequisites * [Introduction to ExecuTorch](intro-how-it-works.md) * [Getting Started](getting-started.md) * [Building ExecuTorch with CMake](using-executorch-building-from-source.md) ::: :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites In this tutorial you will learn how to export a simple PyTorch model for the ExecuTorch VGF backend. ::: :::: ```{warning} This delegate is under active development, to get best results please use a recent version. The VGF backend support is in early development and you may encounter issues. You may encounter some rough edges and features which may be documented or planned but not implemented, please refer to the in-tree documentation for the latest status of features. ``` ```{tip} If you are already familiar with this delegate, you may want to jump directly to the examples: * [Examples in the ExecuTorch repository](https://github.com/pytorch/executorch/tree/main/examples/arm) * [A commandline compiler for example models](https://github.com/pytorch/executorch/blob/main/examples/arm/aot_arm_compiler.py) ``` This tutorial serves as an introduction to using ExecuTorch to deploy PyTorch models on VGF targets. The tutorial is based on `vgf_minimal_example.ipyb`, provided in Arm's example folder. ## Prerequisites ### Hardware To successfully complete this tutorial, you will need a Linux machine with aarch64 or x86_64 processor architecture, or a macOS™ machine with Apple® Silicon. To enable development without a specific development board, we will be using the [ML SDK for Vulkan®](https://github.com/arm/ai-ml-sdk-for-vulkan/) to emulate the program consumer. ### Software First, you will need to install ExecuTorch. Please follow the recommended tutorials if you haven't already, to set up a working ExecuTorch development environment. For the VGF backend it's recommended you [install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html), or from a [nightly](https://download.pytorch.org/whl/nightly/executorch/). In addition to this, you need to install a number of SDK dependencies for generating VGF files. Scripts to automate this are available in the main [ExecuTorch repository](https://github.com/pytorch/executorch/tree/main/examples/arm/). To install VGF dependencies, run ```bash ./examples/arm/setup.sh --i-agree-to-the-contained-eula --disable-ethos-u-deps --enable-mlsdk-deps ``` This will install: - [TOSA Serialization Library](https://www.mlplatform.org/tosa/software.html) for serializing the Exir IR graph into TOSA IR. - [ML SDK Model Converter](https://github.com/arm/ai-ml-sdk-model-converter) for converting TOSA flatbuffers to VGF files. - [Vulkan API](https://www.vulkan.org) should be set up locally for GPU execution support. - [ML Emulation Layer for Vulkan](https://github.com/arm/ai-ml-emulation-layer-for-vulkan) for testing on Vulkan API. ## Set Up the Developer Environment The `setup.sh` script has generated a `setup_path.sh` script that you need to source whenever you restart your shell. Do this by running `source examples/arm/arm-scratch/setup_path.sh` As a simple check that your environment is set up correctly, run ```bash which model-converter ``` Make sure the executable is located where you expect, in the `examples/arm` tree. ## Build ### Ahead-of-Time (AOT) components The ExecuTorch Ahead-of-Time (AOT) pipeline takes a PyTorch Model (a `torch.nn.Module`) and produces a `.pte` binary file, which is then typically consumed by the ExecuTorch Runtime. This [document](https://github.com/pytorch/executorch/blob/main/docs/source/getting-started-architecture.md) goes in much more depth about the ExecuTorch software stack for both AoT as well as Runtime. The example below shows how to quantize a model consisting of a single addition, and export it it through the AOT flow using the VGF backend. For more details, se `examples/arm/vgf_minimal_example.ipynb`. ```python import torch class AddSigmoid(torch.nn.Module): def __init__(self): super().__init__() self.sigmoid = torch.nn.Sigmoid() def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: return self.sigmoid(x + y) example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1)) model = AddSigmoid() model = model.eval() exported_program = torch.export.export(model, example_inputs) graph_module = exported_program.graph_module from executorch.backends.arm.quantizer import ( VgfQuantizer, get_symmetric_quantization_config, ) from executorch.backends.arm.vgf import VgfCompileSpec from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e # Create a compilation spec describing the target for configuring the quantizer compile_spec = VgfCompileSpec() # Create and configure quantizer to use a symmetric quantization config globally on all nodes quantizer = VgfQuantizer(compile_spec) operator_config = get_symmetric_quantization_config(is_per_channel=False) # Set default quantization config for the layers in the models. # Can also be set to `None` to let layers run in FP as default. quantizer.set_global(operator_config) # OPTIONAL: skip quantizing all sigmoid ops (only one for this model); let it run in FP quantizer.set_module_type(torch.nn.Sigmoid, None) # Post training quantization quantized_graph_module = prepare_pt2e(graph_module, quantizer) quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input quantized_graph_module = convert_pt2e(quantized_graph_module) # Create a new exported program using the quantized_graph_module quantized_exported_program = torch.export.export(quantized_graph_module, example_inputs) import os from executorch.backends.arm.vgf import VgfPartitioner from executorch.exir import ( EdgeCompileConfig, ExecutorchBackendConfig, to_edge_transform_and_lower, ) from executorch.extension.export_util.utils import save_pte_program # Create partitioner from compile spec partitioner = VgfPartitioner(compile_spec) # Lower the exported program to the VGF backend edge_program_manager = to_edge_transform_and_lower( quantized_exported_program, partitioner=[partitioner], compile_config=EdgeCompileConfig( _check_ir_validity=False, ), ) # Convert edge program to executorch executorch_program_manager = edge_program_manager.to_executorch( config=ExecutorchBackendConfig(extract_delegate_segments=False) ) # Save pte file cwd_dir = os.getcwd() pte_base_name = "simple_example" pte_name = pte_base_name + ".pte" pte_path = os.path.join(cwd_dir, pte_name) save_pte_program(executorch_program_manager, pte_name) assert os.path.exists(pte_path), "Build failed; no .pte-file found" ``` ```{tip} For a quick start, you can use the script `examples/arm/aot_arm_compiler.py` to produce the pte. To produce a pte file equivalent to the one above, run `python -m examples.arm.aot_arm_compiler --model_name=add --delegate --quantize --output=simple_example.pte --target=vgf` ``` ### Runtime: ## Build executor runtime After the AOT compilation flow is done, we can build the executor runner target. For this tutorial, the default runner can be used. Build it with the following configuration: ```bash # In ExecuTorch top-level, with sourced setup_path.sh cmake \ -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Debug \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_XNNPACK=OFF \ -DEXECUTORCH_BUILD_VULKAN=ON \ -DEXECUTORCH_BUILD_VGF=ON \ -DEXECUTORCH_ENABLE_LOGGING=ON \ -DPYTHON_EXECUTABLE=python \ -Bcmake-out . cmake --build cmake-out --target executor_runner` ``` The block diagram below demonstrates, at the high level, how the various build artifacts are generated and are linked together to generate the final bare-metal executable. ![](arm-delegate-runtime-build.svg) ## Deploying and running on device Since we are using the Vulkan emulation layer, we can run the executor runner with the VGF delegate on the host machine: ```bash ./cmake-out/executor_runner -model_path simple_example.pte ``` The example application is by default built with an input of ones, so the expected result of the quantized addition should be close to 2. ## Takeaways In this tutorial you have learned how to use ExecuTorch to export a PyTorch model to an executable that can run on an embedded target, and then run that executable on simulated hardware. ## FAQs Issue: glslc is not found when configuring the executor runner. Solution: The Vulkan sdk is likely not in your path, check whether setup_path.sh contains something like `export PATH=$(pwd)/examples/arm/arm-scratch/vulkan_sdk/1.4.321.1/x86_64/bin:$PATH`. If not, add it and source the file. If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new). --- # Op support The Core ML backend supports almost all PyTorch operators. If an operator in your model is not supported by Core ML, you will see a warning about this during lowering. If you want to guarantee that your model fully delegates to Core ML, you can set [`lower_full_graph=True`](coreml-partitioner.md) in the `CoreMLPartitioner`. When set, lowering will fail if an unsupported operator is encountered. --- # Core ML Backend Core ML delegate is the ExecuTorch solution to take advantage of Apple's [Core ML framework](https://developer.apple.com/documentation/coreml) for on-device ML. With Core ML, a model can run on CPU, GPU, and the Apple Neural Engine (ANE). ## Features - Dynamic dispatch to the CPU, GPU, and ANE. - Supports fp32 and fp16 computation. ## Target Requirements Below are the minimum OS requirements on various hardware for running a Core ML-delegated ExecuTorch model: - [macOS](https://developer.apple.com/macos) >= 13.0 - [iOS](https://developer.apple.com/ios/) >= 16.0 - [iPadOS](https://developer.apple.com/ipados/) >= 16.0 - [tvOS](https://developer.apple.com/tvos/) >= 16.0 ## Development Requirements To develop you need: - [macOS](https://developer.apple.com/macos) >= 13.0 - [Xcode](https://developer.apple.com/documentation/xcode) >= 14.1 Before starting, make sure you install the Xcode Command Line Tools: ```bash xcode-select --install ``` ---- ## Using the Core ML Backend To target the Core ML backend during the export and lowering process, pass an instance of the `CoreMLPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision. ```python import torch import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.apple.coreml.partition import CoreMLPartitioner from executorch.exir import to_edge_transform_and_lower mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) et_program = to_edge_transform_and_lower( torch.export.export(mobilenet_v2, sample_inputs), partitioner=[CoreMLPartitioner()], ).to_executorch() with open("mv2_coreml.pte", "wb") as file: et_program.write_to_file(file) ``` See [Partitioner API](coreml-partitioner.md) for a reference on available partitioner options. ---- ## Quantization The Core ML delegate can also be used as a backend to execute quantized models. See [Core ML Quantization](coreml-quantization.md) for more information on available quantization schemes and APIs. ## Backward compatibility Core ML supports backward compatibility via the [`minimum_deployment_target`](coreml-partitioner.md#coreml-compilespec) option. A model exported with a specific deployment target is guaranteed to work on all deployment targets >= the specified deployment target. For example, a model exported with `coremltools.target.iOS17` will work on iOS 17 or higher. ---- ## Runtime integration To run the model on device, use the standard ExecuTorch runtime APIs. See [Running on Device](getting-started.md#running-on-device) for more information, including building the iOS frameworks. When building from source, pass `-DEXECUTORCH_BUILD_COREML=ON` when configuring the CMake build to compile the Core ML backend. Due to the use of static initializers for registration, it may be necessary to use whole-archive to link against the `coremldelegate` target. This can typically be done by passing `"$"` to `target_link_libraries`. ``` # CMakeLists.txt add_subdirectory("executorch") ... target_link_libraries( my_target PRIVATE executorch extension_module_static extension_tensor optimized_native_cpu_ops_lib $) ``` No additional steps are necessary to use the backend beyond linking the target. A Core ML-delegated .pte file will automatically run on the registered backend. ## Reference **→{doc}`/backends/coreml/coreml-troubleshooting` — Debug common issues.** **→{doc}`/backends/coreml/coreml-partitioner` — Partitioner options.** **→{doc}`/backends/coreml/coreml-quantization` — Supported quantization schemes.** **→{doc}`/backends/coreml/coreml-op-support` — Supported operators.** ```{toctree} :maxdepth: 2 :hidden: :caption: Core ML Backend coreml-troubleshooting coreml-partitioner coreml-quantization coreml-op-support --- # Partitioner API The Core ML partitioner API allows for configuration of the model delegation to Core ML. Passing a `CoreMLPartitioner` instance with no additional parameters will run as much of the model as possible on the Core ML backend with default settings. This is the most common use case. For advanced use cases, the partitioner exposes the following options via the [constructor](https://github.com/pytorch/executorch/blob/14ff52ff89a89c074fc6c14d3f01683677783dcd/backends/apple/coreml/partition/coreml_partitioner.py#L60): - `skip_ops_for_coreml_delegation`: Allows you to skip ops for delegation by Core ML. By default, all ops that Core ML supports will be delegated. See [here](https://github.com/pytorch/executorch/blob/14ff52ff89a89c074fc6c14d3f01683677783dcd/backends/apple/coreml/test/test_coreml_partitioner.py#L42) for an example of skipping an op for delegation. - `compile_specs`: A list of `CompileSpec`s for the Core ML backend. These control low-level details of Core ML delegation, such as the compute unit (CPU, GPU, ANE), the iOS deployment target, and the compute precision (FP16, FP32). These are discussed more below. - `take_over_mutable_buffer`: A boolean that indicates whether PyTorch mutable buffers in stateful models should be converted to [Core ML `MLState`](https://developer.apple.com/documentation/coreml/mlstate). If set to `False`, mutable buffers in the PyTorch graph are converted to graph inputs and outputs to the Core ML lowered module under the hood. Generally, setting `take_over_mutable_buffer` to true will result in better performance, but using `MLState` requires iOS >= 18.0, macOS >= 15.0, and Xcode >= 16.0. - `take_over_constant_data`: A boolean that indicates whether PyTorch constant data like model weights should be consumed by the Core ML delegate. If set to False, constant data is passed to the Core ML delegate as inputs. By default, take_over_constant_data=True. - `lower_full_graph`: A boolean that indicates whether the entire graph must be lowered to Core ML. If set to True and Core ML does not support an op, an error is raised during lowering. If set to False and Core ML does not support an op, the op is executed on the CPU by ExecuTorch. Although setting `lower_full_graph`=False can allow a model to lower where it would otherwise fail, it can introduce performance overhead in the model when there are unsupported ops. You will see warnings about unsupported ops during lowering if there are any. By default, `lower_full_graph`=False. #### Core ML CompileSpec A list of `CompileSpec`s is constructed with [`CoreMLBackend.generate_compile_specs`](https://github.com/pytorch/executorch/blob/14ff52ff89a89c074fc6c14d3f01683677783dcd/backends/apple/coreml/compiler/coreml_preprocess.py#L210). Below are the available options: - `compute_unit`: this controls the compute units (CPU, GPU, ANE) that are used by Core ML. The default value is `coremltools.ComputeUnit.ALL`. The available options from coremltools are: - `coremltools.ComputeUnit.ALL` (uses the CPU, GPU, and ANE) - `coremltools.ComputeUnit.CPU_ONLY` (uses the CPU only) - `coremltools.ComputeUnit.CPU_AND_GPU` (uses both the CPU and GPU, but not the ANE) - `coremltools.ComputeUnit.CPU_AND_NE` (uses both the CPU and ANE, but not the GPU) - `minimum_deployment_target`: The minimum iOS deployment target (e.g., `coremltools.target.iOS18`). By default, the smallest deployment target needed to deploy the model is selected. During export, you will see a warning about the "Core ML specification version" that was used for the model, which maps onto a deployment target as discussed [here](https://apple.github.io/coremltools/mlmodel/Format/Model.html#model). If you need to control the deployment target, please specify it explicitly. - `compute_precision`: The compute precision used by Core ML (`coremltools.precision.FLOAT16` or `coremltools.precision.FLOAT32`). The default value is `coremltools.precision.FLOAT16`. Note that the compute precision is applied no matter what dtype is specified in the exported PyTorch model. For example, an FP32 PyTorch model will be converted to FP16 when delegating to the Core ML backend by default. Also note that the ANE only supports FP16 precision. - `model_type`: Whether the model should be compiled to the Core ML [mlmodelc format](https://developer.apple.com/documentation/coreml/downloading-and-compiling-a-model-on-the-user-s-device) during .pte creation ([`CoreMLBackend.MODEL_TYPE.COMPILED_MODEL`](https://github.com/pytorch/executorch/blob/14ff52ff89a89c074fc6c14d3f01683677783dcd/backends/apple/coreml/compiler/coreml_preprocess.py#L71)), or whether it should be compiled to mlmodelc on device ([`CoreMLBackend.MODEL_TYPE.MODEL`](https://github.com/pytorch/executorch/blob/14ff52ff89a89c074fc6c14d3f01683677783dcd/backends/apple/coreml/compiler/coreml_preprocess.py#L70)). Using `CoreMLBackend.MODEL_TYPE.COMPILED_MODEL` and doing compilation ahead of time should improve the first time on-device model load time. ### Dynamic and Enumerated Shapes in Core ML Export When exporting an `ExportedProgram` to Core ML, **dynamic shapes** are mapped to [`RangeDim`](https://apple.github.io/coremltools/docs-guides/source/flexible-inputs.html#set-the-range-for-each-dimension). This enables Core ML `.pte` files to accept inputs with varying dimensions at runtime. ⚠️ **Note:** The Apple Neural Engine (ANE) does not support true dynamic shapes. If a model relies on `RangeDim`, Core ML will fall back to scheduling the model on the CPU or GPU instead of the ANE. --- #### Enumerated Shapes To enable limited flexibility on the ANE—and often achieve better performance overall—you can export models using **[enumerated shapes](https://apple.github.io/coremltools/docs-guides/source/flexible-inputs.html#select-from-predetermined-shapes)**. - Enumerated shapes are *not fully dynamic*. - Instead, they define a **finite set of valid input shapes** that Core ML can select from at runtime. - This approach allows some adaptability while still preserving ANE compatibility. --- #### Specifying Enumerated Shapes Unlike `RangeDim`, **enumerated shapes are not part of the `ExportedProgram` itself.** They must be provided through a compile spec. For reference on how to do this, see: - The annotated code snippet below, and - The [end-to-end test in ExecuTorch](https://github.com/pytorch/executorch/blob/main/backends/apple/coreml/test/test_enumerated_shapes.py), which demonstrates how to specify enumerated shapes during export. ```python class Model(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(10, 5) self.linear2 = torch.nn.Linear(11, 5) def forward(self, x, y): return self.linear1(x).sum() + self.linear2(y) model = Model() example_inputs = ( torch.randn((4, 6, 10)), torch.randn((5, 11)), ) # Specify the enumerated shapes. Below we specify that: # # * x can take shape [1, 5, 10] and y can take shape [3, 11], or # * x can take shape [4, 6, 10] and y can take shape [5, 11] # # Any other input shapes will result in a runtime error. # # Note that we must export x and y with dynamic shapes in the ExportedProgram # because some of their dimensions are dynamic enumerated_shapes = {"x": [[1, 5, 10], [4, 6, 10]], "y": [[3, 11], [5, 11]]} dynamic_shapes = [ { 0: torch.export.Dim.AUTO(min=1, max=4), 1: torch.export.Dim.AUTO(min=5, max=6), }, {0: torch.export.Dim.AUTO(min=3, max=5)}, ] ep = torch.export.export( model.eval(), example_inputs, dynamic_shapes=dynamic_shapes ) # If enumerated shapes are specified for multiple inputs, we must export # for iOS18+ compile_specs = CoreMLBackend.generate_compile_specs( minimum_deployment_target=ct.target.iOS18 ) compile_specs.append( CoreMLBackend.generate_enumerated_shapes_compile_spec( ep, enumerated_shapes, ) ) # When using an enumerated shape compile spec, you must specify lower_full_graph=True # in the CoreMLPartitioner. We do not support using enumerated shapes # for partially exported models partitioner = CoreMLPartitioner( compile_specs=compile_specs, lower_full_graph=True ) delegated_program = executorch.exir.to_edge_transform_and_lower( ep, partitioner=[partitioner], ) et_prog = delegated_program.to_executorch() ``` --- # Quantization To quantize a PyTorch model for the Core ML backend, use the `CoreMLQuantizer`. `Quantizers` are backend specific, which means the `CoreMLQuantizer` is configured to quantize models to leverage the quantized operators offered by the Core ML backend. ### Supported Quantization Schemes The CoreML delegate supports the following quantization schemes: - 8-bit static and weight-only quantization via the PT2E flow; dynamic quantization is not supported by CoreML. - 4-bit weight-only affine quantization (per-group or per-channel) via the quantize_ flow - 1-8 bit weight-only LUT quantization (per grouped-channel) via the quantize_ flow ### 8-bit Quantization using the PT2E Flow Quantization with the Core ML backend requires exporting the model for iOS 17 or later. To perform 8-bit quantization with the PT2E flow, follow these steps: 1) Create a [`coremltools.optimize.torch.quantization.LinearQuantizerConfig`](https://apple.github.io/coremltools/source/coremltools.optimize.torch.quantization.html#coremltools.optimize.torch.quantization.LinearQuantizerConfig) and use it to create an instance of a `CoreMLQuantizer`. 2) Use `torch.export.export` to export a graph module that will be prepared for quantization. 3) Call `prepare_pt2e` to prepare the model for quantization. 4) Run the prepared model with representative samples to calibrate the quantizated tensor activation ranges. 5) Call `convert_pt2e` to quantize the model. 6) Export and lower the model using the standard flow. The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques. ```python import torch import coremltools as ct import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.apple.coreml.quantizer import CoreMLQuantizer from executorch.backends.apple.coreml.partition import CoreMLPartitioner from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e from executorch.exir import to_edge_transform_and_lower from executorch.backends.apple.coreml.compiler import CoreMLBackend mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) # Step 1: Define a LinearQuantizerConfig and create an instance of a CoreMLQuantizer # Note that "linear" here does not mean only linear layers are quantized, but that linear (aka affine) quantization # is being performed static_8bit_config = ct.optimize.torch.quantization.LinearQuantizerConfig( global_config=ct.optimize.torch.quantization.ModuleLinearQuantizerConfig( quantization_scheme="symmetric", activation_dtype=torch.quint8, weight_dtype=torch.qint8, weight_per_channel=True, ) ) quantizer = CoreMLQuantizer(static_8bit_config) # Step 2: Export the model for training training_gm = torch.export.export(mobilenet_v2, sample_inputs).module() # Step 3: Prepare the model for quantization prepared_model = prepare_pt2e(training_gm, quantizer) # Step 4: Calibrate the model on representative data # Replace with your own calibration data for calibration_sample in [torch.randn(1, 3, 224, 224)]: prepared_model(calibration_sample) # Step 5: Convert the calibrated model to a quantized model quantized_model = convert_pt2e(prepared_model) # Step 6: Export the quantized model to Core ML et_program = to_edge_transform_and_lower( torch.export.export(quantized_model, sample_inputs), partitioner=[ CoreMLPartitioner( # iOS17 is required for the quantized ops in this example compile_specs=CoreMLBackend.generate_compile_specs( minimum_deployment_target=ct.target.iOS17 ) ) ], ).to_executorch() ``` The above does static quantization (activations and weights are quantized). You can see a full description of available quantization configs in the [coremltools documentation](https://apple.github.io/coremltools/source/coremltools.optimize.torch.quantization.html#coremltools.optimize.torch.quantization.LinearQuantizerConfig). For example, the config below will perform weight-only quantization: ``` weight_only_8bit_config = ct.optimize.torch.quantization.LinearQuantizerConfig( global_config=ct.optimize.torch.quantization.ModuleLinearQuantizerConfig( quantization_scheme="symmetric", activation_dtype=torch.float32, weight_dtype=torch.qint8, weight_per_channel=True, ) ) quantizer = CoreMLQuantizer(weight_only_8bit_config) ``` Quantizing activations requires calibrating the model on representative data. Also note that PT2E currently requires passing at least 1 calibration sample before calling `convert_pt2e`, even for data-free weight-only quantization. See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information. ### LLM quantization with quantize_ The Core ML backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with Core ML: * Quantize embedding/linear layers with IntxWeightOnlyConfig (with weight_dtype torch.int4 or torch.int8, using PerGroup or PerAxis granularity). Using 4-bit or PerGroup quantization requires exporting with minimum_deployment_target >= ct.target.iOS18. Using 8-bit quantization with per-axis granularity is supported on ct.target.IOS16+. See [Core ML `CompileSpec`](coreml-partitioner.md#coreml-compilespec) for more information on setting the deployment target. * Quantize embedding/linear layers with CodebookWeightOnlyConfig (with dtype torch.uint1 through torch.uint8, using various block sizes). Quantizing with CodebookWeightOnlyConfig requires exporting with minimum_deployment_target >= ct.target.iOS18, see [Core ML `CompileSpec`](coreml-partitioner.md#coreml-compilespec) for more information on setting the deployment target. Below is an example that quantizes embeddings to 8-bits per-axis and linear layers to 4-bits using group_size=32 with affine quantization: ```python from torchao.quantization.granularity import PerGroup, PerAxis from torchao.quantization.quant_api import ( IntxWeightOnlyConfig, quantize_, ) # Quantize embeddings with 8-bits, per channel embedding_config = IntxWeightOnlyConfig( weight_dtype=torch.int8, granularity=PerAxis(0), ) quantize_( eager_model, embedding_config, lambda m, fqn: isinstance(m, torch.nn.Embedding), ) # Quantize linear layers with 4-bits, per-group linear_config = IntxWeightOnlyConfig( weight_dtype=torch.int4, granularity=PerGroup(32), ) quantize_( eager_model, linear_config, ) ``` Below is another example that uses codebook quantization to quantize both embeddings and linear layers to 3-bits. In the coremltools documentation, this is called [palettization](https://apple.github.io/coremltools/docs-guides/source/opt-palettization-overview.html): ``` from torchao.quantization.quant_api import ( quantize_, ) from torchao.prototype.quantization.codebook_coreml import CodebookWeightOnlyConfig quant_config = CodebookWeightOnlyConfig( dtype=torch.uint3, # There is one LUT per 16 rows block_size=[16, -1], ) quantize_( eager_model, quant_config, lambda m, fqn: isinstance(m, torch.nn.Embedding) or isinstance(m, torch.nn.Linear), ) ``` Both of the above examples will export and lower to Core ML with the to_edge_transform_and_lower API. --- # Troubleshooting This page describes common issues that you may encounter when using the Core ML backend and how to debug and resolve them. ### Issues during lowering 1. "ValueError: In op, of type [X], named [Y], the named input [Z] must have the same data type as the named input x. However, [Z] has dtype fp32 whereas x has dtype fp16." This happens because the model is in FP16, but Core ML interprets some of the arguments as FP32, which leads to a type mismatch. The solution is to keep the PyTorch model in FP32. Note that the model will be still be converted to FP16 during lowering to Core ML unless specified otherwise in the compute_precision [Core ML `CompileSpec`](coreml-partitioner.md#coreml-compilespec). Also see the [related issue in coremltools](https://github.com/apple/coremltools/issues/2480). ### Issues during runtime 1. [ETCoreMLModelCompiler.mm:55] [Core ML] Failed to compile model, error = Error Domain=com.apple.mlassetio Code=1 "Failed to parse the model specification. Error: Unable to parse ML Program: at unknown location: Unknown opset 'CoreML7'." UserInfo={NSLocalizedDescription=Failed to par$ This means the model requires the Core ML opset 'CoreML7', which requires running the model on iOS >= 17 or macOS >= 14. ## Extracting the mlpackage for profiling and debugging [Core ML *.mlpackage files](https://apple.github.io/coremltools/docs-guides/source/convert-to-ml-program.html#save-ml-programs-as-model-packages) can be extracted from a Core ML-delegated *.pte file. This can help with debugging and profiling for users who are more familiar with *.mlpackage files: ```bash python examples/apple/coreml/scripts/extract_coreml_models.py -m /path/to/model.pte ``` Note that if the ExecuTorch model has graph breaks, there may be multiple extracted *.mlpackage files. --- # MPS Backend MPS delegate is the ExecuTorch solution to take advantage of Apple's GPU for on-device ML using the [MPS Graph](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph?language=objc) framework and tuned kernels provided by [MPS](https://developer.apple.com/documentation/metalperformanceshaders?language=objc). ## Target Requirements Below are the minimum OS requirements on various hardware for running a MPS-delegated ExecuTorch model: - [macOS](https://developer.apple.com/macos) >= 12.4 - [iOS](https://www.apple.com/ios) >= 15.4 ## Development Requirements To develop you need: - [Xcode](https://developer.apple.com/xcode/) >= 14.1 Before starting, make sure you install the Xcode Command Line Tools: ```bash xcode-select --install ``` ## Using the MPS Backend In this step, you will generate a simple ExecuTorch program that lowers MobileNetV3 model to the MPS delegate. You'll then pass this Program (the `.pte` file) during the runtime to run it using the MPS backend. ```bash cd executorch # Note: `mps_example` script uses by default the MPSPartitioner for ops that are not yet supported by the MPS delegate. To turn it off, pass `--no-use_partitioner`. python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --bundled --use_fp16 # To see all options, run following command: python3 -m examples.apple.mps.scripts.mps_example --help ``` ### Runtime **Building the MPS executor runner:** ```bash # In this step, you'll be building the `mps_executor_runner` that is able to run MPS lowered modules: cd executorch ./examples/apple/mps/scripts/build_mps_executor_runner.sh ``` ## Run the mv3 generated model using the mps_executor_runner ```bash ./cmake-out/examples/apple/mps/mps_executor_runner --model_path mv3_mps_float16_bundled.pte --bundled_program ``` - You should see the following results. Note that no output file will be generated in this example: ``` I 00:00:00.003290 executorch:mps_executor_runner.mm:286] Model file mv3_mps_float16_bundled.pte is loaded. I 00:00:00.003306 executorch:mps_executor_runner.mm:292] Program methods: 1 I 00:00:00.003308 executorch:mps_executor_runner.mm:294] Running method forward I 00:00:00.003311 executorch:mps_executor_runner.mm:349] Setting up non-const buffer 1, size 606112. I 00:00:00.003374 executorch:mps_executor_runner.mm:376] Setting up memory manager I 00:00:00.003376 executorch:mps_executor_runner.mm:392] Loading method name from plan I 00:00:00.018942 executorch:mps_executor_runner.mm:399] Method loaded. I 00:00:00.018944 executorch:mps_executor_runner.mm:404] Loading bundled program... I 00:00:00.018980 executorch:mps_executor_runner.mm:421] Inputs prepared. I 00:00:00.118731 executorch:mps_executor_runner.mm:438] Model executed successfully. I 00:00:00.122615 executorch:mps_executor_runner.mm:501] Model verified successfully. ``` ### [Optional] Run the generated model directly using pybind 1. Make sure `pybind` MPS support was installed: ```bash CMAKE_ARGS="-DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh ``` 2. Run the `mps_example` script to trace the model and run it directly from python: ```bash cd executorch # Check correctness between PyTorch eager forward pass and ExecuTorch MPS delegate forward pass python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --no-use_fp16 --check_correctness # You should see following output: `Results between ExecuTorch forward pass with MPS backend and PyTorch forward pass for mv3_mps are matching!` # Check performance between PyTorch MPS forward pass and ExecuTorch MPS forward pass python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --no-use_fp16 --bench_pytorch ``` ### Profiling: 1. [Optional] Generate an [ETRecord](etrecord.rst) while you're exporting your model. ```bash cd executorch python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --generate_etrecord -b ``` 2. Run your Program on the ExecuTorch runtime and generate an [ETDump](etdump.md). ``` ./cmake-out/examples/apple/mps/mps_executor_runner --model_path mv3_mps_float16_bundled.pte --bundled_program --dump-outputs ``` 3. Create an instance of the Inspector API by passing in the ETDump you have sourced from the runtime along with the optionally generated ETRecord from step 1. ```bash python3 -m devtools.inspector.inspector_cli --etdump_path etdump.etdp --etrecord_path etrecord.bin ``` ## Runtime integration ***Step 1***. Create the ExecuTorch core and MPS delegate frameworks to link on iOS ```bash cd executorch ./scripts/build_apple_frameworks.sh --mps ``` `mps_delegate.xcframework` will be in `cmake-out` folder, along with `executorch.xcframework` and `portable_delegate.xcframework`: ```bash cd cmake-out && ls ``` ***Step 2***. Link the frameworks into your XCode project: Go to project Target’s `Build Phases` - `Link Binaries With Libraries`, click the **+** sign and add the frameworks: files located in `Release` folder. - `executorch.xcframework` - `portable_delegate.xcframework` - `mps_delegate.xcframework` From the same page, include the needed libraries for the MPS delegate: - `MetalPerformanceShaders.framework` - `MetalPerformanceShadersGraph.framework` - `Metal.framework` In this tutorial, you have learned how to lower a model to the MPS delegate, build the mps_executor_runner and run a lowered model through the MPS delegate, or directly on device using the MPS delegate static library. --- # NXP eIQ Neutron Backend This manual page is dedicated to introduction NXP eIQ Neutron backend. NXP offers accelerated machine learning models inference on edge devices. To learn more about NXP's machine learning acceleration platform, please refer to [the official NXP website](https://www.nxp.com/applications/technologies/ai-and-machine-learning:MACHINE-LEARNING).
For up-to-date status about running ExecuTorch on Neutron backend please visit the manual page.
## Features ExecuTorch v1.0 supports running machine learning models on selected NXP chips (for now only i.MXRT700). Among currently supported machine learning models are: - Convolution-based neutral networks - Full support for MobileNetV2 and CifarNet ## Target Requirements - Hardware with NXP's [i.MXRT700](https://www.nxp.com/products/i.MX-RT700) chip or a evaluation board like MIMXRT700-EVK. ## Development Requirements - [MCUXpresso IDE](https://www.nxp.com/design/design-center/software/development-software/mcuxpresso-software-and-tools-/mcuxpresso-integrated-development-environment-ide:MCUXpresso-IDE) or [MCUXpresso Visual Studio Code extension](https://www.nxp.com/design/design-center/software/development-software/mcuxpresso-software-and-tools-/mcuxpresso-for-visual-studio-code:MCUXPRESSO-VSC) - [MCUXpresso SDK 25.06](https://mcuxpresso.nxp.com/mcuxsdk/25.06.00/html/index.html) - eIQ Neutron Converter for MCUXPresso SDK 25.06, what you can download from eIQ PyPI: ```commandline $ pip install --index-url https://eiq.nxp.com/repository neutron_converter_SDK_25_06 ``` Instead of manually installing requirements, except MCUXpresso IDE and SDK, you can use the setup script: ```commandline $ ./examples/nxp/setup.sh ``` ## Using NXP eIQ Backend To test converting a neural network model for inference on NXP eIQ Neutron backend, you can use our example script: ```shell # cd to the root of executorch repository ./examples/nxp/aot_neutron_compile.sh [model (cifar10 or mobilenetv2)] ``` For a quick overview how to convert a custom PyTorch model, take a look at our [example python script](https://github.com/pytorch/executorch/tree/release/1.0/examples/nxp/aot_neutron_compile.py). ## Runtime Integration To learn how to run the converted model on the NXP hardware, use one of our example projects on using ExecuTorch runtime from MCUXpresso IDE example projects list. For more finegrained tutorial, visit [this manual page](https://mcuxpresso.nxp.com/mcuxsdk/latest/html/middleware/eiq/executorch/docs/nxp/topics/example_applications.html). ## Reference **→{doc}`nxp-partitioner` — Partitioner options.** **→{doc}`nxp-quantization` — Supported quantization schemes.** **→{doc}`tutorials/nxp-tutorials` — Tutorials.** ```{toctree} :maxdepth: 2 :hidden: :caption: NXP Backend nxp-partitioner nxp-quantization tutorials/nxp-tutorials ``` --- # NXP eIQ Neutron Quantization The eIQ Neutron NPU requires the operators delegated to be quantized. To quantize the PyTorch model for the Neutron backend, use the `NeutronQuantizer` from `backends/nxp/quantizer/neutron_quantizer.py`. The `NeutronQuantizer` is configured to quantize the model with quantization scheme supported by the eIQ Neutron NPU. ### Supported Quantization Schemes The Neutron delegate supports the following quantization schemes: - Static quantization with 8-bit symmetric weights and 8-bit asymmetric activations (via the PT2E quantization flow), per-tensor granularity. - Following operators are supported at this moment: - `aten.abs.default` - `aten.adaptive_avg_pool2d.default` - `aten.addmm.default` - `aten.add.Tensor` - `aten.avg_pool2d.default` - `aten.cat.default` - `aten.conv1d.default` - `aten.conv2d.default` - `aten.dropout.default` - `aten.flatten.using_ints` - `aten.hardtanh.default` - `aten.hardtanh_.default` - `aten.linear.default` - `aten.max_pool2d.default` - `aten.mean.dim` - `aten.mul.Tensor` - `aten.pad.default` - `aten.permute.default` - `aten.relu.default` and `aten.relu_.default` - `aten.reshape.default` - `aten.view.default` - `aten.softmax.int` - `aten.tanh.default`, `aten.tanh_.default` - `aten.sigmoid.default` - `aten.slice_copy.Tensor` ### Static 8-bit Quantization Using the PT2E Flow To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model to edge: 1) Create an instance of the `NeutronQuantizer` class. 2) Use `torch.export.export` to export the model to ATen Dialect. 3) Call `prepare_pt2e` with the instance of the `NeutronQuantizer` to annotate the model with observers for quantization. 4) As static quantization is required, run the prepared model with representative samples to calibrate the quantized tensor activation ranges. 5) Call `convert_pt2e` to quantize the model. 6) Export and lower the model using the standard flow. The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques. To quantize the model, you can use the PT2E workflow: ```python import torch import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.nxp.quantizer.neutron_quantizer import NeutronQuantizer from executorch.backends.nxp.backend.neutron_target_spec import NeutronTargetSpec from executorch.backends.nxp.neutron_partitioner import NeutronPartitioner from executorch.backends.nxp.nxp_backend import generate_neutron_compile_spec from executorch.exir import to_edge_transform_and_lower from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) target_spec = NeutronTargetSpec(target="imxrt700", converter_flavor="SDK_25_09") quantizer = NeutronQuantizer(neutron_target_spec) # (1) training_ep = torch.export.export(model, sample_inputs).module() # (2) prepared_model = prepare_pt2e(training_ep, quantizer) # (3) for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs prepared_model(cal_sample) # (4) Calibrate quantized_model = convert_pt2e(prepared_model) # (5) compile_spec = generate_neutron_compile_spec( "imxrt700", operators_not_to_delegate=None, neutron_converter_flavor="SDK_25_06", ) et_program = to_edge_transform_and_lower( # (6) torch.export.export(quantized_model, sample_inputs), partitioner=[NeutronPartitioner(compile_spec=compile_spec)], ).to_executorch() ``` Or you can use the predefined function for post training quantization from NXP Backend implementation: ```python from executorch.backends.nxp.quantizer.neutron_quantizer import NeutronQuantizer from executorch.backends.nxp.backend.neutron_target_spec import NeutronTargetSpec from executorch.backends.nxp.quantizer.utils import calibrate_and_quantize ... target_spec = NeutronTargetSpec(target="imxrt700", converter_flavor="SDK_25_09") quantized_graph_module = calibrate_and_quantize( aten_model, calibration_inputs, NeutronQuantizer(neutron_target_spec=target_spec), ) ``` See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information. --- # Preparing a Model for NXP eIQ Neutron Backend This guide demonstrating the use of ExecuTorch AoT flow to convert a PyTorch model to ExecuTorch format and delegate the model computation to eIQ Neutron NPU using the eIQ Neutron Backend. ## Step 1: Environment Setup This tutorial is intended to be run from a Linux and uses Conda or Virtual Env for Python environment management. For full setup details and system requirements, see [Getting Started with ExecuTorch](/getting-started). Create a Conda environment and install the ExecuTorch Python package. ```bash conda create -y --name executorch python=3.12 conda activate executorch conda install executorch ``` Run the setup.sh script to install the neutron-converter: ```commandline $ ./examples/nxp/setup.sh ``` ## Step 2: Model Preparation and Running the Model on Target See the example `aot_neutron_compile.py` and its [README](https://github.com/pytorch/executorch/blob/release/1.0/examples/nxp/README.md) file. --- # NXP Tutorials **→{doc}`nxp-basic-tutorial` — Lower and run a model on the NXP eIQ Neutron backend.** ```{toctree} :hidden: :maxdepth: 1 nxp-basic-tutorial ``` --- # Samsung Exynos Backend ExecuTorch's Samsung Exynos backend enables the execution of ExecuTorch models on Samsung SoCs via the NPU/DSP. The delegate is built on top of the [Samsung Exynos AI Litecore SDK](https://soc-developer.semiconductor.samsung.com/global/development/ai-litecore). ## Features - Wide range of operator support - Supported inference precisions: - FP16 - 8-bit statically quantized (int8/uint8) - 16-bit statically quantized (int16/uint16) ## Target Requirements Currently, the Samsung Exynos backend is supported only for devices with the following chipsets: - Exynos 2500 (E9955) ## Development Requirements The [Samsung Exynos AI Litecore SDK](https://soc-developer.semiconductor.samsung.com/global/development/ai-litecore) is required to build the Exynos backend from source, and is also required to export models to the Exynos delegate. ---- ## Using the Samsung Exynos Backend To target the Exynos backend during the export and lowering process, pass an instance of the `EnnPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision. ```python import torch import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.samsung.partition.enn_partitioner import EnnPartitioner from executorch.backends.samsung.serialization.compile_options import ( gen_samsung_backend_compile_spec, ) from executorch.exir import to_edge_transform_and_lower mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) chipset = "E9955" compile_specs = [gen_samsung_backend_compile_spec(chipset)] et_program = to_edge_transform_and_lower( torch.export.export(mobilenet_v2, sample_inputs), partitioner=[EnnPartitioner(compile_specs)], ).to_executorch() with open("mv2_xnnpack.pte", "wb") as file: et_program.write_to_file(file) ``` See [Partitioner API](samsung-partitioner.md) for a reference on available partitioner options. ---- ## Quantization The Samsung Exynos backend support statically quantized models with 8-bit and 16-bit integral types. See [Samsung Exynos Quantization](samsung-quantization.md) for more information on available quantization schemes and APIs. ---- ## Runtime Integration To run the model on-device, use the standard ExecuTorch runtime APIs. The Exynos backend is currently not available in any of ExecuTorch's published packages. To access it, build ExecuTorch from source. When building from source, pass `-DEXECUTORCH_BUILD_EXYNOS=ON` when configuring the CMake build. See [Running on Device](/getting-started.md#running-on-device) for more information. Then, to link against the backend, add the `executorch_backends` CMake target as a build dependency. ``` # CMakeLists.txt add_subdirectory("executorch") ... target_link_libraries( my_target PRIVATE executorch executorch_backends ... ) ``` No additional steps are necessary to use the backend beyond linking the target. Any Exynos delegated .pte file will automatically run on the registered backend. ## Reference **→{doc}`samsung-partitioner` — Partitioner options.** **→{doc}`samsung-quantization` — Supported quantization schemes.** **→{doc}`samsung-op-support` — Supported operators.** ```{toctree} :maxdepth: 2 :hidden: :caption: Exynos Backend samsung-partitioner samsung-quantization samsung-op-support --- # Partitioner API The `EnnPartitioner` API is the primary entrypoint when exporting a model to the Samsung Exynos backend. The partitioner is responsible for determining which parts of the model should be lowered to the backend and also provides an interface for configuring the behaviour of the backend. Currently, the configuration options for `EnnPartitioner` can be generated automatically using the `gen_samsung_backend_compile_spec` API. For instance, ```python from executorch.backends.samsung.partition.enn_partitioner import EnnPartitioner from executorch.backends.samsung.serialization.compile_options import ( gen_samsung_backend_compile_spec, ) from executorch.exir import to_edge_transform_and_lower chipset = "E9955" compile_specs = [gen_samsung_backend_compile_spec(chipset)] et_program = to_edge_transform_and_lower( exported_program, partitioner=[EnnPartitioner(compile_specs)], ).to_executorch() ``` At the moment, only `"E9955"` is supported as a valid chipset name, which corresponds to the Exynose 2500 SoC. Support for additional chipsets will be added in the future. --- # Quantization The Exynos backend currently supports executing statically quantized 8-bit models. ### 8-bit quantization with the PT2E quantization flow To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model: 1) Create an instance of the `EnnQuantizer` class and set the desired quantization behaviour. 2) Use `torch.export.export` to obtain a graph module representation of the source model. 3) Use `prepare_pt2e` to prepare the model for quantization. 4) Execute the prepared model with representative samples to calibrate the quantizated tensor activation ranges. 5) Use `convert_pt2e` to quantize the model. 6) Export and lower the model using the standard export flow. The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the same export flow as non-quantized models. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques. The below example shows how to quantize a MobileNetV2 model using the PT2E quantization flow. ```python import torch import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.samsung.partition.enn_partitioner import EnnPartitioner from executorch.backends.samsung.quantizer.quantizer import EnnQuantizer, Precision from executorch.exir import to_edge_transform_and_lower from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) # Currently, "A8W8" is the only supported precision mode precision = "A8W8" is_per_channel = True is_qat = False quantizer = EnnQuantizer() quantizer.set_quant_params(precision, is_per_channel, is_qat) # (1) training_ep = torch.export.export(model, sample_inputs).module() # (2) prepared_model = prepare_pt2e(training_ep, quantizer) # (3) for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs prepared_model(cal_sample) # (4) Calibrate quantized_model = convert_pt2e(prepared_model) # (5) et_program = to_edge_transform_and_lower( # (6) torch.export.export(quantized_model, sample_inputs), partitioner=[EnnPartitioner()], ).to_executorch() ``` See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information. --- # {BACKEND_NAME} Architecture and Internals This page covers internal implementation details of the backend, and is mainly aimed at contributors and heavy power users. This is an optional page for each backend and has no set structure. Some topics to consider: * High-level design of the backend * Details on the lowering flow * Internal debugging tools and techniques --- # Backend Template Provide a brief overview/description of the backend. At a high-level, what does it do? Consider linking to top-level vendor documentation for the target hardware family and/or framework (Core ML, XNNPACK, etc.). ## Features List high-level features of backend, such as operator and hardware support. ## Target Requirements What hardware and software is required to run the backend on a specific device? For example, does it require specific iOS or Android OS versions? If it's an NPU, what hardware models are supported? ## Development Requirements What software and hardware is needed to create a .PTE file targeting this backend? Are there any additional dependencies that need to be installed that are not included with the ExecuTorch pip package? How does the user install them? ## Using *Backend Name* This section describes the steps users need to take in order to generate a .PTE targeting this backend. Include a full code sample for exporting and lowering a model to this backend. Make sure relevant imports for the backend partitioner are included. ## Runtime Integration This section is intended to tell the user all of the steps they'll need to take to be able to run a .PTE file on-device that is targeting the given backend. - What CMake targets should they link to? - How is this backend compiled from source? - Is the backend bundled by default in iOS and/or Android pre-built libraries? ## Reference **→{doc}`backend-partitioner` — Partitioner options.** **→{doc}`backend-quantization` — Supported quantization schemes.** **→{doc}`backend-troubleshooting` — Debug common issues.** **→{doc}`backend-arch-internals` — Backend internals.** **→{doc}`tutorials/backend-tutorials` — Tutorials.** **→{doc}`guides/backend-guides` — Tutorials.** ```{toctree} :maxdepth: 2 :hidden: :caption: {BACKEND} Backend backend-troubleshooting backend-partitioner backend-quantization backend-op-support backend-arch-internals tutorials/backend-tutorials guides/backend-guides ``` --- # {BACKEND_NAME} Quantization Document quantization schemes and flows for the backend. This should include a description of each scheme and a code example to perform quantization. Example sections for PT2E and quantize_ are included below, to be replaced with details for the target backend. For each supported quantization scheme, include the following: * What is the quantization scheme? * How are weights quantized? * How are activations quantized? Static or dynamic? * How many bits? * What is the granularity? Per-tensor, per-channel, group/block-wise? * What are the steps to quantize a model with this scheme? * Include a code sample. * If the quantization flow only supports a small set of operators - for example, linear only - note this. ### Supported Quantization Schemes The {BACKEND_NAME} delegate supports the following quantization schemes: - {QUANTIZATION_SCHEME_1} - {QUANTIZATION_SCHEME_2} ### {QUANTIZATION_METHOD_1} using the PT2E Flow [Description] [Code Sample] ### LLM Quantization with quantize_ [Description] [Code Sample] --- # {BACKEND_NAME} Troubleshooting This page describes common issues that you may encounter when using the {BACKEND_NAME} backend and how to debug and resolve them. ## {COMMON_ISSUE_1} {ISSUE_DESCRIPTION_1} {SOLUTION_STEPS_1} ## {COMMON_ISSUE_2} {ISSUE_DESCRIPTION_2} {SOLUTION_STEPS_2} --- # Using {FEATURE} on {BACKEND_NAME} This is a placeholder guide. --- # {BACKEND_NAME} Guides **→{doc}`{backend_name}-basic-guide` — Guide description.** ```{toctree} :hidden: :maxdepth: 1 {backend_name}-basic-guides ``` --- # Preparing a Model for {BACKEND_NAME} This is a placeholder tutorial. ## Step 1: Environment Setup This tutorial is intended to be run from a {SUPPORTED_HOST_OS} and uses Conda for Python environment management. For full setup details and system requirements, see [Getting Started with ExecuTorch](/getting-started). Create a Conda environment and install the ExecuTorch Python package. ```bash conda create -y --name executorch python=3.12 conda activate executorch conda install executorch ``` {ADDITIONAL_SETUP_STEPS} ## Step 2: Model Preparation Create a python file named `export_{model_filename}.py`. This script will be responsible for loading the {EXAMPLE_MODEL} model from {MODEL_SOURCE} and create a {BACKEND_NAME}-targeted .pte file. ```py # export_{model_filename}.py from executorch.backends.{backend_name}.partition.{backend_name}_partitioner import {BackendName}Partitioner from executorch.exir import to_edge_transform_and_lower import torch import {MODEL_IMPORT} ``` ### Model Instantiation and Example Inputs Instantiate the {EXAMPLE_MODEL} model from [{MODEL_SOURCE}]({MODEL_SOURCE_URL}). The export process also needs an example model input to trace the model. The model takes {MODEL_INPUT_DESCRIPTION}, so we'll create {INPUT_TUPLE_DESCRIPTION}. ```py model = {MODEL_INSTANTIATION_CODE} example_inputs = ({EXAMPLE_INPUTS},) ``` ### Lower the Model Next, export and lower the model to ExecuTorch. Note that the `{BackendName}Partitioner` passed to the `partitioner` parameter tells ExecuTorch to target the {BACKEND_NAME} backend. ```py exported_program = torch.export.export(model, example_inputs) executorch_program = to_edge_transform_and_lower( exported_program, partitioner=[{BackendName}Partitioner()], ).to_executorch() executorch_program.save("{model_filename}_{backend_name}.pte") ``` ### Run the Script Save the above script to export_{model_filename}.py and run the script. You should see a file named `{model_filename}_{backend_name}.pte` in the current directory. ```bash python export_{model_filename}.py ``` ## Step 3: Running the Model The .pte file created in the previous step can be run on a variety of devices, including {SUPPORTED_PLATFORMS}. ExecuTorch provides runtime APIs and language bindings for a variety of platforms. This tutorial will demonstrate running the model on a desktop using the Python runtime. ### Smoke Test First, we'll verify that the model loads and runs correctly by running the model with {TEST_INPUT_DESCRIPTION}. Create a new script, named `run_{model_filename}.py`, and add the following code. ```py # run_{model_filename}.py from executorch.runtime import Runtime import torch runtime = Runtime.get() input_tensor = {TEST_INPUT_TENSOR} program = runtime.load_program("{model_filename}_{backend_name}.pte") method = program.load_method("forward") outputs = method.execute([input_tensor])[0] print(outputs) ``` When running the script with `python run_{model_filename}.py`, you should see {EXPECTED_OUTPUT_DESCRIPTION} printed to the console. ``` {EXPECTED_OUTPUT_EXAMPLE} ``` # Next Steps - See [Edge Platforms](/edge-platforms-section) to deploy the .pte file on {SUPPORTED_PLATFORMS}. - See [Model Export and Lowering](/using-executorch-export) for more information on model preparation. - See [{BACKEND_NAME} Overview](/backends/{backend_name}/{backend_name}-overview) for more information about the {BACKEND_NAME} backend. --- # {BACKEND_NAME} Tutorials **→{doc}`{backend_name}-basic-tutorial` — Lower and run a model on the {BACKEND_NAME} backend.** ```{toctree} :hidden: :maxdepth: 1 {backend_name}-basic-tutorial ``` --- # Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device This tutorial assumes that you have a working local copy of the ExecuTorch repo, and have gone through the steps to install the executorch pip package or have installed it by building from source. This tutorial also assumes that you have the Android SDK tools installed and that you are able to connect to an Android device via `adb`. Finally, the Android NDK should also be installed, and your environment should have a variable `ANDROID_NDK` that points to the root directory of the NDK. ```shell export ANDROID_NDK= ``` ## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer The model checkpoint and tokenizer can be downloaded from the [Meta Llama website](https://www.llama.com/llama-downloads/). The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`. ## Export the Llama 3.2 1B/3B model First, navigate to the root of the ExecuTorch repo. ```shell # Navigate to executorch root cd ~/executorch ``` Then, set some environment variables to describe how the model should be exported. Feel free to tune the values to your preferences. ```shell export LLM_NAME=Llama3.2 && \ export LLM_SIZE=1B && \ export LLM_SUFFIX="-Instruct" && \ export QUANT=8da4w && \ export BACKEND=vulkan && \ export GROUP_SIZE=64 && \ export CONTEXT_LENGTH=2048 ``` Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that that `--vulkan-force-fp16` flag is set, which will improve model inference latency at the cost of model accuracy. Feel free to remove this flag. ```shell python -m examples.models.llama.export_llama \ -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \ -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \ -d fp32 --${BACKEND} \ -qmode ${QUANT} -G ${GROUP_SIZE} \ --max_seq_length ${CONTEXT_LENGTH} \ --max_context_length ${CONTEXT_LENGTH} \ -kv --use_sdpa_with_kv_cache \ --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ --model "llama3_2" \ --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte ``` After exporting the model, push the exported `.pte` file and the tokenizer to your device. ```shell adb shell mkdir -p /data/local/tmp/llama && \ adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \ /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model && \ adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \ /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte ``` ## Build Core Executorch Components To be able to run the `.pte` file on device, first the core libraries, including the Vulkan backend, must be compiled for Android. ```shell cmake . \ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ --preset "android-arm64-v8a" \ -DANDROID_PLATFORM=android-28 \ -DPYTHON_EXECUTABLE=python \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_PAL_DEFAULT=posix \ -DEXECUTORCH_BUILD_LLAMA_JNI=ON \ -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \ -DEXECUTORCH_BUILD_VULKAN=ON \ -DEXECUTORCH_BUILD_TESTS=OFF \ -Bcmake-out-android-so && \ cmake --build cmake-out-android-so -j16 --target install --config Release ``` ## Build and push the llama runner binary to Android Then, build a binary that can be used to run the `.pte` file. ```shell cmake examples/models/llama \ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ -DEXECUTORCH_ENABLE_LOGGING=ON \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-28 \ -DCMAKE_BUILD_TYPE=Release \ -DPYTHON_EXECUTABLE=python \ -Bcmake-out-android-so/examples/models/llama && \ cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release ``` Once the binary is built, it can be pushed to your Android device. ```shell adb shell mkdir /data/local/tmp/etvk/ && \ adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/ ``` ## Execute the llama runner binary Finally, we can execute the lowered `.pte` file on your device. ```shell adb shell /data/local/tmp/etvk/llama_main \ --model_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \ --tokenizer_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model \ --temperature=0 --seq_len=400 --warmup \ --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\" ``` Here is some sample output captured from a Galaxy S24: ```shell E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I' <|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Here is a short poem I came up with: "Moonlight whispers secrets to the night A gentle breeze that rustles the light The stars up high, a twinkling show A peaceful world, where dreams grow slow" I hope you enjoy it!<|eot_id|> PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000} Prompt Tokens: 14 Generated Tokens: 54 Model Load Time: 2.277000 (seconds) Total inference time: 1.189000 (seconds) Rate: 45.416316 (tokens/second) Prompt evaluation: 0.164000 (seconds) Rate: 85.365854 (tokens/second) Generated 54 tokens: 1.025000 (seconds) Rate: 52.682927 (tokens/second) Time to first generated token: 0.164000 (seconds) Sampling time over 68 tokens: 0.019000 (seconds) ``` --- # Executing and profiling an ExecuTorch Vulkan model on device This tutorial assumes that you have a working local copy of the ExecuTorch repo, and have gone through the steps to install the executorch pip package or have installed it by building from source. This tutorial also assumes that you have the Android SDK tools installed and that you are able to connect to an Android device via `adb`. Finally, the Android NDK should also be installed, and your environment should have a variable `ANDROID_NDK` that points to the root directory of the NDK. ```shell export ANDROID_NDK= ``` ## Lower a model to ExecuTorch Vulkan and obtain the `.pte` file The commands in this tutorial are assumed to be executed from ExecuTorch's root directory. ```shell cd ~/executorch ``` For this tutorial, we will use the export script in [`executorch/examples/vulkan/export.py`](https://github.com/pytorch/executorch/tree/main/examples/vulkan), however any method of generating a `.pte` file will suffice. In this tutorial, the InceptionV3 model is exported. ```shell python -m examples.vulkan.export --model_name=ic3 -o . -fp16 ``` After exporting, there should be a file called `ic3_vulkan.pte` in the root directory of ExecuTorch. Feel free to modify the `-o` argument of the script to control where the `.pte` file will be stored. Then, push the `.pte` file to device. ```shell adb shell mkdir -p /data/local/tmp/etvk/models/ && \ adb push ic3_vulkan.pte /data/local/tmp/etvk/models/ic3_vulkan.pte ``` ## Build the `executor_runner` binary and push to device To be able to run the `.pte` file on device, first the core libraries, including the Vulkan backend, must be compiled for Android. Note that `-DEXECUTORCH_ENABLE_EVENT_TRACER=ON` is used to turn on profiling, and `-DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON` is used to build the runner binary that will be used to execute and profile the `.pte` file. ```shell cmake . \ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ --preset "android-arm64-v8a" \ -DANDROID_PLATFORM=android-28 \ -DPYTHON_EXECUTABLE=python \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_PAL_DEFAULT=posix \ -DEXECUTORCH_BUILD_LLAMA_JNI=ON \ -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \ -DEXECUTORCH_BUILD_VULKAN=ON \ -DEXECUTORCH_BUILD_TESTS=OFF \ -DEXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL=ON \ -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \ -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \ -Bcmake-out-android-so && \ cmake --build cmake-out-android-so -j16 --target install --config Release ``` Once the build completes, we can push the runner binary to device. ```shell adb push cmake-out-android-so/executor_runner /data/local/tmp/etvk/executor_runner ``` ## Execute the `.pte` file Finally, we can execute the lowered `.pte` file on your device. To test run the model file without profiling: ```shell adb shell /data/local/tmp/etvk/executor_runner \ --model_path /data/local/tmp/etvk/models/ic3_vulkan.pte ``` Now, with profiling: ```shell MODEL_NAME=ic3 && \ BACKEND=vulkan && \ NUM_ITERS=3 && \ adb shell mkdir -p /data/local/tmp/etvk/etdumps/ && \ adb shell /data/local/tmp/etvk/executor_runner \ --model_path /data/local/tmp/etvk/models/${MODEL_NAME}_${BACKEND}.pte \ --num_executions=${NUM_ITERS} \ --etdump_path /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \ adb pull /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp ${MODEL_NAME}_${BACKEND}.etdp && \ adb shell rm /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \ python devtools/inspector/inspector_cli.py \ --etdump_path ${MODEL_NAME}_${BACKEND}.etdp ``` Here is some sample (tailed) output from a Samsung Galaxy S24: ```shell ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 165 │ Execute │ conv2d_clamp_half_163 │ 0.345082 │ 0.346164 │ 0.346247 │ 0.345748 │ 0.344812 │ 0.346268 │ [] │ True │ │ [2081488974948084, 2081488995911052, 2081489016763676] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 166 │ Execute │ conv2d_clamp_half_164 │ 0.306124 │ 0.30654 │ 0.306998 │ 0.306557 │ 0.30602 │ 0.307112 │ [] │ True │ │ [2081488975294716, 2081488996256228, 2081489017110204] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 167 │ Execute │ set_zero_int32_165 │ 0.00240245 │ 0.00244403 │ 0.00248561 │ 0.00244403 │ 0.00239205 │ 0.002496 │ [] │ True │ │ [2081488975601100, 2081488996563132, 2081489017417680] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 168 │ Execute │ concat_2_texture3d_half_166 │ 0.0122305 │ 0.01248 │ 0.0125634 │ 0.0124108 │ 0.0121682 │ 0.0125842 │ [] │ True │ │ [2081488975603960, 2081488996565940, 2081489017420436] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 169 │ Execute │ set_zero_int32_167 │ 0.00157056 │ 0.00161195 │ 0.00161214 │ 0.00159478 │ 0.00156021 │ 0.00161219 │ [] │ True │ │ [2081488975616804, 2081488996578888, 2081489017432968] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 170 │ Execute │ concat_3_texture3d_half_168 │ 0.0420369 │ 0.0423281 │ 0.0427857 │ 0.0423974 │ 0.0419641 │ 0.0429001 │ [] │ True │ │ [2081488975618728, 2081488996580864, 2081489017434944] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 171 │ Execute │ update_concat_offset_3_int32_169 │ 0.00261035 │ 0.00265193 │ 0.00265212 │ 0.00263468 │ 0.00259995 │ 0.00265217 │ [] │ True │ │ [2081488975661992, 2081488996623556, 2081489017477272] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 172 │ Execute │ concat_1_texture3d_half_170 │ 0.00758157 │ 0.00774789 │ 0.00803914 │ 0.00779994 │ 0.00753999 │ 0.00811195 │ [] │ True │ │ [2081488975664956, 2081488996626572, 2081489017480288] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 173 │ Execute │ mean2d_half_171 │ 0.0147889 │ 0.0148721 │ 0.0150384 │ 0.0149067 │ 0.0147681 │ 0.01508 │ [] │ True │ │ [2081488975673432, 2081488996634476, 2081489017488400] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 174 │ Execute │ view_half_172 │ 0.00644803 │ 0.00644803 │ 0.00653119 │ 0.00648268 │ 0.00644803 │ 0.00655198 │ [] │ True │ │ [2081488975688876, 2081488996649712, 2081489017503532] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 175 │ Execute │ view_half_173 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ [] │ True │ │ [2081488975695688, 2081488996656524, 2081489017510448] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 176 │ Execute │ linear_naive_texture3d_half_174 │ 0.586726 │ 0.590096 │ 0.595338 │ 0.590876 │ 0.585884 │ 0.596648 │ [] │ True │ │ [2081488975700940, 2081488996661776, 2081489017515700] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 177 │ Execute │ image_to_nchw_texture3d_half_float_175 │ 0.00270395 │ 0.00270414 │ 0.00274572 │ 0.00272139 │ 0.00270391 │ 0.00275612 │ [] │ True │ │ [2081488976297952, 2081488997248024, 2081489018106160] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 178 │ Execute │ DELEGATE_CALL │ 20.8864 │ 20.9461 │ 21.5925 │ 21.1906 │ 20.8715 │ 21.7541 │ [] │ False │ │ [358395625, 380178646, 401147657] │ ├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ │ 179 │ Execute │ Method::execute │ 20.8867 │ 20.9464 │ 21.593 │ 21.191 │ 20.8718 │ 21.7547 │ [] │ False │ │ [358395521, 380178542, 401147552] │ ╘═════╧════════════════════╧════════════════════════════════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧════════════╧═══════════════════╧═════════════════════════╧════════════════════════════════════════════════════════╛ ``` --- # Vulkan Backend Tutorials **→{doc}`etvk-profiling-tutorial`** **→{doc}`etvk-llama-tutorial`** ```{toctree} :maxdepth: 2 :hidden: :caption: Tutorials etvk-profiling-tutorial etvk-llama-tutorial --- # Vulkan Backend The ExecuTorch Vulkan (ET-VK) backend enables ExecuTorch models to execute on GPUs via the cross-platform [Vulkan API](https://www.vulkan.org/). Although the Vulkan API support is almost ubiquitous among modern GPUs, the ExecuTorch Vulkan backend is currently developed with a specific focus for **Android GPUs**. ## Features - Wide operator support via an in-tree [GLSL compute shader library](https://github.com/pytorch/executorch/tree/main/backends/vulkan/runtime/graph/ops/glsl) - Support for models that require dynamic shapes - Support for FP32 and FP16 inference modes - Support for quantized linear layers with 8-bit/4-bit weights and 8-bit dynamically quantized activations - Support for quantized linear layers with 8-bit/4-bit weights and FP32/FP16 activations Note that the Vulkan backend is under active development, and its GLSL compute shader library is being consistently expanded over time. Additional support for quantized operators (i.e. quantized convolution) and additional quantization modes is on the way. ## Target Requirements - Supports Vulkan 1.1 ## Development Requirements To contribute to the Vulkan delegate, the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home#android) must be installed on the development system. After installation, the `glslc` binary must be found in your `PATH` in order to compile Vulkan shaders. This can be checked by running ```sh glslc --version ``` If this is not the case after completing the Vulkan SDK installation, you may have to go into `~/VulkanSDK//` and run ```sh source setup-env.sh ``` or alternatively, ```sh python install_vulkan.py ``` The [Android NDK](https://developer.android.com/ndk/downloads) must also be installed. Any NDK version past NDK r17c should suffice. ---- ## Using the Vulkan Backend To lower a model to the Vulkan backend during the export and lowering process, pass an instance of `VulkanPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision. ```python import torch import torchvision.models as models from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner from executorch.exir import to_edge_transform_and_lower from torchvision.models.mobilenetv2 import MobileNet_V2_Weights mobilenet_v2 = models.mobilenetv2.mobilenet_v2( weights=MobileNet_V2_Weights.DEFAULT ).eval() sample_inputs = (torch.randn(1, 3, 224, 224),) exported_program = torch.export.export(mobilenet_v2, sample_inputs) etvk_program = to_edge_transform_and_lower( exported_program, partitioner=[VulkanPartitioner()], ).to_executorch() with open("mv2_vulkan.pte", "wb") as file: etvk_program.write_to_file(file) ``` See [Partitioner API](vulkan-partitioner.md) for a reference on available partitioner options. ---- ## Quantization The Vulkan delegate currently supports execution of quantized linear layers. See [Vulkan Quantization](vulkan-quantization.md) for more information on available quantization schemes and APIs. ---- ## Runtime Integration To run the model on-device, use the standard ExecuTorch runtime APIs. For integration in Android applications, the Vulkan backend is included in the [executorch-android-vulkan](https://mvnrepository.com/artifact/org.pytorch/executorch-android-vulkan) package. When building from source, pass `-DEXECUTORCH_BUILD_VULKAN=ON` when configuring the CMake build to compile the Vulkan backend. See [Running on Device](/getting-started.md#running-on-device) for more information. To link against the backend, add the `executorch_backends` CMake target as a build dependency, or link directly against `libvulkan_backend`. Due to the use of static initialization to register available compute shaders and operators, it is required to ensure that the library is linked with `--whole-archive`. ```cmake # CMakeLists.txt find_package(executorch CONFIG REQUIRED COMPONENTS vulkan_backend executorch_backends) ... target_link_libraries( my_target PRIVATE executorch executorch_backends ... ) # Ensure that unused code is not discarded. The required linker options may be # different depending on the target platform. Typically, the # executorch_target_link_options_shared_lib function from # executorch/tools/cmake/Utils.cmake can be used to set the required linker # options. target_link_options( executorch_backends INTERFACE "SHELL:LINKER:--whole-archive \ $ \ LINKER:--no-whole-archive" ) ``` No additional steps are necessary to use the backend beyond linking the target. Any Vulkan-delegated .pte file will automatically run on the registered backend. ## Additional Resources **→{doc}`/backends/vulkan/vulkan-partitioner`** **→{doc}`/backends/vulkan/vulkan-quantization`** **→{doc}`/backends/vulkan/vulkan-troubleshooting`** ```{toctree} :maxdepth: 2 :hidden: :caption: Vulkan Backend vulkan-partitioner vulkan-quantization vulkan-op-support vulkan-troubleshooting tutorials/vulkan-tutorials --- # Partitioner API [VulkanPartitioner](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/vulkan_partitioner.py) is a Python class that controls what operators in a model can or should be delegated to the Vulkan backend. It is the primary entrypoint to the Vulkan backend and is also used to configure the behaviour of the Vulkan backend. ## Usage For most use-cases, constructing `VulkanPartitioner()` with no arguments is sufficient. In this case, the partitioner will lower as much of the model to the Vulkan backend as possible. ```python etvk_program = to_edge_transform_and_lower( exported_program, partitioner=[VulkanPartitioner()], ).to_executorch() ``` ## Common Config Options Generally, the Vulkan backend is configured by passing a `compile_options` dictionary to `VulkanPartitioner()`, i.e. ```python compile_options = { "require_dynamic_shapes": True, "force_fp16": True, } etvk_program = to_edge_transform_and_lower( exported_program, partitioner=[VulkanPartitioner(compile_options)], ).to_executorch() ``` ### `require_dynamic_shapes` If a model is expected to use dynamic shapes, then it is recommended to set the `"required_dynamic_shapes"` key in `compile_options`. Not all operators in Vulkan support dynamic shapes at the moment, although the majority do. This flag will prevent operators that don't support dynamic shapes from being lowered to Vulkan. ### `force_fp16` This option causes the Vulkan backend to internally convert all FP32 tensors to FP16. This can improve inference latency and memory footprint at the cost of model accuracy. FP32 input tensors will be automatically converted to FP16 upon entering the Vulkan backend, and FP16 outputs will be automatically be converted to FP32 as they are returned. --- # Quantization The Vulkan backend currently supports execution of quantized linear layers, where weights are symmetrically quantized to 8-bit or 4-bit with per output channel or per group quantization scales. Support for additional quantized operators and quantization schemes (i.e. static + dynamic quantized convolution, support for statically quantized linear) is under active development and will be added soon. ### 4-bit quantization with torchao `quantize_` The `quantize_` API from [torchao](https://github.com/pytorch/ao) allows for more advanced quantization schemes, and is the quantization workflow needed to access 4-bit quantization. 4-bit quantization is commonly used for LLMs. Two options are available to execute linear layers with 4-bit quantization: 1. Dynamically quantized activations via `Int8DynamicActivationIntxWeightConfig` 2. Weight only quantization via `IntxWeightOnlyConfig` Dynamically quantized activations can provide a significant boost in latency compared to weight only quantization, since it allows GPUs to leverage accelerated integer dot product instructions when computing matrix multiplication. Below is a simple example of quantizing a simple sequence of linear layers using the `quantize_` API. ```python import torch from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner from executorch.exir import to_edge_transform_and_lower from torchao.quantization.granularity import PerGroup from torchao.quantization.quant_api import ( Int8DynamicActivationIntxWeightConfig, IntxWeightOnlyConfig, quantize_, ) from torchao.utils import unwrap_tensor_subclass class LinearSequenceModule(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(128, 64, bias=False) self.linear2 = torch.nn.Linear(64, 32, bias=False) self.linear3 = torch.nn.Linear(32, 16, bias=False) def forward(self, x): x = self.linear1(x) x = self.linear2(x) x = self.linear3(x) return x linear_sequence_module = LinearSequenceModule() M = 32 sample_inputs = (torch.randn(M, 128),) group_size = 32 q_config_8da4w = Int8DynamicActivationIntxWeightConfig( weight_dtype=torch.int4, weight_granularity=PerGroup(group_size) ) q_config_4w = IntxWeightOnlyConfig( weight_dtype=torch.int4, granularity=PerGroup(group_size) ) quantize_(linear_sequence_module, q_config_8da4w) unwrap_tensor_subclass(linear_sequence_module) # Regular export path from here exported_program = torch.export.export(linear_sequence_module, sample_inputs) etvk_program = to_edge_transform_and_lower( exported_program, partitioner=[VulkanPartitioner()], ).to_executorch() ``` ### 8-bit quantization with PT2E quantization For 8-bit quantized linear layers, currently the only quantization scheme supported is weight only quantization, with weights that are symmetrically quantized to 8 bits with per output channel quantization scales. To access this quantization mode, the PT2E quantization flow must be used. At a high level, the steps to quantize a model are: 1) Create an instance of the `VulkanQuantizer` class and specify desired quantization behaviour 2) Use `torch.export.export` to prepare for quantization. 3) Call `prepare_pt2e` to prepare the exported graph for quantization. 4) Execute the prepared model with representative samples to calibrate the quantizated tensor activation ranges. 5) Call `convert_pt2e` to quantize the model. 6) Export and lower the model using the standard flow. For example: ```python import torch from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner from executorch.backends.vulkan.quantizer.vulkan_quantizer import ( get_symmetric_quantization_config, VulkanQuantizer, ) from executorch.exir import to_edge_transform_and_lower from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e from torchao.utils import unwrap_tensor_subclass class LinearSequenceModule(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(128, 64, bias=False) self.linear2 = torch.nn.Linear(64, 32, bias=False) self.linear3 = torch.nn.Linear(32, 16, bias=False) def forward(self, x): x = self.linear1(x) x = self.linear2(x) x = self.linear3(x) return x linear_sequence_module = LinearSequenceModule() M = 32 # Create sample inputs sample_inputs = (torch.randn(M, 128),) # Setup quantizer quantizer = VulkanQuantizer() quantizer.set_global(get_symmetric_quantization_config(is_dynamic=False, weight_bits=8)) # Export the model exported_program = torch.export.export(linear_sequence_module, sample_inputs) graph_module = exported_program.module() # Quantize the exported program with PT2E quantization flow quantized_module = prepare_pt2e(graph_module, quantizer) # Calibrate. In practice, this would be done by iterating over a real dataset quantized_module(*sample_inputs) quantized_module = convert_pt2e(quantized_module) # Export once more exported_program = torch.export.export(quantized_module, sample_inputs) # Lower to vulkan etvk_program = to_edge_transform_and_lower( exported_program, partitioner=[VulkanPartitioner()], ).to_executorch() ``` --- # Troubleshooting This page describes common issues that you may encounter when using the Vulkan backend and how to debug and resolve them. ## Vulkan Backend Not Found If you try to execute a .pte file that has been lowered to the Vulkan backend and you see an error like: ```shell E 00:00:00.366934 executorch:method.cpp:74] Backend VulkanBackend is not registered. ``` This error indicates the Vulkan backend is not registered with the runtime. This can happen because the backend was not compiled or linked, or because the registration code was optimized out. First, make sure that when building ExecuTorch, cmake is configured with `-DEXECUTORCH_BUILD_VULKAN=ON`. Next, make sure that your application is linking the `vulkan_backend` target, or the `executorch_backends` target. Finally, ensure that `vulkan_backend` or `executorch_backends` is being linked with the equivalent of `--whole-archive`. ## Slow Performance Performance issues can be caused by a variety of factors: * A key compute shader (most often convolution or linear) is not performing well on your target GPU * Unsupported operators are causing too many graph breaks * An existing operator is lacking support for some memory layout or storage type resulting in a high number of copies being inserted to ensure tensors are in a required representation for the next operator If you experience poor on-device performance for a particular model, please obtain some profiling data while running your model. The [profiling tutorial](./tutorials/etvk-profiling-tutorial.md) can be a good reference for how to do this. Then, please file an issue on Github with the following details: * The device(s) you have tested with, and which devices exhibit poor performance running the model * The profiling data collected from executing the model * The release version of ExecuTorch you are using, or the commit hash you built from if you built from source * If available, an export script that can be used to export your model to aid in reproducing the issue * If available, the `.pte` file you are testing with to aid in reproducing the issue. We will do our best to patch performance problems in the Vulkan backend and help you resolve your issue. --- # Architecture and Internals This is a high-level overview of the ExecuTorch XNNPACK backend delegate. This high performance delegate is aimed to reduce CPU inference latency for ExecuTorch models. We will provide a brief introduction to the XNNPACK library and explore the delegate’s overall architecture and intended use cases. ## What is XNNPACK? XNNPACK is a library of highly-optimized neural network operators for ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, and macOS environments. It is an open source project, you can find more information about it on [github](https://github.com/google/XNNPACK). ## What are ExecuTorch delegates? A delegate is an entry point for backends to process and execute parts of the ExecuTorch program. Delegated portions of ExecuTorch models hand off execution to backends. The XNNPACK backend delegate is one of many available in ExecuTorch. It leverages the XNNPACK third-party library to accelerate ExecuTorch programs efficiently across a variety of CPUs. More detailed information on the delegates and developing your own delegates is available [here](/compiler-delegate-and-partitioner.md). It is recommended that you get familiar with that content before continuing on to the Architecture section. ## Architecture ![High Level XNNPACK delegate Architecture](/backends/xnnpack/xnnpack-delegate-architecture.png) ### Ahead-of-time In the ExecuTorch export flow, lowering to the XNNPACK delegate happens at the `to_backend()` stage. In this stage, the model is partitioned by the `XnnpackPartitioner`. Partitioned sections of the graph are converted to a XNNPACK specific graph represenationed and then serialized via flatbuffer. The serialized flatbuffer is then ready to be deserialized and executed by the XNNPACK backend at runtime. ![ExecuTorch XNNPACK delegate Export Flow](/backends/xnnpack/xnnpack-et-flow-diagram.png) #### Partitioner The partitioner is implemented by backend delegates to mark nodes suitable for lowering. The `XnnpackPartitioner` lowers using node targets and module metadata. Some more references for partitioners can be found [here](/compiler-delegate-and-partitioner.md) ##### Module-based partitioning `source_fn_stack` is embedded in the node’s metadata and gives information on where these nodes come from. For example, modules like `torch.nn.Linear` when captured and exported `to_edge` generate groups of nodes for their computation. The group of nodes associated with computing the linear module then has a `source_fn_stack` of `torch.nn.Linear. Partitioning based on `source_fn_stack` allows us to identify groups of nodes which are lowerable via XNNPACK. For example after capturing `torch.nn.Linear` you would find the following key in the metadata for the addmm node associated with linear: ```python >>> print(linear_node.meta["source_fn_stack"]) 'source_fn_stack': ('fn', ) ``` ##### Op-based partitioning The `XnnpackPartitioner` also partitions using op targets. It traverses the graph and identifies individual nodes which are lowerable to XNNPACK. A drawback to module-based partitioning is that operators which come from [decompositions](https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py) may be skipped. For example, an operator like `torch.nn.Hardsigmoid` is decomposed into add, muls, divs, and clamps. While hardsigmoid is not lowerable, we can lower the decomposed ops. Relying on `source_fn_stack` metadata would skip these lowerables because they belong to a non-lowerable module, so in order to improve model performance, we greedily lower operators based on the op targets as well as the `source_fn_stack`. ##### Passes Before any serialization, we apply passes on the subgraphs to prepare the graph. These passes are essentially graph transformations that help improve the performance of the delegate. We give an overview of the most significant passes and their function below. For a description of all passes see [here](https://github.com/pytorch/executorch/tree/main/backends/xnnpack/_passes): * Channels Last Reshape * ExecuTorch tensors tend to be contiguous before passing them into delegates, while XNNPACK only accepts channels-last memory layout. This pass minimizes the number of permutation operators inserted to pass in channels-last memory format. * Conv1d to Conv2d * Allows us to delegate Conv1d nodes by transforming them to Conv2d * Conv and BN Fusion * Fuses batch norm operations with the previous convolution node #### Serialiazation After partitioning the lowerable subgraphs from the model, The XNNPACK delegate pre-processes these subgraphs and serializes them via flatbuffer for the XNNPACK backend. ##### Serialization Schema The XNNPACK delegate uses flatbuffer for serialization. In order to improve runtime performance, the XNNPACK delegate’s flatbuffer [schema](https://github.com/pytorch/executorch/blob/main/backends/xnnpack/serialization/schema.fbs) mirrors the XNNPACK Library’s graph level API calls. The serialized data are arguments to XNNPACK’s APIs, so that at runtime, the XNNPACK execution graph can efficiently be created with successive calls to XNNPACK’s APIs. ### Runtime The XNNPACK backend’s runtime interfaces with the ExecuTorch runtime through the custom `init` and `execute` function. Each delegated subgraph is contained in an individually serialized XNNPACK blob. When the model is initialized, ExecuTorch calls `init` on all XNNPACK Blobs to load the subgraph from serialized flatbuffer. After, when the model is executed, each subgraph is executed via the backend through the custom `execute` function. To read more about how delegate runtimes interface with ExecuTorch, refer to this [resource](/compiler-delegate-and-partitioner.md). #### **XNNPACK Library** XNNPACK delegate supports CPU's on multiple platforms; more information on the supported hardware architectures can be found on the XNNPACK Library’s [README](https://github.com/google/XNNPACK). #### **Init** When calling XNNPACK delegate’s `init`, we deserialize the preprocessed blobs via flatbuffer. We define the nodes (operators) and edges (intermediate tensors) to build the XNNPACK execution graph using the information we serialized ahead-of-time. As we mentioned earlier, the majority of processing has been done ahead-of-time, so that at runtime we can just call the XNNPACK APIs with the serialized arguments in succession. As we define static data into the execution graph, XNNPACK performs weight packing at runtime to prepare static data like weights and biases for efficient execution. After creating the execution graph, we create the runtime object and pass it on to `execute`. Since weight packing creates an extra copy of the weights inside XNNPACK, We free the original copy of the weights inside the preprocessed XNNPACK Blob, this allows us to remove some of the memory overhead. #### **Execute** When executing the XNNPACK subgraphs, we prepare the tensor inputs and outputs and feed them to the XNNPACK runtime graph. After executing the runtime graph, the output pointers are filled with the computed tensors. #### **Profiling** We have enabled basic profiling for the XNNPACK delegate that can be enabled with the compiler flag `-DEXECUTORCH_ENABLE_EVENT_TRACER` (add `-DENABLE_XNNPACK_PROFILING` for additional details). With ExecuTorch's Developer Tools integration, you can also now use the Developer Tools to profile the model. You can follow the steps in [Using the ExecuTorch Developer Tools to Profile a Model](/tutorials/devtools-integration-tutorial) on how to profile ExecuTorch models and use Developer Tools' Inspector API to view XNNPACK's internal profiling information. An example implementation is available in the `executor_runner` (see [tutorial here](/tutorial-xnnpack-delegate-lowering.md#profiling)). [comment]: <> (TODO: Refactor quantizer to a more official quantization doc) ## Quantization The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. For quantized model delegation, we quantize models using the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library. We will not go over the details of how to implement your custom quantizer, you can follow the docs [here](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html) to do so. However, we will provide a brief overview of how to quantize the model to leverage quantized execution of the XNNPACK delegate. ### Configuring the XNNPACKQuantizer ```python from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import ( XNNPACKQuantizer, get_symmetric_quantization_config, ) quantizer = XNNPACKQuantizer() quantizer.set_global(get_symmetric_quantization_config()) ``` Here we initialize the `XNNPACKQuantizer` and set the quantization config to be symmetrically quantized. Symmetric quantization is when weights are symmetrically quantized with `qmin = -127` and `qmax = 127`, which forces the quantization zeropoints to be zero. `get_symmetric_quantization_config()` can be configured with the following arguments: * `is_per_channel` * Weights are quantized across channels * `is_qat` * Quantize aware training * `is_dynamic` * Dynamic quantization We can then configure the `XNNPACKQuantizer` as we wish. We set the following configs below as an example: ```python quantizer.set_global(quantization_config) .set_object_type(torch.nn.Conv2d, quantization_config) # can configure by module type .set_object_type(torch.nn.functional.linear, quantization_config) # or torch functional op typea .set_module_name("foo.bar", quantization_config) # or by module fully qualified name ``` ### Quantizing your model with the XNNPACKQuantizer After configuring our quantizer, we are now ready to quantize our model ```python from torch.export import export exported_model = export(model_to_quantize, example_inputs).module() prepared_model = prepare_pt2e(exported_model, quantizer) print(prepared_model.graph) ``` Prepare performs some Conv2d-BN fusion, and inserts quantization observers in the appropriate places. For Post-Training Quantization, we generally calibrate our model after this step. We run sample examples through the `prepared_model` to observe the statistics of the Tensors to calculate the quantization parameters. Finally, we convert our model here: ```python quantized_model = convert_pt2e(prepared_model) print(quantized_model) ``` You will now see the Q/DQ representation of the model, which means `torch.ops.quantized_decomposed.dequantize_per_tensor` are inserted at quantized operator inputs and `torch.ops.quantized_decomposed.quantize_per_tensor` are inserted at operator outputs. Example: ```python def _qdq_quantized_linear( x_i8, x_scale, x_zero_point, x_quant_min, x_quant_max, weight_i8, weight_scale, weight_zero_point, weight_quant_min, weight_quant_max, bias_fp32, out_scale, out_zero_point, out_quant_min, out_quant_max ): x_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor( x_i8, x_scale, x_zero_point, x_quant_min, x_quant_max, torch.int8) weight_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor( weight_i8, weight_scale, weight_zero_point, weight_quant_min, weight_quant_max, torch.int8) out_fp32 = torch.ops.aten.linear.default(x_fp32, weight_fp32, bias_fp32) out_i8 = torch.ops.quantized_decomposed.quantize_per_tensor( out_fp32, out_scale, out_zero_point, out_quant_min, out_quant_max, torch.int8) return out_i8 ``` You can read more indepth explanations on PyTorch 2 quantization [here](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html). ## See Also - [Integrating XNNPACK Delegate in Android AAR](/using-executorch-android.md) - [Complete the Lowering to XNNPACK Tutorial](/tutorial-xnnpack-delegate-lowering.md) --- # XNNPACK Backend The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs. [XNNPACK](https://github.com/google/XNNPACK/tree/master) is a library that provides optimized kernels for machine learning operators on Arm and x86 CPUs. ## Features - Wide operator support on Arm and x86 CPUs, available on any modern mobile phone. - Support for a wide variety of quantization schemes and quantized operators. - Supports fp32 and fp16 activations. - Supports 8-bit quantization. ## Target Requirements - ARM64 on Android, iOS, macOS, Linux, and Windows. - ARMv7 (with NEON) on Android. - ARMv6 (with VFPv2) on Linux. - x86 and x86-64 (up to AVX512) on Windows, Linux, Android. ## Development Requirements The XNNPACK delegate does not introduce any development system requirements beyond those required by the core ExecuTorch runtime. ---- ## Using the XNNPACK Backend To target the XNNPACK backend during the export and lowering process, pass an instance of the `XnnpackPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision. ```python import torch import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner from executorch.exir import to_edge_transform_and_lower mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) et_program = to_edge_transform_and_lower( torch.export.export(mobilenet_v2, sample_inputs), partitioner=[XnnpackPartitioner()], ).to_executorch() with open("mv2_xnnpack.pte", "wb") as file: et_program.write_to_file(file) ``` See [Partitioner API](/backends/xnnpack/xnnpack-partitioner) for a reference on available partitioner options. ---- ## Quantization The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. See [XNNPACK Quantization](/backends/xnnpack/xnnpack-quantization) for more information on available quantization schemes and APIs. ---- ## Runtime Integration To run the model on-device, use the standard ExecuTorch runtime APIs. The XNNPACK delegate is included by default in the published Android, iOS, and pip packages. When building from source, pass `-DEXECUTORCH_BUILD_XNNPACK=ON` when configuring the CMake build to compile the XNNPACK backend. See [Running on Device](/getting-started.md#running-on-device) for more information. To link against the backend, add the `executorch_backends` CMake target as a build dependency, or link directly against `libxnnpack_backend`. Due to the use of static registration, it may be necessary to link with whole-archive. This can typically be done by passing `"$"` to `target_link_libraries`. ``` # CMakeLists.txt add_subdirectory("executorch") ... target_link_libraries( my_target PRIVATE executorch executorch_backends ... ) ``` No additional steps are necessary to use the backend beyond linking the target. Any XNNPACK-delegated .pte file will automatically run on the registered backend. ## Reference **→{doc}`/backends/xnnpack/xnnpack-troubleshooting` — Debug common issues.** **→{doc}`/backends/xnnpack/xnnpack-partitioner` — Partitioner options and supported operators.** **→{doc}`/backends/xnnpack/xnnpack-quantization` — Supported quantization schemes.** **→{doc}`/backends/xnnpack/xnnpack-arch-internals` — XNNPACK backend internals.** ```{toctree} :maxdepth: 2 :hidden: :caption: XNNPACK Backend xnnpack-partitioner xnnpack-quantization xnnpack-troubleshooting xnnpack-arch-internals ``` --- # Quantization The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library. ### Supported Quantization Schemes The XNNPACK delegate supports the following quantization schemes: - 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow). - Supports both static and dynamic activations. - Supports per-channel and per-tensor schemes. - Supports linear, convolution, add, mul, cat, and adaptive avg pool 2d operators. Weight-only quantization is not currently supported on XNNPACK. ### 8-bit Quantization using the PT2E Flow To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model: 1) Create an instance of the `XnnpackQuantizer` class. Set quantization parameters. 2) Use `torch.export.export` to prepare for quantization. 3) Call `prepare_pt2e` to prepare the model for quantization. 4) For static quantization, run the prepared model with representative samples to calibrate the quantizated tensor activation ranges. 5) Call `convert_pt2e` to quantize the model. 6) Export and lower the model using the standard flow. The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques. ```python import torch import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer, get_symmetric_quantization_config from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner from executorch.exir import to_edge_transform_and_lower from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) qparams = get_symmetric_quantization_config(is_per_channel=True) # (1) quantizer = XNNPACKQuantizer() quantizer.set_global(qparams) training_ep = torch.export.export(model, sample_inputs).module() # (2) prepared_model = prepare_pt2e(training_ep, quantizer) # (3) for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs prepared_model(cal_sample) # (4) Calibrate quantized_model = convert_pt2e(prepared_model) # (5) et_program = to_edge_transform_and_lower( # (6) torch.export.export(quantized_model, sample_inputs), partitioner=[XnnpackPartitioner()], ).to_executorch() ``` See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information. ### LLM quantization with quantize_ The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK: * Quantize embeedings with `IntxWeightOnlyConfig` (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity) * Quantize linear layers with 4 bit weight and 8bit dynamic activation, use `Int8DynamicActivationIntxWeightConfig` (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity) Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch). ```python from torchao.quantization.granularity import PerGroup, PerAxis from torchao.quantization.quant_api import ( IntxWeightOnlyConfig, Int8DynamicActivationIntxWeightConfig, quantize_, ) # Quantize embeddings with 8-bits, per channel embedding_config = IntxWeightOnlyConfig( weight_dtype=torch.int8, granularity=PerAxis(0), ) qunatize_( eager_model, lambda m, fqn: isinstance(m, torch.nn.Embedding), ) # Quatize linear layers with 8-bit dynamic activations and 4-bit weights linear_config = Int8DynamicActivationIntxWeightConfig( weight_dtype=torch.int4, weight_granularity=PerGroup(32), ) quantize_(eager_model, linear_config) ``` --- # Troubleshooting This page describes common issues that you may encounter when using the XNNPACK backend and how to debug and resolve them. ## XNNPACK Backend Not Found This error indicates the XNNPACK backend is not registered with the runtime. This can happen because the backend was not compiled or linked, or because the registration code was optimized out. The XNNPACK backend is built by default for Python, Android, iOS, and in most CMake presets. * Set the `EXECUTORCH_BUILD_XNNPACK=ON` CMake option option when building from source. * Either by passing the option during CMake configuration or setting it inside the user CMake logic before including ExecuTorch. * See [Building from Source](/using-executorch-building-from-source). * On iOS, link the `backend_xnnpack` [framework](/using-executorch-ios). * If the backend is still not found, link with `WHOLE_ARCHIVE`. * Pass `"LINK_LIBRARY:WHOLE_ARCHIVE,xnnpack_backend>"` to `target_link_libraries` in CMake. ## Slow Performance * Try reducing the thread count using [_unsafe_reset_threadpool](/using-executorch-faqs.md#inference-is-slow-performance-troubleshooting). * Small models may benefit from using fewer threads than default. * Try values between 1 and 4 threads and measure performance on your model. * Use [op-level profiling](/tutorials/devtools-integration-tutorial) to understand which operators are taking the most time. * The XNNPACK backend provides operator-level timing for delegated operators. * See general performance troubleshooting tips in [Performance Troubleshooting](/using-executorch-faqs.md#inference-is-slow-performance-troubleshooting). ## Debugging Why Nodes Are Not Partitioned * To debug cases where operators are not delegated to XNNPACK, you can enable internal debug logs before **to_edge_transform_and_lower** from the partitioner. This will print diagnostic messages explaining why specific nodes fail to partition. ``` python # caption: Enable internal partition debug logging import logging logger = logging.getLogger("executorch.backends.xnnpack.partition") logger.setLevel(logging.DEBUG) if not logger.handlers: ch = logging.StreamHandler() ch.setLevel(logging.DEBUG) formatter = logging.Formatter("[%(levelname)s] %(name)s: %(message)s") ch.setFormatter(formatter) logger.addHandler(ch) ``` --- # Cadence Xtensa Backend In this tutorial we will walk you through the process of getting setup to build ExecuTorch for Cadence Xtensa DSPs and running models on them. [Cadence](https://www.cadence.com/en_US/home.html) is both a hardware and software vendor, providing solutions for many computational workloads, including to run on power-limited embedded devices. The Cadence backend supports multiple DSP families optimized for different workloads: - **HiFi Audio DSPs** (HiFi4/HiFi5): Optimized for audio processing, speech recognition, and wake word detection - **Fusion G3 DSPs**: General-purpose AI acceleration - **Vision P-Series DSPs**: Specialized for computer vision and CNN workloads In addition to the chip, the HiFi4 Neural Network Library ([nnlib](https://github.com/foss-xtensa/nnlib-hifi4)) offers an optimized set of library functions commonly used in NN processing that we utilize in this example to demonstrate how common operations can be accelerated. For an overview of the Cadence ExecuTorch integration with performance benchmarks, see the blog post: [Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch](https://community.cadence.com/cadence_blogs_8/b/ip/posts/running-optimized-pytorch-models-on-cadence-dsps-with-executorch). On top of being able to run on the Xtensa HiFi4 DSP, another goal of this tutorial is to demonstrate how portable ExecuTorch is and its ability to run on a low-power embedded device such as the Xtensa HiFi4 DSP. This workflow does not require any delegates, it uses custom operators and compiler passes to enhance the model and make it more suitable to running on Xtensa HiFi4 DSPs. A custom [quantizer](https://pytorch.org/tutorials/prototype/quantization_in_pytorch_2_0_export_tutorial.html) is used to represent activations and weights as `uint8` instead of `float`, and call appropriate operators. Finally, custom kernels optimized with Xtensa intrinsics provide runtime acceleration. ::::{grid} 2 :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites * In this tutorial you will learn how to export a quantized model with a linear operation targeted for the Xtensa HiFi4 DSP. * You will also learn how to compile and deploy the ExecuTorch runtime with the kernels required for running the quantized model generated in the previous step on the Xtensa HiFi4 DSP. ::: :::{grid-item-card} Tutorials we recommend you complete before this: :class-card: card-prerequisites * [Introduction to ExecuTorch](intro-how-it-works.md) * [Getting Started](getting-started.md) * [Building ExecuTorch with CMake](using-executorch-building-from-source.md) ::: :::: ```{note} The linux part of this tutorial has been designed and tested on Ubuntu 22.04 LTS, and requires glibc 2.34. Workarounds are available for other distributions, but will not be covered in this tutorial. ``` ## Prerequisites (Hardware and Software) In order to be able to succesfully build and run ExecuTorch on a Xtensa HiFi4 DSP you'll need the following hardware and software components. ### Hardware - [i.MX RT600 Evaluation Kit](https://www.nxp.com/design/development-boards/i-mx-evaluation-and-development-boards/i-mx-rt600-evaluation-kit:MIMXRT685-EVK) ### Software - x86-64 Linux system (For compiling the DSP binaries) - [MCUXpresso IDE](https://www.nxp.com/design/software/development-software/mcuxpresso-software-and-tools-/mcuxpresso-integrated-development-environment-ide:MCUXpresso-IDE) - This IDE is supported on multiple platforms including MacOS. You can use it on any of the supported platforms as you'll only be using this to flash the board with the DSP images that you'll be building later on in this tutorial. - [J-Link](https://www.segger.com/downloads/jlink/) - Needed to flash the board with the firmware images. You can install this on the same platform that you installed the MCUXpresso IDE on. - Note: depending on the version of the NXP board, another probe than JLink might be installed. In any case, flashing is done using the MCUXpresso IDE in a similar way. - [MCUXpresso SDK](https://mcuxpresso.nxp.com/en/select?device=EVK-MIMXRT685) - Download this SDK to your Linux machine, extract it and take a note of the path where you store it. You'll need this later. - [Xtensa compiler](https://tensilicatools.com/platform/i-mx-rt600/) - Download this to your Linux machine. This is needed to build ExecuTorch for the HiFi4 DSP. - For cases with optimized kernels, the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4). ## Setting up Developer Environment Step 1. In order to be able to successfully install all the software components specified above users will need to go through the NXP tutorial linked below. Although the tutorial itself walks through a Windows setup, most of the steps translate over to a Linux installation too. [NXP tutorial on setting up the board and dev environment](https://www.nxp.com/document/guide/getting-started-with-i-mx-rt600-evaluation-kit:GS-MIMXRT685-EVK?section=plug-it-in) ```{note} Before proceeding forward to the next section users should be able to succesfullly flash the **dsp_mu_polling_cm33** sample application from the tutorial above and notice output on the UART console indicating that the Cortex-M33 and HiFi4 DSP are talking to each other. ``` Step 2. Make sure you have completed the ExecuTorch setup tutorials linked to at the top of this page. ## Working Tree Description The working tree is: ``` executorch ├── backends │ └── cadence │ ├── aot # Ahead-of-Time compilation tools │ │ ├── compiler.py # Main compilation API │ │ ├── export_example.py # Export workflow example │ │ ├── quantizer/ # Quantization infrastructure │ │ │ ├── quantizer.py # Multiple quantizer implementations │ │ │ ├── patterns.py # Quantization patterns │ │ │ └── fusion_pass.py # Op fusion pass │ │ ├── passes.py # Graph optimization passes │ │ ├── functions.yaml # Generic operator definitions │ │ ├── functions_hifi.yaml # HiFi-specific definitions │ │ ├── functions_fusion_g3.yaml # Fusion G3 definitions │ │ └── functions_vision.yaml # Vision-specific definitions │ ├── runtime/ # Runtime execution infrastructure │ ├── utils/ # Build utilities (FACTO, header gen) │ ├── hifi/ # HiFi Audio DSP family (70+ ops) │ │ ├── kernels # Optimized HiFi4/HiFi5 kernels │ │ ├── operators # HiFi operator implementations │ │ └── third-party │ │ └── nnlib # Cadence NNLIB integration │ ├── fusion_g3/ # Fusion G3 DSP family (25+ ops) │ │ ├── kernels │ │ ├── operators │ │ └── third-party │ │ └── nnlib │ ├── vision/ # Vision P-Series DSP family (17+ ops) │ │ ├── kernels │ │ ├── operators │ │ └── third-party # Vision-specific library │ └── generic/ # Generic fallback implementations (15+ ops) │ └── operators └── examples └── cadence ├── models # 9 example models │ ├── rnnt_encoder.py # ASR encoder (ConvEmformer) │ ├── rnnt_predictor.py # ASR predictor │ ├── rnnt_joiner.py # ASR joiner │ ├── wav2vec2.py # Self-supervised speech │ ├── mobilenet_v2.py # Image classification │ ├── resnet18.py # Image classification │ ├── resnet50.py # Image classification │ ├── vision_transformer.py # ViT │ └── babyllama.py # Small LLM └── operators # Operator test examples ├── test_add_op.py # Add operation tests ├── test_quantized_linear_op.py ├── test_quantized_conv1d_op.py ├── test_requantize_op.py └── test_g3_ops.py # FACTO-based G3 tests ``` ***AoT (Ahead-of-Time) Components***: The AoT folder contains all of the python scripts and functions needed to export the model to an ExecuTorch `.pte` file. The main components include: - **Compiler API** ([compiler.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/compiler.py)): High-level APIs for model compilation including `trace()`, `quantize_pt2()`, `export_to_edge()`, and `export_to_cadence()`. - **Quantizer** ([quantizer/quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py)): Multiple quantization strategies: - `CadenceDefaultQuantizer`: Standard A8W8 (8-bit asymmetric activations, 8-bit weights) - `CadenceWithLayerNormQuantizer`: Adds layer normalization support - `CadenceWakeWordQuantizer`: Optimized for audio wake word models - `CadenceW8A32MixedQuantizer`: Experimental mixed precision (8-bit weights, 32-bit activations) - `CadenceWithSoftmaxQuantizer`: Includes A16 (16-bit activation) softmax - **Compiler Passes** ([passes.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/passes.py)): Graph optimization passes including operator fusion, replacement, simplification, and reordering. - **Operator Registrations** ([ops_registrations.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/ops_registrations.py)): Registers 100+ custom Cadence operators with meta kernels for shape inference. Supports quantized operations for conv1d/2d, linear, matmul, layer norm, and more. - **Export Example** ([export_example.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/export_example.py)): Reference implementation demonstrating the complete export workflow from model to `.pte` file. ***DSP Family-Specific Implementations***: Each DSP family has its own optimized operator and kernel implementations: - **HiFi**: Extensive support for quantized convolutions (1D/2D, depthwise, dilated), linear, matmul, layer norm, ReLU, add, and more. Uses Cadence NNLIB for optimized primitives. - **Fusion G3**: General-purpose operations including arithmetic (add, sub, mul, div), activations (sigmoid, tanh, softmax), layer normalization, and tensor manipulation. - **Vision**: Vision-focused operations including quantized conv, linear, matmul, im2row transformation, and softmax with custom vision library. - **Generic**: Reference implementations used as fallback when DSP-specific optimizations aren't available. ***Kernels***: The kernels folders contain optimized implementations that use Xtensa intrinsics to deliver high performance at low power. Each DSP family has its own kernel implementations tuned for the specific architecture characteristics. ## Build In this step, you will generate the ExecuTorch program from different models. You'll then use this Program (the `.pte` file) during the runtime build step to bake this Program into the DSP image. ### Model Export Examples The Cadence backend provides multiple example models covering different use cases: ***Simple Model***: The first, simple model is meant to test that all components of this tutorial are working properly, and simply does an add operation. The generated file is called `add.pte`. ```bash cd executorch python3 -m examples.portable.scripts.export --model_name="add" ``` ***Quantized Operators***: Test individual quantized operations: - **Quantized Linear**: [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) operation (32→16 features). Linear is the backbone of most ASR models. ```bash python3 -m examples.cadence.operators.test_quantized_linear_op ``` - **Quantized Conv1D**: [Conv1d](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) operation (8→16 channels). Important for wake word and denoising models. ```bash python3 -m examples.cadence.operators.test_quantized_conv1d_op ``` - **Requantize Operation**: Tests dtype conversion between different quantized types. ```bash python3 -m examples.cadence.operators.test_requantize_op ``` In all cases the generated file is called `CadenceDemoModel.pte`. ***Speech/Audio Models***: The torchaudio [RNNT-emformer](https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html) model is an Automatic Speech Recognition (ASR) model, comprised of three different submodels: - **RNNT Predictor**: Sequence of basic ops (embedding, ReLU, linear, layer norm) ```bash python3 -m examples.cadence.models.rnnt_predictor ``` - **RNNT Encoder**: ConvEmformer-based encoder with time reduction and transformer layers ```bash python3 -m examples.cadence.models.rnnt_encoder ``` - **RNNT Joiner**: Joint network combining encoder and predictor outputs ```bash python3 -m examples.cadence.models.rnnt_joiner ``` - **Wav2Vec 2.0**: Self-supervised speech representation model ```bash python3 -m examples.cadence.models.wav2vec2 ``` ***Computer Vision Models***: - **MobileNet V2**: Efficient image classification ```bash python3 -m examples.cadence.models.mobilenet_v2 ``` - **ResNet-18**: Image classification ```bash python3 -m examples.cadence.models.resnet18 ``` - **ResNet-50**: Deeper image classification ```bash python3 -m examples.cadence.models.resnet50 ``` - **Vision Transformer (ViT)**: Transformer-based vision model ```bash python3 -m examples.cadence.models.vision_transformer ``` ***Language Model***: - **Baby LLaMA**: Small LLM for testing transformer operations on DSP ```bash python3 -m examples.cadence.models.babyllama ``` All model exports generate `CadenceDemoModel.pte` files ready for deployment. ### Runtime **Building the DSP firmware image** In this step, you'll be building the DSP firmware image that consists of the sample ExecuTorch runner along with the Program generated from the previous step. This image when loaded onto the DSP will run through the model that this Program consists of. ***Step 1***. Configure the environment variables needed to point to the Xtensa toolchain that you have installed in the previous step. The three environment variables that need to be set include: ```bash # Directory in which the Xtensa toolchain was installed export XTENSA_TOOLCHAIN=/home/user_name/cadence/XtDevTools/install/tools # The version of the toolchain that was installed. This is essentially the name of the directory # that is present in the XTENSA_TOOLCHAIN directory from above. export TOOLCHAIN_VER=RI-2023.11-linux # The Xtensa core that you're targeting. # For HiFi4 (NXP RT600): export XTENSA_CORE=VANILLA_HIFI # For Fusion G3: # export XTENSA_CORE=VANILLA_G3 # For Vision P6: # export XTENSA_CORE=VANILLA_VISION ``` ```{note} The Cadence backend supports multiple DSP families: - **HiFi Audio DSPs** (HiFi4/HiFi5): Core `VANILLA_HIFI`, enable with `-DEXECUTORCH_NNLIB_OPT=ON` - **Fusion G3 DSPs**: Core `VANILLA_G3`, enable with `-DEXECUTORCH_FUSION_G3_OPT=ON` - **Vision P-Series DSPs**: Core `VANILLA_VISION`, enable with `-DEXECUTORCH_VISION_OPT=ON` ``` ***Step 2***. Clone the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4), which contains optimized kernels and primitives for HiFi4 DSPs, with `git clone git@github.com:foss-xtensa/nnlib-hifi4.git`. ***Step 3***. Run the CMake build. In order to run the CMake build, you need the path to the following: - The Program generated in the previous step - Path to the NXP SDK root. This should have been installed already in the [Setting up Developer Environment](#setting-up-developer-environment) section. This is the directory that contains the folders such as boards, components, devices, and other. ```bash cd executorch ./install_executorch.sh --clean mkdir cmake-out # prebuild and install executorch library cmake -DCMAKE_TOOLCHAIN_FILE=/backends/cadence/cadence.cmake \ -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Debug \ -DPYTHON_EXECUTABLE=python3 \ -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \ -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=OFF \ -DEXECUTORCH_BUILD_PTHREADPOOL=OFF \ -DEXECUTORCH_BUILD_CPUINFO=OFF \ -Bcmake-out . cmake --build cmake-out -j --target install --config Debug # build cadence runner cmake -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_TOOLCHAIN_FILE=/examples/backends/cadence.cmake \ -DCMAKE_PREFIX_PATH=/cmake-out \ -DMODEL_PATH= \ -DNXP_SDK_ROOT_DIR= \ -DNN_LIB_BASE_DIR= \ -Bcmake-out/examples/cadence \ examples/cadence cmake --build cmake-out/examples/cadence -j8 -t cadence_executorch_example ``` After having succesfully run the above step you should see two binary files in their CMake output directory. ```bash > ls cmake-xt/*.bin cmake-xt/dsp_data_release.bin cmake-xt/dsp_text_release.bin ``` ## Deploying and Running on Device ***Step 1***. You now take the DSP binary images generated from the previous step and copy them over into your NXP workspace created in the [Setting up Developer Environment](#setting-up-developer-environment) section. Copy the DSP images into the `dsp_binary` section highlighted in the image below. ![MCUXpresso IDE](_static/img/dsp_binary.png) ```{note} As long as binaries have been built using the Xtensa toolchain on Linux, flashing the board and running on the chip can be done only with the MCUXpresso IDE, which is available on all platforms (Linux, MacOS, Windows). ``` ***Step 2***. Clean your work space ***Step 3***. Click **Debug your Project** which will flash the board with your binaries. On the UART console connected to your board (at a default baud rate of 115200), you should see an output similar to this: ```bash > screen /dev/tty.usbmodem0007288234991 115200 Executed model Model executed successfully. First 20 elements of output 0 0.165528 0.331055 ... ``` ## Conclusion and Future Work In this tutorial, you have learned how to export a quantized operation, build the ExecuTorch runtime and run this model on the Xtensa HiFi4 DSP chip. The (quantized linear) model in this tutorial is a typical operation appearing in ASR models, and can be extended to a complete ASR model by creating the model as a new test and adding the needed operators/kernels to [operators](https://github.com/pytorch/executorch/blob/main/backends/cadence/hifi/operators) and [kernels](https://github.com/pytorch/executorch/blob/main/backends/cadence/hifi/kernels). Other models can be created following the same structure, always assuming that operators and kernels are available. --- # MediaTek Backend The MediaTek backend enables acceleration of PyTorch models on edge devices with MediaTek Neuron Processing Units (NPUs). This backend provides tools for exporting, building, and deploying models to leverage MediaTek hardware. ## Features - Acceleration of PyTorch models on MediaTek NPUs - Tools for model export and lowering - Example scripts for model deployment and execution ## Target Requirements - **Hardware:** MediaTek Dimensity 9300 (D9300), Dimensity 9400 (D9400) - **Host OS:** Linux - **SDK:** [NeuroPilot Express SDK](https://neuropilot.mediatek.com/resources/public/npexpress/en/docs/npexpress) ## Development Requirements - Linux operating system - Python dependencies: ```bash pip3 install -r requirements.txt ``` - NeuroPilot SDK Python wheels (download from [NeuroPilot Express SDK](https://neuropilot.mediatek.com/resources/public/npexpress/en/docs/npexpress)): ```bash pip3 install mtk_neuron-8.2.23-py3-none-linux_x86_64.whl pip3 install mtk_converter-8.13.0+public-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl ``` ## Using the MediaTek Backend ### Exporting and Lowering a Model To export and lower a model for the MediaTek backend, use the provided shell script: ```bash cd executorch ./examples/mediatek/shell_scripts/export_oss.sh mobilenetv3 ``` The exported `.pte` file is saved in a directory named after the model. ### Quantizer API Quantizer can be configured with different precision. We currently support A16W16, A16W8, A16W4, A8W8 and A8W4 The example code will be ```python precision = "A16W16" quantizer = NeuropilotQuantizer() quantizer.setup_precision(getattr(Precision, precision)) ``` ### Partitioner API A list of CompileSpec is suppported by MediaTek backend: - `platform-config`: Specifies the targeted MediaTek platform name to compile for. ## Runtime Integration This section presents an example of exporting and deploying a model. Please refer to `executorch/examples/mediatek/` for export and execution examples of various of models. ### Building Example Runners Build example runners: ```bash ./mtk_build_examples.sh ``` Runners are located in `cmake-android-out/examples/mediatek/`. ### Deploying to Device 1. Push `libneuron_backend.so`, `libneuronusdk_adapter.mtk.so` and `libneuron_buffer_allocator.so` to the device. 2. Set the library path before running ExecuTorch: ```bash export LD_LIBRARY_PATH=:::$LD_LIBRARY_PATH ``` ### Building the Backend from Source 1. Copy `NeuronAdapter.h` to `backends/mediatek/runtime/include/api/` 2. Set NDK Path: Ensure that the `$ANDROID_NDK` environment variable is set to the path where the NDK is located. ```bash export ANDROID_NDK= ``` 3. Build the backend library `libneuron_backend.so`: ```bash cd backends/mediatek/scripts/ ./mtk_build.sh ``` The output is `libneuron_backend.so` in `cmake-android-out/backends/mediatek/`. --- # Backends ## Backend Overview ExecuTorch backends provide hardware acceleration for specific hardware targets, enabling models to run efficiently on devices ranging from mobile phones to embedded systems and DSPs. During the export and lowering process, ExecuTorch optimizes your model for the chosen backend, resulting in a `.pte` file specialized for that hardware. To support multiple platforms (e.g., Core ML on iOS, Arm CPU on Android), you typically generate a dedicated `.pte` file for each backend. The choice of backend is informed by the hardware your model will run on. Each backend has its own hardware requirements and level of model/operator support. See the documentation for each backend for details. As part of `.pte` file creation, ExecuTorch identifies model partitions supported by the backend. These are processed ahead of time for efficient execution. Operators not supported by the delegate are executed using the portable CPU fallback (e.g., XNNPACK), allowing for partial acceleration. You can also specify multiple partitioners in order of priority, so unsupported GPU ops can fall back to CPU, for example. --- ## Why Backends Matter Backends are the bridge between your exported model and the hardware it runs on. Choosing the right backend ensures your model takes full advantage of device-specific acceleration, balancing performance, compatibility, and resource usage. --- ## Choosing a Backend | Backend | Platform(s) | Hardware Type | Typical Use Case | |--------------------------------------------------------------|-------------|---------------|---------------------------------| | [XNNPACK](backends/xnnpack/xnnpack-overview.md) | All | CPU | General-purpose, fallback | | [Core ML](/backends/coreml/coreml-overview.md) | iOS, macOS | NPU/GPU/CPU | Apple devices, high performance | | [Metal Performance Shaders](/backends/mps/mps-overview.md) | iOS, macOS | GPU | Apple GPU acceleration | | [Vulkan ](/backends/vulkan/vulkan-overview.md) | Android | GPU | Android GPU acceleration | | [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs | | [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs | | [Arm Ethos-U](/backends/arm-ethos-u/arm-ethos-u-overview.md) | Embedded | NPU | Arm MCUs | | [Arm VGF](/backends/arm-vgf/arm-vgf-overview.md) | Android | GPU | Arm platforms | | [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs | | [NXP](backends/nxp/nxp-overview.md) | Embedded | NPU | NXP SoCs | | [Cadence](backends-cadence) | Embedded | DSP | DSP-optimized workloads | | [Samsung Exynos](/backends/samsung/samsung-overview.md) | Android | NPU | Samsung SoCs | **Tip:** For best performance, export a `.pte` file for each backend you plan to support. --- ## Best Practices - **Test on all target devices:** Operator support may vary by backend. - **Use fallback wisely:** If a backend doesn't support an operator, ExecuTorch will run it on CPU. - **Consult backend docs:** Each backend has unique setup and tuning options. --- ```{toctree} :maxdepth: 3 :hidden: :caption: Backend Overview backends-xnnpack backends/coreml/coreml-overview backends-mps backends-vulkan backends-qualcomm backends-mediatek backends-arm-ethos-u backends-arm-vgf build-run-openvino backends-nxp backends-cadence backends-samsung-exynos --- # Qualcomm AI Engine Backend In this tutorial we will walk you through the process of getting started to build ExecuTorch for Qualcomm AI Engine Direct and running a model on it. Qualcomm AI Engine Direct is also referred to as QNN in the source and documentation. ::::{grid} 2 :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites * In this tutorial you will learn how to lower and deploy a model for Qualcomm AI Engine Direct. ::: :::{grid-item-card} Tutorials we recommend you complete before this: :class-card: card-prerequisites * [Introduction to ExecuTorch](intro-how-it-works.md) * [Getting Started](getting-started.md) * [Building ExecuTorch with CMake](using-executorch-building-from-source.md) ::: :::: ## What's Qualcomm AI Engine Direct? [Qualcomm AI Engine Direct](https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk) is designed to provide unified, low-level APIs for AI development. Developers can interact with various accelerators on Qualcomm SoCs with these set of APIs, including Kryo CPU, Adreno GPU, and Hexagon processors. More details can be found [here](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-10/overview.html). Currently, this ExecuTorch Backend can delegate AI computations to Hexagon processors through Qualcomm AI Engine Direct APIs. ## Prerequisites (Hardware and Software) ### Host OS The QNN Backend is currently verified on the following Linux host operating systems: - **Ubuntu 22.04 LTS (x64)** - **CentOS Stream 9** - **Windows Subsystem for Linux (WSL)** with Ubuntu 22.04 In general, we verify the backend on the same OS versions that the QNN SDK is officially validated against. The exact supported versions are documented in the QNN SDK. #### Windows (WSL) Setup To install Ubuntu 22.04 on WSL, run the following command in PowerShell or Windows Terminal: ```bash wsl --install -d ubuntu 22.04 ``` This command will install WSL and set up Ubuntu 22.04 as the default Linux distribution. For more details and troubleshooting, refer to the official Microsoft WSL installation guide: 👉 [Install WSL | Microsoft Learn](https://learn.microsoft.com/en-us/windows/wsl/install) ### Hardware: You will need an Android / Linux device with adb-connected running on one of below Qualcomm SoCs: - SA8295 - SM8450 (Snapdragon 8 Gen 1) - SM8475 (Snapdragon 8 Gen 1+) - SM8550 (Snapdragon 8 Gen 2) - SM8650 (Snapdragon 8 Gen 3) - SM8750 (Snapdragon 8 Elite) - SSG2115P - SSG2125P - SXR1230P (Linux Embedded) - SXR2230P - SXR2330P - QCM6490 - QCS9100 This example is verified with SM8550 and SM8450. ### Software: - Follow ExecuTorch recommended Python version. - A compiler to compile AOT parts, e.g., the GCC compiler comes with Ubuntu LTS. g++ version need to be 13 or higher. - [Android NDK](https://developer.android.com/ndk). This example is verified with NDK 26c. - (Optional) Target toolchain for linux embedded platform. - [Qualcomm AI Engine Direct SDK](https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk) - Click the "Get Software" button to download the latest version of the QNN SDK. - Although newer versions are available, we have verified and recommend using QNN 2.37.0 for stability. - You can download it directly from the following link: [QNN 2.37.0](https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.37.0.250724/v2.37.0.250724.zip) The directory with installed Qualcomm AI Engine Direct SDK looks like: ``` ├── benchmarks ├── bin ├── docs ├── examples ├── include ├── lib ├── LICENSE.pdf ├── NOTICE.txt ├── NOTICE_WINDOWS.txt ├── QNN_NOTICE.txt ├── QNN_README.txt ├── QNN_ReleaseNotes.txt ├── ReleaseNotes.txt ├── ReleaseNotesWindows.txt ├── sdk.yaml └── share ``` ## Setting up your developer environment ### Conventions `$QNN_SDK_ROOT` refers to the root of Qualcomm AI Engine Direct SDK, i.e., the directory containing `QNN_README.txt`. `$ANDROID_NDK_ROOT` refers to the root of Android NDK. `$EXECUTORCH_ROOT` refers to the root of executorch git repository. ### Setup environment variables We set `LD_LIBRARY_PATH` to make sure the dynamic linker can find QNN libraries. Further, we set `PYTHONPATH` because it's easier to develop and import ExecuTorch Python APIs. ```bash export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang/:$LD_LIBRARY_PATH export PYTHONPATH=$EXECUTORCH_ROOT/.. ``` ## Build An example script for the below building instructions is [here](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/scripts/build.sh). We recommend to use the script because the ExecuTorch build-command can change from time to time. The above script is actively used. It is updated more frequently than this tutorial. An example usage is ```bash cd $EXECUTORCH_ROOT # android target ./backends/qualcomm/scripts/build.sh # (optional) linux embedded target ./backends/qualcomm/scripts/build.sh --enable_linux_embedded # for release build ./backends/qualcomm/scripts/build.sh --release ``` ## Deploying and running on device ### AOT compile a model Refer to [this script](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/scripts/deeplab_v3.py) for the exact flow. We use deeplab-v3-resnet101 as an example in this tutorial. Run below commands to compile: ```bash cd $EXECUTORCH_ROOT python -m examples.qualcomm.scripts.deeplab_v3 -b build-android -m SM8550 --compile_only --download ``` You might see something like below: ``` [INFO][Qnn ExecuTorch] Destroy Qnn context [INFO][Qnn ExecuTorch] Destroy Qnn device [INFO][Qnn ExecuTorch] Destroy Qnn backend opcode name target args kwargs ------------- ------------------------ --------------------------- ----------------------------- -------- placeholder arg684_1 arg684_1 () {} get_attr lowered_module_0 lowered_module_0 () {} call_function executorch_call_delegate executorch_call_delegate (lowered_module_0, arg684_1) {} call_function getitem (executorch_call_delegate, 0) {} call_function getitem_1 (executorch_call_delegate, 1) {} output output output ([getitem_1, getitem],) {} ``` The compiled model is `./deeplab_v3/dlv3_qnn.pte`. ### Test model inference on QNN HTP emulator We can test model inferences before deploying it to a device by HTP emulator. Let's build `qnn_executor_runner` for a x64 host: ```bash # assuming the AOT component is built. cd $EXECUTORCH_ROOT/build-x86 cmake ../examples/qualcomm \ -DCMAKE_PREFIX_PATH="$PWD/lib/cmake/ExecuTorch;$PWD/third-party/gflags;" \ -DCMAKE_FIND_ROOT_PATH_MODE_PACKAGE=BOTH \ -DPYTHON_EXECUTABLE=python3 \ -Bexamples/qualcomm cmake --build examples/qualcomm -j$(nproc) # qnn_executor_runner can be found under examples/qualcomm/executor_runner # The full path is $EXECUTORCH_ROOT/build-x86/examples/qualcomm/executor_runner/qnn_executor_runner ls examples/qualcomm/executor_runner ``` To run the HTP emulator, the dynamic linker needs to access QNN libraries and `libqnn_executorch_backend.so`. We set the below two paths to `LD_LIBRARY_PATH` environment variable: 1. `$QNN_SDK_ROOT/lib/x86_64-linux-clang/` 2. `$EXECUTORCH_ROOT/build-x86/lib/` The first path is for QNN libraries including HTP emulator. It has been configured in the AOT compilation section. The second path is for `libqnn_executorch_backend.so`. So, we can run `./deeplab_v3/dlv3_qnn.pte` by: ```bash cd $EXECUTORCH_ROOT/build-x86 export LD_LIBRARY_PATH=$EXECUTORCH_ROOT/build-x86/lib/:$LD_LIBRARY_PATH examples/qualcomm/executor_runner/qnn_executor_runner --model_path ../deeplab_v3/dlv3_qnn.pte ``` We should see some outputs like the below. Note that the emulator can take some time to finish. ```bash I 00:00:00.354662 executorch:qnn_executor_runner.cpp:213] Method loaded. I 00:00:00.356460 executorch:qnn_executor_runner.cpp:261] ignoring error from set_output_data_ptr(): 0x2 I 00:00:00.357991 executorch:qnn_executor_runner.cpp:261] ignoring error from set_output_data_ptr(): 0x2 I 00:00:00.357996 executorch:qnn_executor_runner.cpp:265] Inputs prepared. I 00:01:09.328144 executorch:qnn_executor_runner.cpp:414] Model executed successfully. I 00:01:09.328159 executorch:qnn_executor_runner.cpp:421] Write etdump to etdump.etdp, Size = 424 [INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn device [INFO] [Qnn ExecuTorch]: Destroy Qnn backend ``` ### Run model inference on an Android smartphone with Qualcomm SoCs ***Step 1***. We need to push required QNN libraries to the device. ```bash # make sure you have write-permission on below path. DEVICE_DIR=/data/local/tmp/executorch_qualcomm_tutorial/ adb shell "mkdir -p ${DEVICE_DIR}" adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV79Stub.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR} adb push ${QNN_SDK_ROOT}/lib/hexagon-v79/unsigned/libQnnHtpV79Skel.so ${DEVICE_DIR} ``` ***Step 2***. We also need to indicate dynamic linkers on Android and Hexagon where to find these libraries by setting `ADSP_LIBRARY_PATH` and `LD_LIBRARY_PATH`. So, we can run `qnn_executor_runner` like ```bash adb push ./deeplab_v3/dlv3_qnn.pte ${DEVICE_DIR} adb push ${EXECUTORCH_ROOT}/build-android/examples/qualcomm/executor_runner/qnn_executor_runner ${DEVICE_DIR} adb push ${EXECUTORCH_ROOT}/build-android/lib/libqnn_executorch_backend.so ${DEVICE_DIR} adb shell "cd ${DEVICE_DIR} \ && export LD_LIBRARY_PATH=${DEVICE_DIR} \ && export ADSP_LIBRARY_PATH=${DEVICE_DIR} \ && ./qnn_executor_runner --model_path ./dlv3_qnn.pte" ``` You should see something like below: ``` I 00:00:00.257354 executorch:qnn_executor_runner.cpp:213] Method loaded. I 00:00:00.323502 executorch:qnn_executor_runner.cpp:262] ignoring error from set_output_data_ptr(): 0x2 I 00:00:00.357496 executorch:qnn_executor_runner.cpp:262] ignoring error from set_output_data_ptr(): 0x2 I 00:00:00.357555 executorch:qnn_executor_runner.cpp:265] Inputs prepared. I 00:00:00.364824 executorch:qnn_executor_runner.cpp:414] Model executed successfully. I 00:00:00.364875 executorch:qnn_executor_runner.cpp:425] Write etdump to etdump.etdp, Size = 424 [INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn backend ``` The model is merely executed. If we want to feed real inputs and get model outputs, we can use ```bash cd $EXECUTORCH_ROOT # android python -m examples.qualcomm.scripts.deeplab_v3 -b build-android -m SM8550 --download -s # (optional) linux embedded python -m examples.qualcomm.scripts.deeplab_v3 -b build-oe-linux -m SXR1230P --download -s -t aarch64-oe-linux-gcc-9.3 ``` The `` can be found by `adb devices` command. After the above command, pre-processed inputs and outputs are put in `$EXECUTORCH_ROOT/deeplab_v3` and `$EXECUTORCH_ROOT/deeplab_v3/outputs` folder. The command-line arguments are written in [utils.py](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/utils.py#L139). The model, inputs, and output location are passed to `qnn_executorch_runner` by `--model_path`, `--input_list_path`, and `--output_folder_path`. ### Run [Android LlamaDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) with QNN backend `$DEMO_APP` refers to the root of the executorch android demo, i.e., the directory containing `build.gradle.kts`. ***Step 1***: Rebuild ExecuTorch AAR ```bash # Build the AAR cd $EXECUTORCH_ROOT export BUILD_AAR_DIR=$EXECUTORCH_ROOT/aar-out ./scripts/build_android_library.sh ``` ***Step 2***: Copy AAR to Android Project ```bash cp $EXECUTORCH_ROOT/aar-out/executorch.aar \ $DEMO_APP/app/libs/executorch.aar ``` ***Step 3***: Build Android APK ```bash cd $DEMO_APP ./gradlew clean assembleDebug -PuseLocalAar=true ``` ***Step 4***: Install on Device ```bash adb install -r app/build/outputs/apk/debug/app-debug.apk ``` ***Step 5***: Push model ```bash adb shell mkdir -p /data/local/tmp/llama adb push model.pte /data/local/tmp/llama adb push tokenizer.bin /data/local/tmp/llama ``` ***Step 6***: Run the Llama Demo - Open the App on Android - Select `QUALCOMM` backend - Select `model.pte` Model - Select `tokenizer.bin` Tokenizer - Select Model Type - Click LOAD MODEL - It should show `Successfully loaded model.` #### Verification Steps ***Step 1***. Verify AAR Contains Your Changes ```bash # Check for debug strings in the AAR unzip -p $DEMO_APP/app/libs/executorch.aar jni/arm64-v8a/libexecutorch.so | \ strings | grep "QNN" # Replace "QNN" with your actual debug string if needed ``` If found, your changes are in the AAR. ***Step 2***. Verify APK Contains Correct Libraries ```bash # Check QNN library version in APK cd $DEMO_APP unzip -l app/build/outputs/apk/debug/app-debug.apk | grep "libQnnHtp.so" ``` Expected size for QNN 2.37.0: ~2,465,440 bytes ***Step 3***. Monitor Logs During Model Loading ```bash adb logcat -c adb logcat | grep -E "ExecuTorch" ``` #### Common Issues and Solutions ##### Issue 1: Error 18 (InvalidArgument) - **Cause**: Wrong parameter order in Runner constructor or missing QNN config - **Solution**: Check `$EXECUTORCH_ROOT/examples/qualcomm/oss_scripts/llama/runner/runner.h` for the correct constructor signature. ##### Issue 2: Error 1 (Internal) with QNN API Version Mismatch - **Symptoms**: ``` W [Qnn ExecuTorch]: Qnn API version 2.33.0 is mismatched E [Qnn ExecuTorch]: Using newer context binary on old SDK E [Qnn ExecuTorch]: Can't create context from binary. Error 5000 ``` - **Cause**: Model compiled with QNN SDK version X but APK uses QNN runtime version Y - **Solution**: - Update `build.gradle.kts` with matching QNN runtime version > **Note:** The version numbers below (`2.33.0` and `2.37.0`) are examples only. Please check for the latest compatible QNN runtime version or match your QNN SDK version to avoid API mismatches. **Before**: ```kotlin implementation("com.qualcomm.qti:qnn-runtime:2.33.0") ``` **After**: ```kotlin implementation("com.qualcomm.qti:qnn-runtime:2.37.0") ``` - Or recompile model with matching QNN SDK version ##### Issue 3: Native Code Changes Not Applied - **Symptoms**: - Debug logs don't appear - Behavior doesn't change - **Cause**: - Gradle using Maven dependency instead of local AAR - **Solution**: - Always build with `-PuseLocalAar=true` flag ##### Issue 4: Logs Not Appearing - **Cause**: Wrong logging tag filter - **Solution**: QNN uses "ExecuTorch" tag: ```bash adb logcat | grep "ExecuTorch" ``` ## Supported model list Please refer to `$EXECUTORCH_ROOT/examples/qualcomm/scripts/` and `$EXECUTORCH_ROOT/examples/qualcomm/oss_scripts/` to the list of supported models. Each script demonstrates: - Model export (torch.export) - Quantization (PTQ/QAT) - Lowering and compilation to QNN delegate Deployment on device or HTP emulator ## How to Support a Custom Model in HTP Backend ### Step-by-Step Implementation Guide Please reference [the simple example](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/scripts/export_example.py) and [more complicated examples](https://github.com/pytorch/executorch/tree/main/examples/qualcomm/scripts) for reference #### Step 1: Prepare Your Model ```python import torch # Initialize your custom model model = YourModelClass().eval() # Your custom PyTorch model # Create example inputs (adjust shape as needed) example_inputs = (torch.randn(1, 3, 224, 224),) # Example input tensor ``` #### Step 2: [Optional] Quantize Your Model Choose between quantization approaches, post training quantization (PTQ) or quantization aware training (QAT): ```python from executorch.backends.qualcomm.quantizer.quantizer import QnnQuantizer from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, prepare_qat_pt2e, convert_pt2e quantizer = QnnQuantizer() m = torch.export.export(model, example_inputs, strict=True).module() # PTQ (Post-Training Quantization) if quantization_type == "ptq": prepared_model = prepare_pt2e(m, quantizer) # Calibration loop would go here prepared_model(*example_inputs) # QAT (Quantization-Aware Training) elif quantization_type == "qat": prepared_model = prepare_qat_pt2e(m, quantizer) # Training loop would go here for _ in range(training_steps): prepared_model(*example_inputs) # Convert to quantized model quantized_model = convert_pt2e(prepared_model) ``` The `QNNQuantizer` is configurable, with the default setting being **8a8w**. For advanced users, refer to the [`QnnQuantizer`](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/quantizer/quantizer.py) documentation for details. ##### Supported Quantization Schemes - **8a8w** (default) - **16a16w** - **16a8w** - **16a4w** - **16a4w_block** ##### Customization Options - **Per-node annotation**: Use `custom_quant_annotations`. - **Per-module (`nn.Module`) annotation**: Use `submodule_qconfig_list`. ##### Additional Features - **Node exclusion**: Discard specific nodes via `discard_nodes`. - **Blockwise quantization**: Configure block sizes with `block_size_map`. For practical examples, see [`test_qnn_delegate.py`](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/tests/test_qnn_delegate.py). #### Step 3: Configure Compile Specs During this step, you will need to specify the target SoC, data type, and other QNN compiler spec. ```python from executorch.backends.qualcomm.utils.utils import ( generate_qnn_executorch_compiler_spec, generate_htp_compiler_spec, QcomChipset, to_edge_transform_and_lower_to_qnn, ) # HTP Compiler Configuration backend_options = generate_htp_compiler_spec( use_fp16=not quantized, # False for quantized models ) # QNN Compiler Spec compile_spec = generate_qnn_executorch_compiler_spec( soc_model=QcomChipset.SM8650, # Your target SoC backend_options=backend_options, ) ``` #### Step 4: Lower and Export the Model ```python # Lower to QNN backend delegated_program = to_edge_transform_and_lower_to_qnn( quantized_model if quantized else model, example_inputs, compile_spec ) # Export to ExecuTorch format executorch_program = delegated_program.to_executorch() # Save the compiled model model_name = "custom_model_qnn.pte" with open(model_name, "wb") as f: f.write(executorch_program.buffer) print(f"Model successfully exported to {model_name}") ``` ## Deep Dive ### Partitioner API The **QnnPartitioner** identifies and groups supported subgraphs for execution on the QNN backend. It uses `QnnOperatorSupport` to check node-level compatibility with the Qualcomm backend via QNN SDK APIs. The partitioner tags supported nodes with a `delegation_tag` and handles constants, buffers, and mutable states appropriately. Please checkout [QNNPartitioner](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/partition/qnn_partitioner.py#L125) for the latest changes. It mostly supports the following 4 inputs, and only compile spec is required ```python class QnnPartitioner(Partitioner): """ QnnPartitioner identifies subgraphs that can be lowered to QNN backend, by tagging nodes for delegation, and manages special cases such as mutable buffers and consumed constants. """ def __init__( self, compiler_specs: List[CompileSpec], skip_node_id_set: set = None, skip_node_op_set: set = None, skip_mutable_buffer: bool = False, ): ... ``` ### Quantization Quantization in the QNN backend supports multiple data bit-widths and training modes (PTQ/QAT). The QnnQuantizer defines quantization configurations and annotations compatible with Qualcomm hardware. Supported schemes include: - 8a8w (default) - 16a16w - 16a8w - 16a4w - 16a4w_block Highlights: - QuantDtype enumerates bit-width combinations for activations and weights. - ModuleQConfig manages per-layer quantization behavior and observers. - QnnQuantizer integrates with PT2E prepare/convert flow to annotate and quantize models. Supports: - Per-channel and per-block quantization - Custom quant annotation via custom_quant_annotations - Skipping specific nodes or ops - Per-module customization via submodule_qconfig_list For details, see: backends/qualcomm/quantizer/quantizer.py ### Operator Support [The full operator support matrix](https://github.com/pytorch/executorch/tree/f32cdc3de6f7176d70a80228f1a60bcd45d93437/backends/qualcomm/builders#operator-support-status is tracked and frequently updated in the ExecuTorch repository. It lists: - Supported PyTorch ops (aten.*, custom ops) - Planned ops - Deprecated ops This matrix directly corresponds to the implementations in: [executorch/backends/qualcomm/builders/node_visitors/*.py](https://github.com/pytorch/executorch/tree/main/backends/qualcomm/builders) ### Custom Ops Support You can extend QNN backend support for your own operators. Follow the [tutorial](https://github.com/pytorch/executorch/tree/f32cdc3de6f7176d70a80228f1a60bcd45d93437/examples/qualcomm/custom_op#custom-operator-support): It covers: - Writing new NodeVisitor for your op - Registering via @register_node_visitor - Creating and linking libQnnOp*.so for the delegate - Testing and verifying custom kernels on HTP ## FAQ If you encounter any issues while reproducing the tutorial, please file a github [issue](https://github.com/pytorch/executorch/issues) on ExecuTorch repo and tag use `#qcom_aisw` tag ### Debugging tips - Before trying any complicated models, try out [a simple model example](https://github.com/pytorch/executorch/tree/f32cdc3de6f7176d70a80228f1a60bcd45d93437/examples/qualcomm#simple-examples-to-verify-the-backend-is-working) and see it if works one device. --- # Building and Running ExecuTorch with OpenVINO Backend In this tutorial we will walk you through the process of setting up the prerequisites, building OpenVINO backend library, exporting `.pte` models with OpenVINO optimizations, and executing the exported models on Intel hardware. ::::{grid} 2 :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites * In this tutorial you will learn how to lower and deploy a model with OpenVINO. ::: :::{grid-item-card} Tutorials we recommend you complete before this: :class-card: card-prerequisites * [Introduction to ExecuTorch](intro-how-it-works.md) * [Setting up ExecuTorch](getting-started.md) * [Building ExecuTorch with CMake](using-executorch-building-from-source.md) ::: :::: ## Introduction to OpenVINO [OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) is an open-source toolkit designed to enhance AI inference on Intel hardware by reducing latency and increasing throughput while preserving accuracy. It optimizes hardware utilization and simplifies AI development and deep learning integration across domains such as computer vision, large language models (LLMs), and generative AI. OpenVINO is integrated as an Executorch delegate to accelerate AI applications deployed with Executorch APIs. ## Supported Hardware OpenVINO backend supports the following hardware: - Intel CPUs - Intel integrated GPUs - Intel discrete GPUs - Intel NPUs For more information on the supported hardware, please refer to [OpenVINO System Requirements](https://docs.openvino.ai/2025/about-openvino/release-notes-openvino/system-requirements.html) page. ## Instructions for Building OpenVINO Backend ### Prerequisites Before you begin, ensure you have openvino installed and configured on your system: ```bash git clone https://github.com/openvinotoolkit/openvino.git cd openvino && git checkout releases/2025/1 git submodule update --init --recursive sudo ./install_build_dependencies.sh mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON make -j cd .. cmake --install build --prefix cd source setupvars.sh ``` Note: The OpenVINO backend is not yet supported with the current OpenVINO release packages. It is recommended to build from source. The instructions for using OpenVINO release packages will be added soon. For more information about OpenVINO build, refer to the [OpenVINO Build Instructions](https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build_linux.md). ### Setup Follow the steps below to setup your build environment: 1. **Setup ExecuTorch Environment**: Refer to the [Environment Setup](using-executorch-building-from-source.md#environment-setup) guide for detailed instructions on setting up the ExecuTorch environment. 2. **Setup OpenVINO Backend Environment** - Install the dependent libs. Ensure that you are inside `executorch/backends/openvino/` directory ```bash pip install -r requirements.txt ``` 3. Navigate to `scripts/` directory. 4. **Build OpenVINO Backend**: Once the prerequisites are in place, run the `openvino_build.sh` script to start the build process, OpenVINO backend will be built under `cmake-out/backends/openvino/` as `libopenvino_backend.a` ```bash ./openvino_build.sh ``` ## Build Instructions for Examples ### AOT step: Refer to the [README.md](../../examples/openvino/README.md) in the `executorch/examples/openvino` folder for detailed instructions on exporting deep learning models from various model suites (TIMM, Torchvision, Hugging Face) to openvino backend using Executorch. Users can dynamically specify the model, input shape, and target device. Below is an example to export a ResNet50 model from Torchvision model suite for CPU device with an input shape of `[1, 3, 256, 256]` ```bash cd executorch/examples/openvino python aot_optimize_and_infer.py --export --suite torchvision --model resnet50 --input_shape "(1, 3, 256, 256)" --device CPU ``` The exported model will be saved as 'resnet50.pte' in the current directory. ### Build C++ OpenVINO Examples After building the OpenVINO backend following the [instructions](#setup) above, the executable will be saved in `/cmake-out/`. The executable requires a model file (`.pte` file generated in the aot step) and the number of inference executions. #### Example Usage Run inference with a given model for 10 executions: ``` ./executor_runner \ --model_path=model.pte \ --num_executions=10 ``` ## Support If you encounter any issues while reproducing the tutorial, please file a github issue on ExecuTorch repo and tag use `#openvino` tag --- # Bundled Program -- a Tool for ExecuTorch Model Validation ## Introduction `BundledProgram` is a wrapper around the core ExecuTorch program designed to help users wrapping test cases with the model they deploy. `BundledProgram` is not necessarily a core part of the program and not needed for its execution, but is particularly important for various other use-cases, such as model correctness evaluation, including e2e testing during the model bring-up process. Overall, the procedure can be broken into two stages, and in each stage we are supporting: * **Emit stage**: Bundling the test I/O cases along with the ExecuTorch program, serializing into flatbuffer. * **Runtime stage**: Accessing, executing, and verifying the bundled test cases during runtime. ## Emit stage This stage mainly focuses on the creation of a `BundledProgram` and dumping it out to the disk as a flatbuffer file. The main procedure is as follow: 1. Create a model and emit its ExecuTorch program. 2. Construct a `List[MethodTestSuite]` to record all test cases that needs to be bundled. 3. Generate `BundledProgram` by using the emited model and `List[MethodTestSuite]`. 4. Serialize the `BundledProgram` and dump it out to the disk. ### Step 1: Create a Model and Emit its ExecuTorch Program. ExecuTorch Program can be emitted from user's model by using ExecuTorch APIs. Follow the [Generate and emit sample ExecuTorch program](getting-started.md#exporting) or [Exporting to ExecuTorch tutorial](tutorials/export-to-executorch-tutorial) . ### Step 2: Construct `List[MethodTestSuite]` to hold test info In `BundledProgram`, we create two new classes, `MethodTestCase` and `MethodTestSuite`, to hold essential info for ExecuTorch program verification. `MethodTestCase` represents a single testcase. Each `MethodTestCase` contains inputs and expected outputs for a single execution. :::{dropdown} `MethodTestCase` ```{eval-rst} .. autofunction:: executorch.devtools.bundled_program.config.MethodTestCase.__init__ :noindex: ``` ::: `MethodTestSuite` contains all testing info for single method, including a str representing method name, and a `List[MethodTestCase]` for all testcases: :::{dropdown} `MethodTestSuite` ```{eval-rst} .. autofunction:: executorch.devtools.bundled_program.config.MethodTestSuite :noindex: ``` ::: Since each model may have multiple inference methods, we need to generate `List[MethodTestSuite]` to hold all essential infos. ### Step 3: Generate `BundledProgram` We provide `BundledProgram` class under `executorch/devtools/bundled_program/core.py` to bundled the `ExecutorchProgram`-like variable, including `ExecutorchProgram`, `MultiMethodExecutorchProgram` or `ExecutorchProgramManager`, with the `List[MethodTestSuite]`: :::{dropdown} `BundledProgram` ```{eval-rst} .. autofunction:: executorch.devtools.bundled_program.core.BundledProgram.__init__ :noindex: ``` ::: Construtor of `BundledProgram `will do sannity check internally to see if the given `List[MethodTestSuite]` matches the given Program's requirements. Specifically: 1. The method_names of each `MethodTestSuite` in `List[MethodTestSuite]` for should be also in program. Please notice that it is no need to set testcases for every method in the Program. 2. The metadata of each testcase should meet the requirement of the coresponding inference methods input. ### Step 4: Serialize `BundledProgram` to Flatbuffer. To serialize `BundledProgram` to make runtime APIs use it, we provide two APIs, both under `executorch/devtools/bundled_program/serialize/__init__.py`. :::{dropdown} Serialize and Deserialize ```{eval-rst} .. currentmodule:: executorch.devtools.bundled_program.serialize .. autofunction:: serialize_from_bundled_program_to_flatbuffer :noindex: ``` ```{eval-rst} .. currentmodule:: executorch.devtools.bundled_program.serialize .. autofunction:: deserialize_from_flatbuffer_to_bundled_program :noindex: ``` ::: ### Emit Example Here is a flow highlighting how to generate a `BundledProgram` given a PyTorch model and the representative inputs we want to test it along with. ```python import torch from executorch.exir import to_edge_transform_and_lower from executorch.devtools import BundledProgram from executorch.devtools.bundled_program.config import MethodTestCase, MethodTestSuite from executorch.devtools.bundled_program.serialize import ( serialize_from_bundled_program_to_flatbuffer, ) from torch.export import export # Step 1: ExecuTorch Program Export class SampleModel(torch.nn.Module): """An example model with multi-methods. Each method has multiple input and single output""" def __init__(self) -> None: super().__init__() self.register_buffer('a', 3 * torch.ones(2, 2, dtype=torch.int32)) self.register_buffer('b', 2 * torch.ones(2, 2, dtype=torch.int32)) def forward(self, x: torch.Tensor, q: torch.Tensor) -> torch.Tensor: z = x.clone() torch.mul(self.a, x, out=z) y = x.clone() torch.add(z, self.b, out=y) torch.add(y, q, out=y) return y # Inference method name of SampleModel we want to bundle testcases to. # Notices that we do not need to bundle testcases for every inference methods. method_name = "forward" model = SampleModel() # Inputs for graph capture. capture_input = ( (torch.rand(2, 2) - 0.5).to(dtype=torch.int32), (torch.rand(2, 2) - 0.5).to(dtype=torch.int32), ) # Export method's FX Graph. method_graph = export( export(model, capture_input).module(), capture_input, ) # Emit the traced method into ET Program. et_program = to_edge_transform_and_lower(method_graph).to_executorch() # Step 2: Construct MethodTestSuite for Each Method # Prepare the Test Inputs. # Number of input sets to be verified n_input = 10 # Input sets to be verified. inputs = [ # Each list below is a individual input set. # The number of inputs, dtype and size of each input follow Program's spec. [ (torch.rand(2, 2) - 0.5).to(dtype=torch.int32), (torch.rand(2, 2) - 0.5).to(dtype=torch.int32), ] for _ in range(n_input) ] # Generate Test Suites method_test_suites = [ MethodTestSuite( method_name=method_name, test_cases=[ MethodTestCase( inputs=input, expected_outputs=(getattr(model, method_name)(*input), ), ) for input in inputs ], ), ] # Step 3: Generate BundledProgram bundled_program = BundledProgram(et_program, method_test_suites) # Step 4: Serialize BundledProgram to flatbuffer. serialized_bundled_program = serialize_from_bundled_program_to_flatbuffer( bundled_program ) save_path = "bundled_program.bpte" with open(save_path, "wb") as f: f.write(serialized_bundled_program) ``` We can also regenerate `BundledProgram` from flatbuffer file if needed: ```python from executorch.devtools.bundled_program.serialize import deserialize_from_flatbuffer_to_bundled_program save_path = "bundled_program.bpte" with open(save_path, "rb") as f: serialized_bundled_program = f.read() regenerate_bundled_program = deserialize_from_flatbuffer_to_bundled_program(serialized_bundled_program) ``` ## Runtime Stage This stage mainly focuses on executing the model with the bundled inputs and comparing the model's output with the bundled expected output. We provide multiple APIs to handle the key parts of it. ### Get ExecuTorch Program Pointer from `BundledProgram` Buffer We need the pointer to ExecuTorch program to do the execution. To unify the process of loading and executing `BundledProgram` and Program flatbuffer, we create an API for this `executorch::bundled_program::get_program_data`. Check out an [example usage](https://github.com/pytorch/executorch/blob/release/1.0/examples/devtools/example_runner/example_runner.cpp#L128-L137) of this API. ### Load Bundled Input to Method To execute the program on the bundled input, we need to load the bundled input into the method. Here we provided an API called `executorch::bundled_program::load_bundled_input`. Check out an [example usage](https://github.com/pytorch/executorch/blob/release/1.0/examples/devtools/example_runner/example_runner.cpp#L253-L259) of this API. ### Verify the Method's Output. We call `executorch::bundled_program::verify_method_outputs` to verify the method's output with bundled expected outputs. Check out an [example usage](https://github.com/pytorch/executorch/blob/release/1.0/examples/devtools/example_runner/example_runner.cpp#L301-L307) of this API. ### Runtime Example Please checkout our [example runner](https://github.com/pytorch/executorch/blob/release/0.6/examples/devtools/README.md#bundledprogram) for a bundled program. You could run these commands to test with the BundledProgram binary (`.bpte`) file you generated in the previous step: ```bash cd executorch ./examples/devtools/build_example_runner.sh ./cmake-out/examples/devtools/example_runner --bundled_program_path {your-bpte-file} --output_verification ``` It is expected to see no output from running the above mentioned snippet. For a detailed example of how the runner should be like, please refer to our [example runner](https://github.com/pytorch/executorch/blob/release/1.0/examples/devtools/example_runner/example_runner.cpp). ### Try the Complete Workflow To test the entire end-to-end workflow including building the example runner, exporting a model, and verifying the bundled program execution, you can use the test script: ```bash cd executorch ./examples/devtools/test_example_runner.sh ``` This script will: 1. Build the example runner using `build_example_runner.sh` 2. Export a MobileNetV2 model as a bundled program 3. Run the example runner with the bundled program to verify correctness This is a great way to ensure your environment is set up correctly and to see the complete BundledProgram workflow in action. ## Common Errors Errors will be raised if `List[MethodTestSuites]` doesn't match the `Program`. Here're two common situations: ### Test input doesn't match model's requirement. Each inference method of PyTorch model has its own requirement for the inputs, like number of input, the dtype of each input, etc. `BundledProgram` will raise error if test input not meet the requirement. Here's the example of the dtype of test input not meet model's requirement: ```python import torch from executorch.exir import to_edge from executorch.devtools import BundledProgram from executorch.devtools.bundled_program.config import MethodTestCase, MethodTestSuite from torch.export import export class Module(torch.nn.Module): def __init__(self): super().__init__() self.a = 3 * torch.ones(2, 2, dtype=torch.float) self.b = 2 * torch.ones(2, 2, dtype=torch.float) def forward(self, x): out_1 = torch.ones(2, 2, dtype=torch.float) out_2 = torch.ones(2, 2, dtype=torch.float) torch.mul(self.a, x, out=out_1) torch.add(out_1, self.b, out=out_2) return out_2 model = Module() method_names = ["forward"] inputs = (torch.ones(2, 2, dtype=torch.float), ) # Find each method of model needs to be traced my its name, export its FX Graph. method_graph = export( export(model, inputs).module(), inputs, ) # Emit the traced methods into ET Program. et_program = to_edge(method_graph).to_executorch() # number of input sets to be verified n_input = 10 # Input sets to be verified for each inference methods. # To simplify, here we create same inputs for all methods. inputs = { # Inference method name corresponding to its test cases. m_name: [ # NOTE: executorch program needs torch.float, but here is torch.int [ torch.randint(-5, 5, (2, 2), dtype=torch.int), ] for _ in range(n_input) ] for m_name in method_names } # Generate Test Suites method_test_suites = [ MethodTestSuite( method_name=m_name, test_cases=[ MethodTestCase( inputs=input, expected_outputs=(getattr(model, m_name)(*input),), ) for input in inputs[m_name] ], ) for m_name in method_names ] # Generate BundledProgram bundled_program = BundledProgram(et_program, method_test_suites) ``` :::{dropdown} Raised Error ``` The input tensor tensor([[-2, 0], [-2, -1]], dtype=torch.int32) dtype shall be torch.float32, but now is torch.int32 --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[1], line 72 56 method_test_suites = [ 57 MethodTestSuite( 58 method_name=m_name, (...) 67 for m_name in method_names 68 ] 70 # Step 3: Generate BundledProgram ---> 72 bundled_program = create_bundled_program(program, method_test_suites) File /executorch/devtools/bundled_program/core.py:276, in create_bundled_program(program, method_test_suites) 264 """Create bp_schema.BundledProgram by bundling the given program and method_test_suites together. 265 266 Args: (...) 271 The `BundledProgram` variable contains given ExecuTorch program and test cases. 272 """ 274 method_test_suites = sorted(method_test_suites, key=lambda x: x.method_name) --> 276 assert_valid_bundle(program, method_test_suites) 278 bundled_method_test_suites: List[bp_schema.BundledMethodTestSuite] = [] 280 # Emit data and metadata of bundled tensor File /executorch/devtools/bundled_program/core.py:219, in assert_valid_bundle(program, method_test_suites) 215 # type of tensor input should match execution plan 216 if type(cur_plan_test_inputs[j]) == torch.Tensor: 217 # pyre-fixme[16]: Undefined attribute [16]: Item `bool` of `typing.Union[bool, float, int, torch._tensor.Tensor]` 218 # has no attribute `dtype`. --> 219 assert cur_plan_test_inputs[j].dtype == get_input_dtype( 220 program, program_plan_id, j 221 ), "The input tensor {} dtype shall be {}, but now is {}".format( 222 cur_plan_test_inputs[j], 223 get_input_dtype(program, program_plan_id, j), 224 cur_plan_test_inputs[j].dtype, 225 ) 226 elif type(cur_plan_test_inputs[j]) in ( 227 int, 228 bool, 229 float, 230 ): 231 assert type(cur_plan_test_inputs[j]) == get_input_type( 232 program, program_plan_id, j 233 ), "The input primitive dtype shall be {}, but now is {}".format( 234 get_input_type(program, program_plan_id, j), 235 type(cur_plan_test_inputs[j]), 236 ) AssertionError: The input tensor tensor([[-2, 0], [-2, -1]], dtype=torch.int32) dtype shall be torch.float32, but now is torch.int32 ``` ::: ### Method name in `BundleConfig` does not exist. Another common error would be the method name in any `MethodTestSuite` does not exist in Model. `BundledProgram` will raise error and show the non-exist method name: ```python import torch from executorch.exir import to_edge from executorch.devtools import BundledProgram from executorch.devtools.bundled_program.config import MethodTestCase, MethodTestSuite from torch.export import export class Module(torch.nn.Module): def __init__(self): super().__init__() self.a = 3 * torch.ones(2, 2, dtype=torch.float) self.b = 2 * torch.ones(2, 2, dtype=torch.float) def forward(self, x): out_1 = torch.ones(2, 2, dtype=torch.float) out_2 = torch.ones(2, 2, dtype=torch.float) torch.mul(self.a, x, out=out_1) torch.add(out_1, self.b, out=out_2) return out_2 model = Module() method_names = ["forward"] inputs = (torch.ones(2, 2, dtype=torch.float),) # Find each method of model needs to be traced my its name, export its FX Graph. method_graph = export( export(model, inputs).module(), inputs, ) # Emit the traced methods into ET Program. et_program = to_edge(method_graph).to_executorch() # number of input sets to be verified n_input = 10 # Input sets to be verified for each inference methods. # To simplify, here we create same inputs for all methods. inputs = { # Inference method name corresponding to its test cases. m_name: [ [ torch.randint(-5, 5, (2, 2), dtype=torch.float), ] for _ in range(n_input) ] for m_name in method_names } # Generate Test Suites method_test_suites = [ MethodTestSuite( method_name=m_name, test_cases=[ MethodTestCase( inputs=input, expected_outputs=(getattr(model, m_name)(*input),), ) for input in inputs[m_name] ], ) for m_name in method_names ] # NOTE: MISSING_METHOD_NAME is not an inference method in the above model. method_test_suites[0].method_name = "MISSING_METHOD_NAME" # Generate BundledProgram bundled_program = BundledProgram(et_program, method_test_suites) ``` :::{dropdown} Raised Error ``` All method names in bundled config should be found in program.execution_plan, but {'MISSING_METHOD_NAME'} does not include. --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[3], line 73 70 method_test_suites[0].method_name = "MISSING_METHOD_NAME" 72 # Generate BundledProgram ---> 73 bundled_program = create_bundled_program(program, method_test_suites) File /executorch/devtools/bundled_program/core.py:276, in create_bundled_program(program, method_test_suites) 264 """Create bp_schema.BundledProgram by bundling the given program and method_test_suites together. 265 266 Args: (...) 271 The `BundledProgram` variable contains given ExecuTorch program and test cases. 272 """ 274 method_test_suites = sorted(method_test_suites, key=lambda x: x.method_name) --> 276 assert_valid_bundle(program, method_test_suites) 278 bundled_method_test_suites: List[bp_schema.BundledMethodTestSuite] = [] 280 # Emit data and metadata of bundled tensor File /executorch/devtools/bundled_program/core.py:141, in assert_valid_bundle(program, method_test_suites) 138 method_name_of_program = {e.name for e in program.execution_plan} 139 method_name_of_test_suites = {t.method_name for t in method_test_suites} --> 141 assert method_name_of_test_suites.issubset( 142 method_name_of_program 143 ), f"All method names in bundled config should be found in program.execution_plan, \ 144 but {str(method_name_of_test_suites - method_name_of_program)} does not include." 146 # check if method_tesdt_suites has been sorted in ascending alphabetical order of method name. 147 for test_suite_id in range(1, len(method_test_suites)): AssertionError: All method names in bundled config should be found in program.execution_plan, but {'MISSING_METHOD_NAME'} does not include. ``` ::: --- # Backend Dialect ## Overview _Backend dialect_ is a special variant of [edge dialect](ir-exir.md), because it contains backend specific nodes and metadata, after backend specific graph transformations. Backend dialect is an optional stage, only needed if we want to introduce backend-awareness into the graph. More specifically, a graph in backend dialect may contain operators or delegated lowered modules (see [delegate doc](compiler-delegate-and-partitioner.md)) that are only meaningful to the target backend. One use case is that if we want to fuse operators into a single operator, for example, fusing consecutive addmm + relu to a single operator addmm_relu, we can do that here. This document describes how to introduce backend specific operators. Difference between custom ops and backend specific ops: while custom ops are showing up in eager mode, ATen dialect, and edge dialect, backend-specific ops are only being introduced by passes happening after edge dialect. ## When to use This dialect allows the introduction of operators that do not conform to the schema defined in the canonical ATen operator set, and which do not appear in any of the dialects above (ATen dialect and edge dialect). Consider using backend operators if your use case satisfies one or more of the following criteria: * Your backend provides a library that optimizes a certain operator that is equivalent to a subgraph. For example, `linear_relu` (equivalent to linear + relu) that can be executed faster on a certain backend. * There's a need to retrace the graph module after it is already lowered to a backend. When we retrace, backend operators can transform back to the original subgraph (in ATen dialect) where normal custom op doesn't take care of that. * Your backend-specific operator doesn't have a generic CPU kernel but only a kernel for a certain backend. Using a backend operator can workaround this issue by using the original subgraph as default kernel and keeping the graph module runnable. * Alternatively, you can use delegate if you are concerned it might be an overkill and just want something more lightweight and only requires Python code at the compiler stage. ## APIs For an operator/subgraph replacement, the common flow is: 1. Register an operator that has the same input and output as the subgraph. This operator won’t have the target-specific implementations (also, doesn’t need to in the compilation stage), but it needs to give the same result as the subgraph. 2. Create a pattern that allows the compiler to find the subgraph and substitute it with the replacement. 3. Write a pass to replace the subgraph with the new operator. In order to facilitate the process, we provide an API to help reduce the effort for ExecuTorch users to do these steps. ### Pass Infra Entry Point To lower edge ops to backend ops, a pass will perform pattern matching to identify the edge ops of interest in the graph, and then replace them with equivalent backend operators. There are two APIs to register such passes: * `transform()`. An API on ExportProgram that allows users to provide custom passes. Note that this is not guarded by any validator so the soundness of the program is not guaranteed. * [ExecutorchBackendConfig.passes](https://github.com/pytorch/executorch/blob/main/exir/capture/_config.py#L40). If added here, the pass will be part of the lowering process from backend dialect to ExecutorchProgram. Example: one such pass is QuantFusion. This pass takes a "canonical quantization pattern", ie. "dequant - some_op - quant" and fuses this pattern into a single operator that is backend specific, i.e. `quantized_decomposed::some_op`. Another simpler example is [here](https://github.com/pytorch/executorch/blob/main/exir/passes/replace_edge_with_backend_pass.py#L20) where we replace `sym_size` operators to the ones that are understood by ExecuTorch ### Pattern Binding Decorator We provide a decorator `bind_pattern_to_op` to help users easily register their backend operators into EXIR. This decorator takes: * a `torch.Library` object, it indicates which library or namespace this backend operator belongs to. * a name or schema. If we already defined the schema of the backend operator in the `torch.Library` object, only a name is needed. Otherwise we can register the schema if a schema string is being passed in. This decorator should be added to the pattern we are trying to match (and then lower to this backend op) on edge dialect. This way we are registering this pattern as a `CompositeImplicitAutograd` kernel for this backend operator. Then the operator can be accessed/used from the passes. The `CompositeImplicitAutograd` kernel makes sure: 1. No need for the user to write a (CPU) runnable kernel. 2. Ensures the retrace-ability of `ExportProgram`. Once retraced, the backend operator will be decomposed into the ATen ops used in the pattern. ## Example Let’s assume a simple program that contains both add and relu operators: ```python def f(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: z = x + y return torch.ops.aten.relu.default(z) ``` After lowering to edge dialect it becomes: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %aten_add_tensor : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%arg0_1, %arg1_1), kwargs = {}) %aten_relu_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.relu.default](args = (%aten_add_tensor,), kwargs = {}) return (aten_relu_default,) ``` Now I want to write a pass to merge `add` and `relu` into `add_relu`, the first step is to write a pattern: ```python # In the pattern, we can use edge ops and ATen ops interchangably def pattern(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: z = torch.ops.aten.add.Tensor(x, y) out = torch.ops.aten.relu.default(z) return out ``` Then we need to create an operator library from the fused operator namespace, then use the decorator on our pattern: ```python lib = Library("foo_namespace", "DEF") @bind_pattern_to_op(lib, "add_relu(Tensor self, Tensor other) -> Tensor") def pattern(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: z = torch.ops.aten.add.Tensor(x, y) out = torch.ops.aten.relu.default(z) return out ``` This way we are registering the pattern as a kernel to `add_relu` and it is ready to be used in a pass. A simple pass looks like this: ```python class AddReluFusionPass(ExportPass): def call(self, graph_module: GraphModule) -> PassResult: # decorator registers this pattern as a CompositeExplicitAutograd kernel, since there's no kernel registered before. @bind_pattern_to_op(lib, "add_relu") def pattern(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: z = torch.ops.aten.add.Tensor(x, y) out = torch.ops.aten.relu.default(z) return out def replacement(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: return torch.ops.foo_namespace.add_relu.default(x, y) subgraph_rewriter.replace_pattern( graph_module, _trace_and_lower_to_edge_ops(pattern), _trace_and_lower_to_edge_ops(replacement), ) return PassResult(graph_module, True) ``` The result graph looks like this: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %foo_namespace_add_relu_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.foo_namespace.add_relu.default](args = (%arg0_1, %arg1_1), kwargs = {}) return (foo_namespace_add_relu_default,) ``` ### Op Set There are the backend operators currently using `bind_pattern_to_op` API. * `executorch_prims::add.int(SymInt a, SymInt b) -> SymInt` * pattern: builtin.add * backend: executor * `executorch_prims::mul.int(SymInt a, SymInt b) -> SymInt` * pattern: builtin.mul * backend: executor * `executorch_prims::sub.int(SymInt a, SymInt b) -> SymInt` * pattern: builtin.sub * backend: executor * `executorch_prims::floordiv.int(SymInt a, SymInt b) -> SymInt` * pattern: builtin.floordiv * backend: executor * `executorch_prims::truediv.int(Scalar a, Scalar b) -> Scalar` * pattern: builtin.div * backend: executor * `executorch_prims::sym_float.Scalar(Scalar a) -> Scalar` * pattern: builtin.float * backend: executor * `executorch_prims::gt.int(SymInt a, SymInt b) -> bool` * pattern: builtin.gt * backend: executor * `executorch_prims::lt.int(SymInt a, SymInt b) -> bool` * pattern: builtin.lt * backend: executor * `executorch_prims::ge.int(SymInt a, SymInt b) -> bool` * pattern: builtin.ge * backend: executor * `executorch_prims::le.int(SymInt a, SymInt b) -> bool` * pattern: builtin.le * backend: executor * `executorch_prims::eq.int(SymInt a, SymInt b) -> bool` * pattern: builtin.eq * backend: executor * `executorch_prims::mod.Scalar(SymInt a, SymInt b) -> SymInt` * pattern: builtin.divmod * backend: executor * `executorch_prims::neg.Scalar(Scalar a) -> Scalar` * pattern: operator.ne * backend: executor * `quantized_decomposed::embedding_byte(Tensor weight, Tensor weight_scales, Tensor weight_zero_points, int weight_quant_min, int weight_quant_max, Tensor indices) -> Tensor` * pattern: [source](https://github.com/pytorch/executorch/blob/main/exir/passes/_quant_patterns_and_replacements.py) * backend: quantization * `quantized_decomposed::add(Tensor a, float a_scale, int a_zero_point, int a_quant_min, int a_quant_max, Tensor b, float b_scale, int b_zero_point, int b_quant_min, int b_quant_max, float out_scale, int out_zero_point, int out_quant_min, int out_quant_max) -> Tensor qc` * pattern: [source](https://github.com/pytorch/executorch/blob/main/exir/passes/_quant_patterns_and_replacements.py) * backend: quantization * `quantized_decomposed::add.scalar(Tensor qa, float a_scale, int a_zero_point, int a_quant_min, int a_quant_max, ScalarType a_dtype, Scalar b, float out_scale, int out_zero_point, int out_quant_min, int out_quant_max, ScalarType out_dtype) -> Tensor` * pattern: [source](https://github.com/pytorch/executorch/blob/main/exir/passes/_quant_patterns_and_replacements.py) * backend: quantization * `quantized_decomposed::add_relu(Tensor a, float a_scale, int a_zero_point, int a_quant_min, int a_quant_max, Tensor b, float b_scale, int b_zero_point, int b_quant_min, int b_quant_max, float out_scale, int out_zero_point, int out_quant_min, int out_quant_max) -> Tensor qc` * pattern: [source](https://github.com/pytorch/executorch/blob/main/exir/passes/_quant_patterns_and_replacements.py) * backend: quantization --- # Custom Compiler Passes and Partitioners ## Passes Passes can be roughly categorized into a couple of axes: Axis A: 1. Creating one-to-X mapping (for example, decomposition) 2. Creating many-to-one mapping (for example, fusion) Axis B: 1. Performing forwards iteration (for example, shape propagation) 2. Performing backwards iteration (for example, dead code elimination) Axis C: 1. Dependent on local node information (eg. out-variant conversion) 2. Dependent on global graph information (eg. memory planning) Our projection on the frequency of these use cases are: 1. A.1, B.1, C.1 2. A.2 3. B.2, C.2 ### Level 1 For level 1 uses cases (creating one-to-X mappings, performing forwards iterations, and looking at local node information), we can utilize a helper class called [`ExportPass`](https://github.com/pytorch/executorch/blob/d9eef24bb720804aa7b400b05241487510ae0dc2/exir/pass_base.py#L44). This is an [interpreter-based](https://pytorch.org/docs/stable/fx.html#the-interpreter-pattern) way where we execute each node and recreate the graph except with transformations specified. This allows us to preserve the IR Spec by ensuring that all nodes created while in the pass meet the IR Spec including ensuring that metadata such as stack trace, FakeTensor values, and torch.nn.Module hierarchy are preserved and updated depending on the transformations made. To implement this pass, we can create a subclass of [`ExportPass`](https://github.com/pytorch/executorch/blob/d9eef24bb720804aa7b400b05241487510ae0dc2/exir/pass_base.py#L44) and implement the exposed functions. When called with a graph module, it will run the graph module and create a new graph containing the changes specified by the pass. This means that the graph module passed in must be runnable on CPU, and this invariant will be maintained after the pass is run. #### One-to-One Pass An example for one-to-one mappings, if we wanted to replace an op A with another op B, we can run the given `fx.GraphModule`, and every time we see op A, return op B. Consider the following example: ```python class ReplaceInPlaceReluWithOutOfPlaceReluPass(ExportPass): """ relu_ is the in-place version. Replace it with relu, which is the out-of-place version """ def call_operator(self, op, args, kwargs, meta): if op != torch.ops.aten.relu_.default: return super().call_operator(op, args, kwargs, meta) return super().call_operator(Op(torch.ops.aten.relu.default), args, kwargs, meta) # To create a pass replace_pass = ReplaceInPlaceReluWithOutOfPlaceReluPass() # To run a pass new_graph_module = replace_pass(graph_module).graph_module ``` The `super().call_operator(op, args, kwargs, meta)` call creates a `call_function` FX node, and returns the result of running the operator with the given arguments. #### One-to-X Pass If we wanted to do one-to-X mappings, like replacing op A with 2 other ops B and C, we would then make 2 calls to `super().call_operator` to create 2 FX nodes, one with op B and another with op C, and return the result of running op C. For example: ```python class ReplaceAddWithMulSub(ExportPass): """ Original: def f(x, y): return x + y After pass: def f(x, y): z = x * y return z - y """ def call_operator(self, op, args, kwargs, meta): if op != torch.ops.aten.add.default: return super().call_operator(op, args, kwargs, meta) x, y = args mul_res = super().call_operator( torch.ops.aten.mul.default, args, {}, meta ) return super().call_operator( torch.ops.aten.sub.default, (mul_res, y), {}, meta ) ``` #### One-to-None Pass If we wanted to remove an op, we can just return the value passed into the function: ```python class RemoveDetachPass(ExportPass): def call_operator(self, op, args, kwargs, meta): if op not in ( torch.ops.aten.detach.default, torch.ops.aten.detach_copy.default, ): return super().call_operator(op, args, kwargs, meta) assert len(args) == 1 return args[0] ``` #### Utilizing Local Information An example of utilizing local node information is, if we wanted to convert all the scalars within the graph to tensors, we can run the given `fx.GraphModule`, and for every argument that contains a scalar, we convert it to a tensor. It might look something like: ```python def args_map(op, fn, args, kwargs): assert isinstance(args, tuple) assert isinstance(kwargs, dict) args = list(args) kwargs = kwargs.copy() # Update the argument based on the function passed def update(key, args, schema): args[key] = fn(args[key], schema) # Update each argument in the schema for i, schema in enumerate(self.op._schema.arguments): if schema.name in kwargs: update(schema.name, kwargs, schema) elif not schema.kwarg_only and i < len(args): update(i, args, schema) class ScalarToTensorPass(ExportPass): def call_operator(self, op, args, kwargs): def try_coerce(value, arg): return ( torch.tensor(value) if isinstance(value, (float, int, bool)) and type(arg.type) == torch.TensorType else value ) args, kwargs = args_map(op, try_coerce, args, kwargs) return super().call_operator(op, args, kwargs) ``` ### Level 2 For creating many-to-one mappings, we can utilize FX's [subgraph rewriter](https://github.com/pytorch/pytorch/blob/8597d37536ef11bdf6b0a539ab79af876e1c92f6/torch/fx/subgraph_rewriter.py#L77). Given a `pattern`, it creates a subgraph of operators matching to the pattern, and then replaces each matched subgraph with the `replacement`. ```{note} This is an inplace operation. ``` The `pattern` and `replacement` inputs must be callable functions written with the same ops that are used in the EXIR graph you are matching with (ATen ops) so that the subgraph rewriter can find the correct pattern in the graph. Inputs to the pattern/replacement callables will be treated as wildcards. Consider the following example: ```python from torch.fx import subgraph_rewriter def replace_patterns(graph_module): def pattern(x, y): x = torch.ops.aten.add.Tensor(x, y) x = torch.ops.aten.mul.Tensor(x, y) return x def replacement(x, y): return torch.ops.aten.sub.Tensor(x, y) replaced_patterns = subgraph_rewriter.replace_pattern_with_filters( traced_module, pattern, replacement ) ``` The subgraph rewriter returns a list of `ReplacedPatterns`: ```python @dataclass class ReplacedPatterns: # Node from which the match was found anchor: Node # Maps nodes in the pattern subgraph to nodes in the larger graph nodes_map: Dict[Node, Node] # List of nodes that were added into the graph replacements: List[Node] ``` ```{note} The nodes created by the subgraph rewriter will not have the metadata that is normally in EXIR nodes (`stack_trace`, `val`, `nn_module_stack`). ``` ### Level 3 For the third way of creating a pass, we can utilize the most basic [`PassBase`](https://github.com/pytorch/pytorch/blob/8597d37536ef11bdf6b0a539ab79af876e1c92f6/torch/fx/passes/infra/pass_base.py#L22). To create a pass, we can subclass this and implement the function `call` with the pass contents. Additionally, we can implement the functions `requires` and `ensures` which will be called before and after the function `call`. Note that these functions can also be overridden in `ExportPass`. To run a pass on a graph module, we can pass the graph module directly to an instance of the class. Consider the following example: ```python class ReplaceAddPass(PassBase): def __init__(self, replace_op): self.replace_op = replace_op def call(self, graph_module): for node in gm.graph.nodes: if node.op == "call_function" and node.target == torch.add: node.target = self.replace_op # Optional to implement, will be called before call() def requires(self, graph_module) -> None: for node in graph_module.graph.nodes: if node.op == "call_function" and node.target == torch.add: return raise ValueError("No torch.add ops!") # Optional to implement, will be called after call() def ensures(self, graph_module: torch.fx.GraphModule) -> None: pass # To create a pass replace_add_with_div = ReplaceAddPass(torch.div) # To run a pass replace_add_with_div(graph_module) ``` ## Pass Manager The `PassManager` is a class used to run multiple passes on a given graph module. When initializing a `PassManager` instance, we pass in a list of passes that we want to run and set a couple of flags. To run the collection of passes on a graph module, we can pass the graph module directly to the `PassManager` instance. An example: ```python from executorch.exir.pass_manager import PassManager pm = PassManager( passes=[replace_add_with_div, replace_div_with_mul], run_checks_after_each_pass=True, suppress_check_failures=False, ) graph_module_out = pm(graph_module) ``` To add a common set of checks that are run after each pass, we can call the function `set_checks(check: Callable)` which takes in a callable function as input. If the `run_checks_after_each_pass` flag is set, the `check` will be called after each pass is run on the graph module. An example: ```python pm = PassManager(passes=[replace_add_with_div, replace_div_with_mul]) def check_div_target(graph_module): for node in graph_module.graph.nodes: if node.op == "call_function" and node.target != torch.div: raise ValueError("Target should be div!") pm.add_checks(check_div_target) pm(graph_module) # raises ValueError after replace_div_with_mul pass ``` ## Partitioner There are a couple of common FX-graph based partitioners we can use to partition the graph. However, these do not necessarily produce a graph that is compliant with IR Spec, so be careful when using them. ### Subgraph Matcher For finding subgraphs within a graph that match a specific pattern, we can utilize FX's [`SubgraphMatcher`](https://github.com/pytorch/pytorch/blob/8597d37536ef11bdf6b0a539ab79af876e1c92f6/torch/fx/passes/utils/matcher_utils.py#L51). Class Attributes: * `pattern (Graph)`: The targeted matching pattern. Placeholder nodes in the graph will be treated as wildcards when matching. * `match_output (bool)`: If True, output node in the pattern graph will be treated as a part of the targeted pattern. If False, output node is ignored during match. * `match_placeholder (bool)`: If True, placeholder node in the pattern graph will be treated as a part of the targeted pattern. If False, placeholder nodes will be used a wildcard. * `remove_overlapping_matches (bool)`: If True, in the case of overlapping matches, only the first match will be returned. * `ignore_literals (bool)`: If True, will not check if literals are equal and will instead treat them as wildcards. Consider the following example: ```python from torch.fx.passes.utils.matcher_utils import SubgraphMatcher class LargeModel(torch.nn.Module): def __init__(self): super().__init__() self._weight = torch.nn.Parameter(torch.ones(3, 3)) self._bias = torch.nn.Parameter(torch.ones(3, 3)) def forward(self, x): return torch.ops.aten.addmm.default(self._bias, x, self._weight) large_model_graph = to_edge(export(LargeModel(), large_inputs)).exported_program().graph_module.graph class PatternModel(torch.nn.Module): def __init__(self): super().__init__() self._weight_1 = torch.nn.Parameter(torch.ones(5, 5)) self._bias_1 = torch.nn.Parameter(torch.ones(5, 5)) def forward(self, x): return torch.ops.aten.addmm.default(self._bias_1, x, self._weight_1) pattern_graph = to_edge(export(PatternModel(), pattern_inputs)).exported_program().graph_module.graph subgraph_matcher = SubgraphMatcher(pattern_graph) match_result = subgraph_matcher.match(large_model_graph) ``` The `match` function returns a list of `InternalMatch`: ```python @dataclass class InternalMatch(): # Nodes from which the match was found anchors: List[Node] # Maps nodes in the pattern subgraph to nodes in the larger graph nodes_map: Dict[Node, Node] = field(default_factory=dict) # Nodes in target graph that are matched placeholder in pattern placeholder_nodes: List[Node] = field(default_factory=list) # Nodes in matched subgraph returned by output returning_nodes: List[Node] = field(default_factory=list) ``` ### Capability Based Partitioner To find the largest subgraphs of nodes that support a specific invariant, we can utilize FX's [`CapabilityBasedPartitioner`](https://github.com/pytorch/pytorch/blob/8597d37536ef11bdf6b0a539ab79af876e1c92f6/torch/fx/passes/infra/partitioner.py#L34C1-L34C1). Class Attributes * `graph_module (torch.fx.GraphModule)`: The graph module we are partitioning on. * `operator_support (OperatorSupportBase)`: The object used to determine if a node in the graph is supported in the partition. * `allows_single_node_partition (bool)`: If True, allows single node partitions to be formed. * `non_compute_ops (Optional[Sequence[str]])`: A set of ops that are considered to be "non-compute" (ex `torch.ops.aten.view` and `_operator.getitem`, so that the partitioner will not create graphs that only contain these non-compute ops * `allowed_single_node_partition_ops (Optional[Sequence[str]])`: A set of ops that are allowed to be in a single node partition. The [`OperatorSupportBase`](https://github.com/pytorch/pytorch/blob/8597d37536ef11bdf6b0a539ab79af876e1c92f6/torch/fx/passes/operator_support.py#L28) class is used by the partitioner to determine if a specific node in the graph belongs in the partition. This is done by overriding the `is_node_supported` function. You can chain multiple `OperatorSuppportBase` by using [`chain`](https://github.com/pytorch/pytorch/blob/8597d37536ef11bdf6b0a539ab79af876e1c92f6/torch/fx/passes/operator_support.py#L150)(which returns False if any of the OperatorSupportBase return False) and [`any_chain`](https://github.com/pytorch/pytorch/blob/8597d37536ef11bdf6b0a539ab79af876e1c92f6/torch/fx/passes/operator_support.py#L164) (which returns True if any of the OperatorSupportBase returns True). Consider the following example: ```python from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner from torch.fx.passes.operator_support import any_chain, OperatorSupportBase class AddMulOperatorSupport(OperatorSupportBase): def is_node_supported(self, submodules, node: torch.fx.Node) -> bool: return node.op == "call_function" and node.target in [ torch.ops.aten.add.Tensor, torch.ops.aten.mul.Tensor, ] capability_partitioner = CapabilityBasedPartitioner( graph_module, op_support, ) # Returns a list of partitions (list of nodes that belong in each partition) partition_list = capability_partitioner.propose_partitions() ``` If you look at the capability based partitioner, you may also find a `fuse_partition` function which will return a modified graph with the partitions as submodules, and calls to these submodules in the toplevel graph through `call_module` nodes. However, this is not compliant to the IR Spec because we do not allow `call_module` nodes. ### Combined We also provide a combined helper function: [`generate_pattern_op_partitions`](https://github.com/pytorch/executorch/blob/d9eef24bb720804aa7b400b05241487510ae0dc2/exir/backend/canonical_partitioners/pattern_op_partitioner.py#L59) Args: * `graph_module (fx.GraphModule)`: Module that we want to partition * `patterns (List[torch.fx.Graph])`: A list of patterns in the form of torch.fx.Graph. These graphs can be obtained through the `graph` field from a GraphModule obtained by exir.capture (recommended) or symbolic tracing (which might not result in an accurate edge dialect graph), or by manual crafting a graph module. * `op_support (OperatorSupportBase)`: A OperatorSupportBase that can be created in the following ways: * Subclassing it directly and implementing `is_node_supported()` * Getting the result of `create_op_support()` * Getting the result of `create_pattern_support()` * Multiple OperatorSupportBase classes chained together with `chain()` or `any_chain()` Returns * A list of partitions (largest possible subgraphs) containing nodes are supported by the union of the given OperatorSupportBase object and the given pattern graphs. ### Source Partitioner For more complicated use cases in which users want to partition based on higher level modules (`torch.nn.Linear` or `torch.nn.functional.Linear`) which are now decomposed into their operators (`aten.permute`, `aten.addmm`), we have the following [helper function](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/utils/source_matcher_utils.py#L51): `get_source_partitions(graph: torch.fx.Graph, wanted_sources: List[Any]) -> Dict[Any, SourcePartition]` Args: * `graph`: The graph we want to partition * `wanted_sources`: List of sources of nodes that were decomposed from this source. This can be a function (ex. `torch.nn.functional.linear`) or a leaf module type (ex. `torch.nn.Linear`) Returns: * Dictionary mapping sources (ex. `torch.nn.modules.linear.Linear`) to a list of `SourcePartitions` that correspond to the list of nodes that were flattened from a module of that type. ```python @dataclass class SourcePartition(): # Nodes in a particular partition nodes: List[Node] # Module type module_type: Type # Nodes in the graph that are needed as inputs to the partition input_nodes: List[Node] = field(default_factory=list) # Nodes in the partition that are being used by nodes outside of the partition output_nodes: List[Node] = field(default_factory=list) # Parameters that are being used params: List[str] = field(default_factory=list) ``` An example: ```python class M(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(3, 3) self.relu = torch.nn.ReLU() self.linear2 = torch.nn.Linear(3, 5) def forward(self, x): x = self.linear1(x) x = self.linear1(x) x = self.relu(x) x = self.linear2(x) return x inputs = (torch.randn(3, 3),) edge_graph = to_edge(export(M(), inputs)).exported_program().graph_module.graph print(edge_graph) """ graph(): %arg0 : [#users=1] = placeholder[target=arg0] %_param_constant0 : [#users=1] = get_attr[target=_param_constant0] %permute_default : [#users=1] = call_function[target=torch.ops.aten.permute_copy.default](args = (%_param_constant0,), kwargs = {}) %_param_constant1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1, %arg0, %t_default), kwargs = {}) %_param_constant0_1 : [#users=1] = get_attr[target=_param_constant0] %permute_default_1 : [#users=1] = call_function[target=torch.ops.aten.permute_copy.default](args = (%_param_constant0_1,), kwargs = {}) %_param_constant1_1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default_1 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1_1, %addmm_default, %t_default_1), kwargs = {}) %relu_default : [#users=1] = call_function[target=torch.ops.aten.relu.default](args = (%addmm_default_1,), kwargs = {}) %_param_constant2 : [#users=1] = get_attr[target=_param_constant2] %permute_default_2 : [#users=1] = call_function[target=torch.ops.aten.permute_copy.default](args = (%_param_constant2,), kwargs = {}) %_param_constant3 : [#users=1] = get_attr[target=_param_constant3] %addmm_default_2 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant3, %relu_default, %t_default_2), kwargs = {}) return [addmm_default_2] """ module_partitions = get_source_partitions(edge_graph, [torch.nn.Linear, torch.nn.ReLU]) print(module_partitions) """ {: [ ModulePartition(nodes=[_param_constant0, t_default, _param_constant1, addmm_default], module_type=, input_nodes=[arg0], output_nodes=[addmm_default], params=["_param_constant0", "_param_constant1"]), ModulePartition(nodes=[_param_constant0_1, t_default_1, _param_constant1_1, addmm_default_1], module_type=, input_nodes=[addmm_default], output_nodes=[addmm_default_1], params=["_param_constant0_1", "_param_constant1_1"]), ModulePartition(nodes=[_param_constant2, t_default_2, _param_constant3, addmm_default_2], module_type=, input_nodes=[relu_default], output_nodes=[addmm_default_2], params=["_param_constant2", "_param_constant3"])], : [ ModulePartition(nodes=[relu_default], module_type=, input_nodes=[addmm_default_1], output_nodes=[relu_default], params=[])]} """ ``` --- # Understanding Backends and Delegates Audience: Vendors, Backend Delegate developers, who are interested in integrating their own compilers and hardware as part of ExecuTorch Backend delegation is an entry point for backends to process and execute PyTorch programs to leverage performance and efficiency benefits of specialized backends and hardware, while still providing PyTorch users with an experience close to that of the PyTorch runtime. ## Backend Interfaces: Overview At a high level, the entry point for backends is defined by 2 components: - An IR to represent the program: **Edge Dialect** (which is produced through the `to_edge` API) - A couple of interfaces for backends to implement: - Ahead-of-Time (AOT) - Program preprocessing (e.g. ahead of time compilation, transformation, optimization...). - Runtime - Program initialization (e.g. runtime compilation). - Program execution. - (optional) Program destroy (e.g. release backend owned resource). A delegate backend implementation is composed of: 1) An ahead-of-time preprocessing interface 2) A runtime initialization and execution interface The diagram looks like following drawing **Figure 1.** A high level of entry points for backend interfaces, including both ahead-of-time and runtime. ## Backend Interfaces: Ahead-of-Time Preprocessing There are mainly two Ahead-of-Time entry point for backend to implement: `partition` and `preprocess`. `partitioner` is an algorithm implemented by the backend to tag the nodes to be lowered to the backend. `to_backend` API will apply the partition algorithm and lower each subgraph, which consists of connected tagged nodes, to the targeted backend. Every subgraph will be sent to the `preprocess` part provided by the backend to be compiled as a binary blob. During partition, the `exported_program` is not allowed to mutate the program, and it's supposed to apply tag to each node. The `PartitionResult` includes both tagged exported program and the partition tags dictionary for `to_backend` to look up the tag and link to the `backend_id` and `compile_spec` ```python def partition( exported_program: ExportedProgram, ) -> PartitionResult: ``` During preprocessing, backends are given an edge dialect program, a list of compile specs specifying the values needed for compilation, and are expected to return a compiled blob, or binary containing the desired program to be run in the backend. During serialization, the compiled blob will be serialized as part of the `.pte` file, and directly loaded to the device. The API for this process is: ```python def preprocess( edge_program: ExportedProgram, compile_specs: List[CompileSpec], ) -> PreprocessResult: ``` A demo of the preprocess function is implemented [here](https://github.com/pytorch/executorch/blob/main/exir/backend/test/backend_with_compiler_demo.py). The demo loops through the nodes in the graph module of the `edge_program` and serializes the `add`, `mul`, and `sin` instructions into a string, which is later parsed and executed at runtime. The diagram looks like following drawing **Figure 2.** The graph goes through partition and each subgraph will be sent to the preprocess part. ## Backend Interfaces: Runtime Initialization and Execution During the runtime, the compiled blob from the `preprocess` function will be loaded and passed directly to the backend's custom `init` function. This function is responsible for further processing the compiled unit, as well as perform any backend initialization. The backend's custom `execute` function will then be called to execute the handle produced by `init`. And finally, if destroying is required for some backend, backends can implement a `destroy` function which will be called when the program is out of its lifespan. ```cpp // Runtime check ET_NODISCARD bool is_available(); // Runtime initialization ET_NODISCARD virtual Result init( BackendInitContext& context, FreeableBuffer* processed, ArrayRef compile_specs); // Runtime execution ET_NODISCARD virtual Error execute( BackendExecutionContext& context, DelegateHandle* handle, Span args); // [optional] Runtime destroy. Destroy the resource held by the backend virtual void destroy(ET_UNUSED DelegateHandle* handle); ``` The diagram looks like following drawing **Figure 3.** The relationship between standard ExecuTorch Runtime and backend entry point. In order to make backend available to ExecuTorch runtime, it must be registered via the `register_backend` API: ```cpp ET_NODISCARD Error register_backend(const Backend& backend); ``` Static registeration, i.e., at libraray init or load time, of a backend can be achieved as follows: ```cpp namespace { auto cls = BackendWithCompiler(); Backend backend{"BackendWithCompilerDemo", &cls}; static auto success_with_compiler = register_backend(backend); } // namespace ``` ## Developer Tools Integration: Debuggability Providing consistent debugging experience, be it for runtime failures or performance profiling, is important. ExecuTorch employs native Developer Tools for this purpose, which enables correlating program instructions to original PyTorch code, via debug handles. You can read more about it [here](etrecord.rst). Delegated program or subgraphs are opaque to ExecuTorch runtime and appear as a special `call_delegate` instruction, which asks corresponding backend to handle the execution of the subgraph or program. Due to the opaque nature of backend delegates, native Developer Tools does not have visibility into delegated program. Thus the debugging, functional or performance, experiences of delegated execution suffers significantly as compared to it's non-delegated counterpart. In order to provide consistent debugging experience to users, regardless of the use of delegation for a model, Developer Tools provide an interface to correlate delegated (sub)graph to original (sub)graph. The Developer Tools do so via debug handles map which allows delegates to generate internal handles that can be associated with the original (sub)graph consumed by the delegate. Then at runtime, backend developer can report error or profiling information using the internal handle, which will be mapped to original (sub)graph using the debug handle map. For more information, please refer to [Delegate Debugging](delegate-debugging.md). By leveraging the debug identifier, backend developer can embed the debug as part of the delegated blob drawing In this way, during execute stage, with the debug identifier, backend developer can associate the failed instruction inside the delegate back to the exact line of PyThon code. drawing ## Common Questions **1. How can we get data in backend.preprocess?** The graph module being preprocessed is a lifted graph, this means that static data like weights and biases are supplied as inputs to the graph. However, we can access the weights and biases ahead-of-time through the exported program. To access these parameters from a given node, we can use the function `get_params` provided in `torch/_export/utils.py` **2. How can we embed the data (like weight/bias) to the backend?** It's common that backends have some ways to optimize the const data. In this case, we'd need to tag the placeholder nodes which are also the state in the partitioner, and during backend.preprocess, we can follow the description in the first question to get the weight. **3. How can we run the lowered module in Python with the specific backend?** We haven't added the support yet but that's the plan! **4. Should we expect to see `get_attr` nodes in the edge dialect program?** `get_attr` nodes will only show up for submodules used for control flow or delegation. It won't hold any data. **5. Can we delegate to multiple backends?** Yes! There are two ways to do this: *Option 1: Run to_backend multiple times for different backends* If we have two backends, backend_1 and backend_2, and they have their own parititioners: backend_1_parititioner and backend_2_partitioner, we can run it like: ```python # Will first lower nodes to backend_1 depending on the backend_1_parititioner depending on partitioner algorithm exported_program_backend_1 = to_backend(exported_program, backend_1_parititioner()) # For the rest of nodes, they will be lowered to backend_2 depending on backend_2_parititioner exported_program_backend_1_and_2 = to_backend(exported_program_backend_1, backend_2_parititioner()) ``` A more concrete example be found [here](https://github.com/pytorch/executorch/blob/main/exir/backend/test/demos/test_xnnpack_qnnpack.py). In this example, qnnpack is one backend and xnnpack is another backend. We haven't open-sourced these two backends delegates yet, and this example won't run out of box. It can be used as a reference to see how it can be done. This option is easy to try because usually all backends will implement their own partitioner. However this option may get different results if we change the order of to_backend call. If we want to have a better control on the nodes, like which backend they should go, option 2 is better. *Option 2: Have a partitioner which partitions for different backends* Another option is to create a customized partitioner, say partitioner `backend_1_2_partitioner`, and inside the partitioner logic, ```python class Backend_1_2_Partitioner(Partitioner): """ Partitions all add/mul nodes regardless of order for Backend2 """ def __init__(self) -> None: self.delegation_spec_1 = DelegationSpec("Backend1", []) self.delegation_spec_2 = DelegationSpec("Backend2", []) self.partition_tags = {} def partition( self, exported_program: ExportedProgram ) -> ExportedProgram: # Tag all nodes in the first partiton to backend 1 node_to_backend_1 = ... # some logic to select the nodes from the graph delegation_tag = f"backend2_tag{partitioner_1.id}" node.meta["delegation_tag"] = delegation_tag self.partition_tags[delegation_tag] = self.delegation_spec_1 # Tag all nodes in the first partiton to backend 2 node_to_backend_2 = ... # some logic to select the nodes from the graph delegation_tag = f"backend2_tag{partitioner_2.id}" node.meta["delegation_tag"] = delegation_tag self.partition_tags[delegation_tag] = self.delegation_spec_2 return exported_program ``` **6. Is there an easy way to write a partitioner?** We provide some helper partitioners [here](compiler-custom-compiler-passes.md) to make it easy to find nodes from decomposed operators. **7. How do we link the node back to the source code?** We provide an helper function ```python from executorch.exir.print_program import inspect_node print(inspect_node(graph, node)) ``` And it will highlight the node in the graph as well as point to the source code, example output will be like following: ``` _param_constant1 error_msg: Here is the node in the graph module: graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %_param_constant0 : [num_users=1] = get_attr[target=_param_constant0] --> %_param_constant1 : [num_users=1] = get_attr[target=_param_constant1] %aten_convolution_default : [num_users=2] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%arg0_1, %_param_constant0, %_param_constant1, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %_param_constant2 : [num_users=1] = get_attr[target=_param_constant2] %_param_constant3 : [num_users=1] = get_attr[target=_param_constant3] %aten_convolution_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_convolution_default, %_param_constant2, %_param_constant3, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten_add_tensor : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%aten_convolution_default, %aten_convolution_default_1), kwargs = {}) %_param_constant4 : [num_users=1] = get_attr[target=_param_constant4] %_param_constant5 : [num_users=1] = get_attr[target=_param_constant5] %aten_convolution_default_2 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor, %_param_constant4, %_param_constant5, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten_gelu_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.gelu.default](args = (%aten_convolution_default_2,), kwargs = {}) return [aten_gelu_default] This node _param_constant1 has metadata of: The node stacktrace: Traceback (most recent call last): File "/tmp/ipykernel_1204253/3382880687.py", line 7, in forward return self.test_model(x) File "/mnt/xarfuse/uid-25337/7b86ad0c-seed-nspid4026532987_cgpid2707357-ns-4026532984/torch/nn/modules/module.py", line 1528, in _call_impl return forward_call(*args, **kwargs) File "/tmp/ipykernel_1204253/712280972.py", line 10, in forward a = self.conv1(x) ``` --- # Compiler Entry Points ```{toctree} :maxdepth: 1 compiler-backend-dialect compiler-custom-compiler-passes compiler-memory-planning ``` --- (compiler-ir-advanced)= # Compiler & IR Advanced compiler features and intermediate representation specifications. ## Compiler Passes - {doc}`compiler-custom-compiler-passes` — Custom compiler passes and optimization ## Memory Management - {doc}`compiler-memory-planning` — Advanced memory planning strategies ## Intermediate Representation - {doc}`ir-exir` — EXIR (Export Intermediate Representation) specification - {doc}`ir-ops-set-definition` — Ops set definition and operator standardization ## Backend dialect - {doc}`compiler-backend-dialect` — Backend dialect and compiler integration ```{toctree} :hidden: :maxdepth: 1 compiler-custom-compiler-passes compiler-memory-planning ir-exir ir-ops-set-definition compiler-backend-dialect --- # Memory Planning Audience: Backend integrators and embedded developers who are interested in customizing the regions of memory ExecuTorch programs operate in. ## Overview MemoryPlanning is the very last action taken before taking an `ExportedProgram` and undergoing emission to an ExecuTorch program. During this process, ExecuTorch takes the size and lifespan of each mutable tensor, and plans out their location in fixed size memory arenas. Concretely, there are three passes related to memory planning: * `SpecPropPass` computes a TensorSpec for each tensor in the graph (inputs, intermediates or outputs). The most important field of the tensor spec is a symbolic expression of the shapes of the tensor, where the initial set of symbols comes from the dimensions of input tensors, intermediate tensor shapes’ symbolic expression is propagated via tensor operations. The dimensions can be marked as either dynamic or static by users and when the dims are dynamic, users are required to annotate the dim with a ValueRange. * `SymShapeEvalPass` evaluates the symbolic expressions to concrete integers with their upper bounds. There are two ways to doing the upper bound specialization: HintBasedSymShapeEval (to be deprecated) is the old way of evaluating the upper bound. It doesn’t look at the ValueRange of the symbols but uses the shapes of example inputs to replace all the symbols. We call it “hint based“ because the example inputs’ shapes are just hints of what the input shapes might be at run time and are used for tracing only. ValueRangeBasedSymShapeEval is the recommended way of doing UpperBoundMemory planning. It will actually look at the ValueRange of the symbols and do an inference over the ranges to get a real upper bound. * `MemoryPlanningPass` does the actual memory planning given all tensors get a TensorSpec with concrete integer shapes. ## Algorithms ExecuTorch provides two options for memory planning algorithms out of the box, but users can define their own if the provided options are inappropriate or insufficient for their use case. * The naive algorithm simply concatenates all the tensors together in a linear memory block without considering memory re-use. It serves as an upper bound for total memory consumption and serves as a baseline. * The Greedy algorithm tries to re-use the already allocated memory based on the best-fit criteria. Specifically: When there isn’t an allocated memory whose lifetime doesn’t overlap with the current tensor that we try to do memory planning for, we allocate a new memory buffer with the same size and lifetime as the current tensor. When there is one or more allocated memory buffer, whose lifetime overlaps with the current tensor, we pick the buffer that has the closest size with current tensor so as to reduce memory fragmentation. Finally, we allocate these memory buffers linearly in memory. ## Method Inputs and Outputs The `MemoryPlanningPass` exposes the option to not memory plan program inputs and outputs. If the IO is not planned then users will be expected to provide data buffers to back these values at runtime. Example: ```python program = edge_program.to_executorch( exir.ExecutorchBackendConfig( memory_planning_pass=MemoryPlanningPass( alloc_graph_input=False, # Inputs will not be memory planned, the data_ptr for input tensors after model load will be nullptr alloc_graph_output=True, # Outputs will be memory planned, the data_ptr for output tensors after model load will be in the `planned_memory`. ) ) ) ``` One common set-up would be for models where the outputs of the model are provided as inputs to subsequent inferences. In that situation, it would generally be better to not memory plan the IO, and instead provide the same buffer to both the input and output at runtime to avoid a copy. ## Custom Memory Plans Users can write custom memory plans to take advantage of multiple memory locations (like SRAM and DRAM), place the outputs of specific nodes in specific locations, or even change the planning algorithm itself. The following example shows how you could reuse the provided planning algorithms, but with multiple hierarchies and placing the outputs of specific ops in specific memory arenas. ```python class CustomPoolMemoryPlanningPass(MemoryPlanningPass): def run(self, graph_module: GraphModule, graph_signature: Optional[ExportGraphSignature]) -> PassResult: for subgm in graph_module.modules(): if not isinstance(subgm, GraphModule): continue for node in subgm.graph.nodes: # mem_id = 1 placeholder and outputs of mul # mem_id = 2 for outputs of add # parent class will copy spec will to alloc nodes if node.op == "placeholder": node.meta["spec"].mem_id = 1 continue if node.op != "call_function": continue if node.target == torch.ops.aten.add.out: node.meta["spec"].mem_id = 2 elif node.target == torch.ops.aten.mul.out: node.meta["spec"].mem_id = 1 return super().run(graph_module, graph_signature) ``` Then later when lowering to ExecuTorch you can use your custom plan in the following way: ```python program = edge_program.to_executorch( exir.ExecutorchBackendConfig( memory_planning_pass=CustomPoolMemoryPlanningPass( memory_planning_algo=greedy, ) ) ) ``` Users attempting to write a custom memory planning algorithm should start by looking at [the greedy algorithm's implementation](https://github.com/pytorch/executorch/blob/d62c41ca86435e5316e7ed292b6d68aff27a2fb7/exir/memory_planning.py#L459C1-L459C12). ## Debugging Tool Please refer to [Memory Planning Inspection](memory-planning-inspection.md) for a tool to inspect the result of memory planning. --- # Concepts This page provides an overview of key concepts and terms used throughout the ExecuTorch documentation. It is intended to help readers understand the terminology and concepts used in PyTorch Edge and ExecuTorch. ## Concepts Map ![](_static/img/concepts-map-overview.png) View in full size View detailed concept map ## [AOT (Ahead of Time)](getting-started-architecture.md#program-preparation) AOT generally refers to the program preparation that occurs before execution. On a high level, ExecuTorch workflow is split into an AOT compilation and a runtime. The AOT steps involve compilation into an Intermediate Representation (IR), along with optional transformations and optimizations. ## [ATen](https://pytorch.org/cppdocs/#aten) Fundamentally, it is a tensor library on top of which almost all other Python and C++ interfaces in PyTorch are built. It provides a core Tensor class, on which many hundreds of operations are defined. ## [ATen Dialect](ir-exir.md#aten-dialect) ATen dialect is the immediate result of exporting an eager module to a graph representation. It is the entry point of the ExecuTorch compilation pipeline; after exporting to ATen dialect, subsequent passes can lower to [Core ATen dialect](concepts.md#concepts#core-aten-dialect) and [Edge dialect](concepts.md#edge-dialect). ATen dialect is a valid [EXIR](concepts.md#exir) with additional properties. It consists of functional ATen operators, higher order operators (like control flow operators) and registered custom operators. The goal of ATen dialect is to capture users’ programs as faithfully as possible. ## ATen mode ATen mode uses the ATen implementation of Tensor (`at::Tensor`) and related types, such as `ScalarType`, from the PyTorch core. This is in contrast to ETensor mode, which uses ExecuTorch’s smaller implementation of tensor (`executorch::runtime::etensor::Tensor`) and related types, such as `executorch::runtime::etensor::ScalarType`. - ATen kernels that rely on the full `at::Tensor` API are usable in this configuration. - ATen kernels tend to do dynamic memory allocation and often have extra flexibility (and thus overhead) to handle cases not needed by mobile/embedded clients. e.g., CUDA support, sparse tensor support, and dtype promotion. - Note: ATen mode is currently a WIP. ## Autograd safe ATen Dialect Autograd safe ATen dialect includes only differentiable ATen operators, along with higher order operators (control flow ops) and registered custom operators. ## Backend A specific hardware (like GPU, NPU) or a software stack (like XNNPACK) that consumes a graph or part of it, with performance and efficiency benefits. ## [Backend Dialect](ir-exir.md#backend-dialect) Backend dialect is the immediate result of exporting Edge dialect to specific backend. It’s target-aware, and may contain operators or submodules that are only meaningful to the target backend. This dialect allows the introduction of target-specific operators that do not conform to the schema defined in the Core ATen Operator Set and are not shown in ATen or Edge Dialect. ## Backend registry A table mapping backend names to backend interfaces. This allows backends to be called via name during runtime. ## Backend Specific Operator These are operators that are not part of ATen dialect or Edge dialect. Backend specific operators are only introduced by passes that happen after Edge dialect (see Backend dialect). These operators are specific to the target backend and will generally execute faster. ## [Buck2](https://buck2.build/) An open-source, large scale build system. Used to build ExecuTorch. ## [CMake](https://cmake.org/) An open-source, cross-platform family of tools designed to build, test and package software. Used to build ExecuTorch. ## Codegen At a high level, codegen performs two tasks; generating the [kernel registration](kernel-library-custom-aten-kernel.md) library, and optionally running [selective build](#selective-build). The kernel registration library connects operator names (referenced in the model) with the corresponding kernel implementation (from the kernel library). The selective build API collects operator information from models and/or other sources and only includes the operators required by them. This can reduce the binary size. The output of codegen is a set of C++ bindings (various `.h`, `.cpp` files) that glue together the kernel library and the ExecuTorch runtime. ## [Core ATen Dialect](https://pytorch.org/docs/stable/torch.compiler_ir.html#irs) Core ATen dialect contains the core ATen operators along with higher order operators (control flow) and registered custom operators. ## [Core ATen operators / Canonical ATen operator set](ir-ops-set-definition.md) A select subset of the PyTorch ATen operator library. Core ATen operators will not be decomposed when exported with the core ATen decomposition table. They serve as a reference for the baseline ATen ops that a backend or compiler should expect from upstream. ## Core ATen Decomposition Table Decomposing an operator means expressing it as a combination of other operators. During the AOT process, a default list of decompositions is employed, breaking down ATen operators into core ATen operators. This is referred to as the Core ATen Decomposition Table. ## [Custom operator](https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU/edit?fbclid=IwAR1qLTrChO4wRokhh_wHgdbX1SZwsU-DUv1XE2xFq0tIKsZSdDLAe6prTxg#heading=h.ahugy69p2jmz) These are operators that aren't part of the ATen library, but which appear in [eager mode](concepts.md#eager-mode). Registered custom operators are registered into the current PyTorch eager mode runtime, usually with a `TORCH_LIBRARY` call. They are most likely associated with a specific target model or hardware platform. For example, [torchvision::roi_align](https://pytorch.org/vision/main/generated/torchvision.ops.roi_align.html) is a custom operator widely used by torchvision (doesn't target a specific hardware). ## DataLoader An interface that enables the ExecuTorch runtime to read from a file or other data source without directly depending on operating system concepts like files or memory allocation. ## [Delegation](compiler-delegate-and-partitioner.md) To run parts (or all) of a program on a specific backend (eg. XNNPACK) while the rest of the program (if any) runs on the basic ExecuTorch runtime. Delegation enables us to leverage the performance and efficiency benefits of specialized backends and hardware. ## Dim Order ExecuTorch introduces `Dim Order` to describe tensor's memory format by returning a permutation of the dimensions, from the outermost to the innermost one. For example, for a tensor with memory format [N, C, H, W], or [contiguous](https://pytorch.org/blog/tensor-memory-format-matters/) memory format, [0, 1, 2, 3] will be its dim order. Also, for a tensor with memory format [N, H, W, C], or [channels_last memory format](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html), we return [0, 2, 3, 1] for its dim order. Currently ExecuTorch only supports dim order representation for [contiguous](https://pytorch.org/blog/tensor-memory-format-matters/) and [channels_last](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) memory format. ## DSP (Digital Signal Processor) Specialized microprocessor chip with architecture optimized for digital signal processing. ## dtype Data type, the type of data (eg. float, integer, etc.) in a tensor. ## [Dynamic Quantization](https://pytorch.org/docs/main/quantization.html#general-quantization-flow) A method of quantizing wherein tensors are quantized on the fly during inference. This is in contrast to [static quantization](concepts.md#static-quantization), where tensors are quantized before inference. ## Dynamic shapes Refers to the ability of a model to accept inputs with varying shapes during inference. For example, the ATen op [unique_consecutive](https://pytorch.org/docs/stable/generated/torch.unique_consecutive.html) and the custom op [MaskRCNN](https://pytorch.org/vision/main/models/mask_rcnn.html) have data dependent output shapes. Such operators are difficult to do memory planning on, as each invocation may produce a different output shape even for the same input shape. To support dynamic shapes in ExecuTorch, kernels can allocate tensor data using the `MemoryAllocator` provided by the client. ## Eager mode Python execution environment where operators in a model are immediately executed as they are encountered. e.g. Jupyter / Colab notebooks are run in eager mode. This is in contrast to graph mode, where operators are first synthesized into a graph which is then compiled and executed. ## [Edge Dialect](ir-exir.md#edge-dialect) A dialect of EXIR with the following properties: - All operators are from a predefined operator set, called 'Edge Operators' or are registered custom operators. - Input and output of the graph, and of each node, must be Tensor. All Scalar types are converted to Tensor. Edge dialect introduces specializations that are useful for Edge devices, but not necessarily for general (server) export. However, Edge dialect does not contain specializations for specific hardware besides those already present in the original Python program. ## Edge operator An ATen operator with a dtype specialization. ## [ExecuTorch](https://github.com/pytorch/executorch) A unified ML software stack within the PyTorch Edge platform designed for efficient on-device inference. ExecuTorch defines a workflow to prepare (export and transform) and execute a PyTorch program on Edge devices such as mobile, wearables, and embedded devices. ## ExecuTorch Method The executable equivalent of an `nn.Module` Python method. For example, the `forward()` Python method would compile into an ExecuTorch `Method`. ## ExecuTorch Program An ExecuTorch `Program` maps string names like `forward` to specific ExecuTorch `Method` entries. ## executor_runner A sample wrapper around the ExecuTorch runtime which includes all the operators and backends. ## [EXIR](ir-exir.md) The **EX**port **I**ntermediate **R**epresentation (IR) from `torch.export`. Contains the computational graph of the model. All EXIR graphs are valid [FX graphs](https://pytorch.org/docs/stable/fx.html#torch.fx.Graph). ## `ExportedProgram` The output of `torch.export` that bundles the computational graph of a PyTorch model (usually an `nn.Module`) with the parameters or weights that the model consumes. ## [flatbuffer](https://github.com/google/flatbuffers) Memory efficient, cross platform serialization library. In the context of ExecuTorch, eager mode Pytorch models are exported to flatbuffer, which is the format consumed by the ExecuTorch runtime. ## Framework tax The cost of various loading and initialization tasks (not inference). For example; loading a program, initializing executor, kernel and backend-delegate dispatch, and runtime memory utilization. ## Functional ATen operators ATen operators that do not have any side effects. ## [Graph](ir-exir.md) An EXIR Graph is a PyTorch program represented in the form of a DAG (directed acyclic graph). Each node in the graph represents a particular computation or operation, and edges of this graph consist of references between nodes. Note: all EXIR graphs are valid [FX graphs](https://pytorch.org/docs/stable/fx.html#torch.fx.Graph). ## Graph mode In graph mode, operators are first synthesized into a graph, which will then be compiled and executed as a whole. This is in contrast to eager mode, where operators are executed as they are encountered. Graph mode typically delivers higher performance as it allows optimizations such as operator fusion. ## Higher Order Operators A higher order operator (HOP) is an operator that: - either accepts a Python function as input, returns a Python function as output, or both. - like all PyTorch operators, higher-order operators also have an optional implementation for backends and functionalities. This lets us e.g. register an autograd formula for the higher-order operator or define how the higher-order operator behaves under ProxyTensor tracing. ## Hybrid Quantization A quantization technique where different parts of the model are quantized with different techniques based on computational complexity and sensitivity to accuracy loss. Some parts of the model may not be quantized to retain accuracy. ## Intermediate Representation (IR) A representation of a program between the source and target languages. Generally, it is a data structure used internally by a compiler or virtual machine to represent source code. ## Kernel An implementation of an operator. There can be multiple implementations of an operator for different backends/inputs/etc. ## Kernel registry / Operator registry A table with mappings between kernel names and their implementations. This allows the ExecuTorch runtime to resolve references to kernels during execution. ## Lowering The process of transforming a model to run on various backends. It is called 'lowering' as it is moving code closer to the hardware. In ExecuTorch, lowering is performed as part of backend delegation. ## [Memory planning](compiler-memory-planning.md) The process of allocating and managing memory for a model. In ExecuTorch, a memory planning pass is run before the graph is saved to flatbuffer. This assigns a memory ID to each tensor and an offset in the buffer, marking where storage for the tensor starts. ## [Node](ir-exir.md) A node in an EXIR graph represents a particular computation or operation, and is represented in Python using [torch.fx.Node](https://pytorch.org/docs/stable/fx.html#torch.fx.Node) class. ## Operator Function on tensors. This is the abstraction; kernels are the implementation. There can be varying implementations for different backends/inputs/etc. ## Operator fusion Operator fusion is the process of combining multiple operators into a single compound operator, resulting in faster computation due to fewer kernel launches and fewer memory read/writes. This is a performance advantage of graph mode vs eager mode. ## Out variant Instead of allocating returned tensors in kernel implementations, an operator's out variant will take in a pre-allocated tensor to its out kwarg, and store the result there. This makes it easier for memory planners to perform tensor lifetime analysis. In ExecuTorch, an out variant pass is performed before memory planning. ## [PAL (Platform Abstraction Layer)](runtime-platform-abstraction-layer.md) Provides a way for execution environments to override operations such as; - Getting the current time. - Printing a log statement. - Panicking the process/system. The default PAL implementation can be overridden if it doesn’t work for a particular client system. ## Partial kernels Kernels that support a subset of tensor dtypes and/or dim orders. ## [Partitioner](compiler-custom-compiler-passes.md#Partitioner) Parts of a model may be delegated to run on an optimized backend. The partitioner splits the graph into the appropriate sub-networks and tags them for delegation. ## ETensor mode ETensor mode uses ExecuTorch’s smaller implementation of tensor (`executorch::runtime::etensor::Tensor`) along with related types (`executorch::runtime::etensor::ScalarType`, etc.). This is in contrast to ATen mode, which uses the ATen implementation of Tensor (`at::Tensor`) and related types (`ScalarType`, etc.) - `executorch::runtime::etensor::Tensor`, also known as ETensor, is a source-compatible subset of `at::Tensor`. Code written against ETensor can build against `at::Tensor`. - ETensor does not own or allocate memory on its own. To support dynamic shapes, kernels can allocate Tensor data using the MemoryAllocator provided by the client. ## Portable kernels Portable kernels are operator implementations that are written to be compatible with ETensor. As ETensor is compatible with `at::Tensor`, portable kernels can be built against `at::Tensor` and used in the same model as ATen kernels. Portable kernels are: - Compatible with ATen operator signatures - Written in portable C++ so they can build for any target - Written as reference implementations, prioritizing clarity and simplicity over optimization - Generally smaller in size than ATen kernels - Written to avoid dynamically allocating memory using new/malloc. ## Program The set of codes and data to describe an ML model. ## Program source code The Python source code to describe the program. It can be a Python function, or a method in PyTorch’s eager mode `nn.Module`. ## [PTQ (Post Training Quantization)](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) A quantization technique where the model is quantized after it has been trained (usually for performance benefits). PTQ applies the quantization flow after training, in contrast to QAT which applies it during training. ## [QAT (Quantization Aware Training)](https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html) Models may lose accuracy after quantization. QAT enables higher accuracy compared to eg. PTQ, by modeling the effects of quantization while training. During training, all weights and activations are ‘fake quantized’; float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all weight adjustments during training are made ‘aware’ that the model will ultimately be quantized. QAT applies the quantization flow during training, in contrast to PTQ which applies it afterwards. ## [Quantization](quantization-overview.md) Techniques for performing computations and memory accesses on tensors with lower precision data, usually `int8`. Quantization improves model performance by lowering the memory usage and (usually) decreasing computational latency; depending on the hardware, computation done in lower precision will typically be faster, e.g. `int8` matmul vs `fp32` matmul. Often, quantization comes at the cost of model accuracy. ## [Runtime](runtime-overview.md) The ExecuTorch runtime executes models on edge devices. It is responsible for program initialization, program execution and, optionally, destruction (releasing backend owned resources). ## [Developer Tools](devtools-overview.md) A collection of tools users need to profile, debug and visualize programs that are running with ExecuTorch. ## [Selective build](kernel-library-selective-build.md) An API used to build a leaner runtime by linking only to kernels used by the program. This provides significant binary size savings. ## [Static Quantization](https://pytorch.org/docs/main/quantization.html#general-quantization-flow) A method of quantizing wherein tensors are statically quantized. That is, floats are converted to a reduced-precision data type before inference. ## [XNNPACK](https://github.com/google/XNNPACK) An optimized library of neural network interface operators for ARM, x86, WebAssembly, and RISC-V platforms. It is an open-source project and used by PyTorch and ExecuTorch. It is a successor to the QNNPack library. The operators support both floating point and quantized values. --- # Contributing to ExecuTorch Thank you for your interest in contributing to ExecuTorch! We want to make it easy to contribute to this project. The source is hosted on GitHub at [pytorch/executorch](https://github.com/pytorch/executorch). See [CONTRIBUTING.md](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md) for detailed information about contributing to ExecuTorch. Join us on [Discord](https://discord.com/invite/Dh43CKSAdc)! --- # Debugging Delegation We provide a list of util functions to give users insights on what happened to the graph modules during the `to_backend()` stage. ## Get delegation summary The `get_delegation_info()` method provides a summary of what happened to the model after the `to_backend()` call: ```python import torch from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner from executorch.exir import to_edge_transform_and_lower from torch.export import Dim, export from torchvision.models.mobilenetv2 import MobileNet_V2_Weights import torchvision.models as models # Dependency needed for debugging delegates from executorch.devtools.backend_debug import get_delegation_info from tabulate import tabulate model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) et_program = to_edge_transform_and_lower( torch.export.export(model, sample_inputs), partitioner=[XnnpackPartitioner()] ) graph_module = et_program.exported_program().graph_module delegation_info = get_delegation_info(graph_module) # print the summary like the number of delegated nodes, non-delegated nodes, etc print(delegation_info.get_summary()) df = delegation_info.get_operator_delegation_dataframe() # print the table including op_type, occurrences_in_delegated_graphs, occurrences_in_non_delegated_graphs print(tabulate(df, headers="keys", tablefmt="fancy_grid")) ``` Example printout: ``` Total delegated subgraphs: 2 Number of delegated nodes: 203 Number of non-delegated nodes: 4 ``` | | op_type | occurrences_in_delegated_graphs | occurrences_in_non_delegated_graphs | |----|---------------------------------------------------|---------------------------------|-------------------------------------| | 0 | aten__native_batch_norm_legit_no_training_default | 52 | 0 | | 1 | aten_add_tensor | 10 | 0 | | 2 | aten_convolution_default | 52 | 0 | | 3 | aten_hardtanh_default | 35 | 0 | | 4 | aten_linear_default | 1 | 0 | | 5 | aten_mean_dim | 1 | 0 | | 6 | aten_view_copy_default | 0 | 1 | | 7 | dim_order_ops__clone_dim_order_default | 0 | 1 | | 8 | getitem | 52 | 2 | | 9 | **Total** | **203** | **4** | From the table, the operator `aten_view_copy_default` appears 0 times in delegate graphs and 1 times in non-delegated graphs. Users can use information like this to debug. `get_item node` is a special case, it means getting the output from the delegate subgraph. ## Visualize delegated graph To see a more detailed view, use the `format_delegated_graph()` method to get a string representation of the entire graph or use `print_delegated_graph()` to print directly: ```python from executorch.exir.backend.utils import format_delegated_graph graph_module = et_program.exported_program().graph_module print(format_delegated_graph(graph_module)) # or call print_delegated_graph(graph_module) ``` It will print the whole model as well as the subgraph consumed by the backend. The generic debug function provided by fx like `print_tabular()` or `print_readable()` will only show `call_delegate` and hide the subgraph consumed by the backend, while this function exposes the contents inside the subgraph. In the example printout below, observe that there are two subgraphs, `aten_view_copy_default` is not delegated, while most of the others ops are delegated.
``` graph(): %b_features_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_0_1_num_batches_tracked] %b_features_1_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_1_conv_0_1_num_batches_tracked] %b_features_1_conv_2_num_batches_tracked : [num_users=0] = placeholder[target=b_features_1_conv_2_num_batches_tracked] %b_features_2_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_2_conv_0_1_num_batches_tracked] %b_features_2_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_2_conv_1_1_num_batches_tracked] %b_features_2_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_2_conv_3_num_batches_tracked] %b_features_3_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_3_conv_0_1_num_batches_tracked] %b_features_3_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_3_conv_1_1_num_batches_tracked] %b_features_3_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_3_conv_3_num_batches_tracked] %b_features_4_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_4_conv_0_1_num_batches_tracked] %b_features_4_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_4_conv_1_1_num_batches_tracked] %b_features_4_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_4_conv_3_num_batches_tracked] %b_features_5_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_5_conv_0_1_num_batches_tracked] %b_features_5_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_5_conv_1_1_num_batches_tracked] %b_features_5_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_5_conv_3_num_batches_tracked] %b_features_6_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_6_conv_0_1_num_batches_tracked] %b_features_6_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_6_conv_1_1_num_batches_tracked] %b_features_6_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_6_conv_3_num_batches_tracked] %b_features_7_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_7_conv_0_1_num_batches_tracked] %b_features_7_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_7_conv_1_1_num_batches_tracked] %b_features_7_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_7_conv_3_num_batches_tracked] %b_features_8_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_8_conv_0_1_num_batches_tracked] %b_features_8_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_8_conv_1_1_num_batches_tracked] %b_features_8_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_8_conv_3_num_batches_tracked] %b_features_9_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_9_conv_0_1_num_batches_tracked] %b_features_9_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_9_conv_1_1_num_batches_tracked] %b_features_9_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_9_conv_3_num_batches_tracked] %b_features_10_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_10_conv_0_1_num_batches_tracked] %b_features_10_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_10_conv_1_1_num_batches_tracked] %b_features_10_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_10_conv_3_num_batches_tracked] %b_features_11_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_11_conv_0_1_num_batches_tracked] %b_features_11_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_11_conv_1_1_num_batches_tracked] %b_features_11_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_11_conv_3_num_batches_tracked] %b_features_12_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_12_conv_0_1_num_batches_tracked] %b_features_12_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_12_conv_1_1_num_batches_tracked] %b_features_12_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_12_conv_3_num_batches_tracked] %b_features_13_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_13_conv_0_1_num_batches_tracked] %b_features_13_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_13_conv_1_1_num_batches_tracked] %b_features_13_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_13_conv_3_num_batches_tracked] %b_features_14_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_14_conv_0_1_num_batches_tracked] %b_features_14_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_14_conv_1_1_num_batches_tracked] %b_features_14_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_14_conv_3_num_batches_tracked] %b_features_15_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_15_conv_0_1_num_batches_tracked] %b_features_15_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_15_conv_1_1_num_batches_tracked] %b_features_15_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_15_conv_3_num_batches_tracked] %b_features_16_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_16_conv_0_1_num_batches_tracked] %b_features_16_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_16_conv_1_1_num_batches_tracked] %b_features_16_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_16_conv_3_num_batches_tracked] %b_features_17_conv_0_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_17_conv_0_1_num_batches_tracked] %b_features_17_conv_1_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_17_conv_1_1_num_batches_tracked] %b_features_17_conv_3_num_batches_tracked : [num_users=0] = placeholder[target=b_features_17_conv_3_num_batches_tracked] %b_features_18_1_num_batches_tracked : [num_users=0] = placeholder[target=b_features_18_1_num_batches_tracked] %x : [num_users=1] = placeholder[target=x] %lowered_module_0 : [num_users=1] = get_attr[target=lowered_module_0] backend_id: XnnpackBackend lowered graph(): %p_features_0_0_weight : [num_users=1] = placeholder[target=p_features_0_0_weight] %p_features_0_1_weight : [num_users=1] = placeholder[target=p_features_0_1_weight] %p_features_0_1_bias : [num_users=1] = placeholder[target=p_features_0_1_bias] %p_features_1_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_1_conv_0_0_weight] %p_features_1_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_1_conv_0_1_weight] %p_features_1_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_1_conv_0_1_bias] %p_features_1_conv_1_weight : [num_users=1] = placeholder[target=p_features_1_conv_1_weight] %p_features_1_conv_2_weight : [num_users=1] = placeholder[target=p_features_1_conv_2_weight] %p_features_1_conv_2_bias : [num_users=1] = placeholder[target=p_features_1_conv_2_bias] %p_features_2_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_2_conv_0_0_weight] %p_features_2_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_2_conv_0_1_weight] %p_features_2_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_2_conv_0_1_bias] %p_features_2_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_2_conv_1_0_weight] %p_features_2_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_2_conv_1_1_weight] %p_features_2_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_2_conv_1_1_bias] %p_features_2_conv_2_weight : [num_users=1] = placeholder[target=p_features_2_conv_2_weight] %p_features_2_conv_3_weight : [num_users=1] = placeholder[target=p_features_2_conv_3_weight] %p_features_2_conv_3_bias : [num_users=1] = placeholder[target=p_features_2_conv_3_bias] %p_features_3_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_3_conv_0_0_weight] %p_features_3_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_3_conv_0_1_weight] %p_features_3_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_3_conv_0_1_bias] %p_features_3_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_3_conv_1_0_weight] %p_features_3_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_3_conv_1_1_weight] %p_features_3_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_3_conv_1_1_bias] %p_features_3_conv_2_weight : [num_users=1] = placeholder[target=p_features_3_conv_2_weight] %p_features_3_conv_3_weight : [num_users=1] = placeholder[target=p_features_3_conv_3_weight] %p_features_3_conv_3_bias : [num_users=1] = placeholder[target=p_features_3_conv_3_bias] %p_features_4_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_4_conv_0_0_weight] %p_features_4_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_4_conv_0_1_weight] %p_features_4_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_4_conv_0_1_bias] %p_features_4_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_4_conv_1_0_weight] %p_features_4_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_4_conv_1_1_weight] %p_features_4_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_4_conv_1_1_bias] %p_features_4_conv_2_weight : [num_users=1] = placeholder[target=p_features_4_conv_2_weight] %p_features_4_conv_3_weight : [num_users=1] = placeholder[target=p_features_4_conv_3_weight] %p_features_4_conv_3_bias : [num_users=1] = placeholder[target=p_features_4_conv_3_bias] %p_features_5_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_5_conv_0_0_weight] %p_features_5_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_5_conv_0_1_weight] %p_features_5_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_5_conv_0_1_bias] %p_features_5_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_5_conv_1_0_weight] %p_features_5_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_5_conv_1_1_weight] %p_features_5_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_5_conv_1_1_bias] %p_features_5_conv_2_weight : [num_users=1] = placeholder[target=p_features_5_conv_2_weight] %p_features_5_conv_3_weight : [num_users=1] = placeholder[target=p_features_5_conv_3_weight] %p_features_5_conv_3_bias : [num_users=1] = placeholder[target=p_features_5_conv_3_bias] %p_features_6_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_6_conv_0_0_weight] %p_features_6_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_6_conv_0_1_weight] %p_features_6_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_6_conv_0_1_bias] %p_features_6_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_6_conv_1_0_weight] %p_features_6_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_6_conv_1_1_weight] %p_features_6_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_6_conv_1_1_bias] %p_features_6_conv_2_weight : [num_users=1] = placeholder[target=p_features_6_conv_2_weight] %p_features_6_conv_3_weight : [num_users=1] = placeholder[target=p_features_6_conv_3_weight] %p_features_6_conv_3_bias : [num_users=1] = placeholder[target=p_features_6_conv_3_bias] %p_features_7_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_7_conv_0_0_weight] %p_features_7_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_7_conv_0_1_weight] %p_features_7_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_7_conv_0_1_bias] %p_features_7_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_7_conv_1_0_weight] %p_features_7_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_7_conv_1_1_weight] %p_features_7_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_7_conv_1_1_bias] %p_features_7_conv_2_weight : [num_users=1] = placeholder[target=p_features_7_conv_2_weight] %p_features_7_conv_3_weight : [num_users=1] = placeholder[target=p_features_7_conv_3_weight] %p_features_7_conv_3_bias : [num_users=1] = placeholder[target=p_features_7_conv_3_bias] %p_features_8_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_8_conv_0_0_weight] %p_features_8_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_8_conv_0_1_weight] %p_features_8_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_8_conv_0_1_bias] %p_features_8_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_8_conv_1_0_weight] %p_features_8_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_8_conv_1_1_weight] %p_features_8_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_8_conv_1_1_bias] %p_features_8_conv_2_weight : [num_users=1] = placeholder[target=p_features_8_conv_2_weight] %p_features_8_conv_3_weight : [num_users=1] = placeholder[target=p_features_8_conv_3_weight] %p_features_8_conv_3_bias : [num_users=1] = placeholder[target=p_features_8_conv_3_bias] %p_features_9_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_9_conv_0_0_weight] %p_features_9_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_9_conv_0_1_weight] %p_features_9_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_9_conv_0_1_bias] %p_features_9_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_9_conv_1_0_weight] %p_features_9_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_9_conv_1_1_weight] %p_features_9_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_9_conv_1_1_bias] %p_features_9_conv_2_weight : [num_users=1] = placeholder[target=p_features_9_conv_2_weight] %p_features_9_conv_3_weight : [num_users=1] = placeholder[target=p_features_9_conv_3_weight] %p_features_9_conv_3_bias : [num_users=1] = placeholder[target=p_features_9_conv_3_bias] %p_features_10_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_10_conv_0_0_weight] %p_features_10_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_10_conv_0_1_weight] %p_features_10_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_10_conv_0_1_bias] %p_features_10_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_10_conv_1_0_weight] %p_features_10_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_10_conv_1_1_weight] %p_features_10_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_10_conv_1_1_bias] %p_features_10_conv_2_weight : [num_users=1] = placeholder[target=p_features_10_conv_2_weight] %p_features_10_conv_3_weight : [num_users=1] = placeholder[target=p_features_10_conv_3_weight] %p_features_10_conv_3_bias : [num_users=1] = placeholder[target=p_features_10_conv_3_bias] %p_features_11_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_11_conv_0_0_weight] %p_features_11_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_11_conv_0_1_weight] %p_features_11_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_11_conv_0_1_bias] %p_features_11_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_11_conv_1_0_weight] %p_features_11_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_11_conv_1_1_weight] %p_features_11_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_11_conv_1_1_bias] %p_features_11_conv_2_weight : [num_users=1] = placeholder[target=p_features_11_conv_2_weight] %p_features_11_conv_3_weight : [num_users=1] = placeholder[target=p_features_11_conv_3_weight] %p_features_11_conv_3_bias : [num_users=1] = placeholder[target=p_features_11_conv_3_bias] %p_features_12_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_12_conv_0_0_weight] %p_features_12_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_12_conv_0_1_weight] %p_features_12_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_12_conv_0_1_bias] %p_features_12_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_12_conv_1_0_weight] %p_features_12_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_12_conv_1_1_weight] %p_features_12_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_12_conv_1_1_bias] %p_features_12_conv_2_weight : [num_users=1] = placeholder[target=p_features_12_conv_2_weight] %p_features_12_conv_3_weight : [num_users=1] = placeholder[target=p_features_12_conv_3_weight] %p_features_12_conv_3_bias : [num_users=1] = placeholder[target=p_features_12_conv_3_bias] %p_features_13_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_13_conv_0_0_weight] %p_features_13_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_13_conv_0_1_weight] %p_features_13_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_13_conv_0_1_bias] %p_features_13_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_13_conv_1_0_weight] %p_features_13_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_13_conv_1_1_weight] %p_features_13_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_13_conv_1_1_bias] %p_features_13_conv_2_weight : [num_users=1] = placeholder[target=p_features_13_conv_2_weight] %p_features_13_conv_3_weight : [num_users=1] = placeholder[target=p_features_13_conv_3_weight] %p_features_13_conv_3_bias : [num_users=1] = placeholder[target=p_features_13_conv_3_bias] %p_features_14_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_14_conv_0_0_weight] %p_features_14_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_14_conv_0_1_weight] %p_features_14_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_14_conv_0_1_bias] %p_features_14_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_14_conv_1_0_weight] %p_features_14_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_14_conv_1_1_weight] %p_features_14_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_14_conv_1_1_bias] %p_features_14_conv_2_weight : [num_users=1] = placeholder[target=p_features_14_conv_2_weight] %p_features_14_conv_3_weight : [num_users=1] = placeholder[target=p_features_14_conv_3_weight] %p_features_14_conv_3_bias : [num_users=1] = placeholder[target=p_features_14_conv_3_bias] %p_features_15_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_15_conv_0_0_weight] %p_features_15_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_15_conv_0_1_weight] %p_features_15_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_15_conv_0_1_bias] %p_features_15_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_15_conv_1_0_weight] %p_features_15_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_15_conv_1_1_weight] %p_features_15_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_15_conv_1_1_bias] %p_features_15_conv_2_weight : [num_users=1] = placeholder[target=p_features_15_conv_2_weight] %p_features_15_conv_3_weight : [num_users=1] = placeholder[target=p_features_15_conv_3_weight] %p_features_15_conv_3_bias : [num_users=1] = placeholder[target=p_features_15_conv_3_bias] %p_features_16_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_16_conv_0_0_weight] %p_features_16_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_16_conv_0_1_weight] %p_features_16_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_16_conv_0_1_bias] %p_features_16_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_16_conv_1_0_weight] %p_features_16_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_16_conv_1_1_weight] %p_features_16_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_16_conv_1_1_bias] %p_features_16_conv_2_weight : [num_users=1] = placeholder[target=p_features_16_conv_2_weight] %p_features_16_conv_3_weight : [num_users=1] = placeholder[target=p_features_16_conv_3_weight] %p_features_16_conv_3_bias : [num_users=1] = placeholder[target=p_features_16_conv_3_bias] %p_features_17_conv_0_0_weight : [num_users=1] = placeholder[target=p_features_17_conv_0_0_weight] %p_features_17_conv_0_1_weight : [num_users=1] = placeholder[target=p_features_17_conv_0_1_weight] %p_features_17_conv_0_1_bias : [num_users=1] = placeholder[target=p_features_17_conv_0_1_bias] %p_features_17_conv_1_0_weight : [num_users=1] = placeholder[target=p_features_17_conv_1_0_weight] %p_features_17_conv_1_1_weight : [num_users=1] = placeholder[target=p_features_17_conv_1_1_weight] %p_features_17_conv_1_1_bias : [num_users=1] = placeholder[target=p_features_17_conv_1_1_bias] %p_features_17_conv_2_weight : [num_users=1] = placeholder[target=p_features_17_conv_2_weight] %p_features_17_conv_3_weight : [num_users=1] = placeholder[target=p_features_17_conv_3_weight] %p_features_17_conv_3_bias : [num_users=1] = placeholder[target=p_features_17_conv_3_bias] %p_features_18_0_weight : [num_users=1] = placeholder[target=p_features_18_0_weight] %p_features_18_1_weight : [num_users=1] = placeholder[target=p_features_18_1_weight] %p_features_18_1_bias : [num_users=1] = placeholder[target=p_features_18_1_bias] %b_features_0_1_running_mean : [num_users=1] = placeholder[target=b_features_0_1_running_mean] %b_features_0_1_running_var : [num_users=1] = placeholder[target=b_features_0_1_running_var] %b_features_1_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_1_conv_0_1_running_mean] %b_features_1_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_1_conv_0_1_running_var] %b_features_1_conv_2_running_mean : [num_users=1] = placeholder[target=b_features_1_conv_2_running_mean] %b_features_1_conv_2_running_var : [num_users=1] = placeholder[target=b_features_1_conv_2_running_var] %b_features_2_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_2_conv_0_1_running_mean] %b_features_2_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_2_conv_0_1_running_var] %b_features_2_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_2_conv_1_1_running_mean] %b_features_2_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_2_conv_1_1_running_var] %b_features_2_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_2_conv_3_running_mean] %b_features_2_conv_3_running_var : [num_users=1] = placeholder[target=b_features_2_conv_3_running_var] %b_features_3_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_3_conv_0_1_running_mean] %b_features_3_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_3_conv_0_1_running_var] %b_features_3_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_3_conv_1_1_running_mean] %b_features_3_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_3_conv_1_1_running_var] %b_features_3_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_3_conv_3_running_mean] %b_features_3_conv_3_running_var : [num_users=1] = placeholder[target=b_features_3_conv_3_running_var] %b_features_4_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_4_conv_0_1_running_mean] %b_features_4_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_4_conv_0_1_running_var] %b_features_4_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_4_conv_1_1_running_mean] %b_features_4_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_4_conv_1_1_running_var] %b_features_4_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_4_conv_3_running_mean] %b_features_4_conv_3_running_var : [num_users=1] = placeholder[target=b_features_4_conv_3_running_var] %b_features_5_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_5_conv_0_1_running_mean] %b_features_5_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_5_conv_0_1_running_var] %b_features_5_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_5_conv_1_1_running_mean] %b_features_5_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_5_conv_1_1_running_var] %b_features_5_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_5_conv_3_running_mean] %b_features_5_conv_3_running_var : [num_users=1] = placeholder[target=b_features_5_conv_3_running_var] %b_features_6_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_6_conv_0_1_running_mean] %b_features_6_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_6_conv_0_1_running_var] %b_features_6_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_6_conv_1_1_running_mean] %b_features_6_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_6_conv_1_1_running_var] %b_features_6_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_6_conv_3_running_mean] %b_features_6_conv_3_running_var : [num_users=1] = placeholder[target=b_features_6_conv_3_running_var] %b_features_7_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_7_conv_0_1_running_mean] %b_features_7_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_7_conv_0_1_running_var] %b_features_7_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_7_conv_1_1_running_mean] %b_features_7_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_7_conv_1_1_running_var] %b_features_7_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_7_conv_3_running_mean] %b_features_7_conv_3_running_var : [num_users=1] = placeholder[target=b_features_7_conv_3_running_var] %b_features_8_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_8_conv_0_1_running_mean] %b_features_8_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_8_conv_0_1_running_var] %b_features_8_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_8_conv_1_1_running_mean] %b_features_8_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_8_conv_1_1_running_var] %b_features_8_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_8_conv_3_running_mean] %b_features_8_conv_3_running_var : [num_users=1] = placeholder[target=b_features_8_conv_3_running_var] %b_features_9_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_9_conv_0_1_running_mean] %b_features_9_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_9_conv_0_1_running_var] %b_features_9_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_9_conv_1_1_running_mean] %b_features_9_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_9_conv_1_1_running_var] %b_features_9_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_9_conv_3_running_mean] %b_features_9_conv_3_running_var : [num_users=1] = placeholder[target=b_features_9_conv_3_running_var] %b_features_10_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_10_conv_0_1_running_mean] %b_features_10_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_10_conv_0_1_running_var] %b_features_10_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_10_conv_1_1_running_mean] %b_features_10_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_10_conv_1_1_running_var] %b_features_10_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_10_conv_3_running_mean] %b_features_10_conv_3_running_var : [num_users=1] = placeholder[target=b_features_10_conv_3_running_var] %b_features_11_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_11_conv_0_1_running_mean] %b_features_11_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_11_conv_0_1_running_var] %b_features_11_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_11_conv_1_1_running_mean] %b_features_11_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_11_conv_1_1_running_var] %b_features_11_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_11_conv_3_running_mean] %b_features_11_conv_3_running_var : [num_users=1] = placeholder[target=b_features_11_conv_3_running_var] %b_features_12_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_12_conv_0_1_running_mean] %b_features_12_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_12_conv_0_1_running_var] %b_features_12_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_12_conv_1_1_running_mean] %b_features_12_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_12_conv_1_1_running_var] %b_features_12_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_12_conv_3_running_mean] %b_features_12_conv_3_running_var : [num_users=1] = placeholder[target=b_features_12_conv_3_running_var] %b_features_13_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_13_conv_0_1_running_mean] %b_features_13_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_13_conv_0_1_running_var] %b_features_13_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_13_conv_1_1_running_mean] %b_features_13_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_13_conv_1_1_running_var] %b_features_13_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_13_conv_3_running_mean] %b_features_13_conv_3_running_var : [num_users=1] = placeholder[target=b_features_13_conv_3_running_var] %b_features_14_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_14_conv_0_1_running_mean] %b_features_14_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_14_conv_0_1_running_var] %b_features_14_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_14_conv_1_1_running_mean] %b_features_14_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_14_conv_1_1_running_var] %b_features_14_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_14_conv_3_running_mean] %b_features_14_conv_3_running_var : [num_users=1] = placeholder[target=b_features_14_conv_3_running_var] %b_features_15_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_15_conv_0_1_running_mean] %b_features_15_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_15_conv_0_1_running_var] %b_features_15_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_15_conv_1_1_running_mean] %b_features_15_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_15_conv_1_1_running_var] %b_features_15_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_15_conv_3_running_mean] %b_features_15_conv_3_running_var : [num_users=1] = placeholder[target=b_features_15_conv_3_running_var] %b_features_16_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_16_conv_0_1_running_mean] %b_features_16_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_16_conv_0_1_running_var] %b_features_16_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_16_conv_1_1_running_mean] %b_features_16_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_16_conv_1_1_running_var] %b_features_16_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_16_conv_3_running_mean] %b_features_16_conv_3_running_var : [num_users=1] = placeholder[target=b_features_16_conv_3_running_var] %b_features_17_conv_0_1_running_mean : [num_users=1] = placeholder[target=b_features_17_conv_0_1_running_mean] %b_features_17_conv_0_1_running_var : [num_users=1] = placeholder[target=b_features_17_conv_0_1_running_var] %b_features_17_conv_1_1_running_mean : [num_users=1] = placeholder[target=b_features_17_conv_1_1_running_mean] %b_features_17_conv_1_1_running_var : [num_users=1] = placeholder[target=b_features_17_conv_1_1_running_var] %b_features_17_conv_3_running_mean : [num_users=1] = placeholder[target=b_features_17_conv_3_running_mean] %b_features_17_conv_3_running_var : [num_users=1] = placeholder[target=b_features_17_conv_3_running_var] %b_features_18_1_running_mean : [num_users=1] = placeholder[target=b_features_18_1_running_mean] %b_features_18_1_running_var : [num_users=1] = placeholder[target=b_features_18_1_running_var] %x : [num_users=1] = placeholder[target=x] %aten_convolution_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%x, %p_features_0_0_weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default, %p_features_0_1_weight, %p_features_0_1_bias, %b_features_0_1_running_mean, %b_features_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default, 0), kwargs = {}) %aten_hardtanh_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem, 0.0, 6.0), kwargs = {}) %aten_convolution_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default, %p_features_1_conv_0_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_1, %p_features_1_conv_0_1_weight, %p_features_1_conv_0_1_bias, %b_features_1_conv_0_1_running_mean, %b_features_1_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_1, 0), kwargs = {}) %aten_hardtanh_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_1, 0.0, 6.0), kwargs = {}) %aten_convolution_default_2 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_1, %p_features_1_conv_1_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_2 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_2, %p_features_1_conv_2_weight, %p_features_1_conv_2_bias, %b_features_1_conv_2_running_mean, %b_features_1_conv_2_running_var, 0.1, 1e-05), kwargs = {}) %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_2, 0), kwargs = {}) %aten_convolution_default_3 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%getitem_2, %p_features_2_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_3 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_3, %p_features_2_conv_0_1_weight, %p_features_2_conv_0_1_bias, %b_features_2_conv_0_1_running_mean, %b_features_2_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_3, 0), kwargs = {}) %aten_hardtanh_default_2 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_3, 0.0, 6.0), kwargs = {}) %aten_convolution_default_4 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_2, %p_features_2_conv_1_0_weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 96), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_4 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_4, %p_features_2_conv_1_1_weight, %p_features_2_conv_1_1_bias, %b_features_2_conv_1_1_running_mean, %b_features_2_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_4 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_4, 0), kwargs = {}) %aten_hardtanh_default_3 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_4, 0.0, 6.0), kwargs = {}) %aten_convolution_default_5 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_3, %p_features_2_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_5 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_5, %p_features_2_conv_3_weight, %p_features_2_conv_3_bias, %b_features_2_conv_3_running_mean, %b_features_2_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_5 : [num_users=2] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_5, 0), kwargs = {}) %aten_convolution_default_6 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%getitem_5, %p_features_3_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_6 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_6, %p_features_3_conv_0_1_weight, %p_features_3_conv_0_1_bias, %b_features_3_conv_0_1_running_mean, %b_features_3_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_6, 0), kwargs = {}) %aten_hardtanh_default_4 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_6, 0.0, 6.0), kwargs = {}) %aten_convolution_default_7 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_4, %p_features_3_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 144), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_7 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_7, %p_features_3_conv_1_1_weight, %p_features_3_conv_1_1_bias, %b_features_3_conv_1_1_running_mean, %b_features_3_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_7 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_7, 0), kwargs = {}) %aten_hardtanh_default_5 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_7, 0.0, 6.0), kwargs = {}) %aten_convolution_default_8 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_5, %p_features_3_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_8 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_8, %p_features_3_conv_3_weight, %p_features_3_conv_3_bias, %b_features_3_conv_3_running_mean, %b_features_3_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_8 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_8, 0), kwargs = {}) %aten_add_tensor : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%getitem_5, %getitem_8), kwargs = {}) %aten_convolution_default_9 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor, %p_features_4_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_9 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_9, %p_features_4_conv_0_1_weight, %p_features_4_conv_0_1_bias, %b_features_4_conv_0_1_running_mean, %b_features_4_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_9 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_9, 0), kwargs = {}) %aten_hardtanh_default_6 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_9, 0.0, 6.0), kwargs = {}) %aten_convolution_default_10 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_6, %p_features_4_conv_1_0_weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 144), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_10 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_10, %p_features_4_conv_1_1_weight, %p_features_4_conv_1_1_bias, %b_features_4_conv_1_1_running_mean, %b_features_4_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_10 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_10, 0), kwargs = {}) %aten_hardtanh_default_7 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_10, 0.0, 6.0), kwargs = {}) %aten_convolution_default_11 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_7, %p_features_4_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_11 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_11, %p_features_4_conv_3_weight, %p_features_4_conv_3_bias, %b_features_4_conv_3_running_mean, %b_features_4_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_11 : [num_users=2] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_11, 0), kwargs = {}) %aten_convolution_default_12 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%getitem_11, %p_features_5_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_12 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_12, %p_features_5_conv_0_1_weight, %p_features_5_conv_0_1_bias, %b_features_5_conv_0_1_running_mean, %b_features_5_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_12 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_12, 0), kwargs = {}) %aten_hardtanh_default_8 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_12, 0.0, 6.0), kwargs = {}) %aten_convolution_default_13 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_8, %p_features_5_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 192), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_13 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_13, %p_features_5_conv_1_1_weight, %p_features_5_conv_1_1_bias, %b_features_5_conv_1_1_running_mean, %b_features_5_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_13 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_13, 0), kwargs = {}) %aten_hardtanh_default_9 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_13, 0.0, 6.0), kwargs = {}) %aten_convolution_default_14 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_9, %p_features_5_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_14 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_14, %p_features_5_conv_3_weight, %p_features_5_conv_3_bias, %b_features_5_conv_3_running_mean, %b_features_5_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_14 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_14, 0), kwargs = {}) %aten_add_tensor_1 : [num_users=2] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%getitem_11, %getitem_14), kwargs = {}) %aten_convolution_default_15 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_1, %p_features_6_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_15 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_15, %p_features_6_conv_0_1_weight, %p_features_6_conv_0_1_bias, %b_features_6_conv_0_1_running_mean, %b_features_6_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_15 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_15, 0), kwargs = {}) %aten_hardtanh_default_10 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_15, 0.0, 6.0), kwargs = {}) %aten_convolution_default_16 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_10, %p_features_6_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 192), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_16 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_16, %p_features_6_conv_1_1_weight, %p_features_6_conv_1_1_bias, %b_features_6_conv_1_1_running_mean, %b_features_6_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_16 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_16, 0), kwargs = {}) %aten_hardtanh_default_11 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_16, 0.0, 6.0), kwargs = {}) %aten_convolution_default_17 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_11, %p_features_6_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_17 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_17, %p_features_6_conv_3_weight, %p_features_6_conv_3_bias, %b_features_6_conv_3_running_mean, %b_features_6_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_17 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_17, 0), kwargs = {}) %aten_add_tensor_2 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%aten_add_tensor_1, %getitem_17), kwargs = {}) %aten_convolution_default_18 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_2, %p_features_7_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_18 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_18, %p_features_7_conv_0_1_weight, %p_features_7_conv_0_1_bias, %b_features_7_conv_0_1_running_mean, %b_features_7_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_18 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_18, 0), kwargs = {}) %aten_hardtanh_default_12 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_18, 0.0, 6.0), kwargs = {}) %aten_convolution_default_19 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_12, %p_features_7_conv_1_0_weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 192), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_19 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_19, %p_features_7_conv_1_1_weight, %p_features_7_conv_1_1_bias, %b_features_7_conv_1_1_running_mean, %b_features_7_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_19 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_19, 0), kwargs = {}) %aten_hardtanh_default_13 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_19, 0.0, 6.0), kwargs = {}) %aten_convolution_default_20 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_13, %p_features_7_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_20 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_20, %p_features_7_conv_3_weight, %p_features_7_conv_3_bias, %b_features_7_conv_3_running_mean, %b_features_7_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_20 : [num_users=2] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_20, 0), kwargs = {}) %aten_convolution_default_21 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%getitem_20, %p_features_8_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_21 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_21, %p_features_8_conv_0_1_weight, %p_features_8_conv_0_1_bias, %b_features_8_conv_0_1_running_mean, %b_features_8_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_21 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_21, 0), kwargs = {}) %aten_hardtanh_default_14 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_21, 0.0, 6.0), kwargs = {}) %aten_convolution_default_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_14, %p_features_8_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 384), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_22, %p_features_8_conv_1_1_weight, %p_features_8_conv_1_1_bias, %b_features_8_conv_1_1_running_mean, %b_features_8_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_22 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_22, 0), kwargs = {}) %aten_hardtanh_default_15 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_22, 0.0, 6.0), kwargs = {}) %aten_convolution_default_23 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_15, %p_features_8_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_23 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_23, %p_features_8_conv_3_weight, %p_features_8_conv_3_bias, %b_features_8_conv_3_running_mean, %b_features_8_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_23 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_23, 0), kwargs = {}) %aten_add_tensor_3 : [num_users=2] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%getitem_20, %getitem_23), kwargs = {}) %aten_convolution_default_24 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_3, %p_features_9_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_24 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_24, %p_features_9_conv_0_1_weight, %p_features_9_conv_0_1_bias, %b_features_9_conv_0_1_running_mean, %b_features_9_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_24 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_24, 0), kwargs = {}) %aten_hardtanh_default_16 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_24, 0.0, 6.0), kwargs = {}) %aten_convolution_default_25 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_16, %p_features_9_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 384), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_25 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_25, %p_features_9_conv_1_1_weight, %p_features_9_conv_1_1_bias, %b_features_9_conv_1_1_running_mean, %b_features_9_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_25 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_25, 0), kwargs = {}) %aten_hardtanh_default_17 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_25, 0.0, 6.0), kwargs = {}) %aten_convolution_default_26 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_17, %p_features_9_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_26 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_26, %p_features_9_conv_3_weight, %p_features_9_conv_3_bias, %b_features_9_conv_3_running_mean, %b_features_9_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_26 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_26, 0), kwargs = {}) %aten_add_tensor_4 : [num_users=2] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%aten_add_tensor_3, %getitem_26), kwargs = {}) %aten_convolution_default_27 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_4, %p_features_10_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_27 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_27, %p_features_10_conv_0_1_weight, %p_features_10_conv_0_1_bias, %b_features_10_conv_0_1_running_mean, %b_features_10_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_27 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_27, 0), kwargs = {}) %aten_hardtanh_default_18 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_27, 0.0, 6.0), kwargs = {}) %aten_convolution_default_28 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_18, %p_features_10_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 384), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_28 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_28, %p_features_10_conv_1_1_weight, %p_features_10_conv_1_1_bias, %b_features_10_conv_1_1_running_mean, %b_features_10_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_28 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_28, 0), kwargs = {}) %aten_hardtanh_default_19 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_28, 0.0, 6.0), kwargs = {}) %aten_convolution_default_29 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_19, %p_features_10_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_29 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_29, %p_features_10_conv_3_weight, %p_features_10_conv_3_bias, %b_features_10_conv_3_running_mean, %b_features_10_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_29 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_29, 0), kwargs = {}) %aten_add_tensor_5 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%aten_add_tensor_4, %getitem_29), kwargs = {}) %aten_convolution_default_30 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_5, %p_features_11_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_30 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_30, %p_features_11_conv_0_1_weight, %p_features_11_conv_0_1_bias, %b_features_11_conv_0_1_running_mean, %b_features_11_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_30 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_30, 0), kwargs = {}) %aten_hardtanh_default_20 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_30, 0.0, 6.0), kwargs = {}) %aten_convolution_default_31 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_20, %p_features_11_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 384), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_31 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_31, %p_features_11_conv_1_1_weight, %p_features_11_conv_1_1_bias, %b_features_11_conv_1_1_running_mean, %b_features_11_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_31 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_31, 0), kwargs = {}) %aten_hardtanh_default_21 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_31, 0.0, 6.0), kwargs = {}) %aten_convolution_default_32 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_21, %p_features_11_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_32 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_32, %p_features_11_conv_3_weight, %p_features_11_conv_3_bias, %b_features_11_conv_3_running_mean, %b_features_11_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_32 : [num_users=2] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_32, 0), kwargs = {}) %aten_convolution_default_33 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%getitem_32, %p_features_12_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_33 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_33, %p_features_12_conv_0_1_weight, %p_features_12_conv_0_1_bias, %b_features_12_conv_0_1_running_mean, %b_features_12_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_33 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_33, 0), kwargs = {}) %aten_hardtanh_default_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_33, 0.0, 6.0), kwargs = {}) %aten_convolution_default_34 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_22, %p_features_12_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 576), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_34 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_34, %p_features_12_conv_1_1_weight, %p_features_12_conv_1_1_bias, %b_features_12_conv_1_1_running_mean, %b_features_12_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_34 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_34, 0), kwargs = {}) %aten_hardtanh_default_23 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_34, 0.0, 6.0), kwargs = {}) %aten_convolution_default_35 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_23, %p_features_12_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_35 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_35, %p_features_12_conv_3_weight, %p_features_12_conv_3_bias, %b_features_12_conv_3_running_mean, %b_features_12_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_35 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_35, 0), kwargs = {}) %aten_add_tensor_6 : [num_users=2] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%getitem_32, %getitem_35), kwargs = {}) %aten_convolution_default_36 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_6, %p_features_13_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_36 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_36, %p_features_13_conv_0_1_weight, %p_features_13_conv_0_1_bias, %b_features_13_conv_0_1_running_mean, %b_features_13_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_36 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_36, 0), kwargs = {}) %aten_hardtanh_default_24 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_36, 0.0, 6.0), kwargs = {}) %aten_convolution_default_37 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_24, %p_features_13_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 576), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_37 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_37, %p_features_13_conv_1_1_weight, %p_features_13_conv_1_1_bias, %b_features_13_conv_1_1_running_mean, %b_features_13_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_37 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_37, 0), kwargs = {}) %aten_hardtanh_default_25 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_37, 0.0, 6.0), kwargs = {}) %aten_convolution_default_38 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_25, %p_features_13_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_38 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_38, %p_features_13_conv_3_weight, %p_features_13_conv_3_bias, %b_features_13_conv_3_running_mean, %b_features_13_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_38 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_38, 0), kwargs = {}) %aten_add_tensor_7 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%aten_add_tensor_6, %getitem_38), kwargs = {}) %aten_convolution_default_39 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_7, %p_features_14_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_39 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_39, %p_features_14_conv_0_1_weight, %p_features_14_conv_0_1_bias, %b_features_14_conv_0_1_running_mean, %b_features_14_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_39 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_39, 0), kwargs = {}) %aten_hardtanh_default_26 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_39, 0.0, 6.0), kwargs = {}) %aten_convolution_default_40 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_26, %p_features_14_conv_1_0_weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 576), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_40 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_40, %p_features_14_conv_1_1_weight, %p_features_14_conv_1_1_bias, %b_features_14_conv_1_1_running_mean, %b_features_14_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_40 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_40, 0), kwargs = {}) %aten_hardtanh_default_27 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_40, 0.0, 6.0), kwargs = {}) %aten_convolution_default_41 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_27, %p_features_14_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_41 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_41, %p_features_14_conv_3_weight, %p_features_14_conv_3_bias, %b_features_14_conv_3_running_mean, %b_features_14_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_41 : [num_users=2] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_41, 0), kwargs = {}) %aten_convolution_default_42 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%getitem_41, %p_features_15_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_42 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_42, %p_features_15_conv_0_1_weight, %p_features_15_conv_0_1_bias, %b_features_15_conv_0_1_running_mean, %b_features_15_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_42 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_42, 0), kwargs = {}) %aten_hardtanh_default_28 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_42, 0.0, 6.0), kwargs = {}) %aten_convolution_default_43 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_28, %p_features_15_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 960), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_43 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_43, %p_features_15_conv_1_1_weight, %p_features_15_conv_1_1_bias, %b_features_15_conv_1_1_running_mean, %b_features_15_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_43 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_43, 0), kwargs = {}) %aten_hardtanh_default_29 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_43, 0.0, 6.0), kwargs = {}) %aten_convolution_default_44 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_29, %p_features_15_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_44 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_44, %p_features_15_conv_3_weight, %p_features_15_conv_3_bias, %b_features_15_conv_3_running_mean, %b_features_15_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_44 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_44, 0), kwargs = {}) %aten_add_tensor_8 : [num_users=2] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%getitem_41, %getitem_44), kwargs = {}) %aten_convolution_default_45 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_8, %p_features_16_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_45 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_45, %p_features_16_conv_0_1_weight, %p_features_16_conv_0_1_bias, %b_features_16_conv_0_1_running_mean, %b_features_16_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_45 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_45, 0), kwargs = {}) %aten_hardtanh_default_30 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_45, 0.0, 6.0), kwargs = {}) %aten_convolution_default_46 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_30, %p_features_16_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 960), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_46 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_46, %p_features_16_conv_1_1_weight, %p_features_16_conv_1_1_bias, %b_features_16_conv_1_1_running_mean, %b_features_16_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_46 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_46, 0), kwargs = {}) %aten_hardtanh_default_31 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_46, 0.0, 6.0), kwargs = {}) %aten_convolution_default_47 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_31, %p_features_16_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_47 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_47, %p_features_16_conv_3_weight, %p_features_16_conv_3_bias, %b_features_16_conv_3_running_mean, %b_features_16_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_47 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_47, 0), kwargs = {}) %aten_add_tensor_9 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.add.Tensor](args = (%aten_add_tensor_8, %getitem_47), kwargs = {}) %aten_convolution_default_48 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_add_tensor_9, %p_features_17_conv_0_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_48 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_48, %p_features_17_conv_0_1_weight, %p_features_17_conv_0_1_bias, %b_features_17_conv_0_1_running_mean, %b_features_17_conv_0_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_48 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_48, 0), kwargs = {}) %aten_hardtanh_default_32 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_48, 0.0, 6.0), kwargs = {}) %aten_convolution_default_49 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_32, %p_features_17_conv_1_0_weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 960), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_49 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_49, %p_features_17_conv_1_1_weight, %p_features_17_conv_1_1_bias, %b_features_17_conv_1_1_running_mean, %b_features_17_conv_1_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_49 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_49, 0), kwargs = {}) %aten_hardtanh_default_33 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_49, 0.0, 6.0), kwargs = {}) %aten_convolution_default_50 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%aten_hardtanh_default_33, %p_features_17_conv_2_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_50 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_50, %p_features_17_conv_3_weight, %p_features_17_conv_3_bias, %b_features_17_conv_3_running_mean, %b_features_17_conv_3_running_var, 0.1, 1e-05), kwargs = {}) %getitem_50 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_50, 0), kwargs = {}) %aten_convolution_default_51 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.convolution.default](args = (%getitem_50, %p_features_18_0_weight, None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), kwargs = {}) %aten__native_batch_norm_legit_no_training_default_51 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten._native_batch_norm_legit_no_training.default](args = (%aten_convolution_default_51, %p_features_18_1_weight, %p_features_18_1_bias, %b_features_18_1_running_mean, %b_features_18_1_running_var, 0.1, 1e-05), kwargs = {}) %getitem_51 : [num_users=1] = call_function[target=operator.getitem](args = (%aten__native_batch_norm_legit_no_training_default_51, 0), kwargs = {}) %aten_hardtanh_default_34 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.hardtanh.default](args = (%getitem_51, 0.0, 6.0), kwargs = {}) %aten_mean_dim : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.mean.dim](args = (%aten_hardtanh_default_34, [-1, -2], True), kwargs = {}) return (aten_mean_dim,) %executorch_call_delegate : [num_users=1] = call_function[target=torch.ops.higher_order.executorch_call_delegate](args = (%lowered_module_0, %x), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%executorch_call_delegate, 0), kwargs = {}) %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.memory.view](args = (%getitem, [1, 1280]), kwargs = {}) %alloc : [num_users=1] = call_function[target=executorch.exir.memory.alloc](args = (((1, 1280), torch.float32),), kwargs = {}) %dim_order_ops__clone_dim_order_default : [num_users=1] = call_function[target=torch.ops.dim_order_ops._clone_dim_order.out](args = (%aten_view_copy_default,), kwargs = {dim_order: [0, 1], out: %alloc}) %lowered_module_1 : [num_users=1] = get_attr[target=lowered_module_1] backend_id: XnnpackBackend lowered graph(): %p_classifier_1_weight : [num_users=1] = placeholder[target=p_classifier_1_weight] %p_classifier_1_bias : [num_users=1] = placeholder[target=p_classifier_1_bias] %dim_order_ops__clone_dim_order_default : [num_users=1] = placeholder[target=dim_order_ops__clone_dim_order_default] %aten_linear_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.linear.default](args = (%dim_order_ops__clone_dim_order_default, %p_classifier_1_weight, %p_classifier_1_bias), kwargs = {}) return (aten_linear_default,) %executorch_call_delegate_1 : [num_users=1] = call_function[target=torch.ops.higher_order.executorch_call_delegate](args = (%lowered_module_1, %dim_order_ops__clone_dim_order_default), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%executorch_call_delegate_1, 0), kwargs = {}) return (getitem_1,) ```
--- # Delegate Debugging [Delegate backends](compiler-delegate-and-partitioner.md) are a prominent component of on-device models due to their flexibility in defining behavior. A side effect of this flexibility is that it operates as an opaque transformation. This obfuscates rich associations and mutations that are valuable in post-processing. - For example, if two different operator fusions were to occur within a delegate, post processing wouldn’t be able to separate the two transformations. Specifically, it makes associating runtime information (such as profiling results) through delegated graphs difficult. Delegate Debug Identifiers provides a framework through which delegate authors can propagate this information and utilize it for post run analysis. The preparation is broken down into three stages: - **Ahead-of-time (AOT)**: Delegate authors generate a __Debug Handle Map__. - **Runtime**: Delegate authors log using the __Delegate Debug Identifiers__ registered AOT in the __Debug Handle Map__. - **Deserialization**: Delegate authors provide a parser for custom metadata in delegate events. ## Ahead-of-Time Integration Delegate authors propagate what transformations occur in a lowered backend by returning a **Debug Handle Map** from the backend implementation. ### Generating a Debug Handle Map **Debug Handle Maps** communicate what transformations occurred in a backend by mapping **Delegate Debug Identifiers** to debug handles. **Delegate Debug Identifiers** are generated or user-provided identifiers for representing points of interest during runtime. Recall that debug handles are unique identifiers to operator instances in the model graph. For example: - **{ 0: (10, 11), 1: (11, 12) }:** Identifiers 0 and 1 in the runtime correspond to operators with the debug handles (10, 11) and (11, 12) respectively. - **{ “fused_op_1_2_3”: (11, 12, 15) }**: Identifier “fused_op_1_2_3” in the runtime corresponds to operators with debug handles (11, 12, 15), and 11, 12, 15 corresponds to the op 1, op 2 and op 3. ```{Note} Identifiers are a means of connecting runtime results to the model graph; the interpretation of the identifiers is defined by the delegate author. ``` **Debug Handle Maps** are constructed through the use of **DelegateMappingBuilder** and returned as a part of `PreprocessResult`. ```python class PreprocessResult: processed_bytes: bytes = bytes() debug_handle_map: Optional[ Union[Dict[int, Tuple[int]], Dict[str, Tuple[int]]] ] = None ``` PreprocessResult is defined [here](https://github.com/pytorch/executorch/blob/main/exir/backend/backend_details.py). #### DelegateMappingBuilder `DelegateMappingBuilder` is a helper class for managing and constructing Debug Handle Maps. The result of the builder should be passed in when constructing PreprocessResult. `DelegateMappingBuilder` is defined [here](https://github.com/pytorch/executorch/blob/main/exir/backend/utils.py) A `DelegateMappingBuilder` instance can be constructed in one of 2 modes: manual identifiers or generated identifiers. ```python # Manual Identifiers, Default builder = DelegateMappingBuilder(generated_identifiers=False) # Generated Identifiers builder = DelegateMappingBuilder(generated_identifiers=True) ``` With **manual identifiers**, users pass in a **Delegate Debug Identifier** when creating entries. With **generated identifiers**, the builder will auto-assign a **Delegate Debug Identifier**. To add an entry to the **Debug Handle Map**, use `insert_delegate_mapping_entry`. It associates one of `fx.Node(s)` or debug handles(s) (sourced from node.meta["debug_handle"]) to an optional **Delegate Debug Identifier** (used for the manual identifiers). The identifier recorded is returned from the call. ```python def insert_delegate_mapping_entry( self, nodes: Optional[Union[Node, List[Node]]] = None, handles: Optional[Union[int, List[int]]] = None, identifier: Optional[Union[int, str]] = None, ) -> Union[int, str]: ``` To retrieve the **Debug Handle Map**, use `get_delegate_mapping`. ```python def get_delegate_mapping( self, ) -> Union[Dict[int, Tuple[int]], Dict[str, Tuple[int]]] ``` A demo of the AOT mapping can be found [here](https://github.com/pytorch/executorch/blob/main/exir/backend/test/backend_with_delegate_mapping_demo.py) ## Runtime Logging Corresponding to the AOT map, the runtime then defines the functionality through which these events are logged. ### Real-Time Logging ExecuTorch allows you to log in real time. **Real time Logging** is useful when timestamps are available as the execution occurs. It provides minimal overhead and is intuitive for authors to call. To log events in real-time (for example, explicitly denoting the profiling start and stop), `event_tracer_start_profiling_delegate` is used to create an `EventEntry` and `event_tracer_end_profiling_delegate` is used to conclude the `EventEntry` for the provided `EventTracer`. To start an `EventTracerEntry` using `event_tracer_start_profiling_delegate`, the **Delegate Debug Identifier** (provided AOT to the `debug_handle_map`) is passed as either the name or `delegate_debug_id` argument depending on the **Delegate Debug Identifier** type (str and int respectively) ```c++ EventTracerEntry event_tracer_start_profiling_delegate( EventTracer* event_tracer, const char* name, DebugHandle delegate_debug_id) ``` To conclude an `EventTracerEntry`, `event_tracer_end_profiling_delegate` is simply provided the original `EventTracerEntry`. Optionally, additional runtime `metadata` can also be logged at this point. ```c++ void event_tracer_end_profiling_delegate( EventTracer* event_tracer, EventTracerEntry event_tracer_entry, const void* metadata = nullptr, size_t metadata_len = 0) ``` ### Post-Time Logging ExecuTorch also allows you to log in post time. Some runtime settings don't have access to timestamps while it is executing. **Post-Time Logging** enables authors to still be able to log these events. To log events in post (for example, logging start and end time simultaneously) `event_tracer_log_profiling_delegate` is called with a combination of the arguments used in the real-time logging API’s and timestamps. ```c++ void event_tracer_log_profiling_delegate( EventTracer* event_tracer, const char* name, DebugHandle delegate_debug_id, et_timestamp_t start_time, et_timestamp_t end_time, const void* metadata = nullptr, size_t metadata_len = 0) ``` A demo of the runtime code can be found [here](https://github.com/pytorch/executorch/blob/main/runtime/executor/test/test_backend_with_delegate_mapping.cpp). ## Surfacing custom metadata from delegate events As seen in the runtime logging API's above, users can log an array of bytes along with their delegate profiling event. We make this data available for users in post processing via the [Inspector API](model-inspector.rst). Users can pass a metadata parser when creating an instance of the Inspector. The parser is a callable that deserializes the data and returns a list of strings or a dictionary containing key-value pairs. The deserialized data is then added back to the corresponding event in the event block for user consumption. Here's an example of how to write this parser: NOTE: The input to the deserializer is a list where each entry is a series of bytes (essentially each entry is an immutable bytearray). Users are expected to iterate over this list, deserialize each entry and then return it in the expected format which is either a list of strings, or a dict. ```python Inspector( etdump_path=etdump_path, # Optional etrecord=etrecord_path, # Optional, only needed if debugging was enabled. buffer_path=buffer_path, delegate_metadata_parser=parse_delegate_metadata ) def parse_delegate_metadata(delegate_metadatas: List[bytes]) -> Union[List[str], Dict[str, Any]]: metadata_str = [] for metadata_bytes in delegate_metadatas: metadata_str += str(metadata_bytes) return metadata_str ``` --- (desktop-backends)= # Backends Available hardware acceleration backends for desktop platforms. ## Linux Backends - {doc}`desktop-xnnpack` — XNNPACK (CPU acceleration) - {doc}`desktop-openvino` — OpenVINO (Intel hardware optimization) ## macOS Backends - {doc}`desktop-coreml` — CoreML (recommended for Apple Silicon) - {doc}`desktop-mps` — Metal Performance Shaders (Apple Silicon GPU) - {doc}`desktop-xnnpack` — XNNPACK (CPU acceleration) ## Windows Backends - {doc}`desktop-xnnpack` — XNNPACK (CPU acceleration) - {doc}`desktop-openvino` — OpenVINO (Intel hardware optimization) ```{toctree} :hidden: desktop-xnnpack desktop-openvino desktop-coreml desktop-mps --- (desktop-section)= # Desktop & Laptop Platforms Deploy ExecuTorch on Linux, macOS, and Windows with optimized backends for each platform. ## Platform Overview & Runtime - {doc}`using-executorch-cpp` — C++ runtime integration guide - {doc}`using-executorch-building-from-source` — Building ExecuTorch from source ## Backends - {doc}`desktop-backends` — Available desktop backends and platform-specific optimization ## Tutorials - {doc}`raspberry_pi_llama_tutorial` — Cross compiling ExecuTorch for the Raspberry Pi on Linux Host ```{toctree} :hidden: using-executorch-cpp using-executorch-building-from-source desktop-backends raspberry_pi_llama_tutorial --- # Tools ```{toctree} :maxdepth: 1 devtools-overview bundled-io etrecord etdump runtime-profiling model-debugging model-inspector memory-planning-inspection delegate-debugging devtools-tutorial ``` --- # Introduction to the ExecuTorch Developer Tools ExecuTorch has been designed with [productivity](intro-overview.md) as one of its core objectives and the ExecuTorch Developer Tools enable this through the comprehensive suite of tools it provides users to help them profile, debug, and visualize models that they have onboarded onto ExecuTorch. All the components of the Developer Tools have been designed from the ground up with deep integration in both the export process and the runtime. This enables us to provide unique features such as linking back operator execution in the runtime to the line of code in the original eager model that this operator originated from. ## Developer Tools Features The ExecuTorch Developer Tools support the following features: - **BundledProgram** is a utility tool for exporting the model bundled with a sample set of (representative) inputs and expected outputs, so that during runtime users can validate that the actual output is in fact the same as the expected output. - **Profiling** models with operator level breakdown of performance stats - Linking back operator performance stats to source code and module hierarchy - Model loading and execution time - **Delegate Integration** - Surfacing performance details from delegate backends - Link back delegate operator execution to the nodes they represent in the edge dialect graph (and subsequently linking back to source code and module hierarchy) - **Debugging** - Intermediate outputs and output quality analysis - **Numerical Discrepancy Detection** - Operator-level numerical discrepancy detection between AOT and runtime intermediate outputs to streamline numerical debugging and validation. - **Memory Allocation Insights** - Visualize how memory is planned, where all the live tensors are at any point in time - **Visualization** - Visualize the model as a computational graph (see more [here](visualize.md)) ## Fundamental components of the Developer Tools In order to fully understand and leverage the power of the Developer Tools in this section, the fundamental components that power the Developer Tools will be detailed. ### ETRecord ETRecord (ExecuTorch Record) is an artifact generated during the export process that stores the graphs and other metadata that is critical for the Developer Tools to be able to link back the performance/debug data sourced from the runtime to the source code of the eager model. To draw a rough equivalence to conventional software development ETRecord can be considered as the binary built with debug symbols that is used for debugging in GNU Project debugger (gdb). More details are available in the [ETRecord documentation](etrecord.rst) on how to generate and store an ETRecord. ### ETDump ETDump (ExecuTorch Dump) is the binary blob that is generated by the runtime after running a model. Similarly as above, to draw a rough equivalence to conventional software development, ETDump can be considered as the coredump of ExecuTorch, but in this case within ETDump we store all the performance and debug data that was generated by the runtime during model execution. ```{note} If you only care about looking at the raw performance data without linking back to source code and other extensive features, an ETDump alone will be enough to leverage the basic features of the Developer Tools. For the full experience, it is recommended that the users also generate an ETRecord. ``` More details are available in the [ETDump documentation](etdump.md) on how to generate and store an ETDump from the runtime. ### Inspector APIs The Inspector Python APIs are the main user entry point into the Developer Tools. They join the data sourced from ETDump and ETRecord to give users access to all the performance and debug data sourced from the runtime along with linkage back to eager model source code and module hierarchy in an easy to use API. More details are available in the [Inspector API documentation](model-inspector.rst) on how to use the Inspector APIs. --- ## Developer Tools Usage Tutorial Please refer to the [Developer Tools tutorial](tutorials/devtools-integration-tutorial) for a walkthrough on how to profile a model in ExecuTorch using the Developer Tools. --- (edge-platforms-section)= # Edge Deploy ExecuTorch on mobile, desktop, and embedded platforms with optimized backends for each. ExecuTorch supports deployment across a wide variety of edge computing platforms, from high-end mobile devices to constrained embedded systems and microcontrollers. ## Android Deploy ExecuTorch on Android devices with hardware acceleration support. **→ {doc}`android-section` — Complete Android deployment guide** Key features: - Hardware acceleration support (CPU, GPU, NPU) - Multiple backend options (XNNPACK, Vulkan, Qualcomm, MediaTek, ARM, Samsung) - Comprehensive examples and demos ## iOS Deploy ExecuTorch on iOS devices with Apple hardware acceleration. **→ {doc}`ios-section` — Complete iOS deployment guide** Key features: - Apple hardware optimization (CoreML, MPS, XNNPACK) - Swift and Objective-C integration - LLM and computer vision examples ## Desktop & Laptop Platforms Deploy ExecuTorch on Linux, macOS, and Windows with optimized backends. **→ {doc}`desktop-section` — Complete desktop deployment guide** Key features: - Cross-platform C++ runtime - Platform-specific optimization (OpenVINO, CoreML, MPS) - CPU and GPU acceleration options ## Embedded Systems Deploy ExecuTorch on constrained embedded systems and microcontrollers. **→ {doc}`embedded-section` — Complete embedded deployment guide** Key features: - Resource-constrained deployment - DSP and NPU acceleration (Cadence, ARM Ethos-U, NXP) - Custom backend development support - LLM and computer vision examples ## Troubleshooting & Support - **{doc}`using-executorch-troubleshooting`** - Common issues and solutions across all platforms ## Next Steps After choosing your platform: - **{doc}`backends-section`** - Deep dive into backend selection and optimization - **{doc}`llm/working-with-llms`** - Working with Large Language Models on edge devices ```{toctree} :hidden: :maxdepth: 3 :caption: Edge Platforms android-section ios-section desktop-section embedded-section using-executorch-troubleshooting --- (embedded-backends)= # Backends Available hardware acceleration backends for embedded systems. ## DSP Acceleration - {doc}`embedded-cadence` — Cadence Xtensa DSP processors ## NPU Acceleration - {doc}`embedded-arm-ethos-u` — ARM Ethos-U NPU acceleration - {doc}`embedded-nxp` — NXP eIQ Neutron Backend ```{toctree} :hidden: embedded-cadence embedded-arm-ethos-u embedded-nxp --- (embedded-section)= # Embedded Systems Deploy ExecuTorch on constrained embedded systems and microcontrollers. ## API Reference & Development Start here for C++ development with ExecuTorch runtime APIs and essential tutorials. - {doc}`executorch-runtime-api-reference` — **Start here**: Complete runtime API reference for embedded development - {doc}`running-a-model-cpp-tutorial` — Step-by-step C++ API tutorial with practical examples - {doc}`extension-module` — Custom module extensions for specialized functionality - {doc}`extension-tensor` — Tensor operations and memory management extensions ## Build & Integration Guide - {doc}`using-executorch-cpp` — Complete setup guide for C++ runtime integration - {doc}`using-executorch-building-from-source` — Building from Source ## Choose Backend for acceleration - {doc}`embedded-backends` — Available embedded backends and acceleration options ## Tutorials - {doc}`tutorial-arm-ethos-u` — Export a simple PyTorch model for the ExecuTorch Ethos-U backend - {doc}`raspberry_pi_llama_tutorial` — Deploy a LLaMA model on a Raspberry Pi - {doc}`pico2_tutorial` — Deploy a demo MNIST model on the Raspberry Pi Pico 2 ```{toctree} :hidden: executorch-runtime-api-reference running-a-model-cpp-tutorial extension-module extension-tensor using-executorch-cpp using-executorch-building-from-source embedded-backends tutorial-arm-ethos-u raspberry_pi_llama_tutorial pico2_tutorial --- # Prerequisite | ETDump - ExecuTorch Dump ETDump (ExecuTorch Dump) is one of the core components of the ExecuTorch Developer Tools. It is the mechanism through which all forms of profiling and debugging data is extracted from the runtime. Users can't parse ETDump directly; instead, they should pass it into the Inspector API, which deserializes the data, offering interfaces for flexible analysis and debugging. ## Generating an ETDump Generating an ETDump is a relatively straightforward process. Users can follow the steps detailed below to integrate it into their application that uses ExecuTorch. 1. ***Include*** the ETDump header in your code. ```C++ #include ``` 2. ***Create*** an Instance of the ETDumpGen class and pass it into the `load_method` call that is invoked in the runtime. ```C++ executorch::etdump::ETDumpGen etdump_gen; Result method = program->load_method(method_name, &memory_manager, &etdump_gen); ``` 3. ***Dump Out the ETDump Buffer*** - after the inference iterations have been completed, users can dump out the ETDump buffer. If users are on a device which has a filesystem, they could just write it out to the filesystem. For more constrained embedded devices, users will have to extract the ETDump buffer from the device through a mechanism that best suits them (e.g. UART, JTAG etc.) ```C++ etdump_result result = etdump_gen.get_etdump_data(); if (result.buf != nullptr && result.size > 0) { // On a device with a file system users can just write it out // to the file-system. FILE* f = fopen(FLAGS_etdump_path.c_str(), "w+"); fwrite((uint8_t*)result.buf, 1, result.size, f); fclose(f); free(result.buf); } ``` 4. ***Compile*** your binary using CMake with the `ET_EVENT_TRACER_ENABLED` pre-processor flag to enable events to be traced and logged into ETDump inside the ExecuTorch runtime. This flag needs to be added to the ExecuTorch library and any operator library that you are compiling into your binary. For reference, you can take a look at `examples/sdk/CMakeLists.txt`. The lines of interest are: ``` target_compile_options(executorch INTERFACE -DET_EVENT_TRACER_ENABLED) target_compile_options(portable_ops_lib INTERFACE -DET_EVENT_TRACER_ENABLED) ``` ### Make sure ET_EVENT_TRACER_ENABLED flag is enabled or ETDump will be empty. If the binary is not compiled with the `ET_EVENT_TRACER_ENABLED` preprocessor flag, no trace events will be recorded and the ETDump will be empty. When this flag is missing, the following code: ```C++ ETDumpResult result = etdump_gen.get_etdump_data(); if (result.buf != nullptr && result.size > 0) { ... } ``` will always return: ```C++ result.buf == nullptr result.size == 0 ``` This indicates that no ETDump data was collected, and therefore nothing can be analyzed through the Inspector API. ## Using an ETDump Pass this ETDump into the [Inspector API](model-inspector.rst) to access this data and do post-run analysis. --- # Lowering a Model as a Delegate Audience: ML Engineers, who are interested in applying delegates to accelerate their program in runtime. Backend delegation is an entry point for backends to process and execute PyTorch programs to leverage the performance and efficiency benefits of specialized backends and hardware, while still providing PyTorch users with an experience close to that of the PyTorch runtime. The backend delegate is usually either provided by ExecuTorch or vendors. The way to leverage delegation in your program is via a standard entry point `to_backend`. ## Frontend Interfaces There are three flows for delegating a program to a backend: 1. Lower the whole module to a backend. This is good for testing backends and the preprocessing stage. 1. Lower the whole module to a backend and compose it with another module. This is good for reusing lowered modules exported from other flows. 1. Lower parts of a module according to a partitioner. This is good for lowering models that include both lowerable and non-lowerable nodes, and is the most streamlined process. ### Flow 1: Lowering the whole module This flow starts from a traced graph module with Edge Dialect representation. To lower it, we call the following function which returns a `LoweredBackendModule` (more documentation on this function can be found in the [Export API reference](export-to-executorch-api-reference.rst)) ```python # defined in backend_api.py def to_backend( backend_id: str, edge_program: ExportedProgram, compile_spec: List[CompileSpec], ) -> LoweredBackendModule: ``` Within this function, the backend's `preprocess()` function is called which produces a compiled blob which will be emitted to the flatbuffer binary. The lowered module can be directly captured, or be put back in a parent module to be captured. Eventually the captured module is serialized in the flatbuffer's model that can be loaded by the runtime. The following is an example of this flow: ```python from executorch.exir.backend.backend_api import to_backend import executorch.exir as exir import torch from torch.export import export from executorch.exir import to_edge # The submodule runs in a specific backend. In this example, `BackendWithCompilerDemo` backend class LowerableSubModel(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x): return torch.sin(x) # Convert the lowerable module to Edge IR Representation to_be_lowered = LowerableSubModel() example_input = (torch.ones(1), ) to_be_lowered_exir_submodule = to_edge(export(to_be_lowered, example_input)) # Import the backend implementation from executorch.exir.backend.test.backend_with_compiler_demo import ( BackendWithCompilerDemo, ) lowered_module = to_backend('BackendWithCompilerDemo', to_be_lowered_exir_submodule.exported_program(), []) ``` We can serialize the program to a flatbuffer format by directly running: ```python # Save the flatbuffer to a local file save_path = "delegate.pte" with open(save_path, "wb") as f: f.write(lowered_module.buffer()) ``` ### Flow 2: Lowering the whole module and composite Alternatively, after flow 1, we can compose this lowered module with another module: ```python # This submodule runs in executor runtime class NonLowerableSubModel(torch.nn.Module): def __init__(self, bias): super().__init__() self.bias = bias def forward(self, a, b): return torch.add(torch.add(a, b), self.bias) # The composite module, including lower part and non-lowerpart class CompositeModel(torch.nn.Module): def __init__(self): super().__init__() self.non_lowerable = NonLowerableSubModel(torch.ones(1) * 0.3) self.lowerable = lowered_module def forward(self, x): a = self.lowerable(x) b = self.lowerable(a) ret = self.non_lowerable(a, b) return a, b, ret composite_model = CompositeModel() model_inputs = (torch.ones(1), ) exec_prog = to_edge(export(composite_model, model_inputs)).to_executorch() # Save the flatbuffer to a local file save_path = "delegate.pte" with open(save_path, "wb") as f: f.write(exec_prog.buffer) ``` ### Flow 3: Partitioning The third flow also starts from a traced graph module with Edge Dialect representation. To lower certain nodes in this graph module, we can use the overloaded [`to_backend` function](https://github.com/pytorch/executorch/blob/d9eef24bb720804aa7b400b05241487510ae0dc2/exir/backend/backend_api.py#L39). ```python def to_backend( edge_program: ExportedProgram, partitioner: Partitioner, ) -> ExportedProgram: ``` This function takes in a `Partitioner` which adds a tag to all the nodes that are meant to be lowered. It will return a `partition_tags` dictionary mapping tags to backend names and module compile specs. The tagged nodes will then be partitioned and lowered to their mapped backends using Flow 1's process. Available helper partitioners are documented [here](compiler-custom-compiler-passes.md). These lowered modules will be inserted into the top-level module and serialized. The following is an example of the flow: ```python import executorch.exir as exir from executorch.exir.backend.backend_api import to_backend from executorch.exir.backend.test.op_partitioner_demo import AddMulPartitionerDemo from executorch.exir.program import ( EdgeProgramManager, to_edge, ) from torch.export import export import torch class Model(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x, y): x = x + y x = x * y x = x - y x = x / y x = x * y x = x + y return x model = Model() model_inputs = (torch.randn(1, 3), torch.randn(1, 3)) core_aten_ep = export(model, model_inputs) edge: EdgeProgramManager = to_edge(core_aten_ep) edge = edge.to_backend(AddMulPartitionerDemo()) exec_prog = edge.to_executorch() # Save the flatbuffer to a local file save_path = "delegate.pte" with open(save_path, "wb") as f: f.write(exec_prog.buffer) ``` ## Runtime After having the program with delegates, to run the model with the backend, we'd need to register the backend. Depending on the delegate implementation, the backend can be registered either as part of global variables or explicitly registered inside the main function. - If it's registered during global variables initialization, the backend will be registered as long as it's statically linked. Users only need to include the library as part of the dependency. - If the vendor provides an API to register the backend, users need to include the library as part of the dependency, and call the API provided by vendors to explicitly register the backend as part of the main function. --- # Examples ```{toctree} :maxdepth: 1 Building an ExecuTorch Android Demo App Building an ExecuTorch iOS Demo App tutorial-arm ``` --- # Exporting to ExecuTorch One of the important steps in getting your PyTorch programs ready for execution on an edge device is exporting them. This is achieved through the use of a PyTorch API called `torch.export`. The `torch.export` documentation, which is part of the PyTorch core library, can be found in the Core PyTorch documentation set. Additionally, we provide a step-by-step tutorial that takes you through the process of exporting a PyTorch program, making it easier for you to understand and implement the process. To learn more about exporting your model: * Complete the [Exporting to ExecuTorch tutorial](tutorials/export-to-executorch-tutorial) . * Read the [torch.export documentation](https://pytorch.org/docs/2.1/export.html). --- # Running an ExecuTorch Model Using the Module Extension in C++ **Author:** [Anthony Shoumikhin](https://github.com/shoumikhin) In the [Detailed C++ Runtime APIs Tutorial](running-a-model-cpp-tutorial.md), we explored the lower-level ExecuTorch APIs for running an exported model. While these APIs offer zero overhead, great flexibility, and control, they can be verbose and complex for regular use. To simplify this and resemble PyTorch's eager mode in Python, we introduce the `Module` facade APIs over the regular ExecuTorch runtime APIs. The `Module` APIs provide the same flexibility but default to commonly used components like `DataLoader` and `MemoryAllocator`, hiding most intricate details. ## Example Let's see how we can run the `SimpleConv` model generated from the [Exporting to ExecuTorch tutorial](tutorials/export-to-executorch-tutorial) using the `Module` and [`TensorPtr`](extension-tensor.md) APIs: ```cpp #include #include using namespace ::executorch::extension; // Create a Module. Module module("/path/to/model.pte"); // Wrap the input data with a Tensor. float input[1 * 3 * 256 * 256]; auto tensor = from_blob(input, {1, 3, 256, 256}); // Perform an inference. const auto result = module.forward(tensor); // Check for success or failure. if (result.ok()) { // Retrieve the output data. const auto output = result->at(0).toTensor().const_data_ptr(); } ``` The code now boils down to creating a `Module` and calling `forward()` on it, with no additional setup. Let's take a closer look at these and other `Module` APIs to better understand the internal workings. ## APIs ### Creating a Module Creating a `Module` object is a fast operation that does not involve significant processing time or memory allocation. The actual loading of a `Program` and a `Method` happens lazily on the first inference unless explicitly requested with a dedicated API. ```cpp Module module("/path/to/model.pte"); ``` For a model with data separated into a PTD file, load them together: ```cpp Module module("/path/to/model.pte", "/path/to/model.ptd"); ``` ### Force-Loading a Method To force-load the `Module` (and thus the underlying ExecuTorch `Program`) at any time, use the `load()` function: ```cpp const auto error = module.load(); assert(module.is_loaded()); ``` To force-load a particular `Method`, call the `load_method()` function: ```cpp const auto error = module.load_method("forward"); assert(module.is_method_loaded("forward")); ``` You can also use the convenience function to load the `forward` method: ```cpp const auto error = module.load_forward(); assert(module.is_method_loaded("forward")); ``` **Note:** The `Program` is loaded automatically before any `Method` is loaded. Subsequent attempts to load them have no effect if a previous attempt was successful. ### Querying for Metadata Get a set of method names that a `Module` contains using the `method_names()` function: ```cpp const auto method_names = module.method_names(); if (method_names.ok()) { assert(method_names->count("forward")); } ``` **Note:** `method_names()` will force-load the `Program` when called for the first time. To introspect miscellaneous metadata about a particular method, use the `method_meta()` function, which returns a `MethodMeta` struct: ```cpp const auto method_meta = module.method_meta("forward"); if (method_meta.ok()) { assert(method_meta->name() == "forward"); assert(method_meta->num_inputs() > 1); const auto input_meta = method_meta->input_tensor_meta(0); if (input_meta.ok()) { assert(input_meta->scalar_type() == ScalarType::Float); } const auto output_meta = method_meta->output_tensor_meta(0); if (output_meta.ok()) { assert(output_meta->sizes().size() == 1); } } ``` **Note:** `method_meta()` will also force-load the `Method` the first time it is called. ### Performing an Inference Assuming the `Program`'s method names and their input format are known ahead of time, you can run methods directly by name using the `execute()` function: ```cpp const auto result = module.execute("forward", tensor); ``` For the standard `forward()` method, the above can be simplified: ```cpp const auto result = module.forward(tensor); ``` **Note:** `execute()` or `forward()` will load the `Program` and the `Method` the first time they are called. Therefore, the first inference will take longer, as the model is loaded lazily and prepared for execution unless it was explicitly loaded earlier. ### Setting Input and Output You can set individual input and output values for methods with the following APIs. #### Setting Inputs Inputs can be any `EValue`, which includes tensors, scalars, lists, and other supported types. To set a specific input value for a method: ```cpp module.set_input("forward", input_value, input_index); ``` - `input_value` is an `EValue` representing the input you want to set. - `input_index` is the zero-based index of the input to set. For example, to set the first input tensor: ```cpp module.set_input("forward", tensor_value, 0); ``` You can also set multiple inputs at once: ```cpp std::vector inputs = {input1, input2, input3}; module.set_inputs("forward", inputs); ``` **Note:** You can skip the method name argument for the `forward()` method. By pre-setting all inputs, you can perform an inference without passing any arguments: ```cpp const auto result = module.forward(); ``` Or just setting and then passing the inputs partially: ```cpp // Set the second input ahead of time. module.set_input(input_value_1, 1); // Execute the method, providing the first input at call time. const auto result = module.forward(input_value_0); ``` **Note:** The pre-set inputs are stored in the `Module` and can be reused multiple times for the next executions. Don't forget to clear or reset the inputs if you don't need them anymore by setting them to default-constructed `EValue`: ```cpp module.set_input(runtime::EValue(), 1); ``` #### Setting Outputs Only outputs of type Tensor can be set at runtime, and they must not be memory-planned at model export time. Memory-planned tensors are preallocated during model export and cannot be replaced. To set the output tensor for a specific method: ```cpp module.set_output("forward", output_tensor, output_index); ``` - `output_tensor` is an `EValue` containing the tensor you want to set as the output. - `output_index` is the zero-based index of the output to set. **Note:** Ensure that the output tensor you're setting matches the expected shape and data type of the method's output. You can skip the method name for `forward()` and the index for the first output: ```cpp module.set_output(output_tensor); ``` **Note:** The pre-set outputs are stored in the `Module` and can be reused multiple times for the next executions, just like inputs. ### Result and Error Types Most of the ExecuTorch APIs return either `Result` or `Error` types: - [`Error`](https://github.com/pytorch/executorch/blob/main/runtime/core/error.h) is a C++ enum containing valid error codes. The default is `Error::Ok`, denoting success. - [`Result`](https://github.com/pytorch/executorch/blob/main/runtime/core/result.h) can hold either an `Error` if the operation fails, or a payload such as an `EValue` wrapping a `Tensor` if successful. To check if a `Result` is valid, call `ok()`. To retrieve the `Error`, use `error()`, and to get the data, use `get()` or dereference operators like `*` and `->`. ### Profiling the Module Use [ExecuTorch Dump](etdump.md) to trace model execution. Create an `ETDumpGen` instance and pass it to the `Module` constructor. After executing a method, save the `ETDump` data to a file for further analysis: ```cpp #include #include #include #include using namespace ::executorch::extension; Module module("/path/to/model.pte", Module::LoadMode::MmapUseMlock, std::make_unique()); // Execute a method, e.g., module.forward(...); or module.execute("my_method", ...); if (auto* etdump = dynamic_cast(module.event_tracer())) { const auto trace = etdump->get_etdump_data(); if (trace.buf && trace.size > 0) { std::unique_ptr guard(trace.buf, free); std::ofstream file("/path/to/trace.etdump", std::ios::binary); if (file) { file.write(static_cast(trace.buf), trace.size); } } } ``` ## Conclusion The `Module` APIs provide a simplified interface for running ExecuTorch models in C++, closely resembling the experience of PyTorch's eager mode. By abstracting away the complexities of the lower-level runtime APIs, developers can focus on model execution without worrying about the underlying details. --- # Managing Tensor Memory in C++ **Author:** [Anthony Shoumikhin](https://github.com/shoumikhin) Tensors are fundamental data structures in ExecuTorch, representing multi-dimensional arrays used in computations for neural networks and other numerical algorithms. In ExecuTorch, the [Tensor](https://github.com/pytorch/executorch/blob/main/runtime/core/portable_type/tensor.h) class doesn’t own its metadata (sizes, strides, dim_order) or data, keeping the runtime lightweight. Users are responsible for supplying all these memory buffers and ensuring that the metadata and data outlive the `Tensor` instance. While this design is lightweight and flexible, especially for tiny embedded systems, it places a significant burden on the user. If your environment requires minimal dynamic allocations, a small binary footprint, or limited C++ standard library support, you’ll need to accept that trade-off and stick with the regular `Tensor` type. Imagine you’re working with a [`Module`](extension-module.md) interface, and you need to pass a `Tensor` to the `forward()` method. You would need to declare and maintain at least the sizes array and data separately, sometimes the strides too, often leading to the following pattern: ```cpp #include using namespace executorch::aten; using namespace executorch::extension; SizesType sizes[] = {2, 3}; DimOrderType dim_order[] = {0, 1}; StridesType strides[] = {3, 1}; float data[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f}; TensorImpl tensor_impl( ScalarType::Float, std::size(sizes), sizes, data, dim_order, strides); // ... module.forward(Tensor(&tensor_impl)); ``` You must ensure `sizes`, `dim_order`, `strides`, and `data` stay valid. This makes code maintenance difficult and error-prone. Users have struggled to manage lifetimes, and many have created their own ad-hoc managed tensor abstractions to hold all the pieces together, leading to a fragmented and inconsistent ecosystem. ## Introducing TensorPtr To alleviate these issues, ExecuTorch provides `TensorPtr`, a smart pointer that manages the lifecycle of both the tensor's data and its dynamic metadata. With `TensorPtr`, users no longer need to worry about metadata lifetimes separately. Data ownership is determined based on whether it is passed by pointer or moved into the `TensorPtr` as an `std::vector`. Everything is bundled in one place and managed automatically, enabling you to focus on actual computations. Here’s how you can use it: ```cpp #include #include using namespace executorch::extension; auto tensor = make_tensor_ptr( {2, 3}, // sizes {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f}); // data // ... module.forward(tensor); ``` The data is now owned by the tensor instance because it's provided as a vector. To create a non-owning `TensorPtr`, just pass the data by pointer. The `type` is deduced automatically based on the data vector (`float`). `strides` and `dim_order` are computed automatically to default values based on the `sizes` if not specified explicitly as additional arguments. `EValue` in `Module::forward()` accepts `TensorPtr` directly, ensuring seamless integration. `EValue` can now be constructed implicitly with a smart pointer to any type that it can hold. This allows `TensorPtr` to be dereferenced implicitly when passed to `forward()`, and `EValue` will hold the `Tensor` that the `TensorPtr` points to. ## API Overview `TensorPtr` is literally an alias for `std::shared_ptr`, so you can work with it easily without duplicating the data and metadata. Each `Tensor` instance may either own its data or reference external data. ### Creating Tensors There are several ways to create a `TensorPtr`. #### Creating Scalar Tensors You can create a scalar tensor, i.e. a tensor with zero dimensions or with one of the sizes being zero. *Providing A Single Data Value* ```cpp auto tensor = make_tensor_ptr(3.14); ``` The resulting tensor will contain a single value `3.14` of type double, which is deduced automatically. *Providing A Single Data Value with a Type* ```cpp auto tensor = make_tensor_ptr(42, ScalarType::Float); ``` Now the integer `42` will be cast to float and the tensor will contain a single value `42` of type float. #### Owning Data from a Vector When you provide sizes and data vectors, `TensorPtr` takes ownership of both the data and the sizes. *Providing Data Vector* ```cpp auto tensor = make_tensor_ptr( {2, 3}, // sizes {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f}); // data (float) ``` The type is deduced automatically as `ScalarType::Float` from the data vector. *Providing Data Vector with a Type* If you provide data of one type but specify a different scalar type, the data will be cast to the given type. ```cpp auto tensor = make_tensor_ptr( {1, 2, 3, 4, 5, 6}, // data (int) ScalarType::Double); // double scalar type ``` In this example, even though the data vector contains integers, we specify the scalar type as `Double`. The integers are cast to double, and the new data vector is owned by the `TensorPtr`. Since the `sizes` argument is skipped in this example, the tensor is one-dimensional with a size equal to the length of the data vector. Note that the reverse cast, from a floating-point type to an integral type, is not allowed because that loses precision. Similarly, casting other types to `Bool` is disallowed. *Providing Data Vector as `std::vector`* You can also provide raw data in the form of a `std::vector`, specifying the sizes and scalar type. The data will be reinterpreted according to the provided type. ```cpp std::vector data = /* raw data */; auto tensor = make_tensor_ptr( {2, 3}, // sizes std::move(data), // data as uint8_t vector ScalarType::Int); // int scalar type ``` The `data` vector must be large enough to accommodate all the elements according to the provided sizes and scalar type. #### Non-Owning Data from Raw Pointer You can create a `TensorPtr` that references existing data without taking ownership. *Providing Raw Data* ```cpp float data[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f}; auto tensor = make_tensor_ptr( {2, 3}, // sizes data, // raw data pointer ScalarType::Float); // float scalar type ``` The `TensorPtr` does not own the data, so you must ensure the `data` remains valid. *Providing Raw Data with Custom Deleter* If you want the `TensorPtr` to manage the lifetime of the data, you can provide a custom deleter. ```cpp auto* data = new double[6]{1.0, 2.0, 3.0, 4.0, 5.0, 6.0}; auto tensor = make_tensor_ptr( {2, 3}, // sizes data, // data pointer ScalarType::Double, // double scalar type TensorShapeDynamism::DYNAMIC_BOUND, // default dynamism [](void* ptr) { delete[] static_cast(ptr); }); ``` The `TensorPtr` will call the custom deleter when it is destroyed, i.e., when the smart pointer is reset and no more references to the underlying `Tensor` exist. #### Sharing Existing Tensor Since `TensorPtr` is a `std::shared_ptr`, you can easily create a `TensorPtr` that shares an existing `Tensor`. Any changes made to the shared data are reflected across all instances sharing the same data. *Sharing Existing TensorPtr* ```cpp auto tensor = make_tensor_ptr({2, 3}, {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f}); auto tensor_copy = tensor; ``` Now `tensor` and `tensor_copy` point to the same data and metadata. #### Viewing Existing Tensor You can create a `TensorPtr` from an existing `Tensor`, copying its properties and referencing the same data. *Viewing Existing Tensor* ```cpp Tensor original_tensor = /* some existing tensor */; auto tensor = make_tensor_ptr(original_tensor); ``` Now the newly created `TensorPtr` references the same data as the original tensor, but has its own metadata copy, so it can interpret or "view" the data differently, but any modifications to the data will be reflected in the original `Tensor` as well. ### Cloning Tensors To create a new `TensorPtr` that owns a copy of the data from an existing tensor: ```cpp Tensor original_tensor = /* some existing tensor */; auto tensor = clone_tensor_ptr(original_tensor); ``` The newly created `TensorPtr` has its own copy of the data, so it can modify and manage it independently. Likewise, you can create a clone of an existing `TensorPtr`. ```cpp auto original_tensor = make_tensor_ptr(/* ... */); auto tensor = clone_tensor_ptr(original_tensor); ``` Note that, regardless of whether the original `TensorPtr` owns the data or not, the newly created `TensorPtr` will own a copy of the data. ### Resizing Tensors The `TensorShapeDynamism` enum specifies the mutability of a tensor's shape: - `STATIC`: The tensor's shape cannot be changed. - `DYNAMIC_BOUND`: The tensor's shape can be changed but cannot contain more elements than it originally had at creation based on the initial sizes. - `DYNAMIC`: The tensor's shape can be changed arbitrarily. Currently, `DYNAMIC` is an alias for `DYNAMIC_BOUND`. When resizing a tensor, you must respect its dynamism setting. Resizing is only allowed for tensors with `DYNAMIC` or `DYNAMIC_BOUND` shapes, and you cannot resize `DYNAMIC_BOUND` tensors to contain more elements than they had initially. ```cpp auto tensor = make_tensor_ptr( {2, 3}, // sizes {1, 2, 3, 4, 5, 6}, // data ScalarType::Int, TensorShapeDynamism::DYNAMIC_BOUND); // Initial sizes: {2, 3} // Number of elements: 6 resize_tensor_ptr(tensor, {2, 2}); // The tensor sizes are now {2, 2} // Number of elements is 4 < initial 6 resize_tensor_ptr(tensor, {1, 3}); // The tensor sizes are now {1, 3} // Number of elements is 3 < initial 6 resize_tensor_ptr(tensor, {3, 2}); // The tensor sizes are now {3, 2} // Number of elements is 6 == initial 6 resize_tensor_ptr(tensor, {6, 1}); // The tensor sizes are now {6, 1} // Number of elements is 6 == initial 6 ``` ## Convenience Helpers ExecuTorch provides several helper functions to create tensors conveniently. ### Creating Non-Owning Tensors with `for_blob` and `from_blob` These helpers allow you to create tensors that do not own the data. *Using `from_blob()`* ```cpp float data[] = {1.0f, 2.0f, 3.0f}; auto tensor = from_blob( data, // data pointer {3}, // sizes ScalarType::Float); // float scalar type ``` *Using `for_blob()` with Fluent Syntax* ```cpp double data[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0}; auto tensor = for_blob(data, {2, 3}, ScalarType::Double) .strides({3, 1}) .dynamism(TensorShapeDynamism::STATIC) .make_tensor_ptr(); ``` *Using Custom Deleter with `from_blob()`* ```cpp int* data = new int[3]{1, 2, 3}; auto tensor = from_blob( data, // data pointer {3}, // sizes ScalarType::Int, // int scalar type [](void* ptr) { delete[] static_cast(ptr); }); ``` The `TensorPtr` will call the custom deleter when it is destroyed. ### Creating Empty Tensors `empty()` creates an uninitialized tensor with sizes specified. ```cpp auto tensor = empty({2, 3}); ``` `empty_like()` creates an uninitialized tensor with the same sizes as an existing `TensorPtr`. ```cpp TensorPtr original_tensor = /* some existing tensor */; auto tensor = empty_like(original_tensor); ``` And `empty_strided()` creates an uninitialized tensor with sizes and strides specified. ```cpp auto tensor = empty_strided({2, 3}, {3, 1}); ``` ### Creating Tensors Filled with Specific Values `full()`, `zeros()` and `ones()` create a tensor filled with a provided value, zeros or ones respectively. ```cpp auto tensor_full = full({2, 3}, 42.0f); auto tensor_zeros = zeros({2, 3}); auto tensor_ones = ones({3, 4}); ``` Similarly to `empty()`, there are extra helper functions `full_like()`, `full_strided()`, `zeros_like()`, `zeros_strided()`, `ones_like()` and `ones_strided()` to create filled tensors with the same properties as an existing `TensorPtr` or with custom strides. ### Creating Random Tensors `rand()` creates a tensor filled with random values between 0 and 1. ```cpp auto tensor_rand = rand({2, 3}); ``` `randn()` creates a tensor filled with random values from a normal distribution. ```cpp auto tensor_randn = randn({2, 3}); ``` `randint()` creates a tensor filled with random integers between min (inclusive) and max (exclusive) integers specified. ```cpp auto tensor_randint = randint(0, 10, {2, 3}); ``` ### Creating Scalar Tensors In addition to `make_tensor_ptr()` with a single data value, you can create a scalar tensor with `scalar_tensor()`. ```cpp auto tensor = scalar_tensor(3.14f); ``` Note that the `scalar_tensor()` function expects a value of type `Scalar`. In ExecuTorch, `Scalar` can represent `bool`, `int`, or floating-point types, but not types like `Half` or `BFloat16`, etc. for which you'd need to use `make_tensor_ptr()` to skip the `Scalar` type. ## Notes on EValue and Lifetime Management The [`Module`](extension-module.md) interface expects data in the form of `EValue`, a variant type that can hold a `Tensor` or other scalar types. When you pass a `TensorPtr` to a function expecting an `EValue`, you can dereference the `TensorPtr` to get the underlying `Tensor`. ```cpp TensorPtr tensor = /* create a TensorPtr */ //... module.forward(tensor); ``` Or even a vector of `EValues` for multiple parameters. ```cpp TensorPtr tensor = /* create a TensorPtr */ TensorPtr tensor2 = /* create another TensorPtr */ //... module.forward({tensor, tensor2}); ``` However, be cautious: `EValue` will not hold onto the dynamic data and metadata from the `TensorPtr`. It merely holds a regular `Tensor`, which does not own the data or metadata but refers to them using raw pointers. You need to ensure that the `TensorPtr` remains valid for as long as the `EValue` is in use. This also applies when using functions like `set_input()` or `set_output()` that expect `EValue`. ## Interoperability with ATen If your code is compiled with the preprocessor flag `USE_ATEN_LIB` enabled, all the `TensorPtr` APIs will use `at::` APIs under the hood. E.g. `TensorPtr` becomes a `std::shared_ptr`. This allows for seamless integration with [PyTorch ATen](https://pytorch.org/cppdocs) library. ### API Equivalence Table Here's a table matching `TensorPtr` creation functions with their corresponding ATen APIs: | ATen | ExecuTorch | |---------------------------------------------|---------------------------------------------| | `at::tensor(data, type)` | `make_tensor_ptr(data, type)` | | `at::tensor(data, type).reshape(sizes)` | `make_tensor_ptr(sizes, data, type)` | | `tensor.clone()` | `clone_tensor_ptr(tensor)` | | `tensor.resize_(new_sizes)` | `resize_tensor_ptr(tensor, new_sizes)` | | `at::scalar_tensor(value)` | `scalar_tensor(value)` | | `at::from_blob(data, sizes, type)` | `from_blob(data, sizes, type)` | | `at::empty(sizes)` | `empty(sizes)` | | `at::empty_like(tensor)` | `empty_like(tensor)` | | `at::empty_strided(sizes, strides)` | `empty_strided(sizes, strides)` | | `at::full(sizes, value)` | `full(sizes, value)` | | `at::full_like(tensor, value)` | `full_like(tensor, value)` | | `at::full_strided(sizes, strides, value)` | `full_strided(sizes, strides, value)` | | `at::zeros(sizes)` | `zeros(sizes)` | | `at::zeros_like(tensor)` | `zeros_like(tensor)` | | `at::zeros_strided(sizes, strides)` | `zeros_strided(sizes, strides)` | | `at::ones(sizes)` | `ones(sizes)` | | `at::ones_like(tensor)` | `ones_like(tensor)` | | `at::ones_strided(sizes, strides)` | `ones_strided(sizes, strides)` | | `at::rand(sizes)` | `rand(sizes)` | | `at::rand_like(tensor)` | `rand_like(tensor)` | | `at::randn(sizes)` | `randn(sizes)` | | `at::randn_like(tensor)` | `randn_like(tensor)` | | `at::randint(low, high, sizes)` | `randint(low, high, sizes)` | | `at::randint_like(tensor, low, high)` | `randint_like(tensor, low, high)` | ## Best Practices - *Manage Lifetimes Carefully*: Even though `TensorPtr` handles memory management, ensure that any non-owned data (e.g., when using `from_blob()`) remains valid while the tensor is in use. - *Use Convenience Functions*: Utilize helper functions for common tensor creation patterns to write cleaner and more readable code. - *Be Aware of Data Ownership*: Know whether your tensor owns its data or references external data to avoid unintended side effects or memory leaks. - *Ensure `TensorPtr` Outlives `EValue`*: When passing tensors to modules that expect `EValue`, ensure that the `TensorPtr` remains valid as long as the `EValue` is in use. ## Conclusion The `TensorPtr` in ExecuTorch simplifies tensor memory management by bundling the data and dynamic metadata into a smart pointer. This design eliminates the need for users to manage multiple pieces of data and ensures safer and more maintainable code. By providing interfaces similar to PyTorch's ATen library, ExecuTorch simplifies the adoption of the new API, allowing developers to transition without a steep learning curve. --- (file-formats-advanced)= # File Formats ExecuTorch file format specifications and internal structure. ## Program File Formats - {doc}`pte-file-format` — PTE (PyTorch ExecuTorch) file format specification - {doc}`ptd-file-format` — PTD file format specification ```{toctree} :hidden: :maxdepth: 1 pte-file-format ptd-file-format --- # Architecture and Components This page describes the technical architecture of ExecuTorch and its individual components. This document is targeted towards engineers who are deploying PyTorch model onto edge devices. **Context** In order to target on-device AI with diverse hardware, critical power requirements, and real-time processing needs, a single monolithic solution is not practical. Instead, a modular, layered, and extensible architecture is desired. ExecuTorch defines a streamlined workflow to prepare (export, transformation, and compilation) and execute a PyTorch program, with opinionated out-of-the-box default components and well-defined entry points for customizations. This architecture greatly improves portability, allowing engineers to use a performant lightweight, cross-platform runtime that easily integrates into different devices and platforms. ## Overview There are three phases to deploy a PyTorch model to on-device: program preparation, runtime preparation, and program execution, as shown in the diagram below, with a number of user entry points. We’ll discuss each step separately in this documentation. ![](executorch_stack.png) **Figure 1.** The figure illustrates the three phases - program preparation, runtime preparation and program execution. ## Program Preparation ExecuTorch extends the flexibility and usability of PyTorch to edge devices. It leverages PyTorch 2 compiler and export functionality ([TorchDynamo](https://pytorch.org/docs/stable/torch.compiler_dynamo_overview.html), [AOTAutograd](https://pytorch.org/functorch/stable/notebooks/aot_autograd_optimizations.html), [Quantization](https://pytorch.org/docs/main/quantization.html), [dynamic shapes](https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever), [control flow](https://pytorch.org/docs/main/export.html#data-shape-dependent-control-flow), etc.) to prepare a PyTorch program for execution on devices. Program preparation is often simply called AOT (ahead-of-time) because export, transformations and compilations to the program are performed before it is eventually run with the ExecuTorch runtime, written in C++. To have a lightweight runtime and small overhead in execution, we push work as much as possible to AOT. Starting from the program source code, below are the steps you would go through to accomplish the program preparation. ### Program Source Code * Like all PyTorch use cases, ExecuTorch starts from model authoring, where standard `nn.Module` eager mode PyTorch programs are created. * Export-specific helpers are used to represent advanced features like [control flow](https://pytorch.org/docs/main/export.html#data-shape-dependent-control-flow) (for example, helper functions to trace both branches of if-else) and [dynamic shapes](https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever) (for example, data dependent dynamic shape constraint). ### Export To deploy the program to the device, engineers need to have a graph representation for compiling a model to run on various backends. With [`torch.export()`](https://pytorch.org/docs/main/export.html), an [EXIR](ir-exir.md) (export intermediate representation) is generated with ATen dialect. All AOT compilations are based on this EXIR, but can have multiple dialects along the lowering path as detailed below. * _[ATen Dialect](ir-exir.md#aten-dialect)_. PyTorch Edge is based on PyTorch’s Tensor library ATen, which has clear contracts for efficient execution. ATen Dialect is a graph represented by ATen nodes which are fully ATen compliant. Custom operators are allowed, but must be registered with the dispatcher. It’s flatten with no module hierarchy (submodules in a bigger module), but the source code and module hierarchy are preserved in the metadata. This representation is also autograd safe. * Optionally, _quantization_, either QAT (quantization-aware training) or PTQ (post training quantization) can be applied to the whole ATen graph before converting to Core ATen. Quantization helps with reducing the model size, which is important for edge devices. * _[Core ATen Dialect](ir-ops-set-definition.md)_. ATen has thousands of operators. It’s not ideal for some fundamental transforms and kernel library implementation. The operators from the ATen Dialect graph are decomposed into fundamental operators so that the operator set (op set) is smaller and more fundamental transforms can be applied. The Core ATen dialect is also serializable and convertible to Edge Dialect as detailed below. ### Edge Compilation The Export process discussed above operates on a graph that is agnostic to the edge device where the code is ultimately executed. During the edge compilation step, we work on representations that are Edge specific. * _[Edge Dialect](ir-exir.md#edge-dialect)_. All operators are either compliant with ATen operators with dtype plus memory layout information (represented as `dim_order`) or registered custom operators. Scalars are converted to Tensors. Those specifications allow following steps focusing on a smaller Edge domain. In addition, it enables the selective build which is based on specific dtypes and memory layouts. With the Edge dialect, there are two target-aware ways to further lower the graph to the _[Backend Dialect](compiler-backend-dialect.md)_. At this point, delegates for specific hardware can perform many operations. For example, Core ML on iOS, QNN on Qualcomm, or TOSA on Arm can rewrite the graph. The options at this level are: * _[Backend Delegate](compiler-delegate-and-partitioner.md)_. The entry point to compile the graph (either full or partial) to a specific backend. The compiled graph is swapped with the semantically equivalent graph during this transformation. The compiled graph will be offloaded to the backend (aka `delegated`) later during the runtime for improved performance. * _User-defined passes_. Target-specific transforms can also be performed by the user. Good examples of this are kernel fusion, async behavior, memory layout conversion, and others. ### Compile to ExecuTorch Program The Edge program above is good for compilation, but not suitable for the runtime environment. On-device deployment engineers can lower the graph that can be efficiently loaded and executed by the runtime. On most Edge environments, dynamic memory allocation/freeing has significant performance and power overhead. It can be avoided using AOT memory planning, and a static execution graph. * The ExecuTorch runtime is static (in the sense of graph representation, but control flow and dynamic shapes are still supported). To avoid output creation and return, all functional operator representations are converted to out variants (outputs passed as arguments). * Optionally, users can apply their own memory planning algorithms. For example, there can be specific layers of memory hierarchy for an embedded system. Users can have their customized memory planning to that memory hierarchy. * The program is emitted to the format that our ExecuTorch runtime can recognize. Finally, the emitted program can be serialized to [flatbuffer](https://github.com/pytorch/executorch/blob/main/schema/program.fbs) format. ## Runtime Preparation With the serialized program, and provided kernel libraries (for operator calls) or backend libraries (for delegate calls), model deployment engineers can now prepare the program for the runtime. ExecuTorch has the _[selective build](kernel-library-selective-build.md)_ APIs, to build the runtime that links to only kernels used by the program, which can provide significant binary size savings in the resulting application. ## Program Execution The ExecuTorch runtime is written in C++ with minimal dependencies for portability and execution efficiency. Because the program is well prepared AOT, the core runtime components are minimal and include: * Platform abstraction layer * Logging and optionally profiling * Execution data types * Kernel and backend registry * Memory management _Executor_ is the entry point to load the program and execute it. The execution triggers corresponding operator kernels or backend execution from this very minimal runtime. ## Developer Tools It should be efficient for users to go from research to production using the flow above. Productivity is especially important, for users to author, optimize and deploy their models. We provide [ExecuTorch Developer Tools](devtools-overview.md) to improve productivity. The Developer Tools are not in the diagram. Instead it's a tool set that covers the developer workflow in all three phases. During the program preparation and execution, users can use the ExecuTorch Developer Tools to profile, debug, or visualize the program. Since the end-to-end flow is within the PyTorch ecosystem, users can correlate and display performance data along with graph visualization as well as direct references to the program source code and model hierarchy. We consider this to be a critical component for quickly iterating and lowering PyTorch programs to edge devices and environments. --- # Getting Started with ExecuTorch This section is intended to describe the necessary steps to take a PyTorch model and run it using ExecuTorch. To use the framework, you will typically need to take the following steps: - Install the ExecuTorch python package and runtime libraries. - Export the PyTorch model for the target hardware configuration. - Run the model using the ExecuTorch runtime APIs on your development platform. - Deploy the model to the target platform using the ExecuTorch runtime. ## System Requirements The following are required to install the ExecuTorch host libraries, needed to export models and run from Python. Requirements for target end-user devices are backend dependent. See the appropriate backend documentation for more information. - Python 3.10 - 3.13 - g++ version 7 or higher, clang++ version 5 or higher, or another C++17-compatible toolchain. - Linux (x86_64 or ARM64), macOS (ARM64), or Windows (x86_64). - Intel-based macOS systems require building PyTorch from source (see [Building From Source](using-executorch-building-from-source.md) for instructions). - On Windows, Visual Studio 2022 or later. ## Installation To use ExecuTorch, you will need to install both the Python package and the appropriate platform-specific runtime libraries. Pip is the recommended way to install the ExecuTorch python package. This package includes the dependencies needed to export a PyTorch model, as well as Python runtime bindings for model testing and evaluation. Consider installing ExecuTorch within a virtual environment, such as one provided by [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#creating-environments) or [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#create-and-use-virtual-environments). ``` pip install executorch ``` To build the framework from source, see [Building From Source](using-executorch-building-from-source.md). Backend delegates may require additional dependencies. See the appropriate backend documentation for more information. > **_NOTE:_** On Windows, ExecuTorch requires a [Visual Studio Developer Powershell](https://learn.microsoft.com/en-us/visualstudio/ide/reference/command-prompt-powershell?view=vs-2022). Running from outside of a developer prompt will manifest as errors related to CL.exe.
## Preparing the Model Exporting is the process of taking a PyTorch model and converting it to the .pte file format used by the ExecuTorch runtime. This is done using Python APIs. PTE files for common models, such as Llama 3.2, can be found on HuggingFace under [ExecuTorch Community](https://huggingface.co/executorch-community). These models have been exported and lowered for ExecuTorch, and can be directly deployed without needing to go through the lowering process. A complete example of exporting, lowering, and verifying MobileNet V2 is available as a [Colab notebook](https://colab.research.google.com/drive/1qpxrXC3YdJQzly3mRg-4ayYiOjC6rue3?usp=sharing). ### Requirements - A PyTorch model. - Example model inputs, typically as PyTorch tensors. You should be able to successfully run the PyTorch model with these inputs. - One or more target hardware backends. ### Selecting a Backend ExecuTorch provides hardware acceleration for a wide variety of hardware. The most commonly used backends are XNNPACK, for Arm and x86 CPU, Core ML (for iOS), Vulkan (for Android GPUs), and Qualcomm (for Qualcomm-powered Android phones). For mobile use cases, consider using XNNPACK for Android and Core ML or XNNPACK for iOS as a first step. See [Hardware Backends](backends-overview.md) for more information. ### Exporting Exporting is done using Python APIs. ExecuTorch provides a high degree of customization during the export process, but the typical flow is as follows. This example uses the MobileNet V2 image classification model implementation in torchvision, but the process supports any [export-compliant](https://pytorch.org/docs/stable/export.html) PyTorch model. For Hugging Face models, you can find a list of supported models in the [*huggingface/optimum-executorch*](https://github.com/huggingface/optimum-executorch) repo. ```python import torch import torchvision.models as models from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner from executorch.exir import to_edge_transform_and_lower model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) et_program = to_edge_transform_and_lower( torch.export.export(model, sample_inputs), partitioner=[XnnpackPartitioner()] ).to_executorch() with open("model.pte", "wb") as f: f.write(et_program.buffer) ``` If the model requires varying input sizes, you will need to specify the varying dimensions and bounds as part of the `export` call. See [Model Export and Lowering](using-executorch-export.md) for more information. The hardware backend to target is controlled by the partitioner parameter to `to_edge_transform_and_lower`. In this example, the XnnpackPartitioner is used to target mobile CPUs. See the [backend-specific documentation](backends-overview.md) for information on how to use each backend. Quantization can also be done at this stage to reduce model size and runtime. Quantization is backend-specific. See the documentation for the target backend for a full description of supported quantization schemes. ### Testing the Model After successfully generating a .pte file, it is common to use the Python runtime APIs to validate the model on the development platform. This can be used to evaluate model accuracy before running on-device. For the MobileNet V2 model from torchvision used in this example, image inputs are expected as a normalized, float32 tensor with a dimensions of (batch, channels, height, width). The output is a tensor containing class logits. See [torchvision.models.mobilenet_v2](https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v2.html) for more information on the input and output tensor format for this model. ```python import torch from executorch.runtime import Runtime from typing import List runtime = Runtime.get() input_tensor: torch.Tensor = torch.randn(1, 3, 224, 224) program = runtime.load_program("model.pte") method = program.load_method("forward") output: List[torch.Tensor] = method.execute([input_tensor]) print("Run successfully via executorch") from torchvision.models.mobilenetv2 import MobileNet_V2_Weights import torchvision.models as models eager_reference_model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() eager_reference_output = eager_reference_model(input_tensor) print("Comparing against original PyTorch module") print(torch.allclose(output[0], eager_reference_output, rtol=1e-3, atol=1e-5)) ``` For complete examples of exporting and running the model, please refer to our [examples GitHub repository](https://github.com/meta-pytorch/executorch-examples/tree/main/mv2/python). Additionally, for Hugging Face models, the [*huggingface/bptimum-executorch*](https://github.com/huggingface/optimum-executorch) library simplifies running these models end-to-end with ExecuTorch using familiar Hugging Face APIs. Visit the repository for specific examples and supported models.
## Running on Device ExecuTorch provides runtime APIs in Java, Objective-C, and C++. Quick Links: - [Android](#android) - [iOS](#ios) - [C++](#c) ### Android #### Installation ExecuTorch provides Java bindings for Android usage, which can be consumed from both Java and Kotlin. To add the library to your app, add the following dependency to gradle build rule. ``` # app/build.gradle.kts dependencies { implementation("org.pytorch:executorch-android:${executorch_version}") } # See latest available versions in https://mvnrepository.com/artifact/org.pytorch/executorch-android ``` #### Runtime APIs Models can be loaded and run from Java or Kotlin using the `Module` class. ```java import org.pytorch.executorch.EValue; import org.pytorch.executorch.Module; import org.pytorch.executorch.Tensor; // … Module model = Module.load("/path/to/model.pte"); Tensor input_tensor = Tensor.fromBlob(float_data, new long[] { 1, 3, height, width }); EValue input_evalue = EValue.from(input_tensor); EValue[] output = model.forward(input_evalue); float[] scores = output[0].toTensor().getDataAsFloatArray(); ``` Note that the [C++](#c) APIs can be used when targeting Android native. For a full example of running a model on Android, see the [DeepLabV3AndroidDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo). For more information on Android development, including building from source, a full description of the Java APIs, and information on using ExecuTorch from Android native code, see [Using ExecuTorch on Android](using-executorch-android.md). ### iOS #### Installation ExecuTorch supports both iOS and MacOS via C++, as well as hardware backends for CoreML, MPS, and CPU. The iOS runtime library is provided as a collection of .xcframework targets and are made available as a Swift PM package. To get started with Xcode, go to File > Add Package Dependencies. Paste the URL of the ExecuTorch repo into the search bar and select it. Make sure to change the branch name to the desired ExecuTorch version in format “swiftpm-”, (e.g. “swiftpm-0.6.0”). The ExecuTorch dependency can also be added to the package file manually. See [Using ExecuTorch on iOS](using-executorch-ios.md) for more information. #### Runtime APIs Models can be loaded and run from Objective-C using the C++ APIs. For more information on iOS integration, including an API reference, logging setup, and building from source, see [Using ExecuTorch on iOS](using-executorch-ios.md). ### C++ ExecuTorch provides C++ APIs, which can be used to target embedded or mobile devices. The C++ APIs provide a greater level of control compared to other language bindings, allowing for advanced memory management, data loading, and platform integration. #### Installation CMake is the preferred build system for the ExecuTorch C++ runtime. To use with CMake, clone the ExecuTorch repository as a subdirectory of your project, and use CMake's `add_subdirectory("executorch")` to include the dependency. The `executorch` target, as well as kernel and backend targets will be made available to link against. The runtime can also be built standalone to support diverse toolchains. See [Using ExecuTorch with C++](using-executorch-cpp.md) and [Building from Source](using-executorch-building-from-source.md) for a detailed description of build integration, targets, and cross compilation. ``` git clone -b viable/strict https://github.com/pytorch/executorch.git ``` ```cmake # Set CMAKE_CXX_STANDARD to 17 or above. set(CMAKE_CXX_STANDARD 17) # CMakeLists.txt set(EXECUTORCH_BUILD_PRESET_FILE ${CMAKE_SOURCE_DIR}/executorch/tools/cmake/preset/llm.cmake) # Set other ExecuTorch options here. add_subdirectory("executorch") ... target_link_libraries( my_target PRIVATE executorch executorch::backends executorch::extensions executorch::kernels) ``` #### Runtime APIs Both high-level and low-level C++ APIs are provided. The low-level APIs are platform independent, do not dynamically allocate memory, and are most suitable for resource-constrained embedded systems. The high-level APIs are provided as a convenience wrapper around the lower-level APIs, and make use of dynamic memory allocation and standard library constructs to reduce verbosity. ExecuTorch uses CMake for native builds. Integration is typically done by cloning the ExecuTorch repository and using CMake add_subdirectory to add the dependency. Loading and running a model using the high-level API can be done as follows: ```cpp #include #include using namespace ::executorch::extension; // Load the model. Module module("/path/to/model.pte"); // Create an input tensor. float input[1 * 3 * 224 * 224]; auto tensor = from_blob(input, {1, 3, 224, 224}); // Perform an inference. const auto result = module.forward(tensor); if (result.ok()) { // Retrieve the output data. const auto output = result->at(0).toTensor().const_data_ptr(); } ``` For more information on the C++ APIs, see [Running an ExecuTorch Model Using the Module Extension in C++](extension-module.md) and [Managing Tensor Memory in C++](extension-tensor.md). For complete examples of building and running C++ application, please refer to our [examples GitHub repository](https://github.com/meta-pytorch/executorch-examples/tree/main/mv2/cpp).
## Next Steps ExecuTorch provides a high-degree of customizability to support diverse hardware targets. Depending on your use cases, consider exploring one or more of the following pages: - [Export and Lowering](using-executorch-export.md) for advanced model conversion options. - [Backend Overview](backends-overview.md) for available backends and configuration options. - [Using ExecuTorch on Android](using-executorch-android.md) and [Using ExecuTorch on iOS](using-executorch-ios.md) for mobile runtime integration. - [Using ExecuTorch with C++](using-executorch-cpp.md) for embedded and mobile native development. - [Profiling and Debugging](using-executorch-troubleshooting.md) for developer tooling and debugging. - [API Reference](export-to-executorch-api-reference.rst) for a full description of available APIs. - [Examples](https://github.com/pytorch/executorch/tree/main/examples) for demo apps and example code. --- (home)= # Welcome to the ExecuTorch Documentation **ExecuTorch** is PyTorch's solution for efficient AI inference on edge devices — from mobile phones to embedded systems. ## Key Value Propositions - **Portability:** Run on diverse platforms, from high-end mobile to constrained microcontrollers - **Performance:** Lightweight runtime with full hardware acceleration (CPU, GPU, NPU, DSP) - **Productivity:** Use familiar PyTorch tools from authoring to deployment --- ## 🎯 Wins & Success Stories ::::{grid} 1 :class-container: success-showcase :::{grid-item-card} :class-header: bg-primary text-white :class-body: text-center [View All Success Stories →](success-stories) ::: :::: --- ## Quick Navigation ::::{grid} 2 :::{grid-item-card} **Get Started** :link: quick-start-section :link-type: doc New to ExecuTorch? Start here for installation and your first model deployment. ::: :::{grid-item-card} **Deploy on Edge Platforms** :link: edge-platforms-section :link-type: doc Deploy on Android, iOS, Laptops / Desktops and embedded platforms with optimized backends. ::: :::{grid-item-card} **Work with LLMs** :link: llm/working-with-llms :link-type: doc Export, optimize, and deploy Large Language Models on edge devices. ::: :::{grid-item-card} 🔧 **Developer Tools** :link: tools-section :link-type: doc Profile, debug, and inspect your models with comprehensive tooling. ::: :::: --- ## Explore Documentation ::::{grid} 1 :::{grid-item-card} **Intro** :link: intro-section :link-type: doc **Overview, architecture, and core concepts** — Understand how ExecuTorch works and its benefits ::: :::: ::::{grid} 1 :::{grid-item-card} **Quick Start** :link: quick-start-section :link-type: doc **Get started with ExecuTorch** — Install, export your first model, and run inference ::: :::: ::::{grid} 1 :::{grid-item-card} **Edge** :link: edge-platforms-section :link-type: doc **Android, iOS, Desktop, Embedded** — Platform-specific deployment guides and examples ::: :::: ::::{grid} 1 :::{grid-item-card} **Backends** :link: backends-section :link-type: doc **CPU, GPU, NPU/Accelerator backends** — Hardware acceleration and backend selection ::: :::: ::::{grid} 1 :::{grid-item-card} **LLMs** :link: llm/working-with-llms :link-type: doc **LLM export, optimization, and deployment** — Complete LLM workflow for edge devices ::: :::: ::::{grid} 1 :::{grid-item-card} **Advanced** :link: advanced-topics-section :link-type: doc **Quantization, memory planning, custom passes** — Deep customization and optimization ::: :::: ::::{grid} 1 :::{grid-item-card} **Tools** :link: tools-section :link-type: doc **Developer tools, profiling, debugging** — Comprehensive development and debugging suite ::: :::: ::::{grid} 1 :::{grid-item-card} **API** :link: api-section :link-type: doc **API Reference Usages & Examples** — Detailed Python, C++, and Java API references ::: :::: ::::{grid} 1 :::{grid-item-card} **💬 Support** :link: support-section :link-type: doc **FAQ, troubleshooting, contributing** — Get help and contribute to the project ::: :::: --- ## What's Supported ::::{grid} 3 :::{grid-item} **Model Types** - Large Language Models (LLMs) - Computer Vision (CV) - Speech Recognition (ASR) - Text-to-Speech (TTS) - More ... ::: :::{grid-item} **Platforms** - Android & iOS - Linux, macOS, Windows - Embedded & MCUs - Go **→ {doc}`edge-platforms-section`** ::: :::{grid-item} **Rich Acceleration** - CPU - GPU - NPU - DSP - Go **→ {doc}`backends-section`** ::: :::: ```{toctree} :hidden: :maxdepth: 1 intro-section quick-start-section edge-platforms-section backends-section llm/working-with-llms advanced-topics-section tools-section api-section support-section --- This page describes how ExecuTorch works and its key benefits. # How ExecuTorch Works At a high-level, there are three steps for running a PyTorch model with ExecuTorch across edge devices, such as laptops, mobile phones, wearables, and IoT devices. 1. **Export the model.** The first step is to capture the PyTorch program as a graph, which is a new representation of the model that can be expressed in terms of a series of operators such as addition, multiplication, or convolution. This process safely preserves the semantics of the original PyTorch program. This representation is the first step to enable running the model on edge use cases that have low memory and/or low compute. 1. **Compile the exported model to an ExecuTorch program.** Given an exported model from step 1, convert it to an executable format called an ExecuTorch program that the runtime can use for inference. This step provides entry points for various optimizations such as compressing the model (e.g., quantization) to reduce size and further compiling subgraphs down to on-device specialized hardware accelerators to improve latency. It also provides an entry point for memory planning, i.e. to efficiently plan the location of intermediate tensors to reduce the runtime memory footprint. 1. **Run the ExecuTorch program on a target device.** Given an input--such as an image represented as an input activation tensor--the ExecuTorch runtime loads the ExecuTorch program, executes the instructions represented by the program, and computes an output. This step is efficient because (1) the runtime is lightweight and (2) an efficient execution plan has already been calculated in steps 1 and 2, making it possible to do performant inference. Furthermore, portability of the core runtime enables performant execution even on highly-constrained devices. This figure illustrates the three-step process of exporting a PyTorch program, compiling it into an ExecuTorch program that targets a specific hardware device, and finally executing the program on the device using the ExecuTorch runtime. ![name](_static/img/how-executorch-works-high-level.png) ## Key Benefits ExecuTorch provides the following benefits to engineers who need to deploy machine learning models to an edge device: * **Export that is robust and powerful.** Export uses [`torch.export()`](https://pytorch.org/docs/main/export.html), which uses the same technology used in PyTorch 2.x to capture PyTorch programs for fast execution. While eager mode is flexible and allows experimentation in Python, it may not work well if Python isn't available or cannot deliver efficient execution. The _Export Intermediate Representation (Export IR)_ that export flow generates can describe a wide range of dynamism in PyTorch models, including control flow and dynamic shapes, which makes it a powerful tool for fully capturing existing PyTorch models with little effort. * **Operator standardization.** During the graph export process, the nodes in the graph represent operators such as addition, multiplication, or convolution. These operators are part of a small standardized list called the [Core ATen Op set](https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir). Most PyTorch programs can be decomposed into a graph using this small set of operators during export. Small list of standardized operators reduces the surface, needed to be covered, by third-party operator libraries as well as accelerator backends, in order to run models exported for ExecuTorch. ExecuTorch runtime ships with one such library, called portable operator library, that implements core ATen opset. * **Standardization for compiler interfaces (aka delegates) and the OSS ecosystem.** In addition to the _Operator standardization_ above, ExecuTorch has a [standardized interface](compiler-delegate-and-partitioner.md) for delegation to compilers. This allows third-party vendors and compilers to implement interfaces and API entry points for compilation and execution of (either partial or full) graphs targeting their specialized hardware. This provides greater flexibility in terms of hardware support and performance optimization, as well as easier integration with the PyTorch open source ecosystem for on-device AI. * **First-party Developer Tools** Due to the above standardization efforts, it was possible to build unified first-party [developer tools](devtools-overview.md) for ExecuTorch, where developers can export, compile, and deploy to a wide range of target devices—such as iOS, Android, and microcontrollers—using the same APIs, streamlining the process and increasing productivity. Additionally, ExecuTorch provides profiling and debugging functionality to easily inspect intermediate states, which are core parts of most developer workflows. * **No intermediate conversions necessary.** ExecuTorch's main design principle is to allow developers to run their models on target devices without the need for converting to third-party intermediate representations. This eliminates a number of problems that on-device developers typically face when working with these conversion steps, such as lack of debuggability and profiling, the need to familiarize themselves with hardware-specific tools, and models not being able to run due to conversion steps failing. * **Ease of customization.** Developers can optimize their deployment for even better performance gains on the target architecture by applying custom techniques, such as [linking with high-performance operator implementations](kernel-library-custom-aten-kernel.md) or [customizing memory planning](compiler-memory-planning.md) based on storage and latency trade-offs. This level of customization is made possible through the standardization of the [compiler pass interface](compiler-custom-compiler-passes.md) and registration APIs on exported graphs. * **Low overhead runtime and execution.** The ExecuTorch runtime, written in C++, is highly efficient and can run on a wide range of architectures, including Linux, iOS, Android, embedded systems, and bare metal hardware, with little additional setup or configuration. It is capable of linking in only those operators needed for the model, resulting in a minimal runtime binary size. It is also able to run at low latency because of ahead-of-time compilation and memory planning stages, with the runtime responsible only for execution (e.g., call operator `conv` and save the result in memory location X). The above highlights the key advantages of ExecuTorch across three main categories: portability, productivity, and performance. We consider it to be an ideal choice for enabling on-device AI across mobile and edge computing platforms. --- # ExecuTorch Overview **ExecuTorch** is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices. Key value propositions of ExecuTorch are: - **Portability:** Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers. - **Productivity:** Enabling developers to use the same toolchains and Developer Tools from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms. - **Performance:** Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs. ## Why ExecuTorch? Supporting on-device AI presents unique challenges with diverse hardware, critical power requirements, low/no internet connectivity, and real-time processing needs. These constraints have historically prevented or slowed down the creation of scalable and performant on-device AI solutions. We designed ExecuTorch, backed by our industry partners like Meta, Arm, Apple, and Qualcomm, to be highly portable and provide superior developer productivity without losing on performance. ## How is ExecuTorch Different from PyTorch Mobile (Lite Interpreter)? PyTorch Mobile uses [TorchScript](https://pytorch.org/docs/stable/jit.html) to allow PyTorch models to run on devices with limited resources. ExecuTorch has a significantly smaller memory size and a dynamic memory footprint resulting in superior performance and portability compared to PyTorch Mobile. Also, ExecuTorch does not rely on TorchScript, and instead leverages PyTorch 2 compiler and export functionality for on-device execution of PyTorch models. Read more in-depth technical overview topics about ExecuTorch: - [How ExecuTorch Works](intro-how-it-works.md) - [High-level Architecture and Components of ExecuTorch](getting-started-architecture.md) - [ExecuTorch Runtime Overview](runtime-overview.md) --- (intro-section)= # Intro Overview, architecture, and core concepts of ExecuTorch. ExecuTorch is PyTorch's solution for training and inference on the Edge, providing portability, productivity, and performance for edge computing platforms. ## Getting Started with ExecuTorch New to ExecuTorch? Start with these foundational topics: - **{doc}`intro-overview`** - High-level overview of ExecuTorch capabilities - **{doc}`intro-how-it-works`** - Technical overview of the ExecuTorch workflow - **{doc}`getting-started-architecture`** - System architecture and components - **{doc}`concepts`** - Core concepts and terminology ```{toctree} :hidden: :maxdepth: 2 :caption: Introduction Topics intro-overview intro-how-it-works getting-started-architecture concepts ``` --- (ios-backends)= # Backends Available hardware acceleration backends for iOS deployment. ## Apple Hardware Acceleration (Recommended) - {doc}`ios-coreml` — CoreML (NPU/GPU, recommended for iOS) - {doc}`ios-mps` — Metal Performance Shaders (GPU) ## CPU Acceleration - {doc}`ios-xnnpack` — XNNPACK (CPU acceleration) ```{toctree} :hidden: ios-coreml ios-mps ios-xnnpack --- # Examples & Demos - [iOS LLM Examples Repository](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - [MobileViT Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/mv3/apple/ExecuTorchDemo) --- (ios-section)= # iOS Deploy ExecuTorch on iOS devices with Apple hardware acceleration. ## Quick Start & Integration - {doc}`using-executorch-ios` — Complete iOS integration guide ## Backends - {doc}`ios-backends` — Available iOS backends and acceleration options ## Examples & Demos - {doc}`ios-examples` — Explore iOS Examples & Demos ```{toctree} :hidden: using-executorch-ios ios-backends ios-examples --- # Export IR Specification Export IR is an intermediate representation (IR) for the result of `torch.export`. To read more on the details of Export IR, please read this [document](https://pytorch.org/docs/main/export.ir_spec.html). The Exported IR is a specification that consists of the following parts: 1. A definition of computation graph model. 2. Set of operators allowed in the graph. A **dialect** is an Exported IR graph composed with the operations defined below, but with additional properties (such as restrictions on operator set or metadata) that are meant for a specific purpose. The EXIR dialects that currently exist are: * [ATen Dialect](#aten-dialect) * [Edge Dialect](#edge-dialect) * [Backend Dialect](#backend-dialect) These dialects represent stages that a captured program goes through from program capture to conversion into an executable format. For example, the ExecuTorch compilation process starts from a Python program capture into ATen Dialect, then ATen Dialect is converted to Edge Dialect, Edge to Backend, and finally to a binary format for execution. ## ATen Dialect ATen dialect will be used as the entry point of the ExecuTorch compilation pipeline. It is the first time an eager mode PyTorch program becomes an Exported IR graph. At this stage, functionalization is performed, removing any tensor aliases and mutations, and allowing for more flexible graph transformations to be made. Additionally, all tensors are converted to continuous format. The goal of this dialect is to capture users' programs as faithfully as possible (while remaining valid Exported IR). Registered custom operators that user has called in eager mode will preserve as-is in ATen dialect. However, we should refrain from adding custom ops in the graph via passes. For now, the function of ATen dialect is to further lower to Edge dialect. However, in the future we can see this one as the common integration point for other export use cases. ### ATen Dialect Properties An ATen dialect graph is a valid Export IR graph with the following additional properties: 1. All operators in `call_function` nodes are either ATen operators (in the `torch.ops.aten` namespace, higher order operators (like control flow operators), or a registered custom operator. A registered custom operator is an operator registered into the current PyTorch eager mode runtime, usually with `TORCH_LIBRARY` call (implies schema). Details for how to register a custom operator can be found [here](https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU/edit#heading=h.3rgxk3v387wl). 2. Every operator must also have a meta kernel. A meta kernel is a function that, given the shapes of the input tensors, can return the shape of output tensor. Details on how to write a meta kernel can be found [here](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0). 3. Input value type must be “Pytree-able”. As a consequence, the output types are also Pytree-able because all operators output are pytree-able. 4. Ops of ATen dialect can choose to work Dynamic dtypes, implicit type promotions and implicit broadcasting of tensors. 5. All tensors memory formats are in `torch.contiguous_format`. ### ATen Operator Definition The operator set definition can be found [here](ir-ops-set-definition.md). ## Edge Dialect This dialect is meant to introduce specializations that are useful for Edge devices but not necessarily for general (server) export. However, we still withhold specializing further to each different hardware. In other words, we don’t want to introduce any new hardware dependent concepts or data; besides those already present in users’ original python program. ### Edge Dialect Properties An Edge dialect graph is a valid Export IR graph with the following additional properties: 1. All operators in OpCall nodes are either from a predefined operator set, called **“Edge Operators”**, or a registered custom operator. An Edge operator is a ATen operator with dtype specialization. This allows users to register kernels that only work for certain dtypes to reduce binary size. 2. Input and output of the graph, and as well as to every node, cannot be Scalar. I.e. All scalar types (such as float, int) are converted to Tensor. ### Using the Edge Dialect The Edge dialect is represented with `exir.EdgeProgramManager` Python class in memory. This contains one or multiple `torch.export.ExportedProgram`s which contain the graph representation of a method. ```python import torch from executorch import exir class MyModule(torch.nn.Module): ... a = MyModule() tracing_inputs = (torch.rand(2, 2),) aten_dialect_program = torch.export.export(a, tracing_inputs) edge_dialect_program: exir.EdgeProgramManager = exir.to_edge(aten_dialect_program) print(edge_dialect_program.exported_program) ``` At this point, user defined graph transformation can be run through `edge_dialect_program.transform(pass)`. Order matters. Note: If the custom pass is touching `node.target`, be aware that all of the `node.target` at this stage are "Edge ops" (more details below) and not torch ops like in the ATen dialect. A tutorial on pass writing can be found [here](compiler-custom-compiler-passes.md). After all these passes are executed, `to_edge()` will make sure the graph is still valid. ### Edge Operators As mentioned before, an edge operator is an ATen core operator with type specialization. This means an instance of the edge operator contains a set of dtype constraints, that describe all the tensor dtypes supported by both the ExecuTorch runtime and their ATen kernels. These dtype constraints are expressed in a DSL defined in [edge.yaml](https://github.com/pytorch/executorch/blob/main/exir/dialects/edge/edge.yaml). Here's an example of the dtype constraints: ``` - func: sigmoid namespace: edge inherits: aten::sigmoid type_alias: T0: [Bool, Byte, Char, Int, Long, Short] T1: [Double, Float] T2: [Float] type_constraint: - self: T0 __ret_0: T2 - self: T1 __ret_0: T1 ``` This is saying if `self` tensor is one of the type `Bool, Byte, Char, Int, Long, Short`, then the return tensor would be `Float`. If `self` is one of `Double, Float`, the return tensor will be the same dtype. After these dtype constraints are collected and documented in edge.yaml, EXIR consumes the file, and loads the constraints into EXIR Edge operators. This makes it convenient for developers to learn the supported dtypes of any argument in the Edge op schema. For example we can do: ```python from executorch.exir.dialects._ops import ops as exir_ops # import dialects ops sigmoid = exir_ops.edge.aten.sigmoid.default print(sigmoid._schema) # aten::sigmoid(Tensor self) -> Tensor self_arg = sigmoid._schema.arguments[0] _return = sigmoid._schema.returns[0] print(self_arg.allowed_types) # {torch.float32, torch.int8, torch.float64, torch.int16, torch.int32, torch.int64, torch.uint8, torch.bool} print(_return.allowed_types) # {torch.float32, torch.float64} ``` These constraints are helpful for someone who wants to write a custom kernel for this operator. Also inside EXIR, we offer a validator to check if the graph is still complying with these dtype constraints, after custom transformations. ### Op Set (WIP) Check out [edge.yaml](https://github.com/pytorch/executorch/blob/main/exir/dialects/edge/edge.yaml) for the complete list of operators having dtype constraints specified. We are gradually expanding this operator set and targeting to provide dtype constraints for all core ATen ops. ## Backend Dialect See this [doc](compiler-backend-dialect.md) --- # Definition of the Core ATen Operator Set This page provides the description and background of the Core ATen Operator Set (opset). This page is recommended reading for those developing a new kernel library or delegate for ExecuTorch. It is also recommended that one is familiar with [`torch.export`](https://pytorch.org/docs/main/export.html) as a prerequisite; in particular, the concepts of torch FX graphs, operator decomposition, and functionalization. The list of operators that have been identified as a Core ATen operator can be found on the [IRs page of the PyTorch documentation website](https://pytorch.org/docs/main/torch.compiler_ir.html). ## What is an Operator Set? `torch.export` performs a full graph capture on a given PyTorch program, producing a graph IR that describes the computation performed by the program. An operator (i.e. an operation performed on a Tensor) is the basic unit of computation in the graph, often corresponding to a unique node in the graph IR. The primary source of operators is the [ATen library](https://pytorch.org/cppdocs/#aten); outside of ATen operators, developers can also define their own operators (i.e. custom operators). An “ATen operator set” or “ATen opset” is the set of ATen operators that can be used to represent a PyTorch program once it has been captured into a graph IR. ## The Functional ATen Operator Set The program capture mechanism of `torch.export` produces a functionalized graph, which only allows functional operators (i.e. operators that do not mutate or alias inputs). Therefore, `torch.export` produces a graph that will contain the functional ATen opset, which contains only functional ATen operators. ## The Core ATen Operator Set An exported graph can be further transformed by applying operator decompositions. This process will replace specified ATen operators with equivalent sequences of other ATen operators. For instance, `aten.hardsigmoid` can be replaced with `aten.clamp(aten.clamp(self + 3, min=0), max=6) / 6`. If a PyTorch program is decomposed with the default decomposition settings, then the resulting graph IR will contain the “core ATen” opset. This opset will be a subset of the functional ATen opset, as some operators will be decomposed. ATen operators that are a part of the core ATen opset (i.e. core ATen operators) will not be decomposed under the default decomposition setting. Generally, core ATen operators cannot be easily re-expressed by other ATen operators through decomposition. The key motivation behind the core ATen opset is to reduce the number of operators that need to be handled by PyTorch backends and compilers once a model is exported. Not only are there a great number of operators defined in the ATen library, but new operators may be added, or the schema of existing operators may change. Without operator decomposition, backends built on top of the IR produced by `torch.export` would have to deal with both a large operator surface, as well as an opset that is constantly in flux. The core ATen opset addresses this by defining a much smaller, more manageable set of operators that was developed with stability in mind. ## Development of the Core ATen Operator Set Although ExecuTorch uses the core ATen opset, it is not specific to ExecuTorch. One of the primary design goals of the core ATen opset is that it should be as generic as possible; the vast majority of use-cases will not want to decompose the operators contained within it. By extension, the decompositions implied by the core ATen opset should be useful to the vast majority of use-cases. Another key consideration was to keep the opset as minimal as possible, but not at the expense of imposing decompositions that would have a profound negative impact on performance or developer experience. The core ATen opset was developed by reviewing a list of ATen operators created by surveying models in public GitHub repositories in addition to well-known open source models. The purpose of the surveying process was to obtain a reduced list of ATen operators that is a proxy of which ATen operators are used the most. This way the most commonly used operators may be reviewed first. The decision of whether each operator should be a core operator or be decomposed by the Core ATen Decomposition Table was determined by: 1. Examining potential decompositions of the operators; the decomposition should be a relatively straightforward re-expression of the operator using other ATen operators. * The decomposition shouldn’t look like an outright implementation of the operator. * The decomposition shouldn't vary based on run-time characteristics of the input. * We also consider if decomposing the operator will impact the precision, numerical validity or memory layout of the output. 2. Thinking about whether developers would want to preserve the operator in the graph for performance or other reasons. * For instance, perhaps an operator can be decomposed but it can map to a single hardware instruction on most platforms, in which case it would be preferable to promote it to a core operator. ## Future Work Until every ATen operator has been reviewed and given a designation of “core” or “decomposed by default”, the core ATen opset cannot be considered fully complete. However, this is a monumental task, and there is a long tail of operators that are not often used. This is why an approach was taken where models were surveyed to determine which ops were the most commonly used which allowed “higher impact” operators to be prioritized. Nonetheless, there are still many operators which have not been evaluated. The plan is to continue evaluating additional operators as the need arises; the PyTorch community may propose additional core operators or additional core decompositions through opening a GitHub issue or by [commenting on this post on the PyTorch Forums](https://dev-discuss.pytorch.org/t/defining-the-core-aten-opset/1464). --- # IR Specification ```{toctree} :maxdepth: 1 ir-exir ir-ops-set-definition ``` --- (kernel-library-advanced)= # Kernel Library Deep Dive Advanced kernel implementation and customization for ExecuTorch. ## Kernel Library Overview - {doc}`kernel-library-overview` — Architecture and design of the kernel library - {doc}`kernel-library-custom-aten-kernel` — Kernel registration and customization ## Build Optimization - {doc}`kernel-library-selective-build` — Selective build for reduced binary footprint ```{toctree} :hidden: :maxdepth: 1 kernel-library-overview kernel-library-custom-aten-kernel kernel-library-selective-build --- # Kernel Registration ## Overview At the last stage of [ExecuTorch model exporting](export-overview.md), we lower the operators in the dialect to the _out variants_ of the [core ATen operators](ir-ops-set-definition.md). Then we serialize these operator names into the model artifact. During runtime execution, for each operator name we will need to find the actual _kernels_, i.e., the C++ functions that do the heavy-lifting calculations and return results. ## Kernel Libraries ### First-party kernel libraries: **[Portable kernel library](https://github.com/pytorch/executorch/tree/main/kernels/portable)** is the in-house default kernel library that covers most of the core ATen operators. It’s easy to use/read and is written in portable C++17. However it’s not optimized for performance, because it’s not specialized for any certain target. Therefore we provide kernel registration APIs for ExecuTorch users to easily register their own optimized kernels. **[Optimized kernel library](https://github.com/pytorch/executorch/tree/main/kernels/optimized)** specializes on performance for some of the operators, leveraging existing third party libraries such as [EigenBLAS](https://gitlab.com/libeigen/eigen). This works best along with the portable kernel library, with a good balance on portability and performance. One example of combining these two libraries can be found [here](https://github.com/pytorch/executorch/blob/main/configurations/CMakeLists.txt). **[Quantized kernel library](https://github.com/pytorch/executorch/tree/main/kernels/quantized)** implements operators for quantization and dequantization. These are out of core ATen operators but are vital to most of the production use cases. ### Custom kernel libraries: **Custom kernels implementing core ATen ops**. Even though we don't have an internal example for custom kernels for core ATen ops, the optimized kernel library can be viewed as a good example. We have optimized [`add.out`](https://github.com/pytorch/executorch/blob/main/kernels/optimized/cpu/op_add.cpp) and a portable [`add.out`](https://github.com/pytorch/executorch/blob/main/kernels/portable/cpu/op_add.cpp). When user is combining these two libraries, we provide APIs to choose which kernel to use for `add.out`. In order to author and use custom kernels implementing core ATen ops, using the [YAML based approach](#yaml-entry-for-core-aten-op-out-variant) is recommended, because it provides full fledged support on 1. combining kernel libraries and define fallback kernels; 2. using selective build to minimize the kernel size. A **[Custom operator](https://github.com/pytorch/executorch/tree/main/extension/llm/custom_ops)** is any operator that an ExecuTorch user defines outside of PyTorch's [`native_functions.yaml`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml). ## Operator & Kernel Contract All the kernels mentioned above, whether they are in-house or customized, should comply with the following requirements: * Match the calling convention derived from operator schema. The kernel registration API will generate headers for the custom kernels as references. * Satisfy the dtype constraints defined in edge dialect. For tensors with certain dtypes as arguments, the result of a custom kernel needs to match the expected dtypes. The constraints are available in edge dialect ops. * Give correct result. We will provide a testing framework to automatically test the custom kernels. ## APIs These are the APIs available to register kernels/custom kernels/custom ops into ExecuTorch: * [YAML Entry API](#yaml-entry-api-high-level-architecture) - [for core ATen op with custom kernels](#yaml-entry-api-for-core-aten-op-out-variant) - [for custom ops](#yaml-entry-api-for-custom-ops) - [CMake Macros](#cmake-macros) * C++ API - [for custom ops](#c-api-for-custom-ops) - [CMake Example](#compile-and-link-the-custom-kernel) If it's not clear which API to use, please see [Best Practices](#custom-ops-api-best-practices). ### YAML Entry API High Level Architecture ![](_static/img/kernel-library-custom-aten-kernel.png) ExecuTorch users are asked to provide: 1. the custom kernel library with C++ implementations 2. a YAML file associated with the library that describes what operators are being implemented by this library. For partial kernels, the yaml file also contains information on the dtypes and dim orders supported by the kernel. More details in the API section. ### YAML Entry API Workflow At build time, the yaml files associated with kernel libraries will be passed to the _kernel resolver_ along with the model op info (see selective build doc) and the outcome is a mapping between a combination of operator names and tensor metadata, to kernel symbols. Then codegen tools will use this mapping to generate C++ bindings that connect the kernels to ExecuTorch runtime. ExecuTorch users need to link this generated library into their application to use these kernels. At static object initialization time, kernels will be registered into the ExecuTorch kernel registry. At runtime initialization stage, ExecuTorch will use the operator name and argument metadata as a key to lookup for the kernels. For example, with “aten::add.out” and inputs being float tensors with dim order (0, 1, 2, 3), ExecuTorch will go into the kernel registry and lookup for a kernel that matches the name and the input metadata. ### YAML Entry API for Core ATen Op Out Variant Top level attributes: * `op` (if the operator appears in `native_functions.yaml`) or `func` for custom operator. The value for this key needs to be the full operator name (including overload name) for `op` key, or a full operator schema (namespace, operator name, operator overload name and schema string), if we are describing a custom operator. For schema syntax please refer to this [instruction](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md). * `kernels`: defines kernel information. It consists of `arg_meta` and `kernel_name`, which are bound together to describe "for input tensors with these metadata, use this kernel". * `type_alias`(optional): we are giving aliases to possible dtype options. `T0: [Double, Float]` means `T0` can be one of `Double` or `Float`. * `dim_order_alias`(optional): similar to `type_alias`, we are giving names to possible dim order options. Attributes under `kernels`: * `arg_meta`: a list of "tensor arg name" entries. The values for these keys are dtypes and dim orders aliases, that are implemented by the corresponding `kernel_name`. This being `null` means the kernel will be used for all types of input. * `kernel_name`: the expected name of the C++ function that will implement this operator. You can put whatever you want to here, but you should follow the convention of replacing the `.` in the overload name with an underscore, and lowercasing all characters. In this example, `add.out` uses the C++ function named `add_out`. `add.Scalar_out` would become `add_scalar_out`, with a lowercase `S`. We support namespace for kernels, but note that we will be inserting a `native::` to the last level of namespace. So `custom::add_out` in the `kernel_name` will point to `custom::native::add_out`. Some examples of operator entry: ```yaml - op: add.out kernels: - arg_meta: null kernel_name: torch::executor::add_out ``` An out variant of a core ATen operator with a default kernel ATen operator with a dtype/dim order specialized kernel (works for `Double` dtype and dim order needs to be (0, 1, 2, 3)) ```yaml - op: add.out type_alias: T0: [Double] dim_order_alias: D0: [[0, 1, 2, 3]] kernels: - arg_meta: self: [T0, D0] other: [T0 , D0] out: [T0, D0] kernel_name: torch::executor::add_out ``` ### YAML Entry API for Custom Ops As mentioned above, this option provides more support in terms of selective build and features such as merging operator libraries. First we need to specify the operator schema as well as a `kernel` section. So instead of `op` we use `func` with the operator schema. As an example, here’s a yaml entry for a custom op: ```yaml - func: allclose.out(Tensor self, Tensor other, float rtol=1e-05, float atol=1e-08, bool equal_nan=False, bool dummy_param=False, *, Tensor(a!) out) -> Tensor(a!) kernels: - arg_meta: null kernel_name: torch::executor::allclose_out ``` The `kernel` section is the same as the one defined in core ATen ops. For operator schema, we are reusing the DSL defined in this [README.md](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md), with a few differences: #### Out variants only ExecuTorch only supports out-style operators, where: * The caller provides the output Tensor or Tensor list in the final position with the name `out`. * The C++ function modifies and returns the same `out` argument. * If the return type in the YAML file is `()` (which maps to void), the C++ function should still modify `out` but does not need to return anything. * The `out` argument must be keyword-only, which means it needs to follow an argument named `*` like in the `add.out` example below. * Conventionally, these out operators are named using the pattern `.out` or `._out`. Since all output values are returned via an `out` parameter, ExecuTorch ignores the actual C++ function return value. But, to be consistent, functions should always return `out` when the return type is non-`void`. #### Can only return `Tensor` or `()` ExecuTorch only supports operators that return a single `Tensor`, or the unit type `()` (which maps to `void`). It does not support returning any other types, including lists, optionals, tuples, or scalars like `bool`. #### Supported argument types ExecuTorch does not support all of the argument types that core PyTorch supports. Here's a list of the argument types we currently support: * Tensor * int * bool * float * str * Scalar * ScalarType * MemoryFormat * Device * Optional * List * List> * Optional> #### CMake Macros We provide build time macros to help users to build their kernel registration library. The macro takes the yaml file describing the kernel library as well as model operator metadata, and packages the generated C++ bindings into a C++ library. The macro is available on CMake. `generate_bindings_for_kernels(FUNCTIONS_YAML functions_yaml CUSTOM_OPS_YAML custom_ops_yaml)` takes a yaml file for core ATen op out variants and also a yaml file for custom ops, generate C++ bindings for kernel registration. It also depends on the selective build artifact generated by `gen_selected_ops()`, see selective build doc for more information. Then `gen_operators_lib` will package those bindings to be a C++ library. As an example: ```cmake # SELECT_OPS_LIST: aten::add.out,aten::mm.out gen_selected_ops("" "${SELECT_OPS_LIST}" "") # Look for functions.yaml associated with portable libs and generate C++ bindings generate_bindings_for_kernels(FUNCTIONS_YAML ${EXECUTORCH_ROOT}/kernels/portable/functions.yaml) # Prepare a C++ library called "generated_lib" with _kernel_lib being the portable library, executorch is a dependency of it. gen_operators_lib("generated_lib" KERNEL_LIBS ${_kernel_lib} DEPS executorch) # Link "generated_lib" into the application: target_link_libraries(executorch_binary generated_lib) ``` We also provide the ability to merge two yaml files, given a precedence. `merge_yaml(FUNCTIONS_YAML functions_yaml FALLBACK_YAML fallback_yaml OUTPUT_DIR out_dir)` merges functions_yaml and fallback_yaml into a single yaml, if there's duplicate entries in functions_yaml and fallback_yaml, this macro will always take the one in functions_yaml. Example: ```yaml # functions.yaml - op: add.out kernels: - arg_meta: null kernel_name: torch::executor::opt_add_out ``` And out fallback: ```yaml # fallback.yaml - op: add.out kernels: - arg_meta: null kernel_name: torch::executor::add_out ``` The merged yaml will have the entry in functions.yaml. ### C++ API for Custom Ops Unlike the YAML entry API, the C++ API only uses C++ macros `EXECUTORCH_LIBRARY` and `WRAP_TO_ATEN` for kernel registration, also without selective build support. It makes this API faster in terms of development speed, since users don't have to do YAML authoring and build system tweaking. Please refer to [Custom Ops Best Practices](#custom-ops-api-best-practices) on which API to use. Similar to [`TORCH_LIBRARY`](https://pytorch.org/cppdocs/library.html#library_8h_1a0bd5fb09d25dfb58e750d712fc5afb84) in PyTorch, `EXECUTORCH_LIBRARY` takes the operator name and the C++ function name and register them into ExecuTorch runtime. #### Prepare custom kernel implementation Define your custom operator schema for both functional variant (used in AOT compilation) and out variant (used in ExecuTorch runtime). The schema needs to follow PyTorch ATen convention (see `native_functions.yaml`). For example: ```yaml custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!) ``` Then write your custom kernel according to the schema using ExecuTorch types, along with APIs to register to ExecuTorch runtime: ```c++ // custom_linear.h/custom_linear.cpp #include Tensor& custom_linear_out(const Tensor& weight, const Tensor& input, optional bias, Tensor& out) { // calculation return out; } ``` #### Use a C++ macro to register it into ExecuTorch Append the following line in the example above: ```c++ // custom_linear.h/custom_linear.cpp // opset namespace myop EXECUTORCH_LIBRARY(myop, "custom_linear.out", custom_linear_out); ``` Now we need to write some wrapper for this op to show up in PyTorch, but don’t worry we don’t need to rewrite the kernel. Create a separate .cpp for this purpose: ```c++ // custom_linear_pytorch.cpp #include "custom_linear.h" #include at::Tensor custom_linear(const at::Tensor& weight, const at::Tensor& input, std::optional bias) { // initialize out at::Tensor out = at::empty({weight.size(1), input.size(1)}); // wrap kernel in custom_linear.cpp into ATen kernel WRAP_TO_ATEN(custom_linear_out, 3)(weight, input, bias, out); return out; } // standard API to register ops into PyTorch TORCH_LIBRARY(myop, m) { m.def("custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor", custom_linear); m.def("custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)", WRAP_TO_ATEN(custom_linear_out, 3)); } ``` #### Compile and link the custom kernel Link it into ExecuTorch runtime: In our `CMakeLists.txt` that builds the binary/application, we need to add custom_linear.h/cpp into the binary target. We can build a dynamically loaded library (.so or .dylib) and link it as well. Here's an example to do it: ```cmake # For executorch_target_link_options_shared_lib include(${EXECUTORCH_ROOT}/tools/cmake/Utils.cmake) # Add a custom op library add_library(custom_op_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/custom_op.cpp) # Include the header target_include_directory(custom_op_lib PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include) # Link ExecuTorch library target_link_libraries(custom_op_lib PUBLIC executorch) # Define a binary target add_executable(custom_op_runner PUBLIC main.cpp) # Link this library with --whole-archive !! IMPORTANT !! this is to avoid the operators being stripped by linker executorch_target_link_options_shared_lib(custom_op_lib) # Link custom op lib target_link_libraries(custom_op_runner PUBLIC custom_op_lib) ``` Link it into the PyTorch runtime: We need to package custom_linear.h, custom_linear.cpp and custom_linear_pytorch.cpp into a dynamically loaded library (.so or .dylib) and load it into our python environment. One way of doing this is: ```python import torch torch.ops.load_library("libcustom_linear.so/dylib") # Now we have access to the custom op, backed by kernel implemented in custom_linear.cpp. op = torch.ops.myop.custom_linear.default ``` #### Using a Custom Operator in a Model The custom operator can explicitly used in the PyTorch model, or you can write a transformation to replace instances of a core operator with the custom variant. For this example, you could find all instances of `torch.nn.Linear` and replace them with `CustomLinear`. ```python def replace_linear_with_custom_linear(module): for name, child in module.named_children(): if isinstance(child, nn.Linear): setattr( module, name, CustomLinear(child.in_features, child.out_features, child.bias), ) else: replace_linear_with_custom_linear(child) ``` The remaining steps are the same as the normal flow. Now you can run this module in eager mode as well as export to ExecuTorch. ### Custom Ops API Best Practices Given that we have 2 kernel registration APIs for custom ops, which API should we use? Here are some pros and cons for each API: * C++ API: - Pros: * Only C++ code changes are needed * Resembles PyTorch custom ops C++ API * Low maintenance cost - Cons: * No selective build support * No centralized bookkeepping * Yaml entry API: - Pros: * Has selective build support * Provides a centralized place for custom ops - It shows what ops are being registered and what kernels are bound to these ops, for an application - Cons: * User needs to create and maintain yaml files * Relatively inflexible to change the op definition Overall if we are building an application and it uses custom ops, during the development phase it's recommended to use the C++ API since it's low-cost to use and flexible to change. Once the application moves to production phase where the custom ops definitions and the build systems are quite stable and binary size is to be considered, it is recommended to use the Yaml entry API. --- # Overview of ExecuTorch’s Kernel Libraries This page provides a description of the Portable Kernel Library and the Optimized Kernel Library, which are the default kernel libraries shipped with ExecuTorch. It is recommended reading for those who are interested in executing ExecuTorch programs with these kernel libraries, or for those who want to implement their own kernels and kernel libraries. An ExecuTorch program encodes instructions that describe the computation that should be performed by the program. Many of these instructions will correspond to calling a specific ATen operator, for example `aten.convolution`. However, one of the core design principles of ExecuTorch is that the signature of an operator should be separate from the implementation of the operator. This means that the ExecuTorch runtime does not ship with any standard implementation for ATen operators; users must make sure to link against kernel libraries that contain implementations of the operators required by their ExecuTorch program, and configure [operator registration](kernel-library-custom-aten-kernel.md) to map an operator signature to the desired implementation. This makes it easy to adjust the implementation of operators such as `aten.convolution` that will be called when executing an ExecuTorch program; it allows users to select the exact operator implementations that will meet the unique performance, memory usage, battery usage, etc. constraints of their use-case. **In essence, a kernel library is simply a collection of ATen operator implementations that follow a common theme or design principle**. Note that due to ExecuTorch’s selective build process (discussed in the following section), operator implementations are linked individually. This means that users can easily mix different kernel libraries in their build without sacrificing build size. ExecuTorch ships with two kernel libraries by default: the **Portable Kernel Library** and the **Optimized Kernel Library**, both of which provide CPU operator implementations. ## Portable Kernel Library The Portable Kernel Library is in a sense the “reference” kernel library that is used by ExecuTorch. The Portable Kernel Library was developed with the following goals in mind: * Correctness * Provide straightforward implementations of ATen operators that are strictly consistent with the original implementation of the operator in PyTorch’s ATen library * Readability / Simplicity * Provide clear, readable source code so that those who want to develop custom implementations of an operator can easily understand the desired behavior of the operator. * Portability * Portable Kernels should be just as portable as the ExecuTorch runtime; operator implementations should not use any external dependencies, or use any unsanctioned features of C++. * Operator Coverage * As the “reference” kernel library for ExecuTorch, the Portable Kernel Library aims to have a high degree of operator coverage. The goal is for the Portable Kernel library to provide an implementation for every operator listed as a Core ATen operator. However, note that operator coverage for the Portable Kernel Library is still a work in progress. The Portable Kernel Library primarily aims to provide easily accessible operator implementations that will “just work” on most platforms, and are guaranteed to provide correct output. Performance is a non-goal for the Portable Kernel Library. In fact, many bottleneck operators such as convolution and matrix multiplication are implemented in the most straightforward way possible in the interest of prioritizing simplicity and readability. Therefore, one should not expect to observe fast inference times if exclusively using the Portable Kernel library. However, outside of specific bottleneck operators, most operators are simple enough where the straightforward implementation of the Portable Kernel Library should still provide adequate performance. Binary size is also a non-goal for the Portable Kernel Library. ## Optimized Kernel Library The Optimized Kernel Library is a supplemental kernel library shipped with ExecuTorch that, in contrast to the Portable Kernel Library, aims to provide performance focused implementations of operators at the cost of portability and readability. Many operator implementations in the Optimized Kernel Library are inspired or based off of the corresponding implementation in PyTorch’s ATen library, so in many cases one can expect the same degree of performance. Generally speaking, operators in the Optimized Kernel Library are optimized in one of two ways: 1. Using CPU vector intrinsics 2. Using optimized math libraries, such as `sleef` and `OpenBLAS` Although portability is not a design goal of the Optimized Kernel Library, implementations are not meant to be fine-tuned for a specific CPU architecture. Instead, the Optimized Kernel library seeks to provide performant implementations that can be applied across a variety of platforms, rather than using optimizations that are specific to a single platform. Another important note is that operator coverage is also a non-goal for the Optimized Kernel Library. There are no plans to add optimized kernels for every Core ATen operator; rather, optimized kernels are added on an as-needed basis to improve performance on specific models. Thus, the operator coverage in the Optimized Kernel Library will be much more limited compared to the Portable Kernel Library. --- # Kernel Library Selective Build _Selective build_ is a build mode on ExecuTorch that uses model metadata to guide ExecuTorch build. This build mode contains build tool APIs available on CMake. ExecuTorch users can use selective build APIs to build an ExecuTorch runtime binary with minimal binary size by only including operators required by models. This document aims to help ExecuTorch users better use selective build, by listing out available APIs, providing an overview of high level architecture and showcasing examples. Preread: [Overview of the ExecuTorch runtime](runtime-overview.md), [High-level architecture and components of ExecuTorch](getting-started-architecture.md) ## Design Principles **Why selective build?** Many ExecuTorch use cases are constrained by binary size. Selective build can reduce the binary size of the ExecuTorch runtime without compromising support for a target model. **What are we selecting?** Our core ExecuTorch library is around 50kB with no operators/kernels or delegates. If we link in kernel libraries such as the ExecuTorch in-house portable kernel library, the binary size of the whole application surges, due to unused kernels being registered into the ExecuTorch runtime. Selective build is able to apply a filter on the kernel libraries, so that only the kernels actually being used are linked, thus reducing the binary size of the application. **How do we select?** Selective build provides APIs to allow users to pass in _op info_, operator metadata derived from target models. Selective build tools will gather these op info and build a filter for all kernel libraries being linked in. ## High Level Architecture ![](_static/img/kernel-library-selective-build.png) Note that all of the selective build tools are running at build-time (to be distinguished from compile-time or runtime). Therefore selective build tools only have access to static data from user input or models. The basic flow looks like this: 1. For each of the models we plan to run, we extract op info from it, either manually or via a Python tool. Op info will be written into yaml files and generated at build time. 2. An _op info aggregator _will collect these model op info and merge them into a single op info yaml file. 3. A _kernel resolver _takes in the linked kernel libraries as well as the merged op info yaml file, then makes a decision on which kernels to be registered into ExecuTorch runtime. ## Selective Build CMake Options To enable selective build when building the executorch kernel libraries as part of a CMake build, the following CMake options are exposed. These options affect the `executorch_kernels` CMake target. Make sure to link this target when using selective build. * `EXECUTORCH_SELECT_OPS_YAML`: A path to a YAML file specifying the operators to include. * `EXECUTORCH_SELECT_OPS_LIST`: A string containing the operators to include. * `EXECUTORCH_SELECT_OPS_MODEL`: A path to a PTE file. Only operators used in this model will be included. * `EXECUTORCH_ENABLE_DTYPE_SELECTIVE_BUILD`: If enabled, operators will be further specialized to only operator on the data types specified in the operator selection. Note that `EXECUTORCH_SELECT_OPS_YAML`, `EXECUTORCH_SELECT_OPS_LIST`, and `EXECUTORCH_SELECT_OPS_MODEL` are mutually exclusive. Only one operator specifier directive is allowed. As an example, to build with only operators used in mv2_xnnpack_fp32.pte, the CMake build can be configured as follows. ``` cmake .. -DEXECUTORCH_SELECT_OPS_MODEL=mv2_xnnpack_fp32.pte ``` ## APIs For fine-grained control, we expose a CMake macro [gen_selected_ops](https://github.com/pytorch/executorch/blob/main/tools/cmake/Codegen.cmake#L12) to allow users to specify op info: ``` gen_selected_ops( LIB_NAME # the name of the selective build operator library to be generated OPS_SCHEMA_YAML # path to a yaml file containing operators to be selected ROOT_OPS # comma separated operator names to be selected INCLUDE_ALL_OPS # boolean flag to include all operators OPS_FROM_MODEL # path to a pte file of model to select operators from DTYPE_SELECTIVE_BUILD # boolean flag to enable dtype selection ) ``` The macro makes a call to gen_oplist.py, which requires a [distinct selection](https://github.com/pytorch/executorch/blob/main/codegen/tools/gen_oplist.py#L222-L228) of API choice. `OPS_SCHEMA_YAML`, `ROOT_OPS`, `INCLUDE_ALL_OPS`, and `OPS_FROM_MODEL` are mutually exclusive options, and should not be used in conjunction. ### Select all ops If this input is set to true, it means we are registering all the kernels from all the kernel libraries linked into the application. If set to true it is effectively turning off selective build mode. ### Select ops from schema yaml Context: each kernel library is designed to have a yaml file associated with it. For more information on this yaml file, see [Kernel Library Overview](kernel-library-overview.md). This API allows users to pass in the schema yaml for a kernel library directly, effectively allowlisting all kernels in the library to be registered. ### Select root ops from operator list This API lets users pass in a list of operator names. Note that this API can be combined with the API above and we will create a allowlist from the union of both API inputs. ### Select ops from model This API lets users pass in a pte file of an exported model. When used, the pte file will be parsed to generate a yaml file that enumerates the operators and dtypes used in the model. ### Dtype Selective Build Beyond pruning the binary to remove unused operators, the binary size can further reduced by removing unused dtypes. For example, if your model only uses floats for the `add` operator, then including variants of the `add` operators for `doubles` and `ints` is unnecessary. The flag `DTYPE_SELECTIVE_BUILD` can be set to `ON` to support this additional optimization. Currently, dtype selective build is only supported with the model API described above. Once enabled, a header file that specifies only the operators and dtypes used by the model is created and linked against a rebuild of the `portable_kernels` lib. This feature is only supported for the portable kernels library; it's not supported for optimized, quantized or custom kernel libraries. ## Example Walkthrough In [examples/selective_build/CMakeLists.txt](https://github.com/pytorch/executorch/blob/main/examples/selective_build/advanced/CMakeLists.txt), we have the following cmake config options: 1. `EXECUTORCH_SELECT_OPS_YAML` 2. `EXECUTORCH_SELECT_OPS_LIST` 3. `EXECUTORCH_SELECT_ALL_OPS` 4. `EXECUTORCH_SELECT_OPS_FROM_MODEL` 5. `EXECUTORCH_DTYPE_SELECTIVE_BUILD` These options allow a user to tailor the cmake build process to utilize the different APIs, and results in different invocations on the `gen_selected_ops` [function](https://github.com/pytorch/executorch/blob/main/examples/selective_build/advanced/CMakeLists.txt). The following table describes some examples of how the invocation changes when these configs are set: | Example cmake Call | Resultant `gen_selected_ops` Invocation | | :----: | :---:| |
cmake -D… -DSELECT_OPS_LIST="aten::add.out,aten::mm.out"
|
gen_selected_ops("" "${SELECT_OPS_LIST}" "" "" "")
| |
cmake -D… -DSELECT_OPS_YAML=ON
|
set(_custom_ops_yaml ${EXECUTORCH_ROOT}/examples/portable/custom_ops/custom_ops.yaml)
gen_selected_ops("${_custom_ops_yaml}" "" "")
| |
cmake -D… -DEXECUTORCH_SELECT_OPS_FROM_MODEL="model.pte.out"
|
gen_selected_ops("" "" "" "${_model_path}" "")
| |
cmake -D… -DEXECUTORCH_SELECT_OPS_FROM_MODEL="model.pte.out" -DEXECUTORCH_DTYPE_SELECTIVE_BUILD=ON
|
gen_selected_ops("" "" "" "${_model_path}" "ON")
| --- # Kernel Library ```{toctree} :maxdepth: 1 kernel-library-overview kernel-library-custom-aten-kernel kernel-library-selective-build ``` --- # Run Llama 3 3B Instruct on Android (with Qualcomm AI Engine Direct Backend) This tutorial demonstrates how to export and run the Llama 3 3B Instruct model on a Qualcomm device using the Qualcomm AI Engine Direct Backend via ExecuTorch. We use a static Llama [implementation](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/model/static_llama.py) to optimize performance and memory usage during on-device inference. ## Prerequisites - Set up your ExecuTorch repo and environment if you haven’t done so by following [the Setting up ExecuTorch](../getting-started-setup.rst) to set up the repo and dev environment. - Read [the Building and Running ExecuTorch with Qualcomm AI Engine Direct Backend page](../backends-qualcomm.md) to understand how to export and run a model with Qualcomm AI Engine Direct Backend on Qualcomm device. - Follow [the README for executorch llama](https://github.com/pytorch/executorch/tree/main/examples/models/llama) to know how to run a llama model on mobile via ExecuTorch. - A Qualcomm device with 16GB RAM - We are continuing to optimize our memory usage to ensure compatibility with lower memory devices. - The version of [Qualcomm AI Engine Direct SDK](https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk) is 2.28.0 or above. ## Instructions ### Step 1: Prepare the checkpoint and tokenizer of the model. 1. For Llama 3 tokenizer and checkpoint, please refer to [instructions](https://www.llama.com/models/llama-3) for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`. ### Step 2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend Deploying large language models like Llama 3 on-device presents the following challenges: 1. The model size is too large to fit in device memory for inference. 2. High model loading and inference time. 3. Difficulty in quantization. To address these, we apply the following optimizations: 1. Quantization: Apply the `quant_recipe` when setting the quantization config to reduce model size and memory usage. 2. Mixed Precision Quantization: compresses KV cache tensors to 8-bit and applies `QuantDtype.use_16a8w` to the LM head. 3. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference. The number of shards might be different depending on the model size. 4. Graph Transformations: Convert operations into accelerator-friendly formats for better runtime performance. You can find the full optimization configuration in this [file](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/__init__.py), as shown below: ``` python @register_llm_model("llama3_2-3b_instruct") @dataclass(init=False, frozen=True) class Llama3_2_3B_Instruct(LLMModelConfig): repo_id = None params_path = None convert_weights = None transform_weight = True # The Llama3_2 enabled should be instruct, however, Llama's tokenizer does not provide utility to apply chat template. instruct_model = False num_sharding = 4 masked_softmax = False # SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. (Implementation details: ./backends/qualcomm/_passes/seq_mse.py) In this configuration, we set `seq_mse_candidates` = 0, which means SeqMSE quantization is not applied. seq_mse_candidates = 0 r1 = False r2 = False r3 = False # quant recipe quant_recipe = Llama3_3BQuantRecipe ``` To export with the Qualcomm AI Engine Direct Backend, ensure the following: 1. The host machine has more than 64GB of memory (RAM + swap space). 2. The entire process takes a few hours. ```bash # export llama python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode kv --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --compile_only ``` Note: end-to-end [instructions](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md) ### Step 3: Invoke the Runtime on an Android smartphone with Qualcomm SoCs **3.1 Connect your android phone** **3.2 Make sure the following artifact is present before running the model.** -- artifact/ └── llama_qnn.pte **3.3 Run model** ```bash # Run llama python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode kv --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --pre_gen_pte ${PATH_TO_ARTIFACT} ``` ## What is coming? - Performance improvements - Reduce the memory pressure during inference to support 12GB Qualcomm devices - Broader LLM Support via [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch?tab=readme-ov-file#llms-large-language-models) - Already supported models (e.g.): Llama2, Llama3, Gemma, Qwen, Phi-4, SmolLM. For usage examples, please refer to [README](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md) ## FAQ If you encounter any issues while reproducing the tutorial, please file a github [issue](https://github.com/pytorch/executorch/issues) on ExecuTorch repo and tag use `#qcom_aisw` tag --- # Exporting custom LLMs If you have your own PyTorch model that is an LLM, this guide will show you how to manually export and lower to ExecuTorch, with many of the same optimizations as covered in the previous `export_llm` guide. This example uses Karpathy’s [nanoGPT](https://github.com/karpathy/nanoGPT), which is a minimal implementation of GPT-2 124M. This guide is applicable to other language models, as ExecuTorch is model-invariant. ## Exporting to ExecuTorch (basic) Exporting takes a PyTorch model and converts it into a format that can run efficiently on consumer devices. For this example, you will need the nanoGPT model and the corresponding tokenizer vocabulary. ::::{tab-set} :::{tab-item} curl ``` curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O curl https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O ``` ::: :::{tab-item} wget ``` wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json ``` ::: :::: To convert the model into a format optimized for standalone execution, there are two steps. First, use the PyTorch `export` function to convert the PyTorch model into an intermediate, platform-independent intermediate representation. Then use the ExecuTorch `to_edge` and `to_executorch` methods to prepare the model for on-device execution. This creates a .pte file which can be loaded by a desktop or mobile application at runtime. Create a file called export_nanogpt.py with the following contents: ```python # export_nanogpt.py import torch from executorch.exir import EdgeCompileConfig, to_edge from torch.nn.attention import sdpa_kernel, SDPBackend from torch.export import export from model import GPT # Load the model. model = GPT.from_pretrained('gpt2') # Create example inputs. This is used in the export process to provide # hints on the expected shape of the model input. example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), ) # Set up dynamic shape configuration. This allows the sizes of the input tensors # to differ from the sizes of the tensors in `example_inputs` during runtime, as # long as they adhere to the rules specified in the dynamic shape configuration. # Here we set the range of 0th model input's 1st dimension as # [0, model.config.block_size]. # See https://pytorch.org/executorch/main/concepts#dynamic-shapes # for details about creating dynamic shapes. dynamic_shape = ( {1: torch.export.Dim("token_dim", max=model.config.block_size)}, ) # Trace the model, converting it to a portable intermediate representation. # The torch.no_grad() call tells PyTorch to exclude training-specific logic. with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad(): m = export(model, example_inputs, dynamic_shapes=dynamic_shape).module() traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape) # Convert the model into a runnable ExecuTorch program. edge_config = EdgeCompileConfig(_check_ir_validity=False) edge_manager = to_edge(traced_model, compile_config=edge_config) et_program = edge_manager.to_executorch() # Save the ExecuTorch program to a file. with open("nanogpt.pte", "wb") as file: file.write(et_program.buffer) ``` To export, run the script with `python export_nanogpt.py` (or python3, as appropriate for your environment). It will generate a `nanogpt.pte` file in the current directory. For more information, see [Exporting to ExecuTorch](../tutorials/export-to-executorch-tutorial) and [torch.export](https://pytorch.org/docs/stable/export.html). ## Backend delegation While ExecuTorch provides a portable, cross-platform implementation for all operators, it also provides specialized backends for a number of different targets. These include, but are not limited to, x86 and ARM CPU acceleration via the XNNPACK backend, Apple acceleration via the Core ML backend and Metal Performance Shader (MPS) backend, and GPU acceleration via the Vulkan backend. Because optimizations are specific to a given backend, each pte file is specific to the backend(s) targeted at export. To support multiple devices, such as XNNPACK acceleration for Android and Core ML for iOS, export a separate PTE file for each backend. To delegate a model to a specific backend during export, ExecuTorch uses the `to_edge_transform_and_lower()` function. This function takes the exported program from `torch.export` and a backend-specific partitioner object. The partitioner identifies parts of the computation graph that can be optimized by the target backend. Within `to_edge_transform_and_lower()`, the exported program is converted to an edge dialect program. The partitioner then delegates compatible graph sections to the backend for acceleration and optimization. Any graph parts not delegated are executed by ExecuTorch's default operator implementations. To delegate the exported model to a specific backend, we need to import its partitioner as well as edge compile config from ExecuTorch codebase first, then call `to_edge_transform_and_lower`. Here's an example of how to delegate nanoGPT to XNNPACK (if you're deploying to an Android phone for instance): ```python # export_nanogpt.py # Load partitioner for Xnnpack backend from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner # Model to be delegated to specific backend should use specific edge compile config from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower import torch from torch.export import export from torch.nn.attention import sdpa_kernel, SDPBackend from torch.export import export from model import GPT # Load the nanoGPT model. model = GPT.from_pretrained('gpt2') # Create example inputs. This is used in the export process to provide # hints on the expected shape of the model input. example_inputs = ( torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long), ) # Set up dynamic shape configuration. This allows the sizes of the input tensors # to differ from the sizes of the tensors in `example_inputs` during runtime, as # long as they adhere to the rules specified in the dynamic shape configuration. # Here we set the range of 0th model input's 1st dimension as # [0, model.config.block_size]. # See ../concepts.html#dynamic-shapes # for details about creating dynamic shapes. dynamic_shape = ( {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)}, ) # Trace the model, converting it to a portable intermediate representation. # The torch.no_grad() call tells PyTorch to exclude training-specific logic. with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad(): m = export(model, example_inputs, dynamic_shapes=dynamic_shape).module() traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape) # Convert the model into a runnable ExecuTorch program. # To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config edge_config = get_xnnpack_edge_compile_config() # Converted to edge program and then delegate exported model to Xnnpack backend # by invoking `to` function with Xnnpack partitioner. edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config) et_program = edge_manager.to_executorch() # Save the Xnnpack-delegated ExecuTorch program to a file. with open("nanogpt.pte", "wb") as file: file.write(et_program.buffer) ``` ## Quantization Quantization refers to a set of techniques for running calculations and storing tensors using lower precision types. Compared to 32-bit floating point, using 8-bit integers can provide both a significant speedup and reduction in memory usage. There are many approaches to quantizing a model, varying in amount of pre-processing required, data types used, and impact on model accuracy and performance. Because compute and memory are highly constrained on mobile devices, some form of quantization is necessary to ship large models on consumer electronics. In particular, large language models, such as Llama2, may require quantizing model weights to 4 bits or less. Leveraging quantization requires transforming the model before export. PyTorch provides the pt2e (PyTorch 2 Export) API for this purpose. This example targets CPU acceleration using the XNNPACK delegate. As such, it needs to use the XNNPACK-specific quantizer. Targeting a different backend will require use of the corresponding quantizer. To use 8-bit integer dynamic quantization with the XNNPACK delegate, call `prepare_pt2e`, calibrate the model by running with a representative input, and then call `convert_pt2e`. This updates the computational graph to use quantized operators where available. ```python # export_nanogpt.py from executorch.backends.transforms.duplicate_dynamic_quant_chain import ( DuplicateDynamicQuantChainPass, ) from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import ( get_symmetric_quantization_config, XNNPACKQuantizer, ) from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e ``` ```python # Use dynamic, per-channel quantization. xnnpack_quant_config = get_symmetric_quantization_config( is_per_channel=True, is_dynamic=True ) xnnpack_quantizer = XNNPACKQuantizer() xnnpack_quantizer.set_global(xnnpack_quant_config) m = export(model, example_inputs).module() # Annotate the model for quantization. This prepares the model for calibration. m = prepare_pt2e(m, xnnpack_quantizer) # Calibrate the model using representative inputs. This allows the quantization # logic to determine the expected range of values in each tensor. m(*example_inputs) # Perform the actual quantization. m = convert_pt2e(m, fold_quantize=False) DuplicateDynamicQuantChainPass()(m) traced_model = export(m, example_inputs) ``` Additionally, add or update the `to_edge_transform_and_lower()` call to use `XnnpackPartitioner`. This instructs ExecuTorch to optimize the model for CPU execution via the XNNPACK backend. ```python from executorch.backends.xnnpack.partition.xnnpack_partitioner import ( XnnpackPartitioner, ) ``` ```python edge_config = get_xnnpack_edge_compile_config() # Convert to edge dialect and lower to XNNPack. edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config) et_program = edge_manager.to_executorch() with open("nanogpt.pte", "wb") as file: file.write(et_program.buffer) ``` For more information, see [Quantization in ExecuTorch](../quantization-overview.md). ## Profiling and Debugging After lowering a model by calling `to_edge_transform_and_lower()`, you may want to see what got delegated and what didn’t. ExecuTorch provides utility methods to give insight on the delegation. You can use this information to gain visibility into the underlying computation and diagnose potential performance issues. Model authors can use this information to structure the model in a way that is compatible with the target backend. ### Visualizing the Delegation The `get_delegation_info()` method provides a summary of what happened to the model after the `to_edge_transform_and_lower()` call: ```python from executorch.devtools.backend_debug import get_delegation_info from tabulate import tabulate # ... After call to to_edge_transform_and_lower(), but before to_executorch() graph_module = edge_manager.exported_program().graph_module delegation_info = get_delegation_info(graph_module) print(delegation_info.get_summary()) df = delegation_info.get_operator_delegation_dataframe() print(tabulate(df, headers="keys", tablefmt="fancy_grid")) ``` For nanoGPT targeting the XNNPACK backend, you might see the following (note that the numbers below are for illustration purposes only and actual values may vary): ``` Total delegated subgraphs: 145 Number of delegated nodes: 350 Number of non-delegated nodes: 760 ``` | | op_type | # in_delegated_graphs | # in_non_delegated_graphs | |----|---------------------------------|------- |-----| | 0 | aten__softmax_default | 12 | 0 | | 1 | aten_add_tensor | 37 | 0 | | 2 | aten_addmm_default | 48 | 0 | | 3 | aten_any_dim | 0 | 12 | | | ... | | | | 25 | aten_view_copy_default | 96 | 122 | | | ... | | | | 30 | Total | 350 | 760 | From the table, the operator `aten_view_copy_default` appears 96 times in delegate graphs and 122 times in non-delegated graphs. To see a more detailed view, use the `format_delegated_graph()` method to get a formatted str of printout of the whole graph or use `print_delegated_graph()` to print directly: ```python from executorch.exir.backend.utils import format_delegated_graph graph_module = edge_manager.exported_program().graph_module print(format_delegated_graph(graph_module)) ``` This may generate a large amount of output for large models. Consider using "Control+F" or "Command+F" to locate the operator you’re interested in (e.g. “aten_view_copy_default”). Observe which instances are not under lowered graphs. In the fragment of the output for nanoGPT below, observe that a transformer module has been delegated to XNNPACK while the where operator is not. ``` %aten_where_self_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.where.self](args = (%aten_logical_not_default_33, %scalar_tensor_23, %scalar_tensor_22), kwargs = {}) %lowered_module_144 : [num_users=1] = get_attr[target=lowered_module_144] backend_id: XnnpackBackend lowered graph(): %p_transformer_h_0_attn_c_attn_weight : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_weight] %p_transformer_h_0_attn_c_attn_bias : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_bias] %getitem : [num_users=1] = placeholder[target=getitem] %sym_size : [num_users=2] = placeholder[target=sym_size] %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%getitem, [%sym_size, 768]), kwargs = {}) %aten_permute_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.permute_copy.default](args = (%p_transformer_h_0_attn_c_attn_weight, [1, 0]), kwargs = {}) %aten_addmm_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.addmm.default](args = (%p_transformer_h_0_attn_c_attn_bias, %aten_view_copy_default, %aten_permute_copy_default), kwargs = {}) %aten_view_copy_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%aten_addmm_default, [1, %sym_size, 2304]), kwargs = {}) return [aten_view_copy_default_1] ``` ### Further Model Analysis and Debugging Through the [ExecuTorch's Developer Tools](getting-started.md#performance-analysis), users are able to profile model execution, giving timing information for each operator in the model, doing model numeric debugging, etc. An ETRecord is an artifact generated at the time of export that contains model graphs and source-level metadata linking the ExecuTorch program to the original PyTorch model. You can view all profiling events without an ETRecord, though with an ETRecord, you will also be able to link each event to the types of operators being executed, module hierarchy, and stack traces of the original PyTorch source code. For more information, see [the ETRecord docs](../etrecord.rst). In your export script, after calling `to_edge()` and `to_executorch()`, call `generate_etrecord()` with the `EdgeProgramManager` from `to_edge()` and the `ExecuTorchProgramManager` from `to_executorch()`. Make sure to copy the `EdgeProgramManager`, as the call to `to_edge_transform_and_lower()` mutates the graph in-place. ``` # export_nanogpt.py import copy from executorch.devtools import generate_etrecord # Make the deep copy immediately after to to_edge() edge_manager_copy = copy.deepcopy(edge_manager) # ... # Generate ETRecord right after to_executorch() etrecord_path = "etrecord.bin" generate_etrecord(etrecord_path, edge_manager_copy, et_program) ``` Run the export script and the ETRecord will be generated as `etrecord.bin`. To learn more about ExecuTorch's Developer Tools, see the [Introduction to the ExecuTorch Developer Tools](../devtools-overview.md). --- # Exporting LLMs with HuggingFace's Optimum ExecuTorch [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) provides a streamlined way to export Hugging Face transformer models to ExecuTorch format. It offers seamless integration with the Hugging Face ecosystem, making it easy to export models directly from the Hugging Face Hub. ## Overview Optimum ExecuTorch supports a much wider variety of model architectures compared to ExecuTorch's native `export_llm` API. While `export_llm` focuses on a limited set of highly optimized models (Llama, Qwen, Phi, and SmolLM) with advanced features like SpinQuant and attention sink, Optimum ExecuTorch can export diverse architectures including Gemma, Mistral, GPT-2, BERT, T5, Whisper, Voxtral, and many others. ### Use Optimum ExecuTorch when: - You need to export models beyond the limited set supported by `export_llm` - Exporting directly from Hugging Face Hub model IDs, including model variants such as finetunes - You want a simpler interface with Hugging Face ecosystem integration ### Use export_llm when: - Working with one of the highly optimized supported models (Llama, Qwen, Phi, SmolLM) - You need advanced optimizations like SpinQuant or attention sink - You need pt2e quantization for QNN/CoreML/Vulkan backends - Working with Llama models requiring custom checkpoints See [Exporting LLMs](export-llm.md) for details on using the native `export_llm` API. ## Prerequisites ### Installation First, clone and install Optimum ExecuTorch from source: ```bash git clone https://github.com/huggingface/optimum-executorch.git cd optimum-executorch pip install '.[dev]' ``` For access to the latest features and optimizations, install dependencies in dev mode: ```bash python install_dev.py ``` This installs `executorch`, `torch`, `torchao`, `transformers`, and other dependencies from nightly builds or source. ## Supported Models Optimum ExecuTorch supports a wide range of model architectures including decoder-only LLMs (Llama, Qwen, Gemma, Mistral, etc.), multimodal models, vision models, audio models (Whisper), encoder models (BERT, RoBERTa), and seq2seq models (T5). For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models). ## Export Methods Optimum ExecuTorch offers two ways to export models: ### Method 1: CLI Export The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format. #### Basic Export ```bash optimum-cli export executorch \ --model "HuggingFaceTB/SmolLM2-135M-Instruct" \ --task "text-generation" \ --recipe "xnnpack" \ --output_dir="./smollm2_exported" ``` #### With Optimizations Add custom SDPA, KV cache optimization, and quantization: ```bash optimum-cli export executorch \ --model "HuggingFaceTB/SmolLM2-135M-Instruct" \ --task "text-generation" \ --recipe "xnnpack" \ --use_custom_sdpa \ --use_custom_kv_cache \ --qlinear 8da4w \ --qembedding 8w \ --output_dir="./smollm2_exported" ``` #### Available CLI Arguments Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`. For the complete list of arguments, run: ```bash optimum-cli export executorch --help ``` ## Optimization Options ### Custom Operators Optimum ExecuTorch includes custom SDPA (~3x speedup) and custom KV cache (~2.5x speedup) operators. Enable with `--use_custom_sdpa` and `--use_custom_kv_cache`. ### Quantization Optimum ExecuTorch uses [TorchAO](https://github.com/pytorch/ao) for quantization. Common options: - `--qlinear 8da4w`: int8 dynamic activation + int4 weight (recommended) - `--qembedding 4w` or `--qembedding 8w`: int4/int8 embedding quantization Example: ```bash optimum-cli export executorch \ --model "meta-llama/Llama-3.2-1B" \ --task "text-generation" \ --recipe "xnnpack" \ --use_custom_sdpa \ --use_custom_kv_cache \ --qlinear 8da4w \ --qembedding 4w \ --output_dir="./llama32_1b" ``` ### Backend Support Supported backends: `xnnpack` (CPU), `coreml` (Apple GPU), `portable` (baseline), `cuda` (NVIDIA GPU). Specify with `--recipe`. ## Exporting Different Model Types Optimum ExecuTorch supports various model architectures with different tasks: - **Decoder-only LLMs**: Use `--task text-generation` - **Multimodal LLMs**: Use `--task multimodal-text-to-text` - **Seq2Seq models** (T5): Use `--task text2text-generation` - **ASR models** (Whisper): Use `--task automatic-speech-recognition` For detailed examples of exporting each model type, see the [Optimum ExecuTorch export guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md). ## Running Exported Models ### Verifying Output with Python After exporting, you can verify the model output in Python before deploying to device using classes from `modeling.py`, such as the `ExecuTorchModelForCausalLM` class for LLMs: ```python from optimum.executorch import ExecuTorchModelForCausalLM from transformers import AutoTokenizer # Load the exported model model = ExecuTorchModelForCausalLM.from_pretrained("./smollm2_exported") tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct") # Generate text generated_text = model.text_generation( tokenizer=tokenizer, prompt="Once upon a time", max_seq_len=128, ) print(generated_text) ``` ### Running on Device After verifying your model works correctly, deploy it to device: - [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime - [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices - [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices ## Performance For performance benchmarks and on-device metrics, see the [Optimum ExecuTorch benchmarks](https://github.com/huggingface/optimum-executorch#-benchmarks-on-mobile-devices) and the [ExecuTorch Benchmark Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fexecutorch). ## Additional Resources - [Optimum ExecuTorch GitHub](https://github.com/huggingface/optimum-executorch) - Full documentation and examples - [Supported Models](https://github.com/huggingface/optimum-executorch#-supported-models) - Complete model list - [Export Guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md) - Detailed export examples - [TorchAO Quantization](https://github.com/pytorch/ao) - Quantization library documentation --- # Exporting LLMs Instead of needing to manually write code to call torch.export(), use ExecuTorch's assortment of lowering APIs, or even interact with TorchAO quantize_ APIs for quantization, we have provided an out of box experience which performantly exports a selection of supported models to ExecuTorch. ## Prerequisites The LLM export functionality requires the `pytorch_tokenizers` package. If you encounter a `ModuleNotFoundError: No module named 'pytorch_tokenizers'` error, install it from the ExecuTorch source code: ```bash pip install -e ./extension/llm/tokenizers/ ``` ## Supported Models As of this doc, the list of supported LLMs include the following: - Llama 2/3/3.1/3.2 - Qwen 2.5/3 - Phi 3.5/4-mini - SmolLM2 The up-to-date list of supported LLMs can be found in the code [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32). **Note:** If you need to export models that are not on this list or other model architectures (such as Gemma, Mistral, BERT, T5, Whisper, etc.), see [Exporting LLMs with Optimum](export-llm-optimum.md) which supports a much wider variety of models from Hugging Face Hub. ## The export_llm API `export_llm` is ExecuTorch's high-level export API for LLMs. In this tutorial, we will focus on exporting Llama 3.2 1B using this API. `export_llm`'s arguments are specified either through CLI args or through a yaml configuration whose fields are defined in [`LlmConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py). To call `export_llm`: ``` python -m executorch.extension.llm.export.export_llm --config +base. ``` ## Basic export To perform a basic export of Llama3.2, we will first need to download the checkpoint file (`consolidated.00.pth`) and params file (`params.json`). You can find these from the [Llama website](https://www.llama.com/llama-downloads/) or [Hugging Face](https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main/original). Then, we specify the `model_class`, `checkpoint` (path to checkpoint file), and `params` (path to params file) as arguments. Additionally, later when we run the exported .pte with our runner APIs, the runner will need to know about the bos and eos ids for this model to know when to terminate. These are exposed through bos and eos getter methods in the .pte, which we can add by specifying bos and eos ids in a `metadata` argument. The values for these tokens can usually be found in the model's `tokenizer_config.json` on HuggingFace. ``` # path/to/config.yaml base: model_class: llama3_2 checkpoint: path/to/consolidated.00.pth params: path/to/params.json metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' # export_llm python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` We only require manually specifying a checkpoint path for the Llama model family, since it is our most optimized model and we have more advanced optimizations such as [SpinQuant](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#spinquant) that require custom checkpoints. For the other supported LLMs, the checkpoint will be downloaded from HuggingFace automatically, and the param files can be found in their respective directories under `executorch/examples/models`, for instance `executorch/examples/models/qwen3/config/0_6b_config.json`. ## Export settings [ExportConfig](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py) contains settings for the exported `.pte`, such as `max_seq_length` (max length of the prompt) and `max_context_length` (max length of the model's memory/cache). ## Adding optimizations `export_llm` performs a variety of optimizations to the model before export, during export, and during lowering. Quantization and delegation to accelerator backends are the main ones and will be covered in the next two sections. All other optimizations can be found under [`ModelConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L120). We will go ahead and add a few optimizations. ``` # path/to/config.yaml base: model_class: llama3_2 checkpoint: path/to/consolidated.00.pth params: path/to/params.json metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' model: use_kv_cache: True use_sdpa_with_kv_cache: True # export_llm python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` `use_kv_cache` and `use_sdpa_with_kv_cache` are recommended to export any LLM, while other options are useful situationally. For example: - `use_shared_embedding` can help for models with tied input/output embedding layers, given that you quantize using TorchAO low bit ops (`quantization.qmode: torchao:8da(\\d+)w` or `quantization.qmode: torchao:fpa(\d+)w`), see more [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L307). - `use_attention_sink` to extend generation by removing from the beginning of the KV cache when the max context length is reached. - `quantize_kv_cache` quantizes the KV cache in int8. - `local_global_attention` implements [Local-Global Attention](https://arxiv.org/abs/2411.09604), making specific attention layers use a much smaller localized sliding window KV cache. ## Quantization Quantization options are defined by [`QuantizationConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L283). ExecuTorch does quantization in two ways: 1. TorchAO [`quantize_`](https://docs.pytorch.org/ao/stable/generated/torchao.quantization.quantize_.html) API 2. [pt2e quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) ### TorchAO (XNNPACK) TorchAO quantizes at the source code level, swapping out Linear modules for QuantizedLinear modules. **To quantize on XNNPACK backend, this is the quantization path to follow.** The quantization modes are defined [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L306). Common ones to use are: - `8da4w`: short for int8 dynamic activation + int4 weight quantization. - `int8`: int8 weight-only quantization. Group size is specified with: - `group_size`: 8, 32, 64, etc. For Arm CPUs, there are also [low-bit kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for int8 dynamic activation + int[1-8] weight quantization. Note that this should not be used alongside XNNPACK, and experimentally we have found that the performance could sometimes even be better for the equivalent `8da4w`. To use these, specify `qmode` to either: - `torchao:8da(\d+)w`: int8 dynamic activation + int[1-8] weights, for example `torchao:8da5w` - `torchao:fpa(\d+)w`: int[1-8] weight only, for example `torchao:fpa4w` To quantize embeddings, specify either `embedding_quantize: ,` (`bitwidth` here must be 2, 4, or 8), or for low-bit kernels use `embedding_quantize: torchao:,` (`bitwidth` can be from 1-8). ``` # path/to/config.yaml base: model_class: llama3_2 checkpoint: path/to/consolidated.00.pth params: path/to/params.json metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' model: use_kv_cache: True use_sdpa_with_kv_cache: True quantization: embedding_quantize: 4,32 qmode: 8da4w # export_llm python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` ### pt2e (QNN, CoreML, and Vulkan) pt2e quantizes at the post-export graph level, swapping nodes and injecting quant/dequant nodes. **To quantize on non-CPU backends (QNN, CoreML, Vulkan), this is the quantization path to follow.** Read more about pt2e [here](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html), and how ExecuTorch uses pt2e [here](https://github.com/pytorch/executorch/blob/main/docs/source/quantization-overview.md). *CoreML and Vulkan support for export_llm is currently experimental and limited. To read more about QNN export, please read [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md).* ## Backend support Backend options are defined by [`BackendConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L434). Each backend has their own backend configuration options. Here is an example of lowering the LLM to XNNPACK for CPU acceleration: ``` # path/to/config.yaml base: model_class: llama3_2 checkpoint: path/to/consolidated.00.pth params: path/to/params.json metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' model: use_kv_cache: True use_sdpa_with_kv_cache: True quantization: embedding_quantize: 4,32 qmode: 8da4w backend: xnnpack: enabled: True extended_ops: True # Expand the selection of ops delegated to XNNPACK. # export_llm python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` ## Profiling and Debugging To see which ops got delegated to the backend and which didn't, specify `verbose: True`: ``` # path/to/config.yaml ... debug: verbose: True ... # export_llm python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. Here is an example:
Click to see delegation details Total delegated subgraphs: 368
Number of delegated nodes: 2588
Number of non-delegated nodes: 2513
| | op_type | # in_delegated_graphs | # in_non_delegated_graphs | |----|---------------------------------|------- |-----| | 0 | _assert_scalar | 0 | 167 | | 1 | _local_scalar_dense | 0 | 123 | | 2 | add | 0 | 31 | | 3 | aten__to_copy_default | 0 | 44 | | 4 | aten_add_tensor | 418 | 44 | | 5 | aten_alias_copy_default | 0 | 52 | | | ... | | | | 15 | aten_linear_default | 183 | 0 | | 18 | aten_mul_tensor | 445 | 0 | | 20 | aten_pow_tensor_scalar | 157 | 0 | | 22 | aten_rsqrt_default | 157 | 0 | | 27 | aten_view_copy_default | 0 | 126 | | 31 | getitem | 366 | 628 | | | ... | | | | 41 | torchao_quantize_affine_default | 183 | 0 | | 42 | Total | 2588 | 2513 |

To do further performance analysis, you can may opt to use [ExecuTorch's Developer Tools](getting-started.md#performance-analysis) to do things such as trace individual operator performance back to source code, view memory planning, and debug intermediate activations. To generate the ETRecord to link back `.pte` program to source code, you can use: ``` # path/to/config.yaml ... debug: generate_etrecord: True ... # export_llm python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` Other debug and profiling options can be found in [DebugConfig](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L228). A few examples ones: - `profile_memory`: Used to generate activation memory profile in chrome trace format. It allows one to visualize the lifetimes of different intermediate tensors of a model, how their lifetimes overlap, where these tensors come from, and how they impact the memory footprint of the model during its execution. Click [here](https://github.com/pytorch/executorch/blob/dd4488d720d676a1227450e8ea0c0c97beed900c/docs/source/memory-planning-inspection.md?plain=1#L19) for more details on memory profiling. - `profile_path`: Used to generate time profile of various components of export_llm. Such components include `torch.export`, quantization, `to_edge`, delegation via to_backend APIs etc. This option generate a .html file that gives you time profile in flamegraph/icicle format. It is helpful to understand what part of `export_llm` takes the most time. Largely useful for developers and contributors of ExecuTorch. For more details on flamegraph one can checkout https://www.parca.dev/docs/icicle-graph-anatomy/ To learn more about ExecuTorch's Developer Tools, see the [Introduction to the ExecuTorch Developer Tools](../devtools-overview.md). --- # Deploying LLMs to ExecuTorch ExecuTorch is designed to support all types of machine learning models, and LLMs are no exception. In this section we demonstrate how to leverage ExecuTorch to performantly run state of the art LLMs on-device out of the box with our provided export LLM APIs, acceleration backends, quantization libraries, tokenizers, and more. We encourage users to use this project as a starting point and adapt it to their specific needs, which includes creating your own versions of the tokenizer, sampler, acceleration backends, and other components. We hope this project serves as a useful guide in your journey with LLMs and ExecuTorch. ## Prerequisites To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up ExecuTorch](../getting-started.md#installation). ## Next steps Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) exporting the LLM to a `.pte` file and (2) running the `.pte` file using our C++ APIs or Swift/Java bindings. ### Exporting - [Exporting LLMs](export-llm.md) - Export using ExecuTorch's native `export_llm` API with advanced optimizations - [Exporting LLMs with Optimum](export-llm-optimum.md) - Export Hugging Face models with broader architecture support - [Exporting custom LLMs](export-custom-llm.md) ### Running - [Running with C++](run-with-c-plus-plus.md) - [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md) - [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) --- # Llama on ExecuTorch See [Llama readme](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md) for detailed information about running Llama on ExecuTorch. --- # Running LLMs on iOS ExecuTorch’s LLM-specific runtime components provide an experimental Objective-C and Swift components around the core C++ LLM runtime. ## Prerequisites Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. ## Runtime API Once linked against the [`executorch_llm`](../using-executorch-ios.md) framework, you can import the necessary components. ### Importing Objective-C: ```objectivec #import ``` Swift: ```swift import ExecuTorchLLM ``` ### TextLLMRunner The `ExecuTorchLLMTextRunner` class (bridged to Swift as `TextLLMRunner`) provides a simple Objective-C/Swift interface for loading a text-generation model, configuring its tokenizer with custom special tokens, generating token streams, and stopping execution. This API is experimental and subject to change. #### Initialization Create a runner by specifying paths to your serialized model (`.pte`) and tokenizer data, plus an array of special tokens to use during tokenization. Initialization itself is lightweight and doesn’t load the program data immediately. Objective-C: ```objectivec NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llama-3.2-instruct" ofType:@"pte"]; NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"tokenizer" ofType:@"model"]; NSArray *specialTokens = @[ @"<|bos|>", @"<|eos|>" ]; ExecuTorchLLMTextRunner *runner = [[ExecuTorchLLMTextRunner alloc] initWithModelPath:modelPath tokenizerPath:tokenizerPath specialTokens:specialTokens]; ``` Swift: ```swift let modelPath = Bundle.main.path(forResource: "llama-3.2-instruct", ofType: "pte")! let tokenizerPath = Bundle.main.path(forResource: "tokenizer", ofType: "model")! let specialTokens = ["<|bos|>", "<|eos|>"] let runner = TextLLMRunner( modelPath: modelPath, tokenizerPath: tokenizerPath, specialTokens: specialTokens ) ``` #### Loading Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. Objective-C: ```objectivec NSError *error = nil; BOOL success = [runner loadWithError:&error]; if (!success) { NSLog(@"Failed to load: %@", error); } ``` Swift: ```swift do { try runner.load() } catch { print("Failed to load: \(error)") } ``` #### Generating Generate tokens from an initial prompt, configured with an `ExecuTorchLLMConfig` object. The callback block is invoked once per token as it’s produced. Objective-C: ```objectivec ExecuTorchLLMConfig *config = [[ExecuTorchLLMConfig alloc] initWithBlock:^(ExecuTorchLLMConfig *c) { c.temperature = 0.8; c.sequenceLength = 2048; }]; NSError *error = nil; BOOL success = [runner generateWithPrompt:@"Once upon a time" config:config tokenCallback:^(NSString *token) { NSLog(@"Generated token: %@", token); } error:&error]; if (!success) { NSLog(@"Generation failed: %@", error); } ``` Swift: ```swift do { try runner.generate("Once upon a time", Config { $0.temperature = 0.8 $0.sequenceLength = 2048 }) { token in print("Generated token:", token) } } catch { print("Generation failed:", error) } ``` #### Stopping Generation If you need to interrupt a long‐running generation, call: Objective-C: ```objectivec [runner stop]; ``` Swift: ```swift runner.stop() ``` #### Resetting To clear the prefilled tokens from the KV cache and reset generation stats, call: Objective-C: ```objectivec [runner reset]; ``` Swift: ```swift runner.reset() ``` ### MultimodalRunner The `ExecuTorchLLMMultimodalRunner` class (bridged to Swift as `MultimodalRunner`) provides an interface for loading and running multimodal models that can accept a sequence of text, image, and audio inputs. #### Multimodal Inputs Inputs are provided as an array of `ExecuTorchLLMMultimodalInput` (or `MultimodalInput` in Swift). You can create inputs from String for text, `ExecuTorchLLMImage` for images (`Image` in Swift), and `ExecuTorchLLMAudio` for audio features (`Audio`) in Swift. Objective-C: ```objectivec ExecuTorchLLMMultimodalInput *textInput = [ExecuTorchLLMMultimodalInput inputWithText:@"What's in this image?"]; NSData *imageData = ...; // Your raw image bytes ExecuTorchLLMImage *image = [[ExecuTorchLLMImage alloc] initWithData:imageData width:336 height:336 channels:3]; ExecuTorchLLMMultimodalInput *imageInput = [ExecuTorchLLMMultimodalInput inputWithImage:image]; ``` Swift: ```swift let textInput = MultimodalInput("What's in this image?") let imageData: Data = ... // Your raw image bytes let image = Image(data: imageData, width: 336, height: 336, channels: 3) let imageInput = MultimodalInput(image) let audioFeatureData: Data = ... // Your raw audio feature bytes let audio = Audio(float: audioFeatureData, batchSize: 1, bins: 128, frames: 3000) let audioInput = MultimodalInput(audio) ``` #### Initialization Create a runner by specifying the paths to your multimodal model and its tokenizer. Objective-C: ```objectivec NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llava" ofType:@"pte"]; NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"llava_tokenizer" ofType:@"bin"]; ExecuTorchLLMMultimodalRunner *runner = [[ExecuTorchLLMMultimodalRunner alloc] initWithModelPath:modelPath tokenizerPath:tokenizerPath]; ``` Swift: ```swift let modelPath = Bundle.main.path(forResource: "llava", ofType: "pte")! let tokenizerPath = Bundle.main.path(forResource: "llava_tokenizer", ofType: "bin")! let runner = MultimodalRunner(modelPath: modelPath, tokenizerPath: tokenizerPath) ``` #### Loading Explicitly load the model before generation. Objective-C: ```objectivec NSError *error = nil; BOOL success = [runner loadWithError:&error]; if (!success) { NSLog(@"Failed to load: %@", error); } ``` Swift: ```swift do { try runner.load() } catch { print("Failed to load: \(error)") } ``` #### Generating Generate tokens from an ordered array of multimodal inputs. Objective-C: ```objectivec NSArray *inputs = @[textInput, imageInput]; ExecuTorchLLMConfig *config = [[ExecuTorchLLMConfig alloc] initWithBlock:^(ExecuTorchLLMConfig *c) { c.sequenceLength = 768; }]; NSError *error = nil; BOOL success = [runner generateWithInputs:inputs config:config tokenCallback:^(NSString *token) { NSLog(@"Generated token: %@", token); } error:&error]; if (!success) { NSLog(@"Generation failed: %@", error); } ``` Swift: ```swift let inputs = [textInput, imageInput] do { try runner.generate(inputs, Config { $0.sequenceLength = 768 }) { token in print("Generated token:", token) } } catch { print("Generation failed:", error) } ``` #### Stopping and Resetting The stop and reset methods for `MultimodalRunner` behave identically to those on `TextRunner`. ## Demo Get hands-on with our [etLLM iOS Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) to see the LLM runtime APIs in action. --- # Running LLMs with C++ This guide explains how to use ExecuTorch's C++ runner library to run LLM models that have been exported to the `.pte` format. The runner library provides a high-level API for text generation with LLMs, handling tokenization, inference, and token generation. ## Prerequisites Before you begin, make sure you have: 1. A model exported to `.pte` format using the `export_llm` API as described in [Exporting popular LLMs out of the box](export-llm.md) or [Exporting custom LLMs](export-custom-llm.md). - Please also see [Model Metadata](#model-metadata) section for important metadata to be serialized into `.pte`. 2. A tokenizer file compatible with your model - For HuggingFace tokenizers, this is a JSON file `tokenizer.json` - For SentencePiece tokenizers, this is a `tokenizer.model` file and normally lives alongside the weights file 3. CMake and a C++ compiler installed - CMake version 3.29 or higher - g++ or clang compiler ## Model Metadata The metadata includes several important configuration parameters to be included during export step, which will be used by the runner library: 1. **`enable_dynamic_shape`**: Whether the model supports dynamic input shapes 2. **`max_seq_len`**: Maximum sequence length the model can handle 3. **`max_context_len`**: Maximum context length for KV cache 4. **`use_kv_cache`**: Whether the model uses KV cache for efficient generation 6. **`get_bos_id`**: Beginning-of-sequence token ID 7. **`get_eos_ids`**: End-of-sequence token IDs ### Adding Metadata During Export To ensure your model has the necessary metadata, you can specify it during export using the `metadata` parameter in the export configuration: ```python # export_llm python -m extension.llm.export.export_llm \ --config path/to/config.yaml \ +base.metadata='{"get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_max_context_len":4096}' ``` ## Building the Runner Library The ExecuTorch LLM runner library can be built using CMake. To integrate it into your project: 1. Add ExecuTorch as a dependency in your CMake project 2. Enable the required components (extension_module, extension_tensor, etc.) 3. Link your application against the `extension_llm_runner` library Here's a simplified example of the CMake configuration: ```cmake # Enable required components set_overridable_option(EXECUTORCH_BUILD_EXTENSION_MODULE ON) set_overridable_option(EXECUTORCH_BUILD_EXTENSION_TENSOR ON) set_overridable_option(EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER ON) # Add ExecuTorch as a dependency add_subdirectory(executorch) # Link against the LLM runner library target_link_libraries(your_app PRIVATE extension_llm_runner) ``` ## Building the Llama Runner ExecuTorch provides a complete example of a C++ runner for Llama models in the [`examples/models/llama`](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-3-run-on-your-computer-to-validate) directory. This runner demonstrates how to use the LLM runner library to run Llama models exported to the `.pte` format. Please note that this runner library is not limited to Llama models and can be used with any text-only decoder-only LLM model that has been exported to the `.pte`. ## Basic Usage Example Here's a simplified example of using the runner: ```cpp #include using namespace executorch::extension::llm; int main() { // Load tokenizer and create runner auto tokenizer = load_tokenizer("path/to/tokenizer.json", nullptr, std::nullopt, 0, 0); auto runner = create_text_llm_runner("path/to/model.pte", std::move(tokenizer)); // Load the model runner->load(); // Configure generation GenerationConfig config; config.max_new_tokens = 100; config.temperature = 0.8f; // Generate text with streaming output runner->generate("Hello, world!", config, [](const std::string& token) { std::cout << token << std::flush; }, nullptr); return 0; } ``` ## The Runner API Architecture The ExecuTorch LLM runner library is designed with a modular architecture that separates concerns between different components of the text generation pipeline. ### IRunner Interface The `IRunner` interface (`irunner.h`) defines the core functionality for LLM text generation. This interface serves as the primary abstraction for interacting with LLM models: ```cpp class IRunner { public: virtual ~IRunner() = default; virtual bool is_loaded() const = 0; virtual runtime::Error load() = 0; virtual runtime::Error generate(...) = 0; virtual runtime::Error generate_from_pos(...) = 0; virtual void stop() = 0; }; ``` Let's examine each method in detail: ```c++ bool is_loaded() const ``` Checks if the model and all necessary resources have been loaded into memory and are ready for inference. This method is useful for verifying the runner's state before attempting to generate text. ```c++ runtime::Error load() ``` Loads the model and prepares it for inference. This includes: - Loading the model weights from the `.pte` file - Initializing any necessary buffers or caches - Preparing the execution environment This method should be called before any generation attempts. It returns an `Error` object indicating success or failure. ```c++ runtime::Error generate( const std::string& prompt, const GenerationConfig& config, std::function token_callback, std::function stats_callback) ``` The primary method for text generation. It takes: - `prompt`: The input text to generate from - `config`: Configuration parameters controlling the generation process - `token_callback`: A callback function that receives each generated token as a string - `stats_callback`: A callback function that receives performance statistics after generation completes The token callback is called for each token as it's generated, allowing for streaming output. The stats callback provides detailed performance metrics after generation completes. ```c++ runtime::Error generate_from_pos( const std::string& prompt, int64_t start_pos, const GenerationConfig& config, std::function token_callback, std::function stats_callback) ``` An advanced version of `generate()` that allows starting generation from a specific position in the KV cache. This is useful for continuing generation from a previous state. ```c++ void stop() ``` Immediately stops the generation loop. This is typically called from another thread to interrupt a long-running generation. ### GenerationConfig Structure The `GenerationConfig` struct controls various aspects of the generation process: ```cpp struct GenerationConfig { bool echo = true; // Whether to echo the input prompt in the output int32_t max_new_tokens = -1; // Maximum number of new tokens to generate bool warming = false; // Whether this is a warmup run int32_t seq_len = -1; // Maximum number of total tokens float temperature = 0.8f; // Temperature for sampling int32_t num_bos = 0; // Number of BOS tokens to add int32_t num_eos = 0; // Number of EOS tokens to add // Helper method to resolve the actual max_new_tokens based on constraints int32_t resolve_max_new_tokens(int32_t max_context_len, int32_t num_prompt_tokens) const; }; ``` The `resolve_max_new_tokens` method handles the logic of determining how many tokens can be generated based on: - The model's maximum context length - The number of tokens in the prompt - The user-specified maximum sequence length and maximum new tokens ### Implementation Components The runner library consists of several specialized components that work together: #### TextLLMRunner The main implementation of the `IRunner` interface that orchestrates the text generation process. It manages: 1. Tokenization of input text 2. Prefilling the KV cache with prompt tokens 3. Generating new tokens one by one 4. Collecting performance statistics #### TextPrefiller Responsible for processing the initial prompt tokens and filling the KV cache. Key features: - Efficiently processes large prompts - Handles dynamic sequence lengths - Supports parallel prefilling for performance optimization #### TextTokenGenerator Generates new tokens one by one in an autoregressive manner. It: - Manages the token generation loop - Applies temperature-based sampling - Detects end-of-sequence conditions - Streams tokens as they're generated #### TextDecoderRunner Interfaces with the ExecuTorch Module to run the model forward pass. It: - Manages inputs and outputs to the model - Handles KV cache updates - Converts logits to tokens via sampling ## Tokenizer Support The runner library supports multiple tokenizer formats through a unified interface: ```cpp std::unique_ptr tokenizer = load_tokenizer( tokenizer_path, // Path to tokenizer file nullptr, // Optional special tokens std::nullopt, // Optional regex pattern (for TikToken) 0, // BOS token index 0 // EOS token index ); ``` Supported tokenizer formats include: 1. **HuggingFace Tokenizers**: JSON format tokenizers 2. **SentencePiece**: `.model` format tokenizers 3. **TikToken**: BPE tokenizers 4. **Llama2c**: BPE tokenizers in the Llama2.c format For custom tokenizers, you can find implementations in the [meta-pytorch/tokenizers](https://github.com/meta-pytorch/tokenizers) repository. ## Other APIs ### Model Warmup For more accurate timing and optimal performance, you should perform a warmup run before actual inference: ```cpp runner->warmup("Hello world", 10); // Generate 10 tokens as warmup ``` During warmup: 1. A special `GenerationConfig` is created with: - `echo = false`: The prompt is not included in the output - `warming = true`: Indicates this is a warmup run - `max_new_tokens`: Set to the specified number of tokens to generate 2. The model runs through the entire generation pipeline: - Loading the model (if not already loaded) - Tokenizing the prompt - Prefilling the KV cache - Generating the specified number of tokens 3. Special behavior during warmup: - Tokens are not displayed to the console - The runner logs "Doing a warmup run..." and "Warmup run finished!" messages 4. After warmup: - The `Stats` object is reset to clear performance metrics - The model remains loaded and ready for actual inference Warmup is particularly important for accurate benchmarking as the first inference often includes one-time initialization costs that would skew performance measurements. ### Memory Usage Monitoring You can monitor memory usage with the `Stats` object: ```cpp std::cout << "RSS after loading: " << get_rss_bytes() / 1024.0 / 1024.0 << " MiB" << std::endl; ``` --- (working-with-llms)= # LLMs Learn how to export LLM models and deploy them across different platforms and runtime environments. This section covers the complete workflow from model export to running inference on mobile devices and edge hardware. ```{toctree} :maxdepth: 1 :caption: Working with LLMs getting-started export-llm export-llm-optimum export-custom-llm run-with-c-plus-plus build-run-llama3-qualcomm-ai-engine-direct-backend run-on-ios ``` --- --- orphan: true --- # Markdown in Sphinx Tips and Tricks In this repository, you can use both markdown and reSTructuredText to author your content. This section lists most common examples of how you can use Sphinx directives in your markdown files to expand your contributions. For more information, see [MyST Parser Documentation](https://myst-parser.readthedocs.io/en/v0.17.1/sphinx/intro.html) and [reSTructuredText to Markdown mapping](https://myst-parser.readthedocs.io/en/v0.17.1/syntax/syntax.html#syntax-directives). ## Admonitions Here is an example of how you can add a note. Similarly, you can add `{tip}` and `{warning}`. ::::{tab-set} :::{tab-item} Example ```{image} /_static/img/s_demo_note_render.png :alt: note :class: bg-primary :width: 210px :align: center ``` ::: :::{tab-item} Source ```{image} /_static/img/s_demo_note_source.png :alt: note :class: bg-primary :width: 170px :align: center ``` ::: :::: ## Images [This page](https://myst-parser.readthedocs.io/en/latest/syntax/images_and_figures.html) has extensive reference on how to add an image. You can use the standard markdown syntax as well as an extended one that allows you to modify width, alignment, and other parameters of an image. ::::{tab-set} :::{tab-item} Standard syntax ```{code-block} ![image example][/_static/img/example-image.png] ``` ::: :::{tab-item} Extended Syntax ````{code-block} ```{image} img/s_demo_note_source.png :alt: example :class: bg-primary :width: 150px :align: center ``` ```` ::: :::: ## Code Block You can use standard code blocks as well as the extended syntax and include the code from other files as. More information can be found on [this page](https://myst-parser.readthedocs.io/en/latest/syntax/code_and_apis.html). Examples: ::::{tab-set} :::{tab-item} Standard syntax ````{code-block} ```python a = 1 b = 2 c = a + b print(c) ``` ```` ::: :::{tab-item} Output ```python a = 1 b = 2 c = a + b print(c) ``` ::: :::: ::::{tab-set} :::{tab-item} Extended Syntax ````{code-block} ```{code-block} python :caption: My example code :emphasize-lines: 4 :lineno-start: 1 a = 1 b = 2 c = a + b print(c) ``` ```` ::: :::{tab-item} Output ```{code-block} python :caption: My example code :emphasize-lines: 4 :lineno-start: 1 a = 1 b = 2 c = a + b print(c) ``` ::: :::: ::::{tab-set} :::{tab-item} Include from other files Here is how you can include the code from another file. In this example, we will only include the code between the `start-after` and `end-before` markers. ````{code-block} ```{literalinclude} _static/example.py :start-after: start :end-before: end ``` ```` The `example.py` file looks like this: ```{code-block} python :emphasize-lines: 10, 16 """ A sample python file """ class Person: def __init__(self, name, age): self.name = name self.age = age # start def introduce(self): print("Hello, my name is", self.name) print("I am", self.age, "years old") # end person = Person("Alice", 25) person.introduce() ::: :::{tab-item} Output ```{literalinclude} _static/example.py :start-after: start :end-before: end ``` ::: :::: --- # Memory Planning Inspection in ExecuTorch After the [Memory Planning](concepts.md#memory-planning) pass of ExecuTorch, memory allocation information is stored on the nodes of the [`ExportedProgram`](concepts.md#exportedprogram). Here, we present a tool designed to inspect memory allocation and visualize all active tensor objects. ## Usage User should add this code after they call [to_executorch()](export-to-executorch-api-reference.rst#executorch.exir.EdgeProgramManager.to_executorch), and it will write memory allocation information stored on the nodes to the file path "memory_profile.json". The file is compatible with the Chrome trace viewer; see below for more information about interpreting the results. ```python from executorch.util.activation_memory_profiler import generate_memory_trace generate_memory_trace( executorch_program_manager=prog, chrome_trace_filename="memory_profile.json", enable_memory_offsets=True, ) ``` * `prog` is an instance of [`ExecuTorchProgramManager`](export-to-executorch-api-reference.rst#executorch.exir.ExecutorchProgramManager), returned by [to_executorch()](export-to-executorch-api-reference.rst#executorch.exir.EdgeProgramManager.to_executorch). * Set `enable_memory_offsets` to `True` to show the location of each tensor on the memory space. ## Chrome Trace Open a Chrome browser tab and navigate to . Upload the generated `.json` to view. Example of a [MobileNet V2](https://pytorch.org/vision/main/models/mobilenetv2.html) model: ![Memory planning Chrome trace visualization](_static/img/memory_planning_inspection.png) Note that, since we are repurposing the Chrome trace tool, the axes in this context may have different meanings compared to other Chrome trace graphs you may have encountered previously: * The horizontal axis, despite being labeled in seconds (s), actually represents megabytes (MBs). * The vertical axis has a 2-level hierarchy. The first level, "pid", represents memory space. For CPU, everything is allocated on one "space"; other backends may have multiple. In the second level, each row represents one time step. Since nodes will be executed sequentially, each node represents one time step, thus you will have as many nodes as there are rows. ## Further Reading * [Memory Planning](compiler-memory-planning.md) --- # Debugging Models in ExecuTorch With the ExecuTorch Developer Tools, users can debug their models for numerical inaccurcies and extract model outputs from their device to do quality analysis (such as Signal-to-Noise, Mean square error etc.). Currently, ExecuTorch supports the following debugging flows: - Extraction of model level outputs via ETDump. - Extraction of intermediate outputs (outside of delegates) via ETDump: - Linking of these intermediate outputs back to the eager model python code. ## Steps to debug a model in ExecuTorch ### Runtime For a real example reflecting the steps below, please refer to [example_runner.cpp](https://github.com/pytorch/executorch/blob/main/examples/devtools/example_runner/example_runner.cpp). 1. [Optional] Generate an [ETRecord](etrecord.rst) while exporting your model. When provided, this enables users to link profiling information back to the eager model source code (with stack traces and module hierarchy). 2. Integrate [ETDump generation](etdump.md) into the runtime and set the debugging level by configuring the `ETDumpGen` object. Then, provide an additional buffer to which intermediate outputs and program outputs will be written. Currently we support two levels of debugging: - Program level outputs ```C++ Span buffer((uint8_t*)debug_buffer, debug_buffer_size); etdump_gen.set_debug_buffer(buffer); etdump_gen.set_event_tracer_debug_level( EventTracerDebugLogLevel::kProgramOutputs); ``` - Intermediate outputs of executed (non-delegated) operations (will include the program level outputs too) ```C++ Span buffer((uint8_t*)debug_buffer, debug_buffer_size); etdump_gen.set_debug_buffer(buffer); etdump_gen.set_event_tracer_debug_level( EventTracerDebugLogLevel::kIntermediateOutputs); ``` 3. Build the runtime with the pre-processor flag that enables tracking of debug events. Instructions are in the [ETDump documentation](etdump.md). 4. Run your model and dump out the ETDump buffer as described [here](etdump.md). (Do so similarly for the debug buffer if configured above) ### Accessing the debug outputs post run using the Inspector API's Once a model has been run, using the generated ETDump and debug buffers, users can leverage the [Inspector API's](model-inspector.rst) to inspect these debug outputs. ```python from executorch.devtools import Inspector # Create an Inspector instance with etdump and the debug buffer. inspector = Inspector(etdump_path=etdump_path, buffer_path = buffer_path, # etrecord is optional, if provided it'll link back # the runtime events to the eager model python source code. etrecord = etrecord_path) # Accessing program outputs is as simple as this: for event_block in inspector.event_blocks: if event_block.name == "Execute": print(event_blocks.run_output) # Accessing intermediate outputs from each event (an event here is essentially an instruction that executed in the runtime). for event_block in inspector.event_blocks: if event_block.name == "Execute": for event in event_block.events: print(event.debug_data) # If an ETRecord was provided by the user during Inspector initialization, users # can print the stacktraces and module hierarchy of these events. print(event.stack_traces) print(event.module_hierarchy) ``` We've also provided a simple set of utilities that let users perform quality analysis of their model outputs with respect to a set of reference outputs (possibly from the eager mode model). ```python from executorch.devtools.inspector import compare_results # Run a simple quality analysis between the model outputs sourced from the # runtime and a set of reference outputs. # # Setting plot to True will result in the quality metrics being graphed # and displayed (when run from a notebook) and will be written out to the # filesystem. A dictionary will always be returned which will contain the # results. for event_block in inspector.event_blocks: if event_block.name == "Execute": compare_results(event_blocks.run_output, ref_outputs, plot = True) ``` --- # New Contributor Guide Welcome to **ExecuTorch** — a runtime for efficient deployment of PyTorch AI models to edge devices, including mobile phones, wearables, and embedded systems. ExecuTorch is proudly open-source and welcomes contributions from developers of all backgrounds. If you're new to ExecuTorch, open-source projects, or GitHub, this guide is for you. We're excited to have you on board! If you have any questions, issues, comments, or just want to say hello to our community, please feel free to introduce yourselves on our **[Discord Server](https://discord.com/invite/Dh43CKSAdc)**. We'd love to speak with you. --- ## 🔑 Prerequisites ### Git This guide assumes a basic knowledge of Git, and how to run Git commands in your terminal. If you've never used Git before, you can read [this quick guide](https://www.freecodecamp.org/news/learn-the-basics-of-git-in-under-10-minutes-da548267cc91/), [git guide](https://rogerdudler.github.io/git-guide/), [cheat sheet](https://towardsdatascience.com/git-commands-cheat-sheet-software-developer-54f6aedc1c46/), the [Setup Git](https://docs.github.com/en/get-started/git-basics/set-up-git) page from GitHub’s documentation, or watch one of the many tutorials on YouTube. Git is a powerful version control system for coding projects — it enables you to collaborate, record code changes, and avoid losing hours of work when you make a mistake. It is essential for projects like ExecuTorch with large codebases and many collaborators. Without it, the complexity of tracking everyone's changes, reviewing their code, and identifying bugs, quickly becomes unmanageable. Git is an industry standard in the coding world, and particularly in open-source. It can take a while to get used to at first, but we promise you it's well worth the effort! We believe that learning Git can make you a significantly stronger and more effective developer. ### A GitHub Account We also assume that you have a GitHub account. If you don't, please [register here](https://github.com/signup), [verify your email address](https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/verifying-your-email-address#verifying-your-email-address) (required for the steps below to work!), then [login](https://github.com/login) to your new account before proceeding further. --- ## 🧑‍💻 Your First Contribution The first step towards making a contribution is finding something you want to work on. If you're new to ExecuTorch or the wider world of open-source, it might seem hard to know where to start. To help you out with this, we've gathered together some beginner-friendly suggestions. These are self-contained pieces of work — "issues" in GitHub parlance — specifically designed to help people new to ExecuTorch get started contributing code. We call these "good first issues", and you can view all of them here: [New Contributors Projects and Issues](https://github.com/orgs/pytorch/projects/102/views/1). Here's what the list looks like at the time of writing — you can see that they all have a purple `good first issue` label in the right-hand column: ![](_static/img/new-contributor-guide/good_first_issues.png) Please check it out and see if anything interests you! New issues are added to this list all the time. Once you've found an issue you like the look of, read our [Contribution Guide](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md). This comprehensive manual will help you: * build ExecuTorch on your machine. * understand the structure of the ExecuTorch codebase. * format, test, and document your code according to ExecuTorch best practices. * and finally, submit your code for review, so it can be polished, approved, and merged into the main codebase. If that seems like a lot of information, please read on — we'll walk you through your first contribution right now. --- ## 📤 Contributing Code, Step-By-Step ### Prepare Your Workspace Before you can start writing any code, you need to get a copy of ExecuTorch codebase onto your GitHub account, and download it onto your dev machine. You'll want to build it, too — otherwise, you won't be able to test your solution. 1. Fork the main ExecuTorch repository into your GitHub account. This creates a clone of the repository in your own space, so you can modify it freely. To do this, visit the [main repository page](https://github.com/pytorch/executorch) and click `Fork`: ![](_static/img/new-contributor-guide/how_to_fork1.png) This will take you to another page. Click `Create fork`: ![](_static/img/new-contributor-guide/how_to_fork2.png) 2. Clone your fork locally. This downloads a copy of your fork onto your dev machine, ready for you to make your changes. In the example below, we clone using HTTP, but any of the provided methods on the `Local` tab are fine. For HTTP, copy the URL given here: ![](_static/img/new-contributor-guide/how_to_clone.png) Then go to your terminal, enter the directory you want to clone the fork to, and run: ```bash git clone https://github.com/pytorch/executorch.git ``` This will create an `executorch` folder in your directory containing your forked codebase. 3. Set the `upstream` pointing to the main ExecuTorch repository. This will allow you to easily synchronize with the latest development. Assuming you're in the same directory you cloned into, run: ```bash cd executorch # enter the cloned project git remote add upstream https://github.com/pytorch/executorch.git ``` To see if it worked, run: ```bash git remote -v ``` Depending on how you cloned your repo (HTTP, SSH, etc.), this should print something like: ```bash origin https://github.com/{YOUR_GITHUB_USERNAME}/executorch.git (fetch) origin https://github.com/{YOUR_GITHUB_USERNAME}/executorch.git (push) upstream https://github.com/pytorch/executorch.git (fetch) upstream https://github.com/pytorch/executorch.git (push) ``` What does this mean? Well: * The `origin` entries show your forked GitHub repository. They tell you that when you run `git pull` or `git push`, your changes will go from/to your GitHub fork. * The `upstream` entries show the main ExecuTorch repository. If you want to sync the latest changes from there, you can run `git fetch upstream`. 4. If you just cloned your fork, your GitHub repository will tell you your branch is up-to-date: ![](_static/img/new-contributor-guide/synced_fork.png) However, ExecuTorch updates frequently — if it's been a while you visited your fork, you might not have the latest version anymore. It's important to keep your fork as up-to-date as possible. Otherwise, the code changes you're making might fix your issue for an old version of the codebase, but _not_ fix it for the current version. GitHub will tell you if your fork is out-of-date. To synchronise the necessary changes, click `Sync fork`, then `Update branch` as shown: ![](_static/img/new-contributor-guide/unsynced_fork.png) 5. Now you have the latest fork on your GitHub account, it's time to download it onto your dev machine. For this, you can run the following commands in your terminal: ```bash git fetch --all --prune # pull all branches from GitHub git checkout main # enter your local main branch git merge upstream/main # merge latest state from GitHub parent repo git push # push updated local main to your GitHub fork ``` 6. [Build the project](using-executorch-building-from-source.md) and [run the tests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#testing). Unfortunately, this step is too long to detail here. If you get stuck at any point, please feel free to ask for help on our [Discord server](https://discord.com/invite/Dh43CKSAdc) — we're always eager to help newcomers get onboarded. One final note before we finish this section. It's very important to get your tests running at this stage, for two reasons: * If they work, it's a great sign that you've got things set up correctly. * As we'll discuss later, you'll want to run the tests _after_ making your changes to ensure you haven't broken existing functionality. Running them _before_ making your changes gives you a baseline you can compare with later test results. ### Implement your changes Great job — you're all set up. Now you can actually start coding! 1. Before making any changes, we recommend creating a new branch. To do this, just run: ```bash git checkout -b YOUR_NEW_BRANCH_NAME ``` You can follow this naming convention: `type/`, where the types are: `bugfix`, `feature`, `docs`, `tests`, etc. — or use something similarly descriptive. By way of example, here are a few branch names that were actually merged to ExecuTorch: * [bugfix/op_eq](https://github.com/pytorch/executorch/pull/9794) * [error-handling-log-intermediate-output-delegate](https://github.com/pytorch/executorch/pull/9759) * [add-datasink-try-before-set-tests](https://github.com/pytorch/executorch/pull/9762) Creating a new branch means that any changes you make will be isolated to your branch, allowing you to work on multiple issues in parallel. It also means that, if your fork gets behind the main repository and you have to synchronise, you won't need to deal with any merge conflicts — accidentally blocking your `main` branch can be very time-consuming. 2. Make your changes. For bugfixes, we recommend a test-driven workflow: - Find a test case that demonstrates your bug. - Verify that your new test case fails on the `main` branch. - Add that example as an automated test, and assert the expected failing results. If you can, try to make this test as minimal as possible to reduce interference with some other issue. Once you have a failing test, you can keep working on the issue and running the test until it passes. **Note:** Even if you do not find the solution, sending a PR with a test covering the issue is a valid contribution. From this point, we can help you find the solution, or even finish it with you. 3. After every set of edits, checkpoint and commit your code changes with a "commit" message that describes the changes you made. For example, in terminal: ```bash git add my_changed_file1 my_new_test_case # Pick the files you changed git commit -m "Fixed bug X and added a passing test case" # Describe your change ``` Try to make your commit messages as descriptive as possible. This helps to maintain a clear project history. Not only will this help your own development, but it will make your code vastly easier for other developers to review and maintain. Here are some example commit messages that were merged to ExecuTorch: * [Delete examples/demo-apps/apple_ios/ExecuTorchDemo directory](https://github.com/pytorch/executorch/pull/9991/commits/df2f451e5e8fc217231975d7a0065a8cc36709cb) * [[ET-VK][ez] Allow logit linear layer to be lowered to Vulkan](https://github.com/pytorch/executorch/pull/9951/commits/3fdd8cab8c58db0be666f3454c41f73ad5964743) * [Allow emitting mutable buffer names in schema](https://github.com/pytorch/executorch/pull/9935/commits/773a34725afea6c0bf1b99d02a9cefb91c4960e1) 4. When you are done making changes and the test case you added is passing, [run the same tests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#testing) you ran earlier (at the end of the [Prepare Your Workspace](#prepare-your-workspace) section). If any tests fail now which were working before, it means your changes have broken some existing functionality. You'll need to dig back into your code to figure out what's gone wrong. 5. Once your new test _and_ the old tests are all working as intended, upload/push these changes to your fork: ```bash # Make sure you've committed all your changes first, then run: git push ``` ### Submit a PR Once you've successfully finished local development, it's time to send out your pull request. This is the final phase — here, we'll help you finetune your changes to get merged into the main repository. 1. After pushing your last edit to remote, your GitHub fork will show your new changed branch — click `Compare & pull request`: ![](_static/img/new-contributor-guide/how_to_pr1.png) Alternatively, you can click the same `Compare & pull request` button on the main ExecuTorch repo: ![](_static/img/new-contributor-guide/how_to_pr2.png) Another way still is via the `Pull request` tab on the main repo — we won't go into that here though, as it takes a few more steps. 2. This will take you to a page where you can format your PR and explain your changes. You'll see all the required details in our PR template. You should choose a title describing the proposed fix and fill in all the required details. ![](_static/img/new-contributor-guide/how_to_pr3.png) In the description, you’ll describe all the changes you’ve made. 3. If you want to submit your PR right away, you can go ahead and click the Green `Create pull request` button. However, please note that this will immediately notify all reviewers. We strongly recommend creating a Draft PR first. This will allow you to perform some extra checks first: * You can get some early feedback on your PR without notifying everybody. * It prevents anyone from accidentally merging your unfinished PR. * Creating it will start CI (["Continuous Integration"](https://en.wikipedia.org/wiki/Continuous_integration)) checks to verify that all tests pass under various configurations. If some tests fail, you can fix them before creating the final PR. To do submit a draft, click the arrow next to the `Create Pull Request` button, then click `Create draft pull request` in the dropdown menu: ![](_static/img/new-contributor-guide/how_to_draft_pr1.png) This will change the green button's text to `Draft pull request`: ![](_static/img/new-contributor-guide/how_to_draft_pr2.png) Click it to create your draft PR. 4. This will take you to your Draft PR page. It might look something like this: ![](_static/img/new-contributor-guide/how_to_draft_pr3.png) As you scroll down, you might see a number of comments and automated checks, some of which may come with alarming red warning signs and the word "Failure"! There's no need to panic, though — they are here to help. Let's go through some common checks one-by-one. * The `pytorch-bot` will probably be the first comment. It runs regular CI checks. When your PR is passing, this comment will automatically update to let you know. ![](_static/img/new-contributor-guide/ci1.png) * If this is your very first contribution to a Meta Open Source project, and you've not signed Meta's contributor license agreement (CLA), you may have a comment like this from `facebook-github-bot`: ![](_static/img/new-contributor-guide/cla1.png) You will need to sign the linked CLA to contribute your code. Once your signature has been processed, the bot will let you know in another comment: ![](_static/img/new-contributor-guide/cla2.png) * You may see a comment from `github-actions` requesting a "release notes" label: ![](_static/img/new-contributor-guide/release_notes.png) As the comment says, you can add a label by commenting on the PR with an instruction to pytorchbot. You can see a list of all our labels [here](https://github.com/pytorch/executorch/labels/). Pick the one which fits your PR best, then add it as a comment using the syntax `@pytorchbot label "YOUR LABEL HERE"`. For example: ![](./_static/img/new-contributor-guide/how_to_label1.png) After you've submitted your comment, `pytorchbot` will add your chosen label to the PR: ![](./_static/img/new-contributor-guide/how_to_label2.png) and the `github-actions` comment requesting a label will disappear. * At the end of your Draft PR, you'll see something like this: ![](_static/img/new-contributor-guide/end_of_draft_pr1.png) This is a summary of all the CI checks and requirements which need to be satisfied before your PR can be merged. Ensure that all tests are passing. If not, click on a failing test to see what went wrong and make the required changes. Once you're happy with your draft, you can click the `Ready for review` button to create your PR: ![](_static/img/new-contributor-guide/end_of_draft_pr2.png) 5. Now you've created your PR, it's time for your changes to be reviewed by the ExecuTorch community and maintainers. You'll need approval from one of our core contributors for your request to be merged. They may have questions or suggestions for you to address or respond to. Be aware that the review process may take a couple of iterations... Nevertheless, we hope that you'll find this feedback helpful. Code reviews can be a fantastic way to learn more about ExecuTorch and coding best practices from other contributors. Those reviewers/maintainers are here to finetune your contribution and eventually catch some issues before we merge the PR. We aim for this process to be pleasing on both sides: we try to give and get the best. Once the reviewers are happy, they'll approve your PR, indicating that they're happy for it to be merged. This will send you a notification and display as follows on your PR page: ![](_static/img/new-contributor-guide/pr_approval1.png) And in the PR comments: ![](_static/img/new-contributor-guide/pr_approval2.png) 6. Once you've received the required approval from a core contributor, you're very nearly done. We just need to make sure all the CI checks have passed, some of which need approval from a maintainer to start: ![](_static/img/new-contributor-guide/how_to_merge1.png) Once all checks these have all been approved, ran, and passed, you can go ahead and merge your PR. If there's a grey `Update branch` button instead of a green `Merge pull request` button, click that first: ![](_static/img/new-contributor-guide/how_to_merge2.png) After a moment, the branch should update with the latest changes, and you'll see the final green `Merge pull request` button: ![](_static/img/new-contributor-guide/how_to_merge3.png) Click it to merge your changes into the main codebase. Congratulations — you're now an official ExecuTorch contributor! Great job making it to the end of our guide — we hope you enjoy contributing. Once again, please check out our **[Discord Server](https://discord.com/invite/Dh43CKSAdc)** if you want to say hello, ask any questions, or talk about any and all things ExecuTorch. We look forward to receiving your contributions! --- # Pico2: A simple MNIST Tutorial Deploy your PyTorch models directly to Raspberry Pi Pico2 microcontroller with ExecuTorch. ## What You'll Build A 28×28 MNIST digit classifier running on memory constrained, low power microcontrollers: - Input: ASCII art digits (0, 1, 4, 7) - Output: Real-time predictions via USB serial - Memory: <400KB total footprint ## Prerequisites - [Environment Setup section](https://docs.pytorch.org/executorch/1.0/using-executorch-building-from-source.html) - Refer to this link on how to accept 'EULA' agreement and setup toolchain [link](https://docs.pytorch.org/executorch/1.0/backends-arm-ethos-u.html#development-requirements) - Verify ARM toolchain ```bash which arm-none-eabi-gcc # --> arm/arm-scratch/arm-gnu-toolchain-13.3.rel1-x86_64-arm-none-eabi/bin/ ``` ## Step 1: Generate pte from given example Model - Use the [provided example model](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist.py) ```bash python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte ``` - **Note:** This is hand-crafted MNIST Classifier (proof-of-concept), and not production trained. This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors. ## Step 2: Build Firmware for Pico2 ```bash # Generate model (Creates balanced_tiny_mlp_mnist.pte) cd ./examples/raspberry_pi/pico2 python export_mlp_mnist.py cd - # Build Pico2 firmware (one command!) ./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2 ``` Output: **executorch_pico.uf2** firmware file (examples/raspberry_pi/pico2/build/) **Note:** '[build_firmware_pico.sh](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/build_firmware_pico.sh)' script converts given model pte to hex array and generates C code for the same via this helper [script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/pte_to_array.py). This C code is then compiled to generate final .uf2 binary which is then flashed to Pico2. ## Step 3: Flash to Pico2 Hold BOOTSEL button on Pico2 Connect USB → Mounts as ^RPI-RP2^ drive Drag & drop ^executorch_pico.uf2^ file Release BOOTSEL → Pico2 reboots with your model ## Step 4: Verify Deployment **Success indicators:** - LED blinks 10× at 500ms → Model running ✅ - LED blinks 10× at 100ms → Error, check serial ❌ **View predictions:** ```bash # Connect serial terminal screen /dev/tty.usbmodem1101 115200 # Expected output: Something like: === Digit 7 === ############################ ############################ #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ### Input stats: 159 white pixels out of 784 total Running neural network inference... ✅ Neural network results: Digit 0: 370.000 Digit 1: 0.000 Digit 2: -3.000 Digit 3: -3.000 Digit 4: 860.000 Digit 5: -3.000 Digit 6: -3.000 Digit 7: 1640.000 ← PREDICTED Digit 8: -3.000 Digit 9: -3.000 � PREDICTED: 7 (Expected: 7) ✅ CORRECT! ``` ## Memory Optimization Tips ### Pico2 Constraints - 520KB SRAM (runtime memory) - 4MB Flash (model storage) - Keep models small: ### Common Issues - "Memory allocation failed" → Reduce model size and use quantization - "Operator missing" → Use selective build: ^--operators=add,mul,relu^ - "Import error" → Check ^arm-none-eabi-gcc^ toolchain setup. In order to resolve some of the issues above, refer to the following guides: - [ExecuTorch Quantization Optimization Guide](https://docs.pytorch.org/executorch/1.0/quantization-optimization.html) - [Model Export & Lowering](https://docs.pytorch.org/executorch/1.0/using-executorch-export.html) and - [Selective Build support](https://docs.pytorch.org/executorch/1.0/kernel-library-selective-build.html) ### Firmware Size Analysis ```bash cd ls -al examples/raspberry_pi/pico2/build/executorch_pico.elf ``` - **Overall section sizes** ```bash arm-none-eabi-size -A examples/raspberry_pi/pico2/build/executorch_pico.elf ``` - **Detailed section breakdown** ```bash arm-none-eabi-objdump -h examples/raspberry_pi/pico2/build/executorch_pico.elf ``` - **Symbol sizes (largest consumers)** ```bash arm-none-eabi-nm --print-size --size-sort --radix=d examples/raspberry_pi/pico2/build/executorch_pico.elf | tail -20 ``` ### Model Memory Footprint - **Model data specifically** ```bash arm-none-eabi-nm --print-size --size-sort --radix=d examples/raspberry_pi/pico2/build/executorch_pico.elf | grep -i model ``` - **Check what's in .bss (uninitialized data)** ```bash arm-none-eabi-objdump -t examples/raspberry_pi/pico2/build/executorch_pico.elf | grep ".bss" | head -10 ``` - **Memory map overview** ```bash arm-none-eabi-readelf -l examples/raspberry_pi/pico2/build/executorch_pico.elf ``` ## Next Steps ### Scale up your deployment - Use real production trained model - Optimize further → INT8 quantization, pruning ### Happy Inference! **Result:** PyTorch model → Pico2 deployment in 4 simple steps 🚀 Total tutorial time: ~15 minutes **Conclusion:** Real-time inference on memory constrained, low power microcontrollers, a complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment --- # Desktop & Laptop ExecuTorch supports desktop and laptop deployment across Linux, macOS, and Windows. ## Platform-Specific Guides - [C++ Runtime Integration](using-executorch-cpp) - Complete setup guide - [Building from Source](using-executorch-building-from-source) ## Available Backends by Platform ### Linux - [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md) - [OpenVINO (Intel)](build-run-openvino) - [ARM Ethos-U (ARM64)](backends-arm-ethos-u) ### macOS - [CoreML (recommended)](backends-coreml) - [MPS (Apple Silicon)](backends-mps) - [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md) ### Windows - [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md) - [OpenVINO (Intel)](build-run-openvino) --- # Embedded Platforms ExecuTorch supports embedded devices from microcontrollers to edge devices. ## Platform-Specific Guides - [C++ Runtime Integration](using-executorch-cpp) - Complete setup guide - [Building from Source](using-executorch-building-from-source) ## Available Backends by Device Type ### Microcontrollers - [Cadence Xtensa Backend](backends-cadence) - [ARM Ethos-U NPU Backend](backends-arm-ethos-u) - [Custom Backend Development](backend-delegates-integration) ### Edge Devices - [ARM Ethos-U NPU Backend](backends-arm-ethos-u) - [NXP eIQ Neutron Backend](backend-nxp) - [Custom Hardware Integration](backend-delegates-integration) --- # Portable C++ Programming NOTE: This document covers the code that needs to build for and execute in target hardware environments. This applies to the core execution runtime, as well as kernel and backend implementations in this repo. These rules do not necessarily apply to code that only runs on the development host, like authoring or build tools. The ExecuTorch runtime code is intendend to be portable, and should build for a wide variety of systems, from servers to mobile phones to DSPs, from POSIX to Windows to bare-metal environments. This means that it can't assume the existence of: - Files - Threads - Exceptions - `stdout`, `stderr` - `printf()`, `fprintf()` - POSIX APIs and concepts in general It also can't assume: - 64 bit pointers - The size of a given integer type - The signedness of `char` To keep the binary size to a minimum, and to keep tight control over memory allocation, the code may not use: - `malloc()`, `free()` - `new`, `delete` - Most `stdlibc++` types; especially container types that manage their own memory like `string` and `vector`, or memory-management wrapper types like `unique_ptr` and `shared_ptr`. And to help reduce complexity, the code may not depend on any external dependencies except: - `flatbuffers` (for `.pte` file deserialization) - `flatcc` (for event trace serialization) - Core PyTorch (only for ATen mode) ## Platform Abstraction Layer (PAL) To avoid assuming the capabilities of the target system, the ExecuTorch runtime lets clients override low-level functions in its Platform Abstraction Layer (PAL), defined in `//executorch/runtime/platform/platform.h`, to perform operations like: - Getting the current timestamp - Printing a log message - Panicking the system ## Memory Allocation Instead of using `malloc()` or `new`, the runtime code should allocate memory using the `MemoryManager` (`//executorch/runtime/executor/memory_manager.h`) provided by the client. ## File Loading Instead of loading files directly, clients should provide buffers with the data already loaded, or wrapped in types like `DataLoader`. ## Integer Types ExecuTorch runtime code should not assume anything about the sizes of primitive types like `int`, `short`, or `char`. For example, the C++ standard only guarantees that `int` will be at least 16 bits wide. And ARM toolchains treat `char` as unsigned, while other toolchains often treat it as signed. Instead, the runtime APIs use a set of more predictable, but still standard, integer types: - `` types like `uint64_t`, `int32_t`; these types guarantee the bit width and signedness, regardless of the architecture. Use these types when you need a very specific integer width. - `size_t` for counts of things, or memory offsets. `size_t` is guaranteed to be big enough to represent any memory byte offset; i.e., it will be as wide as the native pointer type for the target system. Prefer using this instead of `uint64_t` for counts/offsets so that 32-bit systems don't need to pay for the unnecessary overhead of a 64-bit value. - `ssize_t` for some ATen-compatibility situations where `Tensor` returns a signed count. Prefer `size_t` when possible. ## Floating Point Arithmetic Not every system has support for floating point arithmetic: some don't even enable floating point emulation in their toolchains. Therefore, the core runtime code must not perform any floating point arithmetic at runtime, although it is ok to simply create or manage `float` or `double` values (e.g., in an `EValue`). Kernels, being outside of the core runtime, are allowed to perform floating point arithmetic. Though some kernels may choose not to, so that they can run on systems without floating point support. ## Logging Instead of using `printf()`, `fprintf()`, `cout`, `cerr`, or a library like `folly::logging` or `glog`, the ExecuTorch runtime provides the `ET_LOG` interface in `//executorch/runtime/platform/log.h` and the `ET_CHECK` interface in `//executorch/runtime/platform/assert.h`. The messages are printed using a hook in the PAL, which means that clients can redirect them to any underlying logging system, or just print them to `stderr` if available. ### Logging Format Portability #### Fixed-Width Integers When you have a log statement like ``` int64_t value; ET_LOG(Error, "Value %??? is bad", value); ``` what should you put for the `%???` part, to match the `int64_t`? On different systems, the `int64_t` typdef might be `int`, `long int`, or `long long int`. Picking a format like `%d`, `%ld`, or `%lld` might work on one target, but break on the others. To be portable, the runtime code uses the standard (but admittedly awkward) helper macros from ``. Each portable integer type has a corresponding `PRIn##` macro, like - `int32_t` -> `PRId32` - `uint32_t` -> `PRIu32` - `int64_t` -> `PRId64` - `uint64_t` -> `PRIu64` - See https://en.cppreference.com/w/cpp/header/cinttypes for more These macros are literal strings that can concatenate with other parts of the format string, like ``` int64_t value; ET_LOG(Error, "Value %" PRId64 " is bad", value); ``` Note that this requires chopping up the literal format string (the extra double quotes). It also requires the leading `%` before the macro. But, by using these macros, you're guaranteed that the toolchain will use the appropriate format pattern for the type. #### `size_t`, `ssize_t` Unlike the fixed-width integer types, format strings already have a portable way to handle `size_t` and `ssize_t`: - `size_t` -> `%zu` - `ssize_t` -> `%zd` #### Casting Sometimes, especially in code that straddles ATen and lean mode, the type of the value itself might be different across build modes. In those cases, cast the value to the lean mode type, like: ``` ET_CHECK_MSG( input.dim() == output.dim(), "input.dim() %zd not equal to output.dim() %zd", (ssize_t)input.dim(), (ssize_t)output.dim()); ``` In this case, `Tensor::dim()` returns `ssize_t` in lean mode, while `at::Tensor::dim()` returns `int64_t` in ATen mode. Since they both conceptually return (signed) counts, `ssize_t` is the most appropriate integer type. `int64_t` would work, but it would unnecessarily require 32-bit systems to deal with a 64-bit value in lean mode. This is the only situation where casting should be necessary, when lean and ATen modes disagree. Otherwise, use the format pattern that matches the type. --- # `.ptd` file format ExecuTorch `.ptd` files are serialized as modified binary flatbuffer files with data segments appended. They provide a way to store named data using the FlatTensor format. Named data can be tensors or opaque blob data (usually for backends that do not expose data format). Code related to the PTD file format is in the `//executorch/extension/flat_tensor/` directory. ``` ┌───────────────────────────────────┐ │Standard flatbuffer header │ ├───────────────────────────────────┤ │ExecuTorch extended header │ ├───────────────────────────────────┤ │Flatbuffer-serialized metadata │ │(FlatTensor) │ │ │ ┌─ ├───────────────────────────────────┤ │ │Padding │ │ ├───────────────────────────────────┤ │ │Data segment │ │ │ │ │ │ │ │ ├───────────────────────────────────┤ │ │Padding │ Blobs ─┤ ├───────────────────────────────────┤ │ │Data segment │ │ │ │ │ │ │ │ ├───────────────────────────────────┤ │ │Padding │ │ ├───────────────────────────────────┤ │ │... │ └─ └───────────────────────────────────┘ ``` ## Compatibility PTD files are designed for storing named data that can be loaded by ExecuTorch models. ## Headers PTD files can be recognized by the magic string at byte offset 4, beginning with `FT` and followed by two ASCII decimal digits (file identifier from the FlatBuffers schema). PTD files have an extended header at byte offset 8, recognized by the magic string `FH01`. This header includes the size and offset information for both the flatbuffer-serialized metadata and the data segments that follow. Note that this header is ExecuTorch-specific, but even when present it does not upset most flatbuffer-parsing code (apart from the rarely-used `GetBufferStartFromRootPointer()`). All numbers are little-endian, regardless of the host system. Header layout: ``` [0..3] uint32_t byte offset to the beginning of the flatbuffer root table. [4..7] File magic bytes: "FT" followed by two ASCII decimal digits. The digits correspond to the FlatBuffers file identifier. Extended header (always present): | [8..11] Extended header magic bytes: "FH01" - FlatTensor Header version 01. | [12..15] uint32_t size of this extended header in bytes, including the magic | header and this size field. Currently fixed at 40 bytes. | [16..23] uint64_t offset (from byte offset zero) to the start of the | flatbuffer data. | [24..31] uint64_t size of the flatbuffer-encoded tensor metadata in bytes. | [32..39] uint64_t offset (from byte offset zero) to the start of the first | data segment. | [40..47] uint64_t total size of all data segments in bytes. End of extended header. ``` Example: ``` Offset to flatbuffer root (0x44) | File magic ("FT01") | | Extended header magic ("FH01") | | | Extended header size (0x28) vvvvvvvvvvv vvvvvvvvvvv vvvvvvvvvvv vvvvvvvvvvv 0x0000 44 00 00 00 46 54 30 31 46 48 30 31 28 00 00 00 0x0010 30 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 0x0020 30 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ | | Flatbuffer size (0x100) | | Segment data size (0x20) Segment base offset (0x130) ``` Note: this example comes from inspecting the ModuleAddMul.ptd file. ``` python -m test.models.export_program --modules "ModuleAddMul" --external-constants --outdir . xxd -l 64 ModuleAddMulProgram.ptd ``` ## FlatTensor See `//executorch/extension/flat_tensor/serialize/flat_tensor.fbs` for the FlatTensor flatbuffer schema. The flatbuffer-encoded metadata follows the headers and contains: - **Schema version**: Version information for compatibility. - **Data segments**: List of segment descriptors with offset and size information. - **Named data**: List of named data entries, each containing: - **Key**: String identifier for the data blob. - **Segment index**: Reference to the data segment containing the blob. - **Tensor layout**: Optional metadata including scalar type, sizes and dim order, if the data segment contains a tensor. ### Tensor Layout If a data segment contains a canonical tensor, it may have associated layout information: - **Scalar type**: Data type (float32, int32, etc.) using ExecuTorch scalar types. - **Sizes**: Dimensions of the tensor. - **Dim order**: Memory layout order specifying how dimensions are arranged in memory. ## Data segments The `FlatTensor.segments` list in the metadata contains offset and size information about each data segment. Offsets in this list are relative to the segment base offset specified in the extended header. Each segment contains: - **Offset**: Relative offset from the segment base offset. - **Size**: Size of the valid data in bytes (may be followed by padding). ## Named data access Tensors are accessed by string keys through the `named_data` list. Each entry maps a string key to: 1. A segment index pointing to the raw data. 2. Optional tensor layout metadata, if the data segment contains a tensor. This design allows: - Multiple named data blobs to reference the same data segment. - Access to tensor layout data without loading the entire blob. ## Usage PTD files are used to store data outside of the PTE file. Some use-cases: - On-device training: checkpointing for model weights. - Deduplication: sharing model weights between multiple executable PTE files. - Flexible deployment: allow async updates between program and data. --- # `.pte` file format ExecuTorch `.pte` program files are serialized as modified binary flatbuffer files with optional data segments appended. ``` ┌───────────────────────────────────┐ │Standard flatbuffer header │ ├───────────────────────────────────┤ Optional ──> │ExecuTorch extended header │ ├───────────────────────────────────┤ │Flatbuffer-serialized program data │ │ │ │ │ ┌─ ├───────────────────────────────────┤ │ │Padding │ │ ├───────────────────────────────────┤ │ │Segment data │ │ │ │ │ │ │ │ ├───────────────────────────────────┤ │ │Padding │ Optional ─┤ ├───────────────────────────────────┤ │ │Segment data │ │ │ │ │ │ │ │ ├───────────────────────────────────┤ │ │Padding │ │ ├───────────────────────────────────┤ │ │... │ └─ └───────────────────────────────────┘ ``` ## Compatibility See the [Runtime Compatibility Policy]( https://github.com/pytorch/executorch/tree/main/runtime/COMPATIBILITY.md) for details about the compatibility guarantees between the `.pte` format and the ExecuTorch runtime. ## Headers Program files can be recognized by the magic string at byte offset 4, beginning with `ET` and followed by two ASCII decimal digits. Program files may have an optional extended header at byte offset 8, recognized by the magic string beginning with `eh` and followed by two ASCII decimal digits. This header includes the size of the flatbuffer-encoded core program data, and the starting offset of the segments that may follow the program data. Note that this header is ExecuTorch-specific, but even when present it does not upset most flatbuffer-parsing code (apart from the rarely-used `GetBufferStartFromRootPointer()`). All numbers are little-endian, regardless of the host system. Header layout: ``` [0..3] uint32_t byte offset to the beginning of the flatbuffer root table. [4..7] File magic bytes: "ET" followed by two ASCII decimal digits. The digits will change if the binary format of this file is changed in a non-backwards-compatible way. Optional extended header: | [8..11] Extended header magic bytes: "eh" followed by two ASCII decimal | digits. The digits will change if the binary format of this header is | changed in a non-backwards-compatible way. | [12..15] uint32_t size of this extended header in bytes, including the magic | header and this size field. Fields can be added to this header in | the future by increasing this size. This size does not include any | padding that may follow the header. | [16..23] uint64_t size of the flatbuffer-encoded program data, starting from | byte offset zero above. I.e., it includes these headers. | [24..31] uint64_t offset (from byte offset zero above) to the start of the | first segment, or zero if there are no segments. | [32..39] uint64_t size of the segment data, ie. the size from the segment_base_offset | to the end of the segments. Note, the last segment should not have any | trailing padding. | [40..?] Any zero-padding necessary to preserve the alignment of the data | that follows. End of optional extended header. ``` Example: ``` Offset to flatbuffer root (0x38) | File magic ("ET??") | | Extended header magic ("eh??") | | | Extended header size (0x20) vvvvvvvvvvv vvvvvvvvvvv vvvvvvvvvvv vvvvvvvvvvv 0x0000 38 00 00 00 45 54 3F 3F 65 68 3F 3F 20 00 00 00 0x0010 F0 02 00 00 00 00 00 00 00 10 00 00 00 00 00 00 0x0020 20 ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ | Offset to segments (0x1000) Size of program flatbuffer data (0x2f0) | Segment data size (0x20) ``` ## Program data See `//executorch/schema/program.fbs` for the Program flatbuffer schema. The flatbuffer-encoded program data follows the headers. By embedding the size of this region in the extended header, clients can read only the program data without reading in segment data. This is useful because program data typically sticks around for the lifetime of a model, while the large segment data is often freeable after model initialization. ## Segment data The first segment starts at the offset embedded in the extended header. Segments are typically aligned to 4096 or some other power of 2 that matches the target system's memory page size. This makes it easier to use `mmap()` if desired. The `Program.segments` array in the program data contains size/offset information about the segments that optionally follow. Offsets in this array are relative to the segment offet in the extended header. --- (quantization-optimization)= # Quantization & Optimization Advanced techniques for model compression and performance optimization. ## Quantization Strategies - {doc}`quantization-overview` — Comprehensive quantization strategies and techniques ## Performance Optimization - {doc}`runtime-profiling` — Performance profiling and optimization techniques ```{toctree} :hidden: :maxdepth: 1 quantization-overview runtime-profiling --- # Quantization Overview Quantization is a technique that reduces the precision of numbers used in a model’s computations and stored weights—typically from 32-bit floats to 8-bit integers. This reduces the model’s memory footprint, speeds up inference, and lowers power consumption, often with minimal loss in accuracy. Quantization is especially important for deploying models on edge devices such as wearables, embedded systems, and microcontrollers, which often have limited compute, memory, and battery capacity. By quantizing models, we can make them significantly more efficient and suitable for these resource-constrained environments. # Quantization in ExecuTorch ExecuTorch uses [torchao](https://github.com/pytorch/ao/tree/main/torchao) as its quantization library. This integration allows ExecuTorch to leverage PyTorch-native tools for preparing, calibrating, and converting quantized models. Quantization in ExecuTorch is backend-specific. Each backend defines how models should be quantized based on its hardware capabilities. Most ExecuTorch backends use the torchao [PT2E quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) flow, which works on models exported with torch.export and enables quantization that is tailored for each backend. The PT2E quantization workflow has three main steps: 1. Configure a backend-specific quantizer. 2. Prepare, calibrate, convert, and evaluate the quantized model in PyTorch 3. Lower the model to the target backend ## 1. Configure a Backend-Specific Quantizer Each backend provides its own quantizer (e.g., XNNPACKQuantizer, CoreMLQuantizer) that defines how quantization should be applied to a model in a way that is compatible with the target hardware. These quantizers usually support configs that allow users to specify quantization options such as: * Precision (e.g., 8-bit or 4-bit) * Quantization type (e.g., dynamic, static, or weight-only quantization) * Granularity (e.g., per-tensor, per-channel) Not all quantization options are supported by all backends. Consult backend-specific guides for supported quantization modes and configuration, and how to initialize the backend-specific PT2E quantizer: * [XNNPACK quantization](backends/xnnpack/xnnpack-quantization.md) * [CoreML quantization](backends/coreml/coreml-quantization.md) * [QNN quantization](backends-qualcomm.md#step-2-optional-quantize-your-model) ## 2. Quantize and evaluate the model After the backend specific quantizer is defined, the PT2E quantization flow is the same for all backends. A generic example is provided below, but specific examples are given in backend documentation: ```python from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e training_gm = torch.export.export(model, sample_inputs).module() # Prepare the model for quantization using the backend-specific quantizer instance prepared_model = prepare_pt2e(training_gm, quantizer) # Calibrate the model on representative data for sample in calibration_data: prepared_model(sample) # Convert the calibrated model to a quantized model quantized_model = convert_pt2e(prepared_model) ``` The quantized_model is a PyTorch model like any other, and can be evaluated on different tasks for accuracy. Tasks specific benchmarks are the recommended way to evaluate your quantized model, but as crude alternative you can compare to outputs with the original model using generic error metrics like SQNR: ```python from torchao.quantization.utils import compute_error out_reference = model(sample) out_quantized = quantized_model(sample) sqnr = compute_error(out_reference, out_quantized) # SQNR error ``` Note that numerics on device can differ those in PyTorch even for unquantized models, and accuracy evaluation can also be done with pybindings or on device. ## 3. Lower the model The final step is to lower the quantized_model to the desired backend, as you would an unquantized one. See [backend-specific pages](backends-overview.md) for lowering information. --- # Quantization ```{toctree} :maxdepth: 1 quantization-overview ``` --- (quick-start-section)= # Quick Start Get started with ExecuTorch in just a few steps. This section walks you through the essential steps to get ExecuTorch up and running, from initial setup to exporting your first model for edge deployment. ## What You'll Learn Follow these guides in order to get started with ExecuTorch: - **{doc}`getting-started`** - Initial Setup: Set up your development environment and run your first ExecuTorch example. - **{doc}`using-executorch-export`** - Exporting your model: Export for Edge deployment. - **{doc}`using-executorch-building-from-source`** - Building from Source: Build ExecuTorch from source for custom configurations and development. ## Prerequisites - Python 3.10-3.13 - PyTorch 2.9+ - Basic familiarity with PyTorch model development ## Next Steps After completing the quick start, explore: - **{doc}`edge-platforms-section`** - Deploy to specific platforms (Android, iOS, Desktop, Embedded) - **{doc}`backends-section`** - Choose the right acceleration backend for your hardware ```{toctree} :hidden: :maxdepth: 2 :caption: Quick Start Guide getting-started using-executorch-export using-executorch-building-from-source --- # ExecuTorch on Raspberry Pi ## TLDR This tutorial demonstrates how to deploy **Llama models on Raspberry Pi 4/5 devices** using ExecuTorch: - **Prerequisites**: Linux host machine, Python 3.10-3.13, conda environment, Raspberry Pi 4/5 - **Setup**: Automated cross-compilation using `setup.sh` script for ARM toolchain installation - **Export**: Convert Llama models to optimized `.pte` format with quantization options - **Deploy**: Transfer binaries to Raspberry Pi and configure runtime libraries - **Optimize**: Build optimization and performance tuning techniques - **Result**: Efficient on-device Llama inference ## Prerequisites and Hardware Requirements ### Host Machine Requirements **Operating System**: Linux x86_64 (Ubuntu 20.04+ or CentOS Stream 9+) **Software Dependencies**: - **Python 3.10-3.13** (ExecuTorch requirement) - **conda** or **venv** for environment management - **CMake 3.29.6+** - **Git** for repository cloning ### Target Device Requirements **Supported Devices**: **Raspberry Pi 4** and **Raspberry Pi 5** with **64-bit OS** **Memory Requirements**: - **RAM & Storage** Varies by model size and optimization level - **64-bit Raspberry Pi OS** (Bullseye or newer) ### Verification Commands Verify your host machine compatibility: ```bash # Check OS and architecture uname -s # Should output: Linux uname -m # Should output: x86_64 # Check Python version python3 --version # Should be 3.10-3.13 # Check required tools hash cmake git md5sum 2>/dev/null || echo "Missing required tools" cmake --version # Should be 3.29.6+ at minimum ## Development Environment Setup ### Clone ExecuTorch Repository First, clone the ExecuTorch repository with the Raspberry Pi support: ```bash # Create project directory mkdir ~/executorch-rpi && cd ~/executorch-rpi && git clone -b release/1.0 https://github.com/pytorch/executorch.git && cd executorch ``` ### Create Conda Environment ```bash # Create conda environment conda create -yn executorch python=3.10.0 conda activate executorch # Upgrade pip pip install --upgrade pip ``` Alternative: Virtual Environment If you prefer Python's built-in virtual environment: ```bash python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip ``` Refer to → {doc}`getting-started` for more details. ## Cross-Compilation Toolchain Step Run the following script on your Linux host machine: ```bash # Run the Raspberry Pi setup script for Pi 5 examples/raspberry_pi/setup.sh pi5 ``` On successful completion, you should see the following output: ```bash [100%] Linking CXX executable llama_main [100%] Built target llama_main [SUCCESS] LLaMA runner built successfully ==== Verifying Build Outputs ==== [SUCCESS] ✓ llama_main (6.1M) [SUCCESS] ✓ libllama_runner.so (4.0M) [SUCCESS] ✓ libextension_module.a (89K) - static library ✓ ExecuTorch cross-compilation setup completed successfully! ``` ## Model Preparation and Export ### Download Llama Models Download the Llama model from Hugging Face or any other source, and make sure that following files exist. - consolidated.00.pth (model weights) - params.json (model config) - tokenizer.model (tokenizer) ### Export Llama to ExecuTorch Format After downloading the Llama model, export it to ExecuTorch format using the provided script: ```bash #### Set these paths to point to the exported files. Following is an example instruction to export a llama model LLAMA_QUANTIZED_CHECKPOINT=path/to/consolidated.00.pth LLAMA_PARAMS=path/to/params.json python -m extension.llm.export.export_llm \ --config examples/models/llama/config/llama_xnnpack_spinquant.yaml \ +base.model_class="llama3_2" \ +base.checkpoint="${LLAMA_QUANTIZED_CHECKPOINT:?}" \ +base.params="${LLAMA_PARAMS:?}" ``` The file llama3_2.pte will be generated at the place where you run the command - For more details see [Option A: Download and Export Llama3.2 1B/3B Model](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#option-a-download-and-export-llama32-1b3b-model) - Also refer to → {doc}`llm/export-llm` for more details. ## Raspberry Pi Deployment ### Transfer Binaries to Raspberry Pi After successful cross-compilation, transfer the required files: ```bash ##### Set Raspberry Pi details export RPI_UN="pi" # Your Raspberry Pi username export RPI_IP="your-rpi-ip-address" ##### Create deployment directory on Raspberry Pi ssh $RPI_UN@$RPI_IP 'mkdir -p ~/executorch-deployment' ##### Copy main executable scp cmake-out/examples/models/llama/llama_main $RPI_UN@$RPI_IP:~/executorch-deployment/ ##### Copy runtime library scp cmake-out/examples/models/llama/runner/libllama_runner.so $RPI_UN@$RPI_IP:~/executorch-deployment/ ##### Copy model file scp llama3_2.pte $RPI_UN@$RPI_IP:~/executorch-deployment/ scp ./tokenizer.model $RPI_UN@$RPI_IP:~/executorch-deployment/ ``` ### Configure Runtime Libraries on Raspberry Pi SSH into your Raspberry Pi and configure the runtime: #### Set up library environment ```bash cd ~/executorch-deployment echo 'export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH' > setup_env.sh chmod +x setup_env.sh #### Make executable chmod +x llama_main ``` ## Dry Run ```bash source setup_env.sh ./llama_main --help ``` Make sure that the output does not have any GLIBC / other library mismatch errors in the output. If you see any, follow the troubleshooting steps below. ## Troubleshooting ### Issue 1: GLIBC Version Mismatch **Problem:** The binary was compiled with a newer GLIBC version (2.38) than what's available on your Raspberry Pi (2.36). **Error Symptoms:** ```bash ./llama_main: /lib/aarch64-linux-gnu/libm.so.6: version `GLIBC_2.38' not found (required by ./llama_main) ./llama_main: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by ./llama_main) ./llama_main: /lib/aarch64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.15' not found (required by ./llama_main) ./llama_main: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /lib/libllama_runner.so) ``` **There are two potential solutions:** - **Solution A**: Modify the Pi to match the binary (run on Pi) - **Solution B**: Modify the binary to match the Pi (run on host) #### Solution A: Upgrade GLIBC on Raspberry Pi (Recommended) 1. **Check your current GLIBC version:** ```bash ldd --version # Output: ldd (Debian GLIBC 2.36-9+rpt2+deb12u12) 2.36 ``` 2. **⚠️ Compatibility Warning and Safety Check:** ```bash # Just check and warn - don't do the upgrade current_glibc=$(ldd --version | head -n1 | grep -o '[0-9]\+\.[0-9]\+') required_glibc="2.38" echo "Current GLIBC: $current_glibc" echo "Required GLIBC: $required_glibc" if [[ $(echo "$current_glibc < $required_glibc" | bc -l) -eq 1 ]]; then echo "" echo "⚠️ WARNING: Your GLIBC version is too old" echo " You need to upgrade to continue with the next steps" echo " Consider using Solution B (rebuild binary) for better safety" echo "" else echo "✅ Your GLIBC version is already compatible" fi ``` **NOTE:** If the output shows "⚠️ WARNING: Your GLIBC version is too old", proceed with either Upgrade / Step #3 below (or) Solution B. Otherwise skip the next step as your device is __already compatible__ and directly go to Step#4. 3. **Upgrade to newer GLIBC:** ```bash # Add Debian unstable repository echo "deb http://deb.debian.org/debian sid main contrib non-free" | sudo tee -a /etc/apt/sources.list # Update package lists sudo apt update # Install newer GLIBC packages sudo apt-get -t sid install libc6 libstdc++6 # Reboot system sudo reboot ``` 4. **Verify compatibility after reboot:** ```bash cd ~/executorch-deployment source setup_env.sh # Test that the binary works if ./llama_main --help &>/dev/null; then echo "✅ GLIBC upgrade successful - binary is compatible" else echo "❌ GLIBC upgrade failed - binary still incompatible" echo "Consider rolling back or refer to documentation for troubleshooting" fi ``` 5. **Test the fix:** ```bash cd ~/executorch-deployment source setup_env.sh ./llama_main --model_path ./llama3_2.pte --tokenizer_path ./tokenizer.model --seq_len 128 --prompt "Hello" ``` **Important Notes:** - Select "Yes" when prompted to restart services - Press Enter to keep current version for configuration files - Backup important data before upgrading #### Solution B: Rebuild with Raspberry Pi's GLIBC (Advanced) If you prefer not to upgrade your Raspberry Pi system: 1. **Copy Pi's filesystem to host machine:** ```bash # On Raspberry Pi - install rsync ssh pi@ sudo apt update && sudo apt install rsync exit # On host machine - copy Pi's filesystem mkdir -p ~/rpi5-sysroot rsync -aAXv --exclude={"/proc","/sys","/dev","/run","/tmp","/mnt","/media","/lost+found"} \ pi@:/ ~/rpi5-sysroot ``` 2. **Update CMake toolchain file:** ```bash # Edit arm-toolchain-pi5.cmake # Replace this line: # set(CMAKE_SYSROOT "${TOOLCHAIN_PATH}/aarch64-none-linux-gnu/libc") # With this: set(CMAKE_SYSROOT "/home/yourusername/rpi5-sysroot") set(CMAKE_FIND_ROOT_PATH "${CMAKE_SYSROOT}") ``` 3. **Rebuild binaries:** ```bash # Clean and rebuild rm -rf cmake-out ./examples/raspberry_pi/rpi_setup.sh pi5 --force-rebuild # Verify GLIBC version strings ./cmake-out/examples/models/llama/llama_main | grep GLIBC_ # Should show max GLIBC_2.36 (matching your Pi) ``` --- ### Issue 2: Library Not Found **Problem:** Required libraries are not found at runtime. **Error Symptoms:** ```bash ./llama_main: error while loading shared libraries: libllama_runner.so: cannot open shared object file ``` **Solution:** ```bash # Ensure you're in the correct directory and environment is set cd ~/executorch-deployment source setup_env.sh ./llama_main --help ``` **Root Cause:** Either `LD_LIBRARY_PATH` is not set or you're not in the deployment directory. --- ### Issue 3: Tokenizer JSON Parsing Warnings **Problem:** Warning messages about JSON parsing errors after running the llama_main binary. **Error Symptoms:** ```bash E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] ``` **Solution:** These warnings can be safely ignored. They don't affect model inference. --- ## Quick Test Command After resolving issues, test with: ```bash cd ~/executorch-deployment source setup_env.sh ./llama_main --model_path ./llama3_2.pte --tokenizer_path ./tokenizer.model --seq_len 128 --prompt "What is the meaning of life?" ``` ## Debugging Tools Enable ExecuTorch logging: ```bash # Set log level for debugging export ET_LOG_LEVEL=Info ./llama_main --model_path ./model.pte --verbose ``` ## Final Run command ```bash cd ~/executorch-deployment source setup_env.sh ./llama_main --model_path ./llama3_2.pte --tokenizer_path ./tokenizer.model --seq_len 128 --prompt "What is the meaning of life?" ``` Happy Inferencing! --- # Detailed C++ Runtime APIs Tutorial **Author:** [Jacob Szwejbka](https://github.com/JacobSzwejbka) In this tutorial, we will cover how to run an ExecuTorch model in C++ using the more detailed, lower-level APIs: prepare the `MemoryManager`, set inputs, execute the model, and retrieve outputs. However, if you’re looking for a simpler interface that works out of the box, consider trying the [Module Extension Tutorial](extension-module.md) and [Using ExecuTorch with C++](using-executorch-cpp.md). For a high level overview of the ExecuTorch Runtime please see [Runtime Overview](runtime-overview.md), and for more in-depth documentation on each API please see the [Runtime API Reference](executorch-runtime-api-reference.rst). [Here](https://github.com/pytorch/executorch/blob/main/examples/portable/executor_runner/executor_runner.cpp) is a fully functional version C++ model runner, and the [Setting up ExecuTorch](getting-started-setup.rst) doc shows how to build and run it. ## Prerequisites You will need an ExecuTorch model to follow along. We will be using the model `SimpleConv` generated from the [Exporting to ExecuTorch tutorial](tutorials/export-to-executorch-tutorial) . ## Model Loading The first step towards running your model is to load it. ExecuTorch uses an abstraction called a `DataLoader` to handle the specifics of retrieving the `.pte` file data, and then `Program` represents the loaded state. Users can define their own `DataLoader`s to fit the needs of their particular system. In this tutorial we will be using the `FileDataLoader`, but you can look under [Example Data Loader Implementations](https://github.com/pytorch/executorch/tree/main/extension/data_loader) to see other options provided by the ExecuTorch project. For the `FileDataLoader` all we need to do is provide a file path to the constructor. ``` cpp using executorch::aten::Tensor; using executorch::aten::TensorImpl; using executorch::extension::FileDataLoader; using executorch::extension::MallocMemoryAllocator; using executorch::runtime::Error; using executorch::runtime::EValue; using executorch::runtime::HierarchicalAllocator; using executorch::runtime::MemoryManager; using executorch::runtime::Method; using executorch::runtime::MethodMeta; using executorch::runtime::Program; using executorch::runtime::Result; using executorch::runtime::Span; Result loader = FileDataLoader::from("/tmp/model.pte"); assert(loader.ok()); Result program = Program::load(&loader.get()); assert(program.ok()); ``` ## Setting Up the MemoryManager Next we will set up the `MemoryManager`. One of the principles of ExecuTorch is giving users control over where the memory used by the runtime comes from. Today (late 2023) users need to provide 2 different allocators: * Method Allocator: A `MemoryAllocator` used to allocate runtime structures at `Method` load time. Things like Tensor metadata, the internal chain of instructions, and other runtime state come from this. * Planned Memory: A `HierarchicalAllocator` containing 1 or more memory arenas where internal mutable tensor data buffers are placed. At `Method` load time internal tensors have their data pointers assigned to various offsets within. The positions of those offsets and the sizes of the arenas are determined by memory planning ahead of time. For this example we will retrieve the size of the planned memory arenas dynamically from the `Program`, but for heapless environments users could retrieve this information from the `Program` ahead of time and allocate the arena statically. We will also be using a malloc based allocator for the method allocator. ``` cpp // Method names map back to Python nn.Module method names. Most users will only // have the singular method "forward". const char* method_name = "forward"; // MethodMeta is a lightweight structure that lets us gather metadata // information about a specific method. In this case we are looking to get the // required size of the memory planned buffers for the method "forward". Result method_meta = program->method_meta(method_name); assert(method_meta.ok()); std::vector> planned_buffers; // Owns the Memory std::vector> planned_arenas; // Passed to the allocator size_t num_memory_planned_buffers = method_meta->num_memory_planned_buffers(); // It is possible to have multiple layers in our memory hierarchy; for example, // SRAM and DRAM. for (size_t id = 0; id < num_memory_planned_buffers; ++id) { // .get() will always succeed because id < num_memory_planned_buffers. size_t buffer_size = static_cast(method_meta->memory_planned_buffer_size(id).get()); planned_buffers.push_back(std::make_unique(buffer_size)); planned_arenas.push_back({planned_buffers.back().get(), buffer_size}); } HierarchicalAllocator planned_memory( {planned_arenas.data(), planned_arenas.size()}); // Version of MemoryAllocator that uses malloc to handle allocations rather then // a fixed buffer. MallocMemoryAllocator method_allocator; // Assemble all of the allocators into the MemoryManager that the Executor will // use. MemoryManager memory_manager(&method_allocator, &planned_memory); ``` ## Loading a Method In ExecuTorch we load and initialize from the `Program` at a method granularity. Many programs will only have one method 'forward'. `load_method` is where initialization is done, from setting up tensor metadata, to initializing delegates, etc. ``` cpp Result method = program->load_method(method_name); assert(method.ok()); ``` ## Setting Inputs Now that we have our method we need to set up its inputs before we can perform an inference. In this case we know our model takes a single (1, 3, 256, 256) sized float tensor. Depending on how your model was memory planned, the planned memory may or may not contain buffer space for your inputs and outputs. If the outputs were not memory planned then users will need to set up the output data pointer with 'set_output_data_ptr'. In this case we will just assume our model was exported with inputs and outputs handled by the memory plan. ``` cpp // Create our input tensor. float data[1 * 3 * 256 * 256]; Tensor::SizesType sizes[] = {1, 3, 256, 256}; Tensor::DimOrderType dim_order = {0, 1, 2, 3}; TensorImpl impl( ScalarType::Float, // dtype 4, // number of dimensions sizes, data, dim_order); Tensor t(&impl); // Implicitly casts t to EValue Error set_input_error = method->set_input(t, 0); assert(set_input_error == Error::Ok); ``` ## Perform an Inference Now that our method is loaded and our inputs are set we can perform an inference. We do this by calling `execute`. ``` cpp Error execute_error = method->execute(); assert(execute_error == Error::Ok); ``` ## Retrieve Outputs Once our inference completes we can retrieve our output. We know that our model only returns a single output tensor. One potential pitfall here is that the output we get back is owned by the `Method`. Users should take care to clone their output before performing any mutations on it, or if they need it to have a lifespan separate from the `Method`. ``` cpp EValue output = method->get_output(0); assert(output.isTensor()); ``` ## Conclusion This tutorial demonstrated how to run an ExecuTorch model using low-level runtime APIs, which offer granular control over memory management and execution. However, for most use cases, we recommend using the Module APIs, which provide a more streamlined experience without sacrificing flexibility. For more details, check out the [Module Extension Tutorial](extension-module.md). --- # Backend Delegate Implementation and Linking Please refer to: - The "Runtime Initialization and Execution" section of [Compiler Backend and Delegate](compiler-delegate-and-partitioner.md). - [Integrating a Backend Delegate into ExecuTorch](backend-delegates-integration.md). - [Third-Party Dependency Management for Backend Delegates](backend-delegates-dependencies.md). --- (runtime-integration-advanced)= # Runtime & Integration Advanced runtime integration topics ## Platform Integration - {doc}`runtime-platform-abstraction-layer` — Platform abstraction layer for cross-platform deployment ## Portable C++ Programming - {doc}`portable-cpp-programming` — Portable C++ programming for cross-platform deployment ```{toctree} :hidden: :maxdepth: 1 runtime-platform-abstraction-layer portable-cpp-programming --- # ExecuTorch Runtime Overview This document discusses the design of the ExecuTorch runtime, which executes ExecuTorch program files on edge devices like smartphones, wearables, and embedded devices. The code for the main execution API is under [`executorch/runtime/executor/`](https://github.com/pytorch/executorch/tree/main/runtime/executor). Before reading this document we recommend that you read [How ExecuTorch Works](intro-how-it-works.md). At the highest level, the ExecuTorch runtime is responsible for: * Loading binary `.pte` program files that were generated by the [`to_executorch()`](tutorials/export-to-executorch-tutorial) step of the model-lowering process. * Executing the series of instructions that implement a lowered model. Note that as of late 2023, the ExecuTorch runtime only supports model inference, and does not yet support training. This diagram shows the high-level flow of, and components involved with, exporting and executing an ExecuTorch program: ![High-level diagram of the ExecuTorch Runtime](_static/img/runtime-overview-high-level.png) The runtime is also responsible for: * Managing the memory used during load and execution, potentially across multiple memory banks like SRAM and DRAM. * Mapping symbolic operator names like `"aten::add.out"` to concrete C++ functions or [_kernels_](kernel-library-overview.md) that implement the semantics of those operators. * Dispatching predetermined sections of the model to [backend delegates](compiler-delegate-and-partitioner.md) for acceleration. * Optionally gathering [profiling data](runtime-profiling.md) during load and execution. ## Design Goals The ExecuTorch runtime was designed to run on a wide variety of edge devices, from modern smartphone CPUs to resource-constrained microcontrollers and DSPs. It has first-class support for [delegating](compiler-delegate-and-partitioner.md) execution to one or more backends to take advantage of architecture-specific optimizations and modern heterogeneous architectures. It is small and portable enough to run directly in bare-metal embedded environments with no operating systems, dynamic memory, or threads. ### Low Execution Overhead #### Memory * The core runtime library is less than 50kB when built without kernels or backends. * Constant tensors point directly into the `.pte` file data, avoiding copies of that data. The alignment of these data chunks can be adjusted at `.pte` creation time. * Backend delegates can choose to unload their precompiled data after model initialization, reducing peak memory usage. * Mutable tensor memory layout is planned ahead of time and packed into a small set of user-allocated buffers, providing fine-grained control over memory location. This is especially useful on systems with heterogeneous memory hierarchies, allowing placement onto (e.g.) SRAM or DRAM close to the core that will operate on the data. #### CPU * Model execution is a simple loop over an array of instructions, most of which are function pointers to kernels and backend delegates. This keeps the execution overhead small, on the order of microseconds to nanoseconds per operation. * The implementation of an operation (like "add" or "conv3d") can be fully customized for a particular target system without needing to modify the original model or generated `.pte` file. ### Familiar PyTorch Semantics ExecuTorch is a first-class component of the PyTorch stack, and reuses APIs and semantics whenever possible. * The C++ types used by ExecuTorch are source-compatible with the corresponding types from core PyTorch's `c10::` and `at::` libraries, and ExecuTorch provides [`aten_bridge`](https://github.com/pytorch/executorch/blob/main/extension/aten_util/aten_bridge.h) to convert between the two. This can be helpful for projects that already use PyTorch C++ types. * The semantics of operators like `aten::add` and `aten::sigmoid` are identical between ExecuTorch and core PyTorch. ExecuTorch provides a testing framework to ensure this, and to help test future implementations of these operators. ### Portable Code and Architecture The ExecuTorch runtime is implemented with portability in mind, so that users can build it for a wide variety of target systems. #### C++ Language Considerations * The code is C++17-compatible to work with older toolchains. * The runtime does not use exceptions or RTTI, although it is not antagonistic to them. * The code is compatible with GCC and Clang, and has also been built with several proprietary embedded toolchains. * The repo provides CMake build system to make integration easier. #### Operating System Considerations The runtime makes no direct system calls. All access to memory, files, logging, and clocks are abstracted through the [_Runtime Platform Abstraction Layer (PAL)_](runtime-platform-abstraction-layer.md) and injected interfaces like `DataLoader` and `MemoryAllocator`. See the [runtime api reference](executorch-runtime-api-reference.rst) to learn more. Applications can control all memory allocation through the `MemoryManager`, `MemoryAllocator`, `HierarchicalAllocator`, and `DataLoader` classes. The core runtime makes no direct calls to `malloc()` or `new`, or to types like `std::vector` that allocate under the hood. This makes it possible to: * Run in environments without a heap, but still use the heap if desired. * Avoid synchronization on the heap during model load and execution. * Control which memory region to use for different types of data. For example, one set of mutable tensors could live in SRAM while another set lived in DRAM. * Easily monitor how much memory the runtime uses. However, please note that specific kernel or backend implementations may use arbitrary runtime or operating system features. Users should double-check the docs for the kernel and backend libraries that they use. #### Threading Considerations The core runtime does no threading or locking, and does not use thread local variables. But, it plays well with higher-level synchronization. * Each `Program` instance is immutable and therefore _[fully thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#thread-safe)_. Multiple threads may concurrently access a single `Program` instance. * Each `Method` instance is mutable but self-contained, and therefore _[conditionally thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#conditionally-thread-safe)_. Multiple threads can concurrently access and execute independent `Method` instances, but access and execution of a single instance must be serialized. However, please note: * There are two global tables that may be read during `Program::load_method()`: the kernel registration table and the backend registration table. * In practice, these tables are only modified at process/system load time, and are effectively frozen before the first `Program` is loaded. But some applications may need to be aware of these tables, especially if they manually mutate them after process/system load time. * Specific kernel or backend implementations may have their own threading restrictions. Users should double-check the docs for the kernel and backend libraries that they use. ## Further Reading For more details about the ExecuTorch runtime, please see: * [Using ExecuTorch with C++](using-executorch-cpp.md) * [Detailed Runtime APIs Tutorial](running-a-model-cpp-tutorial.md) * [Simplified Runtime APIs Tutorial](extension-module.md) * [Building from Source](using-executorch-building-from-source.md) * [Runtime Platform Abstraction Layer](runtime-platform-abstraction-layer.md) * [Runtime Profiling](runtime-profiling.md) * [Backends and Delegates](compiler-delegate-and-partitioner.md) * [Backend Delegate Implementation](runtime-backend-delegate-implementation-and-linking.md) * [Kernel Library Overview](kernel-library-overview.md) --- # Runtime Platform Abstraction Layer (PAL) The ExecuTorch _Platform Abstraction Layer_ (PAL) provides a way for execution environments to override operations like: - Getting the current time. - Printing a log statement. - Panicking the process/system. The PAL function declarations are in [`executorch/runtime/platform/platform.h`](https://github.com/pytorch/executorch/blob/main/runtime/platform/platform.h). ## Overriding the default PAL The default PAL implementation is in [`executorch/runtime/platform/default/posix.cpp`](https://github.com/pytorch/executorch/blob/main/runtime/platform/default/posix.cpp). It uses `std::chrono::steady_clock` for the time, prints log messages to `stderr`, and makes other default assumptions. But, if they don't work for your system, you can override the default PAL by: - Including [`executorch/runtime/platform/platform.h`](https://github.com/pytorch/executorch/blob/main/runtime/platform/platform.h) in one of your application's `.c` or `.cpp` files. - Defining an implementation of one or more of the `et_pal_*()` functions. The default PAL functions are weak symbols, so providing your own strong-symbol definition can override them at link time. To ensure that your definitions take precedence, you may need to ensure that the strong definitions precede the weak definitions in the link order. ## Minimal PAL If you run into build problems because your system doesn't support the functions called by `posix.cpp`, you can instead use the no-op minimal PAL at [`executorch/runtime/platform/default/minimal.cpp`](https://github.com/pytorch/executorch/blob/main/runtime/platform/default/minimal.cpp) by passing `-DEXECUTORCH_PAL_DEFAULT=minimal` to `cmake`. This will avoid calling `fprintf()`, `std::chrono::steady_clock`, and anything else that `posix.cpp` uses. But since the `minimal.cpp` `et_pal_*()` functions are no-ops, you will need to override all of them. --- # Profiling Models in ExecuTorch Profiling in ExecuTorch gives users access to these runtime metrics: - Model Load Time. - Operator Level Execution Time. - Delegate Execution Time. - If the delegate that the user is calling into has been integrated with the [Developer Tools](delegate-debugging.md), then users will also be able to access delegated operator execution time. - End-to-end Inference Execution Time. One uniqe aspect of ExecuTorch Profiling is the ability to link every runtime executed operator back to the exact line of python code from which this operator originated. This capability enables users to easily identify hotspots in their model, source them back to the exact line of Python code, and optimize if chosen to. We provide access to all the profiling data via the Python [Inspector API](model-inspector.rst). The data mentioned above can be accessed through these interfaces, allowing users to perform any post-run analysis of their choice. ## Steps to Profile a Model in ExecuTorch 1. [Optional] Generate an [ETRecord](etrecord.rst) while you're exporting your model. If provided this will enable users to link back profiling details to eager model source code (with stack traces and module hierarchy). 2. Build the runtime with the pre-processor flags that enable profiling. Detailed in the [ETDump documentation](etdump.md). 3. Run your Program on the ExecuTorch runtime and generate an [ETDump](etdump.md). 4. Create an instance of the [Inspector API](model-inspector.rst) by passing in the ETDump you have sourced from the runtime along with the optionally generated ETRecord from step 1. - Through the Inspector API, users can do a wide range of analysis varying from printing out performance details to doing more finer granular calculation on module level. Please refer to the [Developer Tools tutorial](tutorials/devtools-integration-tutorial) for a step-by-step walkthrough of the above process on a sample model. --- # Runtime ```{toctree} :maxdepth: 1 runtime-overview extension-module extension-tensor running-a-model-cpp-tutorial runtime-backend-delegate-implementation-and-linking runtime-platform-abstraction-layer portable-cpp-programming pte-file-format ptd-file-format ``` --- (success-stories)= # Success Stories Discover how organizations are leveraging ExecuTorch to deploy AI models at scale on edge devices. --- ## Featured Success Stories ::::{grid} 1 :gutter: 3 :::{grid-item-card} **Meta's Family of Apps** :class-header: bg-primary text-white **Industry:** Social Media & Messaging **Hardware:** Android & iOS Devices **Impact:** Billions of users, latency reduction Powers Instagram, WhatsApp, Facebook, and Messenger with real-time on-device AI for content ranking, recommendations, and privacy-preserving features at scale. [Read Blog →](https://engineering.fb.com/2025/07/28/android/executorch-on-device-ml-meta-family-of-apps/) ::: :::{grid-item-card} **Meta Quest & Ray-Ban Smart Glasses** :class-header: bg-success text-white **Industry:** AR/VR & Wearables **Hardware:** Quest 3, Ray-Ban Meta Smart Glasses, Meta Ray-Ban Display Enables real-time computer vision, hand tracking, voice commands, and translation on power-constrained wearable devices. [Read Blog →](https://ai.meta.com/blog/executorch-reality-labs-on-device-ai/) ::: :::{grid-item-card} **Liquid AI: Efficient, Flexible On-Device Intelligence** :class-header: bg-info text-white **Industry:** Artificial Intelligence / Edge Computing **Hardware:** CPU via PyTorch ExecuTorch **Impact:** 2× faster inference, lower latency, seamless multimodal deployment Liquid AI builds foundation models that make AI work where the cloud can't. In its LFM2 series, the team uses PyTorch ExecuTorch within the LEAP Edge SDK to deploy high-performance multimodal models efficiently across devices. ExecuTorch provides the flexibility to support custom architectures and processing pipelines while reducing inference latency through graph optimization and caching. Together, they enable faster, more efficient, privacy-preserving AI that runs entirely on the edge. [Read Blog →](https://www.liquid.ai/blog/how-liquid-ai-uses-executorch-to-power-efficient-flexible-on-device-intelligence) ::: :::{grid-item-card} **PrivateMind: Complete Privacy with On-Device AI** :class-header: bg-warning text-white **Industry:** Privacy & Personal Computing **Hardware:** iOS & Android Devices **Impact:** 100% on-device processing PrivateMind delivers a fully private AI assistant using ExecuTorch's .pte format. Built with React Native ExecuTorch, it supports LLaMA, Qwen, Phi-4, and custom models with offline speech-to-text and PDF chat capabilities. [Visit →](https://privatemind.swmansion.com) ::: :::{grid-item-card} **NimbleEdge: On-Device Agentic AI Platform** :class-header: bg-danger text-white **Industry:** AI Infrastructure **Hardware:** iOS & Android Devices **Impact:** 30% higher TPS on iOS, faster time-to-market with Qwen/Gemma models NimbleEdge successfully integrated ExecuTorch with its open-source DeliteAI platform to enable agentic workflows orchestrated in Python on mobile devices. The extensible ExecuTorch ecosystem allowed implementation of on-device optimization techniques leveraging contextual sparsity. ExecuTorch significantly accelerated the release of "NimbleEdge AI" for iOS, enabling models like Qwen 2.5 with tool calling support and achieving up to 30% higher transactions per second. [Visit →](https://nimbleedge.com) • [Blog →](https://www.nimbleedge.com/blog/meet-nimbleedge-ai-the-first-truly-private-on-device-assistant) • [iOS App →](https://apps.apple.com/in/app/nimbleedge-ai/id6746237456) ::: :::: --- ## Featured Ecosystem Integrations and Interoperability ::::{grid} 2 2 3 3 :gutter: 2 :::{grid-item-card} **Hugging Face Transformers** :class-header: bg-secondary text-white Popular models from Hugging Face easily export to ExecuTorch format for on-device deployment. [Learn More →](https://github.com/huggingface/optimum-executorch/) ::: :::{grid-item-card} **React Native ExecuTorch** :class-header: bg-secondary text-white Declarative toolkit for running AI models and LLMs in React Native apps with privacy-first, on-device execution. [Explore →](https://docs.swmansion.com/react-native-executorch/) • [Blog →](https://expo.dev/blog/how-to-run-ai-models-with-react-native-executorch) ::: :::{grid-item-card} **torchao** :class-header: bg-secondary text-white PyTorch-native quantization and optimization library for preparing efficient models for ExecuTorch deployment. [Blog →](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/) • [Qwen Example →](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4) • [Phi Example →](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) ::: :::{grid-item-card} **Unsloth** :class-header: bg-secondary text-white Optimize LLM fine-tuning with faster training and reduced VRAM usage, then deploy efficiently with ExecuTorch. [Example Model →](https://huggingface.co/metascroy/Qwen3-4B-int8-int4-unsloth) • [Blog →](https://docs.unsloth.ai/new/quantization-aware-training-qat) • [Doc →](https://docs.unsloth.ai/new/deploy-llms-phone) ::: :::{grid-item-card} **Ultralytics** :class-header: bg-secondary text-white Deploy on-device inference for Ultralytics YOLO models using ExecuTorch. [Explore →](https://docs.ultralytics.com/integrations/executorch/) • [Blog →](https://www.ultralytics.com/blog/deploy-ultralytics-yolo-models-using-the-executorch-integration) ::: :::{grid-item-card} **Arm ML Embedded Evaluation Kit** :class-header: bg-secondary text-white Build and deploy ML applications on Arm Cortex-M (M55, M85) and Ethos-U NPUs (U55, U65, U85) using ExecuTorch. [Explore →](https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit) ::: :::{grid-item-card} **Alif Semiconductor Ensemble** :class-header: bg-secondary text-white Run generative AI on Ensemble E4/E6/E8 MCUs with Arm Ethos-U85 NPU acceleration. [Learn More →](https://alifsemi.com/press-release/alif-semiconductor-elevates-generative-ai-with-support-for-executorch-runtime/) ::: :::{grid-item-card} **Digica AI SDK** :class-header: bg-secondary text-white Automate PyTorch model deployment to iOS, Android, and edge devices with ExecuTorch-powered SDK. [Blog →](https://www.digica.com/blog/effortless-edge-deployment-of-ai-models-with-digicas-ai-sdk-feat-executorch.html) ::: :::: --- ## Featured Demos - **Text and Multimodal LLM demo mobile apps** - Text (Llama, Qwen3, Phi-4) and multimodal (Gemma3, Voxtral) mobile demo apps. [Try →](https://github.com/meta-pytorch/executorch-examples/tree/main/llm) - **Voxtral** - Deploy audio-text-input LLM on CPU (via XNNPACK) and on CUDA. [Try →](https://github.com/pytorch/executorch/blob/main/examples/models/voxtral/README.md) - **Whisper** - Deploy OpenAI's Whisper speech recognition model on CUDA and Metal backends. [Try →](https://github.com/pytorch/executorch/blob/main/examples/models/whisper/README.md) - **LoRA adapter** - Export two LoRA adapters that share a single foundation weight file, saving memory and disk space. [Try →](https://github.com/meta-pytorch/executorch-examples/tree/main/program-data-separation/cpp/lora_example) - **OpenVINO from Intel** - Deploy [Yolo12](https://github.com/pytorch/executorch/tree/main/examples/models/yolo12), [Llama](https://github.com/pytorch/executorch/tree/main/examples/openvino/llama), and [Stable Diffusion](https://github.com/pytorch/executorch/tree/main/examples/openvino/stable_diffusion) on [OpenVINO from Intel](https://www.intel.com/content/www/us/en/developer/articles/community/optimizing-executorch-on-ai-pcs.html). - **Audio Generation** - Generate audio from text prompts using Stable Audio Open Small on Arm CPUs with XNNPACK and KleidiAI. [Try →](https://github.com/Arm-Examples/ML-examples/tree/main/kleidiai-examples/audiogen-et) • [Video →](https://www.youtube.com/watch?v=q2P0ESVxhAY) *Want to showcase your demo? [Submit here →](https://github.com/pytorch/executorch/issues)* --- (support-section)= # Support In this section, find answers to common questions, troubleshooting guides, and information on how to contribute to the ExecuTorch project. Get help with issues and learn how to participate in the community. - {doc}`using-executorch-faqs` — FAQ - {doc}`using-executorch-troubleshooting` — Common Issues - {doc}`contributing` — Contributing ```{toctree} :hidden: :maxdepth: 1 :caption: Support using-executorch-faqs using-executorch-troubleshooting contributing --- (tools-sdk-section)= # Tools In this section, explore ExecuTorch's comprehensive developer tools for profiling, debugging, and model inspection. These tools help optimize performance and troubleshoot issues during development and deployment. - {doc}`devtools-overview` — Developer Tools Overview - {doc}`bundled-io` — Bundled I/O - {doc}`etrecord` — ETRecord - {doc}`etdump` — ETDump - {doc}`runtime-profiling` — Profiling Suite - {doc}`model-debugging` — Debugging Tools - {doc}`model-inspector` — Model Inspector - {doc}`memory-planning-inspection` — Memory Planning Inspection - {doc}`devtools-tutorial` — Development Utilities - {doc}`visualization` — Model Visualization ```{toctree} :hidden: :maxdepth: 1 :caption: Tools devtools-overview bundled-io etrecord etdump runtime-profiling model-debugging model-inspector memory-planning-inspection devtools-tutorial visualization --- # TITLE ::::{grid} 2 :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites * In this tutorial you will learn how to lower and deploy a model for Backend X. ::: :::{grid-item-card} Tutorials we recommend you complete before this: :class-card: card-prerequisites * [Introduction to ExecuTorch](intro-how-it-works.md) * [Setting up ExecuTorch](getting-started-setup.rst) * [Building ExecuTorch with CMake](using-executorch-building-from-source.md) ::: :::: ## Prerequisites (Hardware and Software) Provide instructions on what kind of hardware and software are pre-requisite for the tutorial. ### Hardware: - Hardware requirements to go through this tutorial ### Software: - Software requirements to go through this tutorial ## Setting up your developer environment Steps that the users need to go through to setup their developer environment for this tutorial. ## Build ### AOT (Ahead-of-time) components: Describe what set of steps user should go through, as part of this tutorial, in order to export a model and ready it for execution on the target platform. Such steps, illustrated via example API invocations, may include quantization, delegation, custom passes, custom memory planning etc. ### Runtime: Steps that the users need to go through to build the runtime/supporting app that they can then run on their target device. The tutorial may target a) building separate executable via which a exported model can be run, or b) building libraries that should be linked into native apps available inside examples folder. You are free to add both options. For the latter, you can link to demo-app tutorial page, that should contain instructions on linking custom libraries. ## Deploying and running on device Steps that the users need to go through to deploy and run the runtime and model generated from the previous steps. ## Frequently encountered errors and resolution. Describe what common errors uses may see and how to resolve them. If you encountered any bugs or issues following this tutorial please file a bug/issue here at XYZ, with hashtag #ExecuTorch #MyHashTag.. --- # Building and Running ExecuTorch with XNNPACK Backend The following tutorial will familiarize you with leveraging the ExecuTorch XNNPACK Delegate for accelerating your ML Models using CPU hardware. It will go over exporting and serializing a model to a binary file, targeting the XNNPACK Delegate Backend and running the model on a supported target platform. To get started quickly, use the script in the ExecuTorch repository with instructions on exporting and generating a binary file for a few sample models demonstrating the flow. ::::{grid} 2 :::{grid-item-card} What you will learn in this tutorial: :class-card: card-prerequisites In this tutorial, you will learn how to export an XNNPACK lowered Model and run it on a target platform ::: :::{grid-item-card} Before you begin it is recommended you go through the following: :class-card: card-prerequisites * [Setting up ExecuTorch](getting-started-setup.rst) * [Model Lowering Tutorial](tutorials/export-to-executorch-tutorial) * [ExecuTorch XNNPACK Delegate](backends/xnnpack/xnnpack-overview.md) ::: :::: ## Lowering a model to XNNPACK ```python import torch import torchvision.models as models from torch.export import export, ExportedProgram from torchvision.models.mobilenetv2 import MobileNet_V2_Weights from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner from executorch.exir import EdgeProgramManager, ExecutorchProgramManager, to_edge_transform_and_lower from executorch.exir.backend.backend_api import to_backend mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) exported_program: ExportedProgram = export(mobilenet_v2, sample_inputs) edge: EdgeProgramManager = to_edge_transform_and_lower( exported_program, partitioner=[XnnpackPartitioner()], ) ``` We will go through this example with the [MobileNetV2](https://pytorch.org/hub/pytorch_vision_mobilenet_v2/) pretrained model downloaded from the TorchVision library. The flow of lowering a model starts after exporting the model `to_edge`. We call the `to_backend` api with the `XnnpackPartitioner`. The partitioner identifies the subgraphs suitable for XNNPACK backend delegate to consume. Afterwards, the identified subgraphs will be serialized with the XNNPACK Delegate flatbuffer schema and each subgraph will be replaced with a call to the XNNPACK Delegate. ```python >>> print(edge.exported_program().graph_module) GraphModule( (lowered_module_0): LoweredBackendModule() (lowered_module_1): LoweredBackendModule() ) def forward(self, b_features_0_1_num_batches_tracked, ..., x): lowered_module_0 = self.lowered_module_0 lowered_module_1 = self.lowered_module_1 executorch_call_delegate_1 = torch.ops.higher_order.executorch_call_delegate(lowered_module_1, x); lowered_module_1 = x = None getitem_53 = executorch_call_delegate_1[0]; executorch_call_delegate_1 = None aten_view_copy_default = executorch_exir_dialects_edge__ops_aten_view_copy_default(getitem_53, [1, 1280]); getitem_53 = None aten_clone_default = executorch_exir_dialects_edge__ops_aten_clone_default(aten_view_copy_default); aten_view_copy_default = None executorch_call_delegate = torch.ops.higher_order.executorch_call_delegate(lowered_module_0, aten_clone_default); lowered_module_0 = aten_clone_default = None getitem_52 = executorch_call_delegate[0]; executorch_call_delegate = None return (getitem_52,) ``` We print the graph after lowering above to show the new nodes that were inserted to call the XNNPACK Delegate. The subgraphs which are being delegated to XNNPACK are the first argument at each call site. It can be observed that the majority of `convolution-relu-add` blocks and `linear` blocks were able to be delegated to XNNPACK. We can also see the operators which were not able to be lowered to the XNNPACK delegate, such as `clone` and `view_copy`. ```python exec_prog = edge.to_executorch() with open("xnnpack_mobilenetv2.pte", "wb") as file: exec_prog.write_to_file(file) ``` After lowering to the XNNPACK Program, we can then prepare it for executorch and save the model as a `.pte` file. `.pte` is a binary format that stores the serialized ExecuTorch graph. ## Lowering a Quantized Model to XNNPACK The XNNPACK delegate can also execute symmetrically quantized models. To understand the quantization flow and learn how to quantize models, refer to [Quantization Overview](quantization-overview.md). For the sake of this tutorial, we will leverage the `quantize()` python helper function conveniently added to the `executorch/executorch/examples` folder. ```python from torch.export import export from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() sample_inputs = (torch.randn(1, 3, 224, 224), ) mobilenet_v2 = export(mobilenet_v2, sample_inputs).module() # 2-stage export for quantization path from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import ( get_symmetric_quantization_config, XNNPACKQuantizer, ) def quantize(model, example_inputs): """This is the official recommended flow for quantization in pytorch 2.0 export""" print(f"Original model: {model}") quantizer = XNNPACKQuantizer() # if we set is_per_channel to True, we also need to add out_variant of quantize_per_channel/dequantize_per_channel operator_config = get_symmetric_quantization_config(is_per_channel=False) quantizer.set_global(operator_config) m = prepare_pt2e(model, quantizer) # calibration m(*example_inputs) m = convert_pt2e(m) print(f"Quantized model: {m}") # make sure we can export to flat buffer return m quantized_mobilenetv2 = quantize(mobilenet_v2, sample_inputs) ``` Quantization requires a two stage export. First we use the `export` API to capture the model before giving it to `quantize` utility function. After performing the quantization step, we can now leverage the XNNPACK delegate to lower the quantized exported model graph. From here, the procedure is the same as for the non-quantized model lowering to XNNPACK. ```python # Continued from earlier... edge = to_edge_transform_and_lower( export(quantized_mobilenetv2, sample_inputs), compile_config=EdgeCompileConfig(_check_ir_validity=False), partitioner=[XnnpackPartitioner()] ) exec_prog = edge.to_executorch() with open("qs8_xnnpack_mobilenetv2.pte", "wb") as file: exec_prog.write_to_file(file) ``` ## Lowering with `aot_compiler.py` script We have also provided a script to quickly lower and export a few example models. You can run the script to generate lowered fp32 and quantized models. This script is used simply for convenience and performs all the same steps as those listed in the previous two sections. ``` python -m examples.xnnpack.aot_compiler --model_name="mv2" --quantize --delegate ``` Note in the example above, * the `-—model_name` specifies the model to use * the `-—quantize` flag controls whether the model should be quantized or not * the `-—delegate` flag controls whether we attempt to lower parts of the graph to the XNNPACK delegate. The generated model file will be named `[model_name]_xnnpack_[qs8/fp32].pte` depending on the arguments supplied. ## Running the XNNPACK Model with CMake After exporting the XNNPACK Delegated model, we can now try running it with example inputs using CMake. We can build and use the executor_runner, which is a sample wrapper for the ExecuTorch Runtime. The XNNPACK Backend is enabled via the compilation flag `-DEXECUTORCH_BUILD_XNNPACK=ON`. We first begin by configuring the CMake build like such: ```bash # cd to the root of executorch repo cd executorch # Get a clean cmake-out directory ./install_executorch.sh --clean mkdir cmake-out # Configure cmake cmake \ -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_ENABLE_LOGGING=ON \ -DPYTHON_EXECUTABLE=python \ -Bcmake-out . ``` Then you can build the runtime componenets with ```bash cmake --build cmake-out -j9 --target install --config Release ``` Now you should be able to find the executable built at `./cmake-out/executor_runner` you can run the executable with the model you generated as such ```bash ./cmake-out/executor_runner --model_path=./mv2_xnnpack_fp32.pte # or to run the quantized variant ./cmake-out/executor_runner --model_path=./mv2_xnnpack_q8.pte ``` ## Building and Linking with the XNNPACK Backend You can build the XNNPACK backend [CMake target](https://github.com/pytorch/executorch/blob/main/backends/xnnpack/CMakeLists.txt#L83), and link it with your application binary such as an Android or iOS application. For more information on this you may take a look at this [resource](using-executorch-android.md) next. ## Profiling To enable profiling in the `executor_runner` pass the flags `-DEXECUTORCH_ENABLE_EVENT_TRACER=ON` and `-DEXECUTORCH_BUILD_DEVTOOLS=ON` to the build command (add `-DENABLE_XNNPACK_PROFILING=ON` for additional details). This will enable ETDump generation when running the inference and enables command line flags for profiling (see `executor_runner --help` for details). --- # Usage This section describes how to use Executorch. It covers everything from getting started to platform-specific implementations, runtime integration, troubleshooting, and frequently asked questions. ```{toctree} :maxdepth: 1 getting-started using-executorch-export using-executorch-android using-executorch-ios using-executorch-cpp using-executorch-runtime-integration using-executorch-troubleshooting using-executorch-building-from-source using-executorch-faqs ``` --- # Using ExecuTorch on Android 🚀 Quick Start: __New to ExecuTorch__ ? Jump to [Using AAR from Maven Central](#using-aar-from-maven-central) for the fastest setup, then see the [Runtime Integration](#runtime-integration) example. To use from Android, ExecuTorch provides Java/Kotlin API bindings and Android platform integration, available as an AAR file. Note: This page covers Android app integration through the AAR library. The ExecuTorch C++ APIs can also be used from Android native, and the documentation can be found on this page about cross compilation. ## Installation __Choose your installation method:__ - __[Maven Central](#using-aar-from-maven-central)__ (recommended): Easiest for most developers - __[Direct AAR file](#using-aar-file-directly)__: For specific versions or offline development - __[Build from source](#building-from-source)__: For custom backends or contributions All ExecuTorch Android libraries are packaged into an Android library (AAR), executorch.aar for both generic (image/audio processing) and LLM (LLaMA) use case. In each release, prebuilt AAR artifacts are uploaded to Maven and S3. Users can also build the AAR from source. ### Contents of library The AAR artifact contains the Java library for users to integrate with their Java/Kotlin application code, as well as the corresponding JNI library (.so file), which is loaded by the Java code during initialization. - [Java library](https://github.com/pytorch/executorch/tree/main/extension/android/executorch_android/src/main/java/org/pytorch/executorch) - JNI contains the JNI binding for the corresponding Java code, and ExecuTorch native library, including - Core ExecuTorch runtime libraries - XNNPACK backend - Portable kernels - Optimized kernels - Quantized kernels - LLaMa-specific Custom ops library. - Comes with two ABI variants, arm64-v8a and x86_64. The AAR library can be used for generic Android device with arm64-v8a or x86_64 architecture. It can be used across form factors, including phones, tablets, tv boxes, etc, as it does not contain any UI components. ## Using AAR from Maven Central ✅ Recommended for most developers ExecuTorch is available on Maven Central. Simply add the target org.pytorch:executorch-android:${executorch_version} to your Android app dependency (build.gradle), and build your app. For example: ```kotlin app/build.gradle.kts dependencies { implementation("org.pytorch:executorch-android:${executorch_version}") } ``` Note: If you want to use release v1.0.0, please use dependency org.pytorch:executorch-android:1.0.0. Click the screenshot below to watch the demo video on how to add the package and run a simple ExecuTorch model with Android Studio. Integrating and Running ExecuTorch on Android ## Using AAR file directly You can also directly specify an AAR file in the app. We upload pre-built AAR to S3 during each release, or as a snapshot. ### Latest Released versions (Recommended) Starting from [v1.0.0](https://github.com/pytorch/executorch/releases/tag/v1.0.0), there are respective executorch.aar library available by backends | AAR | SHASUMS | Backend | | ------- | --- | ------- | | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-xnnpack/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-xnnpack/executorch.aar.sha256sums) | [XNNPACK](backends/xnnpack/xnnpack-overview.md) | | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-qnn/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-qnn/executorch.aar.sha256sums) | [Qualcomm AI Engine](backends-qualcomm.md) | | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-vulkan/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-vulkan/executorch.aar.sha256sums) | [Vulkan](backends/vulkan/vulkan-overview.md) | ### Older Released versions Download the older released version | Version | AAR | SHASUMS | | ------- | --- | ------- | | [v0.7.0](https://github.com/pytorch/executorch/releases/tag/v0.7.0) | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/v0.7.0/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/v0.7.0/executorch.aar.sha256sums) | | [v0.6.0](https://github.com/pytorch/executorch/releases/tag/v0.6.0) | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/v0.6.0/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/v0.6.0/executorch.aar.sha256sums) | | [v0.5.0](https://github.com/pytorch/executorch/releases/tag/v0.5.0) | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/v0.5.0-rc3/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/v0.5.0-rc3/executorch.aar.sha256sums) | ### Snapshots from main branch Starting from 2025-04-12, you can download nightly `main` branch snapshots: * `executorch.aar`: `https://ossci-android.s3.amazonaws.com/executorch/release/snapshot-{YYYYMMDD}/executorch.aar` * `executorch.aar.sha256sums`: `https://ossci-android.s3.amazonaws.com/executorch/release/snapshot-{YYYYMMDD}/executorch.aar.sha256sums` * Replace `YYYYMMDD` with the actual date you want to use. * AAR file is generated by [this workflow](https://github.com/pytorch/executorch/blob/c66b37d010c88a113560693b14dc6bd112593c11/.github/workflows/android-release-artifacts.yml#L14-L15). For example: ```sh curl -O https://ossci-android.s3.amazonaws.com/executorch/release/snapshot-20250412/executorch.aar curl -O https://ossci-android.s3.amazonaws.com/executorch/release/snapshot-20250412/executorch.aar.sha256sums ``` We aim to make every daily snapshot available and usable. However, for best stability, please use releases, not snapshots. ## Using AAR file To add the AAR file to your app: Download the AAR. Add it to your gradle build rule as a file path. An AAR file itself does not contain dependency info, unlike the Maven one which bundled with pom.xml. The Java package requires fbjni and soloader, and currently requires users to explicitly declare the dependency. Therefore, two more dependencies in gradle rule is required: ```kotlin implementation("com.facebook.soloader:soloader:0.10.5") implementation("com.facebook.fbjni:fbjni:0.7.0") ``` ### Example usage In your app working directory, such as executorch-examples/llm/android/LlamaDemo, ```sh mkdir -p app/libs curl https://ossci-android.s3.amazonaws.com/executorch/release/${executorch_version}/executorch.aar -o app/libs/executorch.aar ``` And include it in gradle: ```kotlin app/build.gradle.kts dependencies { implementation(files("libs/executorch.aar")) implementation("com.facebook.soloader:soloader:0.10.5") implementation("com.facebook.fbjni:fbjni:0.7.0") } ``` Now you can compile your app with the ExecuTorch Android library. ## Building from Source ```text scripts/build_android_library.sh ``` is a helper script to build the Java library (into .jar), native library (into .so), and the packaged AAR file. You need Android SDK and NDK to use it. Current NDK version used in ExecuTorch CI: r28c. You need to set ANDROID_HOME to Android SDK home and ANDROID_NDK to the correct NDK root (containing NOTICE file). ```sh export ANDROID_HOME=/path/to/sdk export ANDROID_NDK=/path/to/ndk sh scripts/build_android_library.sh ``` NOTE: Currently, XNNPACK backend is always built with the script. ### Optional environment variables Optionally, set these environment variables before running build_android_library.sh. - __ANDROID_ABIS__ Set environment variable ANDROID_ABIS to either arm64-v8a or x86_64 if you only need to build the native library for one ABI only. ```sh export ANDROID_ABIS=arm64-v8a ``` (Or) ```sh export ANDROID_ABIS=x86_64 ``` And then run the script. ```sh sh scripts/build_android_library.sh ``` - __EXECUTORCH_CMAKE_BUILD_TYPE__ Set environment variable EXECUTORCH_CMAKE_BUILD_TYPE to Release or Debug based on your needs. - __Using MediaTek backend__ To use MediaTek backend, after installing and setting up the SDK, set NEURON_BUFFER_ALLOCATOR_LIB and NEURON_USDK_ADAPTER_LIB to the corresponding path. - __Using Qualcomm AI Engine Backend__ To use Qualcomm AI Engine Backend, after installing and setting up the SDK, set QNN_SDK_ROOT to the corresponding path. - __Using Vulkan Backend__ To use Vulkan Backend, set EXECUTORCH_BUILD_VULKAN to ON. ## Android Backends The following backends are available for Android: | Backend | Type | Doc | | ------- | -------- | --- | | [XNNPACK](https://github.com/google/XNNPACK) | CPU | [Doc](backends/xnnpack/xnnpack-overview.md) | | [MediaTek NeuroPilot](https://neuropilot.mediatek.com/) | NPU | [Doc](backends-mediatek.md) | | [Qualcomm AI Engine](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) | NPU | [Doc](backends-qualcomm.md) | | [Vulkan](https://www.vulkan.org/) | GPU | [Doc](backends/vulkan/vulkan-overview.md) | Start with XNNPACK (CPU backend) for maximum compatibility, then add hardware-specific backends for optimization. ## Runtime Integration Here is an example code sample in Java that demonstrates how to integrate ExecuTorch into an Android app: ```java import org.pytorch.executorch.EValue; import org.pytorch.executorch.Module; import org.pytorch.executorch.Tensor; public class MainActivity extends Activity { private Module module; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); // Load the ExecuTorch module Module module = Module.load("/data/local/tmp/add.pte"); Tensor tensor1 = Tensor.fromBlob(new float[] {1.0f}, new long[] {1}); Tensor tensor2 = Tensor.fromBlob(new float[] {20.0f}, new long[] {1}); EValue eValue1 = EValue.from(tensor1); EValue eValue2 = EValue.from(tensor2); float result = module.forward(eValue1, eValue2)[0].toTensor().getDataAsFloatArray()[0]; } ``` Push the corresponding pte file to your Android device: ```sh adb push extension/module/test/resources/add.pte /data/local/tmp/ ``` This example loads an ExecuTorch module, prepares input data, runs inference, and processes the output data. Please use [DeepLabV3AndroidDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo) and [LlamaDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) for the code examples using ExecuTorch AAR package. ## Java API reference Please see [Java API reference](https://pytorch.org/executorch/main/javadoc/). --- # Building from Source ExecuTorch uses [CMake](https://cmake.org/) as the primary build system. Even if you don't use CMake directly, CMake can emit scripts for other format like Make, Ninja or Xcode. For information, see [cmake-generators(7)](https://cmake.org/cmake/help/latest/manual/cmake-generators.7.html). ## System Requirements ### Operating System ExecuTorch is tested on the following systems, although it should also work in similar environments. * Linux (x86_64) * CentOS 8+ * Ubuntu 20.04.6 LTS+ * RHEL 8+ * macOS (x86_64/ARM64) * Big Sur (11.0)+ * Windows (x86_64) * Windows 10+ with Visual Studio 2022+ and [Clang-CL](https://learn.microsoft.com/en-us/cpp/build/clang-support-msbuild?view=msvc-170) * Windows Subsystem for Linux (WSL) with any of the Linux options ### Software Requirements * `conda` or another virtual environment manager - `conda` is recommended as it provides cross-language support and integrates smoothly with `pip` (Python's built-in package manager) - Otherwise, Python's built-in virtual environment manager `python venv` is a good alternative. * `g++` version 7 or higher, `clang++` version 5 or higher, or another C++17-compatible toolchain. * `python` version 3.10-3.13 * `ccache` (optional) - A compiler cache that speeds up recompilation * **macOS** - `Xcode Command Line Tools` * **Windows** - `Visual Studio Clang Tools` - See [Clang/LLVM support in Visual Studio](https://learn.microsoft.com/en-us/cpp/build/clang-support-msbuild?view=msvc-170). Additional dependencies will be automatically installed when running the [Python installation](#building-the-python-package). Note that the cross-compilable core runtime code supports a wider range of toolchains, down to C++17. See [Runtime Overview](runtime-overview.md) for portability details. ## Environment Setup Clone the ExecuTorch repository from GitHub and create a conda environment. Venv can be used in place of conda. ```bash git clone -b viable/strict https://github.com/pytorch/executorch.git cd executorch conda create -yn executorch python=3.10.0 conda activate executorch ``` > **_NOTE:_** Addition Windows Setup > > ExecuTorch requires symlinks to be enabled to build the Python components. To enable symlinks, run the following command before cloning the repository. Missing symlinks will manifest as an error related to `version.py` when running `pip install .`. See [src/README.md](https://github.com/pytorch/executorch/blob/main/src/README.md) for more information. > ```bash > git config --system core.symlinks true > ```
## Building the Python package To build and install the ExecuTorch Python components, used for PTE creation and Python runtime bindings, run the following command. This will install the ExecuTorch python package and its dependencies into the active Python environment. ```bash # Install ExecuTorch pip package and its dependencies. ./install_executorch.sh ``` The `install_executorch.sh` script supports the following flags: * `--clean`: Removes build artifacts. * `--editable`: Install the ExecuTorch python package in editable mode (see [Editable Install](#editable-install)). * `--minimal`: Install only the minimal set of dependencies required to run ExecuTorch. Do not install dependencies for examples. * `--use-pt-pinned-commit`: Install the pinned PyTorch commit or release version. When not specified, the latest PyTorch nightly build is installed. For Intel-based macOS systems, use `--use-pt-pinned-commit --minimal`. As PyTorch does not provide pre-built binaries for Intel Mac, installation requires building PyTorch from source. Instructions can be found in [PyTorch Installation](https://github.com/pytorch/pytorch#installation). Note that only the XNNPACK and CoreML backends are built by default. Additional backends can be enabled or disabled by setting the corresponding CMake flags: ```bash # Enable the MPS backend CMAKE_ARGS="-DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh ``` ### Verify the Build To verify that the Python components are installed correctly, run the following command. This will create a file named mv2_xnnpack_fp32.pte in the current directory for the MobileNet V2 model with the XNNPACK backend. If it completes without error, the ExecuTorch Python components are installed successfully. ```bash python -m executorch.examples.xnnpack.aot_compiler --model_name="mv2" --delegate ``` ### Editable Install For development, include the `--editable` flag, which allows for local changes to ExecuTorch Python code to be reflected without a re-install. Note that when C++ files are modified, you will need to re-run the full installation to reflect the changes. ```bash ./install_executorch.sh --editable # Or you can directly do the following if dependencies are already installed # either via a previous invocation of `./install_executorch.sh` or by explicitly installing requirements via `./install_requirements.sh` first. pip install -e . --no-build-isolation ``` > **_WARNING:_** > Some modules can't be imported directly in editable mode. This is a known [issue](https://github.com/pytorch/executorch/issues/9558) and we are actively working on a fix for this. To work around this: > ```bash > # This will fail > python -c "from executorch.exir import CaptureConfig" > # But this will succeed > python -c "from executorch.exir.capture import CaptureConfig" > ``` > **_NOTE:_** Cleaning the build system > > When fetching a new version of the upstream repo (via `git fetch` or `git > pull`) it is a good idea to clean the old build artifacts. The build system > does not currently adapt well to changes in build dependencies. > > You should also update and pull the submodules again, in case their versions > have changed. > > ```bash > # From the root of the executorch repo: > ./install_executorch.sh --clean > git submodule sync > git submodule update --init --recursive > ``` > > The `--clean` command removes build artifacts, pip outputs, and also clears the ccache if it's installed, ensuring a completely fresh build environment.
## Building the C++ Runtime The ExecuTorch runtime uses CMake as the build system. When using ExecuTorch from C++ user code with CMake, adding ExecuTorch as a submodule and referencing via CMake `add_subdirectory` will build the runtime as part of the user build. When user code is not using CMake, the runtime can be built standalone and linked. The CMake options described below apply in both cases. Scripts are also provided for [Android AAR](#cross-compiling-for-android) and [iOS framework](#cross-compiling-for-ios) builds. | Use Case | How to Build | | :------------------------- | :--------------------------------------------------------------------------------- | | C++ with user CMake | Use CMake `add_subdirectory`. | | C++ without user CMake | Bulild ExecuTorch standalone with CMake. Link libraries with user build. | | Android with Java/Kotlin | Use [scripts/build_android_libraries.sh](#cross-compiling-for-android). | | Android with C++ | Follow C++ build steps, [cross-compile for Android](#cross-compiling-for-android). | | iOS | Use [scripts/build_ios_frameworks.sh](#cross-compiling-for-ios). | ### Configuring Configuration should be done after cloning, pulling the upstream repo, or changing build options. Once this is done, you won't need to do it again until you pull from the upstream repo or modify any CMake-related files. When building as a submodule as part of a user CMake build, ExecuTorch CMake options can be specified either as part of the user CMake configuration or in user CMake code. CMake configuration for standalone runtime build: ```bash mkdir cmake-out cmake -B cmake-out --preset [preset] [options] cmake --build cmake-out -j10 ``` #### Build Presets ExecuTorch provides fine-grained control over what is built, as described in [Build Options](#build-options). These options are grouped into CMake presets to cover common scenarios while preserving the ability to override individual options. Presets can be specified when configuring CMake by specifying `--preset [name]` when configuring. Preset values for common scenarios are listed below. Using a platform preset is recommended to avoid needing to specify many fine-grained build options. * `android-arm64-v8a` - Build features and backends common for arm64-v8a Android targets. * `android-x86_64` - Build features and backends common for x86_64 Android targets. * `arm-baremetal` - Build for bare-metal ARM targets. * `ios` - Build features and backends common for iOS targets. * `macos` - Build features and backends common for Mac targets. * `linux` - Build features and backends for Linux targets. * `llm` - Build Large Language Model-specific features. * `profiling` - Build the ExecuTorch runtime with profiling enabled. * `zephyr` - Build for Zephyr RTOS. User CMake: ```cmake set(EXECUTORCH_BUILD_PRESET_FILE ${CMAKE_SOURCE_DIR}/executorch/tools/cmake/preset/llm.cmake) ``` Standalone build: ```bash # Configure the build with the ios preset. cmake .. --preset ios ``` #### Build Options CMake options can be used to for fine-grained control of build type, control which features are built, and configure functionality, such as logging. Options are typically specified during CMake configuration. Default values of each option are set by the active preset, but can be overridden by specifying the option when configuring. Note that many build options require other options to be enabled. This may require enabling multiple options to enable a given feature. The CMake build output will provide an error message when a required option is not enabled. User CMake: ```cmake set(EXECUTORCH_BUILD_XNNPACK ON) ``` Standalone build: ```bash cmake -DEXECUTORCH_BUILD_XNNPACK=ON ``` ##### Build Type The CMake build is typically set to `Debug` or `Release`. For production use or profiling, release mode should be used to improve performance and reduce binary size. It disables program verification and executorch logging and adds optimizations flags. The `EXECUTORCH_OPTIMIZE_SIZE` flag can be used to further optimize for size with a small performance tradeoff. ```bash # Specify build type during CMake configuration cmake .. -DCMAKE_BUILD_TYPE=Release ``` ##### Backends Typically, each hardware backend exposes a CMake option to control whether the backend is built. See backend-specific documentation for more details. * `EXECUTORCH_BUILD_CADENCE` - Build the Cadence DSP backend. * `EXECUTORCH_BUILD_COREML` - Build the Apple CoreML backend. * `EXECUTORCH_BUILD_CORTEX_M` - Build the ARM Cortex-M backend. * `EXECUTORCH_BUILD_MPS` - Build the Apple Metal Performance Shader backend. * `EXECUTORCH_BUILD_NEURON` - Build the MediaTek Neuron backend. * `EXECUTORCH_BUILD_OPENVINO` - Build the Intel OpenVINO backend. * `EXECUTORCH_BUILD_QNN` - Build the Qualcomm AI Engine backend. * `EXECUTORCH_BUILD_VGF` - Build the ARM VGF backend. * `EXECUTORCH_BUILD_VULKAN` - Build the Vulkan GPU backend. * `EXECUTORCH_BUILD_XNNPACK` - Build the XNNPACK CPU backend. ```bash # Build the XNNPACK and Vulkan backends. cmake .. -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_VULKAN=ON ``` ##### Extensions ExecuTorch extensions provide optional functionality outside of the core runtime. As the core runtime is designed to run in constrained environments, these features are typically disabled by default. Extensions include higher-level APIs (Module and Tensor), multi-threading support (Threadpool), training, and more. * `EXECUTORCH_BUILD_EXTENSION_APPLE` - Build the Apple extension. This provides Swift and Objective-C bindings, log routing, and platform integration with Mac and iOS. See [Using ExecuTorch on iOS](using-executorch-ios.md). * `EXECUTORCH_BUILD_EXTENSION_DATA_LOADER` - Build the data loader extension. Provides classes to load PTEs from files or buffers. * `EXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR` - Build the flat tensor extension. Provides functionality to load and save tensor data in .ptd format. * `EXECUTORCH_BUILD_EXTENSION_LLM` - Build the Large Language Model extension. Provides LLM-specific functionality, such as tokenizer APIs. See [Working with LLMs](llm/getting-started.md). * `EXECUTORCH_BUILD_EXTENSION_LLM_APPLE` - Build the Large Language Model Apple extensions. * `EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER` - Build the Large Language Model runner extension. * `EXECUTORCH_BUILD_EXTENSION_MODULE` - Build the Module API extension. See [High-Level APIs](using-executorch-cpp.md#high-level-apis). * `EXECUTORCH_BUILD_EXTENSION_TENSOR` - Build the Tensor API extension. Provides convenience APIs for creating and managing tensors. See [High-Level APIs](using-executorch-cpp.md#high-level-apis) and [extension/tensor](https://github.com/pytorch/executorch/tree/main/extension/tensor). * `EXECUTORCH_BUILD_EXTENSION_TRAINING` - Build the training extension. This is experimental. * `EXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL` - Build the EValue utility extension. Provides a method to print EValue objects. See [print_evalue.h](https://github.com/pytorch/executorch/blob/main/extension/evalue_util/print_evalue.h). * `EXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL` - Build the runner utility extension. Provides utility methods for running models, such as allocating input and output tensor memory and generating inputs. See [executor_runner.cpp](https://github.com/pytorch/executorch/blob/main/examples/portable/executor_runner/executor_runner.cpp) for example usage. ``` # Enable the data loader extension. cmake .. -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON ``` ##### Logging Logging is enabled by default in debug builds and disabled in release. When enabled, the default log level is Info. Both log enable and level can be overriden with options. See [Logging](using-executorch-runtime-integration.md#logging). Disabling logging and decreasing log verbosity will reduce binary size by stripping unused strings from the build. * `EXECUTORCH_ENABLE_LOGGING` - Enable or disable framework log messages. * `EXECUTORCH_LOG_LEVEL` - The minimum log level to emit. One of `debug`, `info`, `error`, or `fatal`. ``` # Enable logging at debug cmake .. -DEXECUTORCH_ENABLE_LOGGING=ON -DEXECUTORCH_LOG_LEVEL=debug ``` ### Building Build all targets with `cmake --build`. ```bash # cd to the root of the executorch repo cd executorch # Build using the configuration that you previously generated under the # `cmake-out` directory. # # NOTE: The `-j` argument specifies how many jobs/processes to use when # building, and tends to speed up the build significantly. It's typical to use # "core count + 1" as the `-j` value. cmake --build cmake-out -j9 ``` > **_TIP:_** For faster rebuilds, consider installing ccache (see [Compiler Cache section](#compiler-cache-ccache) above). On first builds, ccache populates its cache. Subsequent builds with the same compiler flags can be significantly faster.
## CMake Targets and Output Libraries To link against the ExecuTorch framework from CMake, the following top-level targets are exposed: * `executorch::backends`: Contains all configured backends. * `executorch::extensions`: Contains all configured extensions. * `executorch::kernels`: Contains all configured kernel libraries. The backends, extensions, and kernels included in these targets are controlled by the various `EXECUTORCH_` CMake options specified by the build. Using these targets will automatically pull in the required dependencies to use the configured features. ### Linking Without CMake To link against the runtime from outside of the CMake ecosystem, the runtime can be first built with CMake and then linked directly. A few of the relevant top-level targets are described below. Note that this is a more involved process than using CMake and is only recommended when using CMake is not viable. - `libexecutorch.a`: The core of the ExecuTorch runtime. Does not contain any operator/kernel definitions or backend definitions. - `libportable_kernels.a`: The implementations of ATen-compatible operators, following the signatures in `//kernels/portable/functions.yaml`. - `libportable_kernels_bindings.a`: Generated code that registers the contents of `libportable_kernels.a` with the runtime. - NOTE: This must be linked into your application with a flag like `-Wl,-force_load` or `-Wl,--whole-archive`. It contains load-time functions that automatically register the kernels, but linkers will often prune those functions by default because there are no direct calls to them. `libportable_kernels.a`, so the program may use any of the operators it implements. Backends typically introduce additional targets. See backend-specific documentation for more details. ### Verify the Build To verify the build, ExecuTorch optionally compiles a simple, stand-alone model runner to run PTE files with all-one input tensors. It is not enabled by default in most presets, but can be enabled by configuring with `-DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON -DEXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL=ON`. Once compiled, invoke the runner with a sample PTE (such as the one generated by [verifying the Python build](#verify-the-build)). ```bash cmake-out/executor_runner --model_path=mv2_xnnpack_fp32.pte ``` If the runner runs successfully, you should see output similar to the following: ``` I 00:00:00.043703 executorch:executor_runner.cpp:379] Model executed successfully 1 time(s) in 15.013292 ms. I 00:00:00.043720 executorch:executor_runner.cpp:383] 1 outputs: OutputX 0: tensor(sizes=[1, 1000], [ -0.509859, 0.300644, 0.0953884, 0.147724, 0.231202, 0.338554, 0.206888, -0.0575762, -0.389273, -0.0606864, ..., 0.421219, 0.100447, -0.506771, -0.115824, -0.693017, -0.183262, 0.154781, -0.410684, 0.0119296, 0.449713, ]) ```
## Cross-Compiling for Android ### Pre-requisites - Set up a Python environment and clone the ExecuTorch repository, as described in [Environment Setup](#environment-setup). - Install the [Android SDK](https://developer.android.com/studio). Android Studio is recommended. - Install the [Android NDK](https://developer.android.com/ndk). - Option 1: Install via [Android Studio](https://developer.android.com/studio/projects/install-ndk). - Option 2: Download from [NDK Downloads](https://developer.android.com/ndk/downloads). ### Building the AAR With the NDK installed, the `build_android_library.sh` script will build the ExecuTorch Java AAR, which contains ExecuTorch Java bindings. See [Using the AAR File](using-executorch-android.md#using-aar-file) for usage. ```bash export ANDROID_ABIS=arm64-v8a export BUILD_AAR_DIR=aar-out mkdir -p $BUILD_AAR_DIR sh scripts/build_android_library.sh ``` ### Android Native To use the ExecuTorch runtime from native Android C++ code, the runtime can be cross-compiled for Android. The recommended approach is to add ExecuTorch as a submodule of the user project and use [CMake](https://developer.android.com/ndk/guides/cmake) for the native build. The above steps for C++ with CMake can be followed. For direct cross-compilation, the ExecuTorch runtime can be configured to build with the NDK toolchain: ```bash # point -DCMAKE_TOOLCHAIN_FILE to the location where ndk is installed cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a .. ```
## Cross-Compiling for iOS iOS binaries are built as [frameworks](https://developer.apple.com/documentation/xcode/creating-a-multi-platform-binary-framework-bundle) instead of static libraries. The frameworks contain the compiled ExecuTorch runtime and public headers. ### Pre-requisites * Install Xcode from the [Mac App Store](https://apps.apple.com/app/xcode/id497799835) and install the Command Line Tools using the terminal. ```bash xcode-select --install ``` ### Building 1. Build the frameworks: ```bash ./scripts/build_apple_frameworks.sh ``` Run the above command with `--help` flag to learn more on how to build additional backends (like [Core ML](backends/coreml/coreml-overview.md), [MPS](backends/mps/mps-overview.md) or XNNPACK), etc. Note that some backends may require additional dependencies and certain versions of Xcode and iOS. See backend-specific documentation for more details. 2. Copy over the generated `.xcframework` bundles to your Xcode project, link them against your targets and don't forget to add an extra linker flag `-all_load`. See the [iOS Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/mv3/apple/ExecuTorchDemo) tutorial for example usage of the ExecuTorch frameworks. ## Compiler Cache (ccache) ExecuTorch automatically detects and enables [ccache](https://ccache.dev/) if it's installed. This significantly speeds up recompilation by caching previously compiled objects: - If ccache is detected, you'll see: `ccache found and enabled for faster builds` - If ccache is not installed, you'll see: `ccache not found, builds will not be cached` To install ccache: ```bash # Ubuntu/Debian sudo apt install ccache # macOS brew install ccache # CentOS/RHEL sudo yum install ccache # or sudo dnf install ccache ``` No additional configuration is needed - the build system will automatically use ccache when available. See [CMakeLists.txt](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt) --- # Using ExecuTorch with C++ In order to support a wide variety of devices, from high-end mobile phones down to tiny embedded systems, ExecuTorch provides an API surface with a high degree of customizability. The C++ APIs expose advanced configuration options, such as controlling memory allocation, placement, and data loading. To meet the needs of both application and embedded programming, ExecuTorch provides a low-level, highly-customizable core set of APIs, and set of high-level extensions, which abstract away many of the low-level details that are not relevant for mobile application programming. ## High-Level APIs The C++ `Module` class provides the high-level interface to load and execute a model from C++. It is responsible for loading the .pte file, configuring memory allocation and placement, and running the model. The Module constructor takes a file path and provides a simplified `forward()` method to run the model. In addition the Module class, the tensor extension provides an encapsulated interface to define and manage tensor memory. It provides the `TensorPtr` class, which is a "fat" smart pointer. It provides ownership over the tensor data and metadata, such as size and strides. The `make_tensor_ptr` and `from_blob` methods, defined in `tensor.h`, provide owning and non-owning tensor creation APIs, respectively. ```cpp #include #include using namespace ::executorch::extension; // Load the model. Module module("/path/to/model.pte"); // Create an input tensor. float input[1 * 3 * 256 * 256]; auto tensor = from_blob(input, {1, 3, 256, 256}); // Perform an inference. const auto result = module.forward(tensor); if (result.ok()) { // Retrieve the output data. const auto output = result->at(0).toTensor().const_data_ptr(); } ``` For more information on the Module class, see [Running an ExecuTorch Model Using the Module Extension in C++](extension-module.md). For information on high-level tensor APIs, see [Managing Tensor Memory in C++](extension-tensor.md). For complete examples of building and running a C++ application using the Module API, refer to our [examples GitHub repository](https://github.com/meta-pytorch/executorch-examples/tree/main/mv2/cpp). ## Low-Level APIs Running a model using the low-level runtime APIs allows for a high-degree of control over memory allocation, placement, and loading. This allows for advanced use cases, such as placing allocations in specific memory banks or loading a model without a file system. For an end to end example using the low-level runtime APIs, see [Detailed C++ Runtime APIs Tutorial](running-a-model-cpp-tutorial.md). ## Building with CMake ExecuTorch uses CMake as the primary build system. Inclusion of the module and tensor APIs are controlled by the `EXECUTORCH_BUILD_EXTENSION_MODULE` and `EXECUTORCH_BUILD_EXTENSION_TENSOR` CMake options. As these APIs may not be supported on embedded systems, they are disabled by default when building from source. The low-level API surface is always included. To link, add the `executorch` target as a CMake dependency, along with `executorch_backends`, `executorch_extensions`, and `extension_kernels`, to link all configured backends, extensions, and kernels. ``` # CMakeLists.txt add_subdirectory("executorch") ... target_link_libraries( my_target PRIVATE executorch executorch::backends executorch::extensions executorch::kernels) ``` See [Building from Source](using-executorch-building-from-source.md) for more information on the CMake build process. ## Reference Runners The ExecuTorch repository includes several reference runners, which are simple programs that load and execute a .pte file, typically with random inputs. These can be used to sanity check model execution on a development platform and as a code reference for runtime integration. The `executor_runner` target is built by default when building with CMake. It can be invoked as follows: ``` ./cmake-out/executor_runner --model_path path/to/model.pte ``` The runner source code can be found in the ExecuTorch repo under [examples/portable/executor_runner.cpp](https://github.com/pytorch/executorch/blob/main/examples/portable/executor_runner/executor_runner.cpp). Some backends, such as CoreML, have dedicated runners to showcase backend and platform-specific functionality. See [examples/apple/coreml](https://github.com/pytorch/executorch/tree/main/examples/apple/coreml) and the [examples](https://github.com/pytorch/executorch/tree/main/examples) directory for more information. ## Next Steps - [Runtime API Reference](executorch-runtime-api-reference.rst) for documentation on the available C++ runtime APIs. - [Running an ExecuTorch Model Using the Module Extension in C++](extension-module.md) for information on the high-level Module API. - [Managing Tensor Memory in C++](extension-tensor.md) for information on high-level tensor APIs. - [Running an ExecuTorch Model in C++ Tutorial](running-a-model-cpp-tutorial.md) for information on the low-level runtime APIs. - [Building from Source](using-executorch-building-from-source.md) for information on CMake build integration. --- # Model Export and Lowering The section describes the process of taking a PyTorch model and converting to the runtime format used by ExecuTorch. This process is commonly known as "exporting", as it uses the PyTorch export functionality to convert a PyTorch model into a format suitable for on-device execution. This process yields a .pte file which is optimized for on-device execution using a particular backend. If using program-data separation, it also yields a corresponding .ptd file containing only the weights/constants from the model. ## Prerequisites Exporting requires the ExecuTorch python libraries to be installed, typically by running `pip install executorch`. See [Installation](getting-started.md#Installation) for more information. This process assumes you have a PyTorch model, can instantiate it from Python, and can provide example input tensors to run the model. ## The Export and Lowering Process The process to export and lower a model to the .pte format typically involves the following steps: 1) Select a backend to target. 2) Prepare the PyTorch model, including inputs and shape specification. 3) Export the model using torch.export.export. 4) Optimize the model for the target backend using to_edge_transform_and_lower. 5) Create the .pte file by calling to_executorch and serializing the output.
Quantization - the process of using reduced precision to reduce inference time and memory footprint - is also commonly done at this stage. See [Quantization Overview](quantization-overview.md) for more information. ## Hardware Backends ExecuTorch backends provide hardware acceleration for a specific hardware target. In order to achieve maximum performance on target hardware, ExecuTorch optimizes the model for a specific backend during the export and lowering process. This means that the resulting .pte file is specialized for the specific hardware. In order to deploy to multiple backends, such as Core ML on iOS and Arm CPU on Android, it is common to generate a dedicated .pte file for each. The choice of hardware backend is informed by the hardware that the model is intended to be deployed on. Each backend has specific hardware requirements and level of model support. See the documentation for each hardware backend for more details. As part of the .pte file creation process, ExecuTorch identifies portions of the model (partitions) that are supported for the given backend. These sections are processed by the backend ahead of time to support efficient execution. Portions of the model that are not supported on the delegate, if any, are executed using the portable fallback implementation on CPU. This allows for partial model acceleration when not all model operators are supported on the backend, but may have negative performance implications. In addition, multiple partitioners can be specified in order of priority. This allows for operators not supported on GPU to run on CPU via XNNPACK, for example. ### Available Backends Commonly used hardware backends are listed below. For mobile, consider using XNNPACK for Android and XNNPACK or Core ML for iOS. To create a .pte file for a specific backend, pass the appropriate partitioner class to `to_edge_transform_and_lower`. See the appropriate backend documentation and the [Export and Lowering](#export-and-lowering) section below for more information. - [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md) - [Core ML (iOS)](backends/coreml/coreml-overview.md) - [Metal Performance Shaders (iOS GPU)](backends/mps/mps-overview.md) - [Vulkan (Android GPU)](backends/vulkan/vulkan-overview.md) - [Qualcomm NPU](backends-qualcomm.md) - [MediaTek NPU](backends-mediatek.md) - [Arm Ethos-U NPU](backends-arm-ethos-u.md) - [Cadence DSP](backends-cadence.md) ## Model Preparation The export process takes in a standard PyTorch model, typically a `torch.nn.Module`. This can be an custom model definition, or a model from an existing source, such as TorchVision or HuggingFace. See [Getting Started with ExecuTorch](getting-started.md) for an example of lowering a TorchVision model. Model export is done from Python. This is commonly done through a Python script or from an interactive Python notebook, such as Jupyter or Colab. The example below shows instantiation and inputs for a simple PyTorch model. The inputs are prepared as a tuple of torch.Tensors, and the model can run with these inputs. ```python import torch class Model(torch.nn.Module): def __init__(self): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Conv2d(1, 8, 3), torch.nn.ReLU(), torch.nn.Conv2d(8, 16, 3), torch.nn.ReLU(), torch.nn.AdaptiveAvgPool2d((1,1)) ) self.linear = torch.nn.Linear(16, 10) def forward(self, x): y = self.seq(x) y = torch.flatten(y, 1) y = self.linear(y) return y model = Model().eval() inputs = (torch.randn(1,1,16,16),) outputs = model(*inputs) print(f"Model output: {outputs}") ``` Note that the model is set to evaluation mode using `.eval()`. Models should always be exported in evaluation mode unless performing on-device training. This mode configures certain operations with training-specific behavior, such as batch norm or dropout, to use the inference-mode configuration. ## Export and Lowering To actually export and lower the model, call `export`, `to_edge_transform_and_lower`, and `to_executorch` in sequence. This yields an ExecuTorch program which can be serialized to a file. Putting it all together, lowering the example model above using the XNNPACK delegate for mobile CPU performance can be done as follows: ```python import torch from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner from executorch.exir import to_edge_transform_and_lower from torch.export import Dim, export class Model(torch.nn.Module): def __init__(self): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Conv2d(1, 8, 3), torch.nn.ReLU(), torch.nn.Conv2d(8, 16, 3), torch.nn.ReLU(), torch.nn.AdaptiveAvgPool2d((1,1)) ) self.linear = torch.nn.Linear(16, 10) def forward(self, x): y = self.seq(x) y = torch.flatten(y, 1) y = self.linear(y) return y model = Model() inputs = (torch.randn(1,1,16,16),) dynamic_shapes = { "x": { 2: Dim("h", min=16, max=1024), 3: Dim("w", min=16, max=1024), } } exported_program = export(model, inputs, dynamic_shapes=dynamic_shapes) executorch_program = to_edge_transform_and_lower( exported_program, partitioner = [XnnpackPartitioner()] ).to_executorch() with open("model.pte", "wb") as file: file.write(executorch_program.buffer) ``` This yields a `model.pte` file which can be run on mobile devices. To generate a `model.pte`, `model.ptd` pair with the weights inside `model.ptd`, add the following transform function to tag constants as external: ```python from executorch.exir.passes.external_constants_pass import ( delegate_external_constants_pass_unlifted, ) # Tag the unlifted ep.module(). tagged_module = exported_program.module() delegate_external_constants_pass_unlifted( module=tagged_module, gen_tag_fn=lambda x: "model", # This is the filename the weights will be saved to. In this case, weights will be saved as "model.ptd" ) # Re-export to get the EP. exported_program = export(tagged_module, inputs, dynamic_shapes=dynamic_shapes) executorch_program = to_edge_transform_and_lower( exported_program, partitioner = [XnnpackPartitioner()] ).to_executorch() ``` To save the PTD file: ``` executorch_program.write_tensor_data_to_file(output_directory) ``` It will be saved to the file `model.ptd`, with the file name coming from `gen_tag_fn` in the transform pass. ### Supporting Varying Input Sizes (Dynamic Shapes) The PyTorch export process uses the example inputs provided to trace through the model and reason about the size and type of tensors at each step. Unless told otherwise, export will assume a fixed input size equal to the example inputs and will use this information to optimize the model. Many models require support for varying input sizes. To support this, export takes a `dynamic_shapes` parameter, which informs the compiler of which dimensions can vary and their bounds. This takes the form of a nested dictionary, where keys correspond to input names and values specify the bounds for each input. In the example model, inputs are provided as 4-dimensions tensors following the standard convention of batch, channels, height, and width (NCHW). An input with the shape `[1, 3, 16, 16]` indicates 1 batch, 3 channels, and a height and width of 16. Suppose your model supports images with sizes between 16x16 and 1024x1024. The shape bounds can be specified as follows: ``` dynamic_shapes = { "x": { 2: Dim("h", min=16, max=1024), 3: Dim("w", min=16, max=1024), } } ep = torch.export.export(model, inputs, dynamic_shapes=dynamic_shapes) ``` In the above example, `"x"` corresponds to the parameter name in `Model.forward`. The 2 and 3 keys correpond to dimensions 2 and 3, which are height and width. As there are no specifications for batch and channel dimensions, these values are fixed according to the example inputs. ExecuTorch uses the shape bounds both to optimize the model and to plan memory for model execution. For this reason, it is advised to set the dimension upper bounds to no higher than needed, as higher bounds increase memory consumption. For more complex use cases, dynamic shape specification allows for mathematical relationships between dimensions. For more information on dynamic shape specification, see [Expressing Dynamism](https://pytorch.org/docs/stable/export.html#expressing-dynamism). ## Testing the Model Before integrating the runtime code, it is common to test the exported model from Python. This can be used to evaluate model accuracy and sanity check behavior before moving to the target device. Note that not all hardware backends are available from Python, as they may require specialized hardware to function. See the specific backend documentation for more information on hardware requirements and the availablilty of simulators. The XNNPACK delegate used in this example is always available on host machines. ```python import torch from executorch.runtime import Runtime runtime = Runtime.get() input_tensor = torch.randn(1, 3, 32, 32) program = runtime.load_program("model.pte") method = program.load_method("forward") outputs = method.execute([input_tensor]) ``` To run a model with program and data separated, please use the [ExecuTorch Module pybindings](https://github.com/pytorch/executorch/blob/main/extension/pybindings/README.md). ```python import torch from executorch.extension.pybindings import portable_lib input_tensor = torch.randn(1, 3, 32, 32) module = portable_lib._load_for_executorch("model.pte", "model.ptd") outputs = module.forward([input_tensor]) ``` There is also an E2E demo in [executorch-examples](https://github.com/meta-pytorch/executorch-examples/tree/main/program-data-separation). For more information, see [Runtime API Reference](executorch-runtime-api-reference.rst). ## Advanced Topics While many models will "just work" following the steps above, some more complex models may require additional work to export. These include models with state and models with complex control flow or auto-regressive generation. See the [Llama model](https://github.com/pytorch/executorch/tree/main/examples/models/llama) for example use of these techniques. ### State Management Some types of models maintain internal state, such as KV caches in transformers. There are two ways to manage state within ExecuTorch. The first is to bring the state out as model inputs and outputs, effectively making the core model stateless. This is sometimes referred to as managing the state as IO. The second approach is to leverage mutable buffers within the model directly. A mutable buffer can be registered using the PyTorch [register_buffer](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer) API on `nn.Module`. Storage for the buffer is managed by the framework, and any mutations to the buffer within the model are written back at the end of method execution. Mutable buffers have several limitations: - Export of mutability can be fragile. - Consider explicitly calling `detach()` on tensors before assigning to a buffer if you encounter export-time errors related to gradients. - Ensure that any operations done on a mutable buffer are done with in-place operations (typipcally ending in `_`). - Do not reassign the buffer variable. Instead, use `copy_` to update the entire buffer content. - Mutable buffers are not shared between multiple methods within a .pte. - In-place operations are replaced with non-in place variants, and the resulting tensor is written back at the end of the method execution. This can be a performance bottleneck when using `index_put_`. - Buffer mutations are not supported on all backends and may cause graph breaks and memory transfers back to CPU. Support for mutation is expiremental and may change in the future. ### Dynamic Control Flow Control flow is considered dynamic if the path taken is not fixed at export-time. This is commonly the case when if or loop conditions depend on the value of a Tensor, such as a generator loop that terminates when an end-of-sequence token is generated. Shape-dependent control flow can also be dynamic if the tensor shape depends on the input. To make dynamic if statements exportable, they can be written using [torch.cond](https://docs.pytorch.org/docs/stable/generated/torch.cond.html). Dynamic loops are not currently supported on ExecuTorch. The general approach to enable this type of model is to export the body of the loop as a method, and then handle loop logic from the application code. This is common for handling generator loops in auto-regressive models, such as transformer incremental decoding. ### Multi-method Models ExecuTorch allows for bundling of multiple methods with a single .pte file. This can be useful for more complex model architectures, such as encoder-decoder models. The include multiple methods in a .pte, each method must be exported individually with `torch.export.export`, yielding one `ExportedProgram` per method. These can be passed as a dictionary into `to_edge_transform_and_lower`: ```python encode_ep = torch.export.export(...) decode_ep = torch.export.export(...) lowered = to_edge_transform_and_lower({ "encode": encode_ep, "decode": decode_ep, }).to_executorch() ``` At runtime, the method name can be passed to `load_method` and `execute` on the `Module` class. Multi-method .ptes have several caveats: - Methods are individually memory-planned. Activation memory is not current re-used between methods. For advanced use cases, a [custom memory plan](compiler-memory-planning.md) or [custom memory allocators](https://docs.pytorch.org/executorch/stable/runtime-overview.html#operating-system-considerations) can be used to overlap the allocations. - Mutable buffers are not shared between methods. - PyTorch export does not currently allow for exporting methods on a module other than `forward`. To work around this, it is common to create wrapper `nn.Modules` for each method. ```python class EncodeWrapper(torch.nn.Module): def __init__(self, model): super().__init__() self.model = model def forward(self, *args, **kwargs): return self.model.encode(*args, **kwargs) class DecodeWrapper(torch.nn.Module): # ... encode_ep = torch.export.export(EncodeWrapper(model), ...) decode_ep = torch.export.export(DecodeWrapper(model), ...) # ... ``` ## Next Steps The PyTorch and ExecuTorch export and lowering APIs provide a high level of customizability to meet the needs of diverse hardware and models. See [torch.export](https://pytorch.org/docs/main/export.html) and [Export API Reference](export-to-executorch-api-reference.rst) for more information. For advanced use cases, see the following: - [Quantization Overview](quantization-overview.md) for information on quantizing models to reduce inference time and memory footprint. - [Memory Planning](compiler-memory-planning.md) for information on controlling memory placement and planning. - [Custom Compiler Passes](compiler-custom-compiler-passes.md) for information on writing custom compiler passes. - [Export IR Specification](ir-exir.md) for information on the intermediate representation generated by export. --- # Frequently Asked Questions This page summarizes frequently asked questions and provides guidance on issues that commonly occur when adopting ExecuTorch. If a specific issue is not covered here, consider searching for or creating an issue on GitHub under [Issues](https://github.com/pytorch/executorch/issues) or [Discussions](https://github.com/pytorch/executorch/discussions). ## Installation ### Missing /usr/include/python3.x Most likely `python-dev` library needs to be installed. Please run ``` sudo apt install python-dev ``` if you are using Ubuntu, or use an equivalent install command. ### ModuleNotFoundError: No module named 'pytorch_tokenizers' The `pytorch_tokenizers` package is required for LLM export functionality. Install it from the ExecuTorch source code: ``` pip install -e ./extension/llm/tokenizers/ ``` ## Export ### Missing out variants: { _ } The model likely contains torch custom operators. Custom ops need an Executorch implementation and need to be loaded at export time. See the [ExecuTorch Custom Ops Documentation](kernel-library-custom-aten-kernel.md#apis) for details on how to do this. ### RuntimeError: PyTorch convert function for op _ not implemented The model likely contains an operator that is not yet supported on ExecuTorch. In this case, consider searching for or creating an issue on [GitHub](https://github.com/pytorch/executorch/issues). ## Runtime ExecuTorch error codes are defined in [executorch/core/runtime/error.h](https://github.com/pytorch/executorch/blob/main/runtime/core/error.h). ### Inference is Slow / Performance Troubleshooting If building the runtime from source, ensure that the build is done in release mode. For CMake builds, this can be done by passing `-DCMAKE_BUILD_TYPE=Release`. Ensure the model is delegated. If not targeting a specific accelerator, use the XNNPACK delegate for CPU performance. Undelegated operators will typically fall back to the ExecuTorch portable library, which is designed as a fallback, and is not intended for performance sensitive operators. To target XNNPACK, pass an `XnnpackPartitioner` to `to_edge_transform_and_lower`. See [Building and Running ExecuTorch with XNNPACK Backend](tutorial-xnnpack-delegate-lowering.md) for more information. Thread count can have a significant impact on CPU performance. The optimal thread count may depend on the model and application. By default, ExecuTorch will currently use as many threads as there are cores. Consider setting the thread count to cores / 2, or just set to 4 on mobile CPUs. Thread count can be set with the following function. Ensure this is done prior to loading or running a model. ``` ::executorch::extension::threadpool::get_threadpool()->_unsafe_reset_threadpool(num_threads); ``` For a deeper investigation into model performance, ExecuTorch supports operator-level performance profiling. See [Using the ExecuTorch Developer Tools to Profile a Model](devtools-integration-tutorial.md) for more information. ### Missing Logs ExecuTorch provides hooks to route runtime logs. By default, logs are sent to stdout/stderr, but users can override `et_pal_emit_log_message` to route logs to a custom destination. The Android and iOS extensions also provide out-of-box log routing to the appropriate platform logs. See [Runtime Platform Abstraction Layer (PAL)](runtime-platform-abstraction-layer.md) for more information. ### Error setting input: 0x10 / Attempted to resize a bounded tensor... This usually means the inputs provided do not match the shape of the example inputs used during model export. If the model is expected to handle varying size inputs (dynamic shapes), make sure the model export specifies the appropriate bounds. See [Expressing Dynamism](https://pytorch.org/docs/stable/export.html#expressing-dynamism) for more information on specifying dynamic shapes. ### Error 0x14 (Operator Missing) This usually means that the selective build configuration is incorrect. Ensure that the operator library is generated from the current version of the model and the corresponding `et_operator_library` is a dependency of the app-level `executorch_generated_lib` and the generated lib is linked into the application. This can also occur if the ExecuTorch portable library does not yet have an implementation of the given ATen operator. In this case, consider search for or creating an issue on [GitHub](https://github.com/pytorch/executorch/issues). ### Error 0x20 (Not Found) This error can occur for a few reasons, but the most common is a missing backend target. Ensure the appropriate backend target is linked. For XNNPACK, this is `xnnpack_backend`. If the backend is linked but is still not available, try linking with --whole-archive: `-Wl,--whole-archive libxnnpack_backend.a -Wl,--no-whole-archive`. ### Duplicate Kernel Registration Abort This manifests as a crash call stack including ExecuTorch kernel registration and failing with an `et_pal_abort`. This typically means there are multiple `gen_operators_lib` targets linked into the applications. There must be only one generated operator library per target, though each model can have its own `gen_selected_ops/generate_bindings_for_kernels` call. --- # Using ExecuTorch on iOS ExecuTorch supports both iOS and macOS via Objective-C, Swift, and C++. ExecuTorch also provides backends to leverage Core ML and Metal Performance Shaders (MPS) for hardware-accelerated execution on Apple platforms. ## Integration The ExecuTorch Runtime for iOS and macOS (ARM64) is distributed as a collection of prebuilt [.xcframework](https://developer.apple.com/documentation/xcode/creating-a-multi-platform-binary-framework-bundle) binary targets. These targets are compatible with both iOS and macOS devices and simulators and are available in both release and debug modes: * `executorch` - Core runtime components * `executorch_llm` - LLM-specific runtime components * `backend_coreml` - Core ML backend * `backend_mps` - MPS backend * `backend_xnnpack` - XNNPACK backend * `kernels_llm` - Custom kernels for LLMs * `kernels_optimized` - Accelerated generic CPU kernels * `kernels_quantized` - Quantized kernels * `kernels_torchao` - Quantized CPU kernels from torchao Link your binary with the ExecuTorch runtime and any backends or kernels used by the exported ML model. It is recommended to link the core runtime to the components that use ExecuTorch directly, and link kernels and backends against the main app target. **Note:** You may need to add some extra linker flags for the build settings of the components that links against ExecuTorch backends or kernels to let them register properly at the app startup. See the [Linkage](#Linkage) section for more details. **Note:** To access logs, link against the Debug build of the ExecuTorch runtime, i.e., the `executorch_debug` framework. For optimal performance, always link against the Release version of the deliverables (those without the `_debug` suffix), which have all logging overhead removed. See the [Logging](#Logging) section for more details. ### Swift Package Manager The prebuilt ExecuTorch runtime, backend, and kernels are available as a [Swift PM](https://www.swift.org/documentation/package-manager/) package. #### Xcode In Xcode, go to `File > Add Package Dependencies`. Paste the URL of the [ExecuTorch repo](https://github.com/pytorch/executorch) into the search bar and select it. Make sure to change the branch name to the desired ExecuTorch version in format "swiftpm-", (e.g. "swiftpm-1.0.0"), or a branch name in format "swiftpm-." (e.g. "swiftpm-1.1.0-20251101") for a [nightly build](https://ossci-ios.s3.amazonaws.com/list.html) on a specific date. ![](_static/img/swiftpm_xcode1.png) Then select which ExecuTorch framework should link against which target. ![](_static/img/swiftpm_xcode2.png) Click the screenshot below to watch the *demo video* on how to add the package and run a simple ExecuTorch model on iOS. Integrating and Running ExecuTorch on Apple Platforms #### CLI Add a package and target dependencies on ExecuTorch to your package file like this: ```swift // swift-tools-version:5.9 import PackageDescription let package = Package( name: "YourPackageName", platforms: [ .iOS(.v17), .macOS(.v12), ], products: [ .library(name: "YourPackageName", targets: ["YourTargetName"]), ], dependencies: [ // Use "swiftpm-." branch name for a nightly build. .package(url: "https://github.com/pytorch/executorch.git", branch: "swiftpm-1.0.0") ], targets: [ .target( name: "YourTargetName", dependencies: [ .product(name: "executorch", package: "executorch"), .product(name: "backend_xnnpack", package: "executorch"), .product(name: "kernels_optimized", package: "executorch"), // Add other backends and kernels as needed. ]), linkerSettings: [ // Force load all symbols from static libraries to trigger backends and kernels registration .unsafeFlags(["-Wl,-all_load"]) ] ] ) ``` Then check if everything works correctly: ```bash cd path/to/your/package swift package resolve # or just build it swift build ``` ### Building from Source Another way to integrate the ExecuTorch runtime is to build the necessary components from sources locally and link against them. This is useful when customizing the runtime. 1. Install [Xcode](https://developer.apple.com/xcode/resources/) 15+ and Command Line Tools: ```bash xcode-select --install ``` 2. Clone ExecuTorch: ```bash git clone -b viable/strict https://github.com/pytorch/executorch.git --depth 1 --recurse-submodules --shallow-submodules && cd executorch ``` 3. Set up [Python](https://www.python.org/downloads/macos/) 3.10+ and activate a virtual environment: ```bash python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip ``` 4. Install the required dependencies, including those needed for the backends like [Core ML](backends/coreml/coreml-overview.md) or [MPS](backends/mps/mps-overview.md), if you plan to build them later: ```bash ./install_requirements.sh # CoreML-only requirements: ./backends/apple/coreml/scripts/install_requirements.sh ``` 5. Install [CMake](https://cmake.org): Download the macOS binary distribution from the [CMake website](https://cmake.org/download), open the `.dmg` file, move `CMake.app` to the `/Applications` directory, and then run the following command to install the CMake command-line tools: ```bash sudo /Applications/CMake.app/Contents/bin/cmake-gui --install ``` 6. Use the provided script to build .xcframeworks: The following command will build the ExecuTorch runtime components along with all available kernels and backends for the Apple platform in both Release and Debug modes: ```bash ./scripts/build_apple_frameworks.sh ``` After the build finishes successfully, the resulting frameworks can be found in the `cmake-out` directory. Copy them to your project and link them against your targets. ## Linkage ExecuTorch initializes its backends and kernels (operators) during app startup by registering them in a static dictionary. If you encounter errors like "unregistered kernel" or "unregistered backend" at runtime, you may need to explicitly force-load certain components. Use the `-all_load` or `-force_load` linker flags in your Xcode build configuration to ensure components are registered early. Here's an example of a Xcode configuration file (`.xcconfig`): ``` ET_PLATFORM[sdk=iphonesimulator*] = simulator ET_PLATFORM[sdk=iphoneos*] = ios ET_PLATFORM[sdk=macos*] = macos OTHER_LDFLAGS = $(inherited) \ -force_load $(BUILT_PRODUCTS_DIR)/libexecutorch_debug_$(ET_PLATFORM).a \ -force_load $(BUILT_PRODUCTS_DIR)/libbackend_coreml_$(ET_PLATFORM).a \ -force_load $(BUILT_PRODUCTS_DIR)/libbackend_mps_$(ET_PLATFORM).a \ -force_load $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack_$(ET_PLATFORM).a \ -force_load $(BUILT_PRODUCTS_DIR)/libkernels_optimized_$(ET_PLATFORM).a \ -force_load $(BUILT_PRODUCTS_DIR)/libkernels_quantized_$(ET_PLATFORM).a ``` **Note:** In the example above, we link against the Debug version of the ExecuTorch runtime (`libexecutorch_debug`) to preserve the logs. Normally, that does not impact the performance too much. Nevertheless, remember to link against the release version of the runtime (`libexecutorch`) for the best performance and no logs. You can assign such a config file to your target in Xcode: 1. Add the `.xcconfig` file to your project. 2. Navigate to the project’s Info tab. 3. Select the configuration file in the build configurations for Release (or Debug) mode. ## Runtime API ExecuTorch provides native Objective-C APIs, automatically bridged to Swift, for interacting with the runtime. These APIs act as wrappers around the core C++ components found in [extension/tensor](extension-tensor.md) and [extension/module](extension-module.md), offering a more idiomatic experience for Apple platform developers. **Note:** These Objective-C/Swift APIs are currently experimental and subject to change. ### Importing Once linked against the `executorch` framework, you can import the necessary components. Objective-C (Objective-C++): ```objectivec // Import the main umbrella header for Module/Tensor/Value wrappers. #import // If using C++ directly alongside Objective-C++, you might still need C++ headers. #import #import ``` Swift: ```swift import ExecuTorch ``` #### Example Here's a concise example demonstrating how to load a model, prepare input, run inference, and process output using the Objective-C and Swift API. Imagine you have a MobileNet v3 model (`mv3.pte`) that takes a `[1, 3, 224, 224]` float tensor as input and outputs logits. Objective-C: ```objectivec NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"mv3" ofType:@"pte"]; // Create a module with the model file path. Nothing gets loaded into memory just yet. ExecuTorchModule *module = [[ExecuTorchModule alloc] initWithFilePath:modelPath]; NSError *error; // Optional error output argument to learn about failures. // Force-load the program and 'forward' method. Otherwise, it's loaded at the first execution. [module loadMethod:@"forward" error:&error]; float *imageBuffer = ...; // Existing image buffer. // Create an input tensor referencing the buffer and assuming the given shape and data type. ExecuTorchTensor *inputTensor = [[ExecuTorchTensor alloc] initWithBytesNoCopy:imageBuffer shape:@[@1, @3, @224, @224] dataType:ExecuTorchDataTypeFloat]; // Execute the 'forward' method with the given input tensor and get output values back. NSArray *outputs = [module forwardWithTensor:inputTensor error:&error]; // Get the first output value assuming it's a tensor. ExecuTorchTensor *outputTensor = outputs.firstObject.tensorValue; // Access the output tensor data. [outputTensor bytesWithHandler:^(const void *pointer, NSInteger count, ExecuTorchDataType dataType) { float *logits = (float *)pointer; // Use logits... }]; ``` Swift: ```swift let modelPath = Bundle.main.path(forResource: "mv3", ofType: "pte")! // Create a module with the model file path. Nothing gets loaded into memory just yet. let module = Module(filePath: modelPath) // Force-load the program and 'forward' method. Otherwise, it's loaded at the first execution. try module.load("forward") let imageBuffer: UnsafeMutableRawPointer = ... // Existing image buffer // Create an input tensor referencing the buffer and assuming the given shape and data type. let inputTensor = Tensor(&imageBuffer, shape: [1, 3, 224, 224]) // Execute the 'forward' method with the given input tensor and get an output tensor back. let outputTensor = try Tensor(module.forward(inputTensor)) // Copy the tensor data into logits array for easier access. let logits = outputTensor.scalars() // Use logits... ``` ### Tensor A tensor is a multi-dimensional array of elements (such as floats or integers) and includes metadata like shape (dimensions) and data type. Tensors are used to feed inputs to a model and retrieve outputs, or for any computation you need to do on raw data. You can create tensors from simple arrays of numbers, inspect their properties, read or modify their contents, and even reshape or copy them. ExecuTorch offers `ExecuTorchTensor` class in Objective-C and two tensor types in Swift: - `AnyTensor`: A type-erased tensor, bridged from `ExecuTorchTensor` in Objective-C. You might use it when the tensor's data type is only known at runtime, for example, when converting from an untyped `Value` object before casting it to a generic `Tensor`. - `Tensor`: A generic, type-safe wrapper around AnyTensor. This is the recommended type for most use cases in Swift. It ensures the element type (e.g., `Float`, `Int`) is known at compile time, providing type-safe access to tensor data and catching type mismatches early. You can convert between them using `tensor.anyTensor` (to get the underlying `AnyTensor`) and `anyTensor.asTensor()` (to convert to a typed `Tensor` if the data types match). #### Key Properties: - `dataType`: The element type (e.g., `.float`, `.int`, `.byte`). In `Tensor`, this is determined by `T`. - `shape`: An array of `Int` describing the size of each dimension. - `count`: The total number of elements. - `strides`: The jump in memory needed to advance one element along each dimension. - `dimensionOrder`: The order of dimensions in memory. - `shapeDynamism`: Indicates if the tensor shape can change (`.static`, `.dynamicBound`, `.dynamicUnbound`). #### Initialization: You can create a new tensor from an existing one, either as a view (which shares the same underlying data) or as a copy (which gets its own unique data). - View: `init(_:)` creates a new tensor instance that points to the same memory as the original. Modifying the data through one tensor will affect the other. - Copy: `copy()` creates a completely independent duplicate of the tensor, including its own copy of the data. Objective-C: ```objectivec // Create a view. ExecuTorchTensor *tensorView = [[ExecuTorchTensor alloc] initWithTensor:originalTensor]; // Create a copy. ExecuTorchTensor *tensorCopy = [originalTensor copy]; ``` Swift: ```swift // Create a view. let tensorView = Tensor(originalTensor) // Create a copy. let tensorCopy = originalTensor.copy() ``` Tensors can be initialized directly from memory pointers or `Data` objects. - `init(bytesNoCopy:...)`: Creates a tensor that references an existing memory buffer without copying. The buffer's lifetime must be managed manually and must exceed the tensor's. - `init(bytes:...)`: Creates a tensor by copying data from a memory buffer. - `init(data:...)`: Creates a tensor using an `NSData` (Objective-C) or `Data` (Swift) object, referencing its bytes without copying. Objective-C: ```objectivec // Create by copying bytes. float data[] = {1.0f, 2.0f, 3.0f, 4.0f}; NSArray *shape = @[@2, @2]; ExecuTorchTensor *tensorFromBytes = [[ExecuTorchTensor alloc] initWithBytes:data shape:shape dataType:ExecuTorchDataTypeFloat]; // Create from NSData (no copy). NSData *nsData = [NSData dataWithBytes:data length:sizeof(data)]; ExecuTorchTensor *tensorFromNSData = [[ExecuTorchTensor alloc] initWithData:nsData shape:shape dataType:ExecuTorchDataTypeFloat]; ``` Swift: ```swift // Create from a buffer without copying (unsafe). var mutableData: [Float] = [1.0, 2.0, 3.0, 4.0] let tensorNoCopy = mutableData.withUnsafeMutableBytes { pointer in Tensor( bytesNoCopy: pointer.baseAddress!, shape: [2, 2] ) } // Create from Data (no copy). let data = Data(bytes: &mutableData, count: mutableData.count * MemoryLayout.size) let tensorFromData = Tensor(data: data, shape: [2, 2]) ``` The most convenient way to create tensors is from Swift arrays or single scalar values. The `Tensor` API uses type inference to determine the `dataType` automatically. objective-c: ```objectivec // Create from an array of scalars. NSArray *scalars = @[@(1), @(2), @(3)]; NSArray *shape = @[@3]; ExecuTorchTensor *tensorFromScalars = [[ExecuTorchTensor alloc] initWithScalars:scalars shape:shape dataType:ExecuTorchDataTypeInt]; // Create a float scalar tensor. ExecuTorchTensor *scalarTensor = [[ExecuTorchTensor alloc] initWithFloat:3.14f]; ``` Swift: ```swift // Create from an array of scalars (infers shape and copies data). let tensor = Tensor([1.0, 2.0, 3.0, 4.0]) // Creates a Tensor with shape [4] // Specify shape. let tensorWithShape = Tensor([1, 2, 3, 4, 5, 6], shape: [2, 3]) // Creates Tensor // Create without copying from an `inout` array. var liveData: [Int32] = [10, 20, 30] let tensorNoCopy = Tensor(&liveData) // Modifying `liveData` affects `tensorNoCopy` // Create an Int scalar tensor. let scalarTensor = Tensor(42) // Infers Tensor with shape [] ``` #### Factory Methods: ExecuTorch provides a rich set of factory methods to create tensors with pre-filled or random data. - `empty`: Creates a tensor with uninitialized data. - `full`: Creates a tensor filled with a specified scalar value. - `ones`: Creates a tensor filled with ones. - `zeros`: Creates a tensor filled with zeros. - `rand`: Creates a tensor with random values uniformly distributed in `[0, 1)`. - `randn`: Creates a tensor with random values from a normal distribution (mean 0, variance 1). - `randint`: Creates a tensor with random integers in a specified range `[low, high)`. Each method has a `like:` variant that creates a new tensor with the same shape and properties as an existing one. Objective-C: ```objectivec // Create a 2x2 tensor filled with zeros. ExecuTorchTensor *zeros = [ExecuTorchTensor zerosTensorWithShape:@[@2, @2] dataType:ExecuTorchDataTypeFloat]; // Create a tensor of ones with the same shape as `zeros`. ExecuTorchTensor *ones = [ExecuTorchTensor onesTensorLikeTensor:zeros]; ``` Swift: ```swift // Create a 2x2 tensor filled with the value 7. let fullTensor = Tensor.full(shape: [2, 2], scalar: 7) // Create a 3x3 tensor of ones. let onesTensor = Tensor.ones(shape: [3, 3]) // Create a tensor of zeros with the same shape as onesTensor. let zerosTensor = Tensor.zeros(like: onesTensor) // Create a tensor with random integers between 10 (inclusive) and 20 (exclusive). let randomInts = Tensor.randint(low: 10, high: 20, shape: [5]) // Create a 2x2 type-erased tensor filled with zeros and explicit data type. let anyZeros = AnyTensor.zeros(shape: [2, 2], dataType: .float) // Create a 2x3 type-erased tensor filled with random values and explicit data type. let anyRand = AnyTensor.rand(shape: [2, 3], dataType: .double) ``` #### Accessing Data: Reading data: - `scalars()`: Returns a copy of the tensor's elements as a new `[T]` array. - `withUnsafeBytes(_:)`: Provides a type-safe, immutable buffer pointer (`UnsafeBufferPointer`) for efficient, direct memory access without creating a new array. - `bytesWithHandler:`: The Objective-C and `AnyTensor` approach, which uses a callback with a raw `void *` pointer and requires manual type casting. Objective-C: ```objectivec [tensor bytesWithHandler:^(const void *pointer, NSInteger count, ExecuTorchDataType dataType) { if (dataType == ExecuTorchDataTypeFloat) { const float *floatPointer = (const float *)pointer; NSLog(@"First float element: %f", floatPointer[0]); } }]; ``` Swift: ```swift let tensor = Tensor([1.0, 2.0, 3.0, 4.0], shape: [2, 2]) // Get data copy as a Swift array. let scalars = tensor.scalars() print("All scalars: \(scalars)") // [1.0, 2.0, 3.0, 4.0] // Access data via a buffer pointer. tensor.withUnsafeBytes { buffer in print("First float element: \(buffer.first ?? 0.0)") } anyTensor.bytes { pointer, count, dataType in // Must check data type and manually cast the pointer for type-erased tensor. if dataType == .float { let buffer = UnsafeBufferPointer(start: pointer.assumingMemoryBound(to: Float.self), count: count) print("First float element from AnyTensor: \(buffer.first ?? 0.0)") } } ``` Modifying Data: - `withUnsafeMutableBytes(_:)`: The preferred Swift method. Provides a type-safe, mutable buffer pointer (`UnsafeMutableBufferPointer`) for in-place modification. - `mutableBytesWithHandler:`: The Objective-C and `AnyTensor` equivalent. Objective-C: ```objectivec [tensor mutableBytesWithHandler:^(void *pointer, NSInteger count, ExecuTorchDataType dataType) { if (dataType == ExecuTorchDataTypeFloat) { float *floatPointer = (float *)pointer; floatPointer[0] = 100.0f; // Modify the tensor's data. } }]; ``` Swift: ```swift let tensor = Tensor([1.0, 2.0, 3.0, 4.0], shape: [2, 2]) // Modify the tensor's data in place. tensor.withUnsafeMutableBytes { buffer in buffer[1] = 200.0 } // tensor's data is now [1.0, 200.0, 3.0, 4.0] anyTensor.mutableBytes { pointer, count, dataType in if dataType == .float { let buffer = UnsafeMutableBufferPointer(start: pointer.assumingMemoryBound(to: Float.self), count: count) buffer[0] = 100.0 // Modify the AnyTensor's data } } ``` #### Resizing: Tensors can be resized if their shape dynamism is not `.static`. Resizing only changes the tensor's metadata (shape and strides) and does not reallocate or change the underlying data, so the new shape must have the same total number of elements. Objective-C: ```objectivec NSError *error; BOOL success = [tensor resizeToShape:@[@4, @1] error:&error]; if (success) { NSLog(@"Resized shape: %@", tensor.shape); } else { NSLog(@"Resize failed: %@", error); } ``` Swift: ```swift do { try tensor.resize(to: [4, 1]) print("Resized shape: \(tensor.shape)") } catch { print("Resize failed: \(error)") } ``` #### Equality: You can check if two tensors are equal using the `==` operator. It compares their data type, shape, strides, dimension order, and all underlying element data. The `shapeDynamism` property is disregarded in this comparison. #### Printing: Tensors conform to `CustomStringConvertible` in Swift and implement `-description` in Objective-C, so you can print them directly to the console for easy debugging. ### Value The `Value` class (exposed as `ExecuTorchValue` in Objective-C) is a dynamic container that can hold different types of data, primarily used for model inputs and outputs. ExecuTorch methods accept and return arrays of `Value` objects. #### Key Properties: - `tag`: Indicates the type of data held (e.g., `.tensor`, `.integer`, `.string`, `.boolean`). - `isTensor`, `isInteger`, `isString`, etc.: Boolean checks for the type. - `tensor`, `integer`, `string`, `boolean`, `double`: Accessors for the underlying data (return `nil` or a default value if the tag doesn't match). #### Initialization: Create Value objects directly from the data they should hold. Objective-C: ```objectivec #import ExecuTorchTensor *tensor = [[ExecuTorchTensor alloc] initWithFloat:1.0f]; ExecuTorchValue *tensorValue = [[ExecuTorchValue alloc] valueWithTensor:tensor]; ExecuTorchValue *intValue = [[ExecuTorchValue alloc] valueWithInteger:100]; ExecuTorchValue *stringValue = [[ExecuTorchValue alloc] valueWithString:@"hello"]; ExecuTorchValue *boolValue = [[ExecuTorchValue alloc] valueWithBoolean:YES]; ExecuTorchValue *doubleValue = [[ExecuTorchValue alloc] valueWithDouble:3.14]; ``` Swift: ```swift import ExecuTorch let tensor = Tensor(2.0) let tensorValue = Value(tensor) let intValue = Value(200) let stringValue = Value("world") let boolValue = Value(false) let doubleValue = Value(2.718) ``` Also, in Swift, all the types that `Value` can hold conform to the `ValueConvertible` protocol, so you can create `Value` objects directly from them without explicitly wrapping them in `Value` constructors: ```swift func processValue(_ value: ValueConvertible) { // ... } processValue(1) // Value processValue(1.0) // Value processValue("hello") // Value processValue(true) // Value processValue(Tensor(1.0)) // Value ``` ### Module The `Module` class (exposed as `ExecuTorchModule` in Objective-C) represents a loaded ExecuTorch model (`.pte` file). It provides methods to load the model program, inspect its methods, and execute them for inference. Note: `Module` and its methods are not thread-safe. If you need to do concurrent inferences from multiple threads, create one `Module` per thread. #### Initialization: Create a `Module` instance by providing the file path to the `.pte` model. Initialization itself is lightweight and doesn't load the program data immediately. You can also specify a `ModuleLoadMode` to control how the file is loaded, such as using memory mapping for efficiency. Objective-C: ```objectivec #import NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"model" ofType:@"pte"]; ExecuTorchModule *module = [[ExecuTorchModule alloc] initWithFilePath:modelPath]; // Optional: specify load mode, e.g., memory mapping. ExecuTorchModule *moduleMmap = [[ExecuTorchModule alloc] initWithFilePath:modelPath loadMode:ExecuTorchModuleLoadModeMmap]; ``` Swift: ```swift import ExecuTorch let modelPath = Bundle.main.path(forResource: "model", ofType: "pte")! let module = Module(filePath: modelPath) // Optional: specify load mode, e.g., memory mapping. let moduleMmap = Module(filePath: modelPath, loadMode: .mmap) ``` #### Loading: Model loading is deferred until explicitly requested or needed. You can load the entire program or individual methods. While execution calls can trigger loading automatically, it's often more efficient to load methods explicitly beforehand. - `load()`: Loads the basic program structure. You can specify a `ModuleVerification` level, though minimal verification is used by default. - `load(_:)`: Loads the program structure and prepares a specific method (e.g., "forward") for execution. This performs necessary setup like backend delegation and is recommended if you know which method you'll run. - `isLoaded()` / `isLoaded(_:)`: Check loading status. Objective-C: ```objectivec NSError *error; // Loads program and prepares 'forward' for execution. BOOL success = [module loadMethod:@"forward" error:&error]; if (success) { NSLog(@"Forward method loaded: %d", [module isMethodLoaded:@"forward"]); } else { NSLog(@"Failed to load method: %@", error); } ``` Swift: ```swift do { // Loads program and prepares 'forward' for execution. try module.load("forward") print("Forward method loaded: \(module.isLoaded("forward"))") } catch { print("Failed to load method: \(error)") } ``` #### Inspecting Method Metadata You can programmatically inspect a method's contract—its input/output types, tensor shapes, data types, and more—by retrieving its MethodMetadata. This is incredibly useful for building dynamic applications that can adapt to different models without hardcoding dimensions. Objective-c: ```objectivec NSError *error; ExecuTorchMethodMetadata *metadata = [module methodMetadata:@"forward" error:&error]; if (metadata) { // Check if the first input is a tensor. ExecuTorchValueTag firstInputTag = [metadata.inputValueTags[0] unsignedIntValue]; if (firstInputTag == ExecuTorchValueTagTensor) { // Get the metadata for the first input tensor. ExecuTorchTensorMetadata *tensorMeta = metadata.inputTensorMetadata[@0]; if (tensorMeta) { NSLog(@"Expected input shape: %@", tensorMeta.shape); NSLog(@"Expected input data type: %ld", (long)tensorMeta.dataType); // You can now dynamically create a matching input tensor. } } } ``` Swift: ```swift do { // Easily inspect the "forward" method at runtime. let metadata = try module.methodMetadata("forward") // Check if the first input is a tensor and get its metadata. if metadata.inputValueTags.first == .tensor, let tensorMeta = metadata.inputTensorMetadata[0] { print("Expected input shape: \(tensorMeta.shape)") print("Expected input data type: \(tensorMeta.dataType)") // Dynamically create a random tensor that matches the model's input specs. let input = AnyTensor.rand(shape: tensorMeta.shape, dataType: tensorMeta.dataType) // Use the dynamically created tensor for inference. let outputs = try module.forward(input) print("Successfully ran inference with dynamic input.") } } catch { print("Failed to get metadata or run inference: \(error)") } ``` #### Execution: The Module class offers flexible ways to execute methods. Inputs can be any type conforming to `ValueConvertible` (like `Tensor`, `Int`, `Float`, `Bool`, etc.). - `execute(_:_:)`: Execute any available method by name with one or more inputs. - `forward(_:)`: A convenient shortcut for executing the common "forward" method. The API provides overloads for single inputs, multiple inputs, or no inputs. Outputs are returned in two ways: - As an array of `Value`s, letting you inspect and cast results yourself. - As your expected type. The generic overloads decode the result directly into your desired Swift type (such as a single `Tensor`, an array, or any custom type conforming to the `ValueSequenceConstructible` protocol). If the output doesn’t match the expected type (e.g. multiple Values returned when a single object is expected, or a tensor data type mismatch), an invalid type error is thrown. Objective-C: ```objectivec ExecuTorchTensor *inputTensor1 = [[ExecuTorchTensor alloc] initWithScalars:@[@1.0f, @2.0f]]; ExecuTorchTensor *inputTensor2 = [[ExecuTorchTensor alloc] initWithScalars:@[@3.0f, @4.0f]]; ExecuTorchTensor *singleInputTensor = [[ExecuTorchTensor alloc] initWithFloat:5.0f]; NSError *error; // Execute "forward" using the shortcut with an array of Tensors. NSArray *outputs1 = [module forwardWithTensors:@[inputTensor1, inputTensor2] error:&error]; if (outputs1) { NSLog(@"Forward output count: %lu", (unsigned long)outputs1.count); } else { NSLog(@"Execution failed: %@", error); } // Execute "forward" with a single Tensor input. NSArray *outputs2 = [module forwardWithTensor:singleInputTensor error:&error]; if (outputs2) { NSLog(@"Forward single input output count: %lu", (unsigned long)outputs2.count); } else { NSLog(@"Execution failed: %@", error); } // Execute a potentially different method by name. NSArray *outputs3 = [module executeMethod:@"another_method" withInput:[[ExecuTorchValue alloc] valueWithTensor:inputTensor1] error:&error]; // Process outputs (assuming first output is a tensor). if (outputs1) { ExecuTorchValue *firstOutput = outputs1.firstObject; if (firstOutput.isTensor) { ExecuTorchTensor *resultTensor = firstOutput.tensorValue; // Process resultTensor. } } ``` Swift: ```swift let inputTensor1 = Tensor([1.0, 2.0]) let inputTensor2 = Tensor([3.0, 4.0]) let singleInputTensor = Tensor([5.0]) do { // Execute "forward" using the shortcut with an array of Tensors. let outputs1 = try module.forward([inputTensor1, inputTensor2]) print("Forward output count: \(outputs1.count)") // Execute "forward" with a single Tensor input. let outputs2 = try module.forward(singleInputTensor) print("Forward single input output count: \(outputs2.count)") // Execute a potentially different method by name. let outputs3 = try module.execute("another_method", [inputTensor1]) // Process outputs by converting the first output Value to a typed Tensor. if let outputTensor: Tensor = outputs1.first?.tensor() { // Now you have a type-safe tensor and can access its data easily. let logits = try outputTensor.scalars() print("First 5 logits: \(logits.prefix(5))") } // Try casting the outputs to a single typed object. let tensorOutput = try Tensor(module.forward(inputTensor1, inputTensor2)) let logits = tensorOutput.scalars() } catch { print("Execution failed: \(error)") } ``` #### Method Names: You can query the available method names in the model after the program is loaded. Objective-C: ```objectivec NSError *error; // Note: methodNames: will load the program if not already loaded. NSSet *names = [module methodNames:&error]; if (names) { NSLog(@"Available methods: %@", names); } else { NSLog(@"Could not get method names: %@", error); } ``` Swift: ```swift do { // Note: methodNames() will load the program if not already loaded. let names = try module.methodNames() print("Available methods: \(names)") // Output: e.g., {"forward"} } catch { print("Could not get method names: \(error)") } ``` ### Logging ExecuTorch provides APIs for logging in Objective-C and Swift via the `ExecuTorchLog` (`Log` in Swift) singleton. You can subscribe custom log sinks conforming to the `ExecuTorchLogSink` (`LogSink` in Swift) protocol to receive internal ExecuTorch log messages. **Note:** Logs are stripped in the Release builds of ExecuTorch frameworks. To capture logs, link against the Debug builds (e.g., `executorch_debug`) during development. Objective-C: ```objectivec #import #import @interface MyClass : NSObject @end @implementation MyClass - (instancetype)init { self = [super init]; if (self) { #if DEBUG [ExecuTorchLog.sharedLog addSink:self]; #endif } return self; } - (void)dealloc { #if DEBUG [ExecuTorchLog.sharedLog removeSink:self]; #endif } #if DEBUG - (void)logWithLevel:(ExecuTorchLogLevel)level timestamp:(NSTimeInterval)timestamp filename:(NSString *)filename line:(NSUInteger)line message:(NSString *)message { NSString *logMessage = [NSString stringWithFormat:@"%@:%lu %@", filename, (unsigned long)line, message]; switch (level) { case ExecuTorchLogLevelDebug: os_log_with_type(OS_LOG_DEFAULT, OS_LOG_TYPE_DEBUG, "%{public}@", logMessage); break; case ExecuTorchLogLevelInfo: os_log_with_type(OS_LOG_DEFAULT, OS_LOG_TYPE_INFO, "%{public}@", logMessage); break; case ExecuTorchLogLevelError: os_log_with_type(OS_LOG_DEFAULT, OS_LOG_TYPE_ERROR, "%{public}@", logMessage); break; case ExecuTorchLogLevelFatal: os_log_with_type(OS_LOG_DEFAULT, OS_LOG_TYPE_FAULT, "%{public}@", logMessage); break; default: os_log(OS_LOG_DEFAULT, "%{public}@", logMessage); break; } } #endif @end ``` Swift: ```swift import ExecuTorch import os.log public class MyClass { public init() { #if DEBUG Log.shared.add(sink: self) #endif } deinit { #if DEBUG Log.shared.remove(sink: self) #endif } } #if DEBUG extension MyClass: LogSink { public func log(level: LogLevel, timestamp: TimeInterval, filename: String, line: UInt, message: String) { let logMessage = "\(filename):\(line) \(message)" switch level { case .debug: os_log(.debug, "%{public}@", logMessage) case .info: os_log(.info, "%{public}@", logMessage) case .error: os_log(.error, "%{public}@", logMessage) case .fatal: os_log(.fault, "%{public}@", logMessage) default: os_log("%{public}@", logMessage) } } } #endif ``` **Note:** In the example, the logs are intentionally stripped out when the code is not built for Debug mode, i.e., the `DEBUG` macro is not defined or equals zero. ## Debugging If you are linking against a Debug build of the ExecuTorch frameworks, configure your debugger to map the source code correctly by using the following LLDB command in the debug session: ``` settings append target.source-map /executorch ``` ## Troubleshooting ### Slow execution Ensure the exported model is using an appropriate backend, such as XNNPACK, Core ML, or MPS. If the correct backend is invoked but performance issues persist, confirm that you are linking against the Release build of the backend runtime. For optimal performance, link the ExecuTorch runtime in Release mode too. If debugging is needed, you can keep the ExecuTorch runtime in Debug mode with minimal impact on performance, but preserve logging and debug symbols. ### Swift PM If you encounter a checksum mismatch error with Swift PM, clear the package cache using the Xcode menu (`File > Packages > Reset Package Caches`) or the following command: ```bash rm -rf .xcodeproj/project.xcworkspace/xcshareddata/swiftpm \ ~/Library/org.swift.swiftpm \ ~/Library/Caches/org.swift.swiftpm \ ~/Library/Caches/com.apple.dt.Xcode \ ~/Library/Developer/Xcode/DerivedData ``` **Note:** Ensure Xcode is fully quit before running the terminal command to avoid conflicts with active processes. --- # Runtime Integration This section describes options for configuring and customizing the ExecuTorch runtime. While the pre-built packages are designed to provide an "out-of-box" experience, it is common to require additional configuration when shipping into production. ExecuTorch provides the ability to compile-time gate features, such as logging, customize system integration, and include only the operators needed to run specific models (selective build). ## Logging ExecuTorch runtime code includes logging statements at various levels, to aid with integration and debugging. Logging inclusion is controlled at build time by the `EXECUTORCH_ENABLE_LOGGING` and `EXECUTORCH_LOG_LEVEL` CMake options. Having these exposed as compile-time configuration allows for all logging-related code to be excluded when not used, which is critical for resource constrained systems. Logging is sent to STDOUT and STDERR by default on host platforms, and is redirected to OS-specific logging on Android and iOS. See [Platform Abstraction Layer](#platform-abstraction-layer-pal) below for more information on log routing. To configure log level when building from source, specify `EXECUTORCH_ENABLE_LOGGING` as on or off and `EXECUTORCH_LOG_LEVEL` as one of debug, info, error, or fatal. Logging is enabled by default in debug builds and disabled in release. Log level defaults to info. See [Building from Source](using-executorch-building-from-source.md) for more information. ``` cmake -b cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON -DEXECUTORCH_LOG_LEVEL=DEBUG ... ``` ## Platform Abstraction Layer (PAL) The ExecuTorch Platform Abstraction Layer, or PAL, is a glue layer responsible for providing integration with a particular host system. This includes log routing, timestamps, and abort handling. ExecuTorch provides a default implementation for POSIX-compliant targets, as well as a Android and iOS-specific implementations under the appropriate extensions. For non-POSIX-compliant systems, a minimal no-op PAL implementation is provided. It is expected that users override the relevant PAL methods in order to enable logging, timestamps, and aborts. The minimal PAL can be selected by building with `-DEXECUTORCH_PAL_DEFAULT=minimal`. ### Overriding the PAL Overriding the default PAL implementation is commonly done to route logs to a user-specified destination or to provide PAL functionality on embedded systems. The PAL can be overriden usinn runtime APIs or at link time. Prefer the runtime API unless you specifically need link-time overrides. ### Runtime PAL Registration To register a custom PAL implementation, take the following steps: - Include [`executorch/runtime/platform/platform.h`](https://github.com/pytorch/executorch/blob/main/runtime/platform/platform.h) in one of your application's `.c` or `.cpp` files. - Create an instance of the [PalImpl](https://github.com/pytorch/executorch/blob/7b39a0ce63bfb5124d4d29cfb6c8af85a3c580ba/runtime/platform/platform.h#L163) struct. - Set one or more fields to custom PAL function implementations. Leave fields as null to use the default platform implementation. - The PalImpl struct provides a [create](https://github.com/pytorch/executorch/blob/7b39a0ce63bfb5124d4d29cfb6c8af85a3c580ba/runtime/platform/platform.h#L168) method for this purpose. - Call `executorch::platform::register_pal(pal_impl)` to register the implementation. - This can be done from as as a global constructor, as in the example below. Here is a complete example from [pybindings.cpp](https://github.com/pytorch/executorch/blob/7b39a0ce63bfb5124d4d29cfb6c8af85a3c580ba/extension/pybindings/pybindings.cpp#L1178), where logs are redirected to show up properly in a Python notebook environment. ```cpp namespace { void emit_log_message( et_timestamp_t timestamp, et_pal_log_level_t level, const char* filename, ET_UNUSED const char* function, size_t line, const char* message, ET_UNUSED size_t length) { std::cerr << "[" << filename << ":" << line << "] " << message << std::endl; } runtime::PalImpl build_pal() { return runtime::PalImpl::create(emit_log_message, __FILE__); } // Update PAL to redirect logs. ET_UNUSED bool registration_result = runtime::register_pal(build_pal()); } ``` ### Weak Symbol Override ExecuTorch also provides a link-time method to override the PAL using weak symbols. This method is primarily maintained for backwards compatibility. To override one or more PAL methods, take the following steps: - Include [`executorch/runtime/platform/platform.h`](https://github.com/pytorch/executorch/blob/main/runtime/platform/platform.h) in one of your application's `.c` or `.cpp` files. - Define an implementation of one or more of the `et_pal_*()` functions. The default PAL functions are weak symbols, so providing your own strong-symbol definition can override them at link time. To ensure that your definitions take precedence, you may need to ensure that the strong definitions precede the weak definitions in the link order. See [runtime/platform/platform.h](https://github.com/pytorch/executorch/blob/main/runtime/platform/platform.h) for the PAL function signatures and [runtime/platform/default/posix.cpp](https://github.com/pytorch/executorch/blob/main/runtime/platform/default/posix.cpp) for the reference POSIX implementation. ## Kernel Libraries During export, a model is broken down into a list of operators, each providing some fundamental computation. Adding two tensors is an operator, as is convolution. Each operator requires a corresponding operator kernel to perform the computation on the target hardware. ExecuTorch backends are the preferred way to do this, but not all operators are supported on all backends. To handle this, ExecuTorch provides two implementations - the *portable* and *optimized* kernel libraries. The portable kernel library provides full support for all operators in a platform-independent manner. The optimized library carries additional system requirements, but is able to leverage multithreading and vectorized code to achieve greater performance. Operators can be drawn for both for a single build, allowing the optimized library to be used where available with the portable library as a fallback. The choice of kernel library is transparent to the user when using mobile pre-built packages. However, it is important when building from source, especially on embedded systems. On mobile, the optimized operators are preferred where available. See [Overview of ExecuTorch's Kernel Libraries](kernel-library-overview.md) for more information. ## Selective Build By default, ExecuTorch ships with all supported operator kernels, allowing it to run any supported model at any precision. This comes with a binary size of several megabytes, which may be undesirable for production use cases or resource constrained systems. To minimize binary size, ExecuTorch provides selective build functionality, in order to include only the operators needed to run specific models. Note the selective build only applies to the portable and optimized kernel libraries. Delegates do not participate in selective build and can be included or excluded by linking indivually. See [Kernel Library Selective Build](kernel-library-selective-build.md) for more information. --- # Profiling and Debugging To facilitate model and runtime integration, ExecuTorch provides tools to profile model resource utilization, numerics, and more. This section describes the available troubleshooting tools and steps to resolve issues when integrating ExecuTorch. ## General Troubleshooting Steps - To troubleshoot failure of runtime API calls, such as loading or running a model, ensure that ExecuTorch framework logging is enabled. See [Logging](using-executorch-runtime-integration.md#logging) for more information. - As a preliminary step to troubleshoot slow run times, ensure that performance testing is being done in a release build, and that the model is delegated. See [Inference is Slow](using-executorch-faqs.md#inference-is-slow--performance-troubleshooting) for more information. - Check [Frequently Asked Questions](using-executorch-faqs.md) for common issues and questions encountered during install, model export, and runtime integration. ## Developer Tools The ExecuTorch developer tools, or devtools, are a collection of tooling for troubleshooting model performance, numerics, and resource utilization. See [Introduction to the ExecuTorch Developer Tools](devtools-overview.md) for an overview of the available developer tools and usage. ## Next Steps - [Frequently Asked Questions](using-executorch-faqs.md) for solutions to commonly encountered questions and issues. - [Introduction to the ExecuTorch Developer Tools](runtime-profiling.md) for a high-level introduction to available developer tooling. - [Using the ExecuTorch Developer Tools to Profile a Model](tutorials/devtools-integration-tutorial) for information on runtime performance profiling. - [Inspector APIs](runtime-profiling.md) for reference material on trace inspector APIs. --- # Visualize a Model using ModelExplorer The [visualization_utils.py](../../devtools/visualization/visualization_utils.py) contains functions for visualizing ExecuTorch models as computational graphs using the `ModelExplorer` utility. ## Installation To install the `ModelExplorer` and its dependencies, run: ``` ./devtools/install_requirements.sh ``` ## Visualize a model The function `visualize()` takes an `ExportedProgram` and launches a `ModelExplorer` server instance. A browser tab will open, containing the visualization. The operations in the graph will be grouped together into collapsable nodes, based on which `nn.Module` instances they originate from (see **Figure 1**). These nodes can be expanded by clicking the button in their top left corner, as shown in **Figure 2**. The model can contain an entire hierarchy of collapsable nodes, reflecting its original _PyTorch_ implementation (see **Figure 3**).
Figure 1: Model visualization collapsed into a single node representing the original module.
Figure 2: Button to expand a node.
Figure 3: Hierarchy of expandable nodes.
The **Model Explorer GUI** provides a button in the top left corner of the screen (see **Figure 4 **), which expands all the nested expandable nodes. The result will display all the low-level operations, surrounded by rectangles which indicate their membership to specific `nn.Module` instances.
Figure 4: Expand all nodes.
Sometimes, it is not ideal to view the model like this. Focusing on visualizing the origin of the final nodes can make it harder to see the flow of data in the graph. For this purpose, a button in the top left corner can flatten all the layers (expandable nodes), effectively hiding the original `nn.Module` instances and just displaying the model as a computational graph (see **Figure 5**).
Figure 5: Flatten the model to a simple computational graph.
--- # Visualize a Model with Highlighted QDQ Clusters and Partitions The [visualization_utils.py](../../devtools/visualization/visualization_utils.py) contains the function `visualize_with_clusters()` which takes an `ExportedProgram` and visualizes it using the `ModelExplorer` utility. It groups QDQ clusters and individual partitions together to improve readability. Example usage is available in [examples/nxp/aot_neutron_compile.py](../../examples/nxp/aot_neutron_compile.py). An example of the visualization is shown in **Figure 6.**
Figure 6: Example of the QDQ cluster and partition highlighting visualization.
## Usage There are two main use cases for the visualization: ### 1. Launching the `ModelExplorer` and Visualizing the Model Immediately Call: ```python visualize_with_clusters(exported_program) ``` This starts a `ModelExplorer` server and opens a browser tab with the visualization. By default, each call starts a new server instance and opens a new browser tab. To reuse an existing server, set the `reuse_server` parameter to `True`. Starting the server is **blocking**, so the rest of your script will not run. ### 2. Storing a Serialized Graph and Visualizing Later (Non-blocking) To save the visualization to a JSON file, call: ```python visualize_with_clusters(exported_program, "my_model.json") ``` This just saves the visualization in the file, and it does **not** start the `ModelExplorer` server. You can then open the file in the `ModelExplorer` GUI at any point. To launch the server, run: ```bash model-explorer [model-file-json] ``` If the `model-file-json` is provided, the `ModelExplorer` will open the model visualization. Otherwise, the `ModelBuilder` GUI home page will appear. In that case, click **Select from your computer**, choose the JSON file, and then click **View selected models** to display the graph. --- ## Styling the Graph `visualize_with_clusters()` supports custom grouping of nodes into QDQ clusters and partitions. You can pass the following optional parameters: - `get_node_partition_name` - `get_node_qdq_cluster_name` These are functions that take a node and return a string identifying the partition or cluster it belongs to. Nodes with the same partition/cluster string will be grouped together and labeled accordingly in the visualization. ### Load a predefined style for QDQ cluster and partition highlighting. A color style for the QDQ cluster and partition highlighting is already provided in [devtools/visualization/model_explorer_styles/cluster_highlight_style.json](../../devtools/visualization/model_explorer_styles/cluster_highlight_style.json). To load it follow these steps: 1. Click the **palette icon** in the top-right corner of the `ModelExplorer` interface. 2. Click **Import rules**. 3. Select the [cluster_highlight_style.json](../../devtools/visualization/model_explorer_styles/cluster_highlight_style.json) file to apply predefined styles that highlight each partition in a different color.
Figure 7: Add custom color styling to the graph.