Arm backend: Review documentation for 1.2

Erik-Lundell · Erik-Lundell · commit 5ddbab2dcba7 · 2026-03-04T15:56:29.000+01:00
- Make sure API is up to date
- Make sure paths are up to date
- Rerun docgen
- Typos

Signed-off-by: Erik Lundell &lt;erik.lundell@arm.com&gt;
Change-Id: I3ad78fdfed3d60c76badc3ba794236372f3590da
diff --git a/backends/arm/README.md b/backends/arm/README.md
@@ -51,7 +51,7 @@ backends/arm/
 │   └── quantization_annotator.py  # Defines how operators are annotated for quantization
 │
 ├── runtime/                       # Backends for running inference on target devices
-│   ├── ArmEthosUBackend.cpp
+│   ├── EthosUBackend.cpp
 │   └── VGFBackend.cpp
 │
 ├── scripts/                       # Auxiliary build, dependency installation and utility scripts
diff --git a/backends/arm/ethosu/partitioner.py b/backends/arm/ethosu/partitioner.py
@@ -17,7 +17,7 @@ class EthosUPartitioner(TOSAPartitioner):
     """Partitions subgraphs supported by the Arm Ethos-U backend.
 
     Args:
-        compile_spec: List of CompileSpec objects for Ethos-U backend.
+        compile_spec: EthosUCompileSpec object for configuring the lowering.
         additional_checks: Optional sequence of additional operator support checks.
 
     """
diff --git a/backends/arm/scripts/docgen/ethos-u/backends-arm-ethos-u-overview.md.in b/backends/arm/scripts/docgen/ethos-u/backends-arm-ethos-u-overview.md.in
@@ -20,7 +20,6 @@ The target system must include an Ethos-U NPU.
 
 ```{tip}
 All requirements can be downloaded using `examples/arm/setup.sh --i-agree-to-the-contained-eula` and added to the path using
-set(CMAKE_INSTALL_PREFIX "${CMAKE_BINARY_DIR}")
 `source examples/arm/arm-scratch/setup_path.sh`. Note that this means accepting the End-User License Agreements (EULA:s) required for using the downloaded software.
 ```
 
diff --git a/backends/arm/scripts/docgen/ethos-u/backends-arm-ethos-u-quantization.md.in b/backends/arm/scripts/docgen/ethos-u/backends-arm-ethos-u-quantization.md.in
@@ -9,9 +9,10 @@ Currently, the symmetric `int8` config defined by `executorch.backends.arm.quant
 The Arm Ethos-U delegate supports the following quantization schemes:
 
 - 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
-- Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization). This is under development.
-- Partial quantization is *not* supported on the Ethos-U backend. The entire model must be quantized.
+- Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization).
+- Limited support for 8-bit quantization with 8-bit activations and 4-bit weights (a.k.a. 8x4 quantization). 
+- Partial quantization is supported by the quantizer, but non-quantized operators won't be delegated to the Ethos-U backend.
 
 ### Quantization API
 
-$QUANTIZER
+$QUANTIZER
diff --git a/backends/arm/scripts/docgen/vgf/backends-arm-vgf-overview.md.in b/backends/arm/scripts/docgen/vgf/backends-arm-vgf-overview.md.in
@@ -33,6 +33,7 @@ And for building and running your application using the generic executor_runner:
 The [VGF Minimal Example](https://github.com/pytorch/executorch/blob/main/examples/arm/vgf_minimal_example.ipynb) demonstrates how to lower a module using the VGF backend.
 
 The main configuration point for the lowering is the `VgfCompileSpec` consumed by the partitioner and quantizer.
+To extract the VGF file for integration into applications without the ExecuTorch runtime, use `VgfCompileSpec.dump_intermediate_artifacts_to()`.  
 The full user-facing API is documented below.
 
 $COMPILE_SPEC
diff --git a/backends/arm/scripts/docgen/vgf/vgf-getting-started-tutorial.md.in b/backends/arm/scripts/docgen/vgf/vgf-getting-started-tutorial.md.in
@@ -73,7 +73,7 @@ Make sure the executable is located where you expect, in the `examples/arm` tree
 
 The ExecuTorch Ahead-of-Time (AOT) pipeline takes a PyTorch Model (a `torch.nn.Module`) and produces a `.pte` binary file, which is then typically consumed by the ExecuTorch Runtime. This [document](https://github.com/pytorch/executorch/blob/main/docs/source/getting-started-architecture.md) goes in much more depth about the ExecuTorch software stack for both AoT as well as Runtime.
 
-The example below shows how to quantize a model consisting of a single addition, and export it it through the AOT flow using the VGF backend. For more details, se `examples/arm/vgf_minimal_example.ipynb`.
+The example below shows how to quantize a model consisting of a single addition, and export it through the AOT flow using the VGF backend. For more details, see `examples/arm/vgf_minimal_example.ipynb`.
 
 $MINIMAL_EXAMPLE
 
@@ -106,7 +106,7 @@ cmake \
   -DPYTHON_EXECUTABLE=python \
   -Bcmake-out .
 
-cmake --build cmake-out --target executor_runner`
+cmake --build cmake-out --target executor_runner
 ```
 
 
diff --git a/backends/arm/scripts/pre-push b/backends/arm/scripts/pre-push
@@ -34,7 +34,7 @@ VERBS="Add|Fix|Update|Refactor|Improve|Remove|Change|Implement|Create|Modify|"\
 "Profile|Recalculate|Reconstruct|Redefine|Redesign|Reevaluate|Relocate|Remap|"\
 "Render|Reposition|Request|Revert|Sanitize|Specify|Strengthen|Stub|Substitute|"\
 "Tag|Tweak|Unify|Unlock|Unset|Use|Validate|Verify|Rename|Relax|Format|Don't|"\
-"Consolidate"
+"Consolidate|Review"
 
 # Remote branch
 REMOTE=$(git rev-parse --abbrev-ref --symbolic-full-name @{u} 2>/dev/null)
diff --git a/backends/cortex_m/README.md b/backends/cortex_m/README.md
@@ -7,21 +7,21 @@
 
 The Cortex-M backend is implemented as an operator dialect/library based on [CMSIS-NN](https://github.com/ARM-software/CMSIS-NN), together with the `CortexMQuantizer` which targets supported ops, and the `CortexMPassManager` which modifies the exported program to use Cortex-M operators where possible. It is intended for use with **channels-last input**  since this is what the accelerated kernels are using.
 
-For a detailed example of the full lowering flow, see `examples/arm/cortex_m_minimal_example.ipynb`.
+For a detailed example of the full lowering flow, see `examples/arm/cortex_m_mv2_example.ipynb`.
 
 ## Testing
-Tests are available in `backends/cortex-m/test/` using the `backends/test` harness. The python implementations of the operators are tested in tests named `test_dialect_*`, while actual accelerated implementations are tested on simulated hardware in the tests named `test_implementation_*`.
+Tests are available in `backends/cortex_m/test/` using the `backends/test` harness. The python implementations of the operators are tested in tests named `test_dialect_*`, while actual accelerated implementations are tested on simulated hardware in the tests named `test_implementation_*`.
 
 To run tests:
 ```
 examples/arm/setup.sh --i-agree-to-the-contained-eula                     # Download needed toolchains and simulators
 examples/arm/arm-scratch/setup_path.sh                                    # Add dependencies to path
-backends/cortex-m/test/build_test_runner.sh                               # Build executor-runner with cortex-m oplib + kernels registred
-pytest --config-file=backends/arm/test/pytest.ini backends/cortex-m/test  # Run tests with correct configuration file
+backends/cortex_m/test/build_test_runner.sh                               # Build executor-runner with cortex-m oplib + kernels registred
+pytest --config-file=backends/arm/test/pytest.ini backends/cortex_m/test  # Run tests with correct configuration file
 ```
 
 ## Supported operators
-Refer to `backends/cortex-m/test/ops` for currently supported accelerated ops/dtypes. Additionally, the quantizer targets pure "data-movement ops" such as data copies, slicing and concatinations to use quantized dtypes using the portable-kernels operator lbrary.
+Refer to `backends/cortex_m/test/ops` for currently supported accelerated ops/dtypes. Additionally, the quantizer targets pure "data-movement ops" such as data copies, slicing and concatenations to use quantized dtypes using the portable-kernels operator library.
 In general however, operators not supported by Cortex-M are kept in `fp32` using non-accelerated portable-kernels. It is recommended to analyze the graph after lowering to understand how much of the graph has been accelerated.
 
 ## Notices
diff --git a/docs/source/backends/arm-ethos-u/arm-ethos-u-overview.md b/docs/source/backends/arm-ethos-u/arm-ethos-u-overview.md
@@ -20,7 +20,6 @@ The target system must include an Ethos-U NPU.
 
 ```{tip}
 All requirements can be downloaded using `examples/arm/setup.sh --i-agree-to-the-contained-eula` and added to the path using
-set(CMAKE_INSTALL_PREFIX "${CMAKE_BINARY_DIR}")
 `source examples/arm/arm-scratch/setup_path.sh`. Note that this means accepting the End-User License Agreements (EULA:s) required for using the downloaded software.
 ```
 
@@ -68,15 +67,17 @@ Args:
 ```python
 def EthosUCompileSpec.dump_intermediate_artifacts_to(self, output_path: str | None):
 ```
-Sets a path for dumping intermediate results during such as tosa and pte.
+Sets a path for dumping intermediate results during such as tosa and
+pte.
 
 Args:
 - **output_path**: Path to dump intermediate results to.
 
 ```python
 def EthosUCompileSpec.get_intermediate_path(self) -> str | None:
 ```
-Gets the path used for dumping intermediate results such as tosa and pte.
+Gets the path used for dumping intermediate results such as tosa and
+pte.
 
 Returns:
     Path where intermediate results are saved.
@@ -94,7 +95,9 @@ Gets whether the output order workaround is being applied.
 ```python
 def EthosUCompileSpec.get_pass_pipeline_config(self) -> executorch.backends.arm.common.pipeline_config.ArmPassPipelineConfig:
 ```
-Returns configuration that controls how the Arm pass pipeline should behave.
+Returns configuration that controls how the Arm pass pipeline should
+behave.
+
 Subclasses may override to tweak defaults for specific targets.
 
 ```python
@@ -108,8 +111,8 @@ Args:
 ```python
 def EthosUCompileSpec.set_pass_pipeline_config(self, config: executorch.backends.arm.common.pipeline_config.ArmPassPipelineConfig) -> None:
 ```
-Sets the configuration that controls how the Arm pass pipeline should behave.
-Subclasses may override to tweak defaults for specific targets.
+Sets the configuration that controls how the Arm pass pipeline should
+behave. Subclasses may override to tweak defaults for specific targets.
 
 Args:
 - **config**: The custom ArmPassPipelineConfig to set.
diff --git a/docs/source/backends/arm-ethos-u/arm-ethos-u-partitioner.md b/docs/source/backends/arm-ethos-u/arm-ethos-u-partitioner.md
@@ -8,7 +8,7 @@ class EthosUPartitioner(compile_spec: executorch.backends.arm.ethosu.compile_spe
 Partitions subgraphs supported by the Arm Ethos-U backend.
 
 Args:
-- **compile_spec**: List of CompileSpec objects for Ethos-U backend.
+- **compile_spec**: EthosUCompileSpec object for configuring the lowering.
 - **additional_checks**: Optional sequence of additional operator support checks.
 
 ```python
diff --git a/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md b/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md
@@ -9,8 +9,9 @@ Currently, the symmetric `int8` config defined by `executorch.backends.arm.quant
 The Arm Ethos-U delegate supports the following quantization schemes:
 
 - 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
-- Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization). This is under development.
-- Partial quantization is *not* supported on the Ethos-U backend. The entire model must be quantized.
+- Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization).
+- Limited support for 8-bit quantization with 8-bit activations and 4-bit weights (a.k.a. 8x4 quantization). 
+- Partial quantization is supported by the quantizer, but non-quantized operators won't be delegated to the Ethos-U backend.
 
 ### Quantization API
 
@@ -26,7 +27,12 @@ Args:
 ```python
 def EthosUQuantizer.quantize_with_submodules(self, model: 'GraphModule', calibration_samples: 'list[tuple]', is_qat: 'bool' = False):
 ```
-Quantizes a GraphModule in a way such that conditional submodules are handled properly.
+Quantizes a GraphModule in a way such that conditional submodules are
+handled properly.
+
+Note: torchao's prepare_pt2e and convert_pt2e natively handle
+while_loop body_fn submodules, so we only manually process cond
+branches and while_loop cond_fn here.
 
 Args:
 - **model (GraphModule)**: The model to quantize.
diff --git a/docs/source/backends/arm-ethos-u/arm-ethos-u-troubleshooting.md b/docs/source/backends/arm-ethos-u/arm-ethos-u-troubleshooting.md
@@ -26,7 +26,7 @@ You can see how  this coupling between the memory mode and runtime application i
 
 The arm_executor_runner supports [bundled-io](https://docs.pytorch.org/executorch/0.4/bundled-io.html) and [ETdump](https://docs.pytorch.org/executorch/stable/etdump.html) debugging tools.
 
-To enable bundled-io, set `EXECUTORCH_BUILD_DEVTOOLS` when building Executorch and `DET_BUNDLE_IO` when building the executor_runner. To enable ETdump, set `EXECUTORCH_BUILD_ARM_ETDUMP` when building Executorch and `DEXECUTORCH_ENABLE_EVENT_TRACER` when building the executor_runner.
+To enable bundled-io, set `-DEXECUTORCH_BUILD_DEVTOOLS=ON` when building Executorch and `-DET_BUNDLE_IO=ON` when building the executor_runner. To enable ETdump, set `-DEXECUTORCH_BUILD_ARM_ETDUMP=ON` when building Executorch and `-DEXECUTORCH_ENABLE_EVENT_TRACER=ON` when building the executor_runner.
 
 ## Issues with memory formats
 
diff --git a/docs/source/backends/arm-ethos-u/tutorials/ethos-u-getting-started.md b/docs/source/backends/arm-ethos-u/tutorials/ethos-u-getting-started.md
@@ -202,14 +202,6 @@ To learn more, check out these learning paths:
 https://learn.arm.com/learning-paths/embedded-and-microcontrollers/rpi-llama3/
 https://learn.arm.com/learning-paths/embedded-and-microcontrollers/visualizing-ethos-u-performance/
 
-### Project Templates
-
-These project templates provide alternative starting points for ExecuTorch development:
-
-- [CMSIS-Executorch Project Template](https://github.com/Arm-Examples/cmsis-executorch) — Docker-based build environment with Keil Studio/VS Code integration for Arm Ethos-U applications, featuring automated CI/CD and AVH-SSE-300 simulation support.
-
-- [ExecuTorch on Zephyr RTOS with CMSIS](https://github.com/Arm-Examples/cmsis-zephyr-executorch) — Complete example of running ExecuTorch inference on Arm Cortex-M with Ethos-U NPU acceleration using Zephyr RTOS and CMSIS Toolbox.
-
 ## FAQs
 
 If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new).
diff --git a/docs/source/backends/arm-vgf/arm-vgf-overview.md b/docs/source/backends/arm-vgf/arm-vgf-overview.md
@@ -33,6 +33,7 @@ And for building and running your application using the generic executor_runner:
 The [VGF Minimal Example](https://github.com/pytorch/executorch/blob/main/examples/arm/vgf_minimal_example.ipynb) demonstrates how to lower a module using the VGF backend.
 
 The main configuration point for the lowering is the `VgfCompileSpec` consumed by the partitioner and quantizer.
+To extract the VGF file for integration into applications without the ExecuTorch runtime, use `VgfCompileSpec.dump_intermediate_artifacts_to()`.  
 The full user-facing API is documented below.
 
 ```python
@@ -43,7 +44,7 @@ Normalise inputs and populate the underlying Arm compile spec.
 Args:
 - **tosa_spec (TosaSpecification | str | None)**: TOSA specification to
         target. Strings are parsed via ``TosaSpecification.create_from_string``.
-        Defaults to ``"TOSA-1.0+FP+INT"``.
+        Defaults to ``"TOSA-1.0+FP+INT+int4+int16"``.
 - **compiler_flags (list[str] | None)**: Optional converter-backend flags.
 
 ```python
@@ -57,15 +58,17 @@ Args:
 ```python
 def VgfCompileSpec.dump_intermediate_artifacts_to(self, output_path: str | None):
 ```
-Sets a path for dumping intermediate results during such as tosa and pte.
+Sets a path for dumping intermediate results during such as tosa and
+pte.
 
 Args:
 - **output_path**: Path to dump intermediate results to.
 
 ```python
 def VgfCompileSpec.get_intermediate_path(self) -> str | None:
 ```
-Gets the path used for dumping intermediate results such as tosa and pte.
+Gets the path used for dumping intermediate results such as tosa and
+pte.
 
 Returns:
     Path where intermediate results are saved.
@@ -83,7 +86,9 @@ Gets whether the output order workaround is being applied.
 ```python
 def VgfCompileSpec.get_pass_pipeline_config(self) -> executorch.backends.arm.common.pipeline_config.ArmPassPipelineConfig:
 ```
-Returns configuration that controls how the Arm pass pipeline should behave.
+Returns configuration that controls how the Arm pass pipeline should
+behave.
+
 Subclasses may override to tweak defaults for specific targets.
 
 ```python
@@ -97,8 +102,8 @@ Args:
 ```python
 def VgfCompileSpec.set_pass_pipeline_config(self, config: executorch.backends.arm.common.pipeline_config.ArmPassPipelineConfig) -> None:
 ```
-Sets the configuration that controls how the Arm pass pipeline should behave.
-Subclasses may override to tweak defaults for specific targets.
+Sets the configuration that controls how the Arm pass pipeline should
+behave. Subclasses may override to tweak defaults for specific targets.
 
 Args:
 - **config**: The custom ArmPassPipelineConfig to set.
diff --git a/docs/source/backends/arm-vgf/arm-vgf-quantization.md b/docs/source/backends/arm-vgf/arm-vgf-quantization.md
@@ -46,7 +46,12 @@ Args:
 ```python
 def VgfQuantizer.quantize_with_submodules(self, model: 'GraphModule', calibration_samples: 'list[tuple]', is_qat: 'bool' = False):
 ```
-Quantizes a GraphModule in a way such that conditional submodules are handled properly.
+Quantizes a GraphModule in a way such that conditional submodules are
+handled properly.
+
+Note: torchao's prepare_pt2e and convert_pt2e natively handle
+while_loop body_fn submodules, so we only manually process cond
+branches and while_loop cond_fn here.
 
 Args:
 - **model (GraphModule)**: The model to quantize.
diff --git a/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md b/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md
@@ -73,7 +73,7 @@ Make sure the executable is located where you expect, in the `examples/arm` tree
 
 The ExecuTorch Ahead-of-Time (AOT) pipeline takes a PyTorch Model (a `torch.nn.Module`) and produces a `.pte` binary file, which is then typically consumed by the ExecuTorch Runtime. This [document](https://github.com/pytorch/executorch/blob/main/docs/source/getting-started-architecture.md) goes in much more depth about the ExecuTorch software stack for both AoT as well as Runtime.
 
-The example below shows how to quantize a model consisting of a single addition, and export it it through the AOT flow using the VGF backend. For more details, se `examples/arm/vgf_minimal_example.ipynb`.
+The example below shows how to quantize a model consisting of a single addition, and export it through the AOT flow using the VGF backend. For more details, see `examples/arm/vgf_minimal_example.ipynb`.
 
 ```python
 import torch
@@ -191,7 +191,7 @@ cmake \
   -DPYTHON_EXECUTABLE=python \
   -Bcmake-out .
 
-cmake --build cmake-out --target executor_runner`
+cmake --build cmake-out --target executor_runner
 ```
 
 
diff --git a/examples/arm/README.md b/examples/arm/README.md
@@ -31,11 +31,11 @@ aware training using the ArmQuantizer.
 
 There is an easy to use example flow to compile your PyTorch model to a PTE file for the Arm backend called `aot_arm_compiler.py`
 that you can use to generate PTE files, it can generate PTE files for the supported targets `-t` or even non delegated (Cortex-M)
-using different memory modes and can both use a python file as input or just use the models from examples/models with `--model_input`.
+using different memory modes and can both use a python file as input or just use the models from examples/models with `--model_name`.
 It also supports generating Devtools artifacts like BundleIO BPTE files, and ETRecords. Run it with `--help` to check its capabilities.
 
 You point out the model to convert with `--model_name=<MODELNAME/FILE>` It supports running a model from examples/models or models
-from a python file if you just specify `ModelUnderTest` and `ModelInput` in it.
+from a python file if you just specify `ModelUnderTest` and `ModelInputs` in it.
 
 ```
 $ python3 -m examples.arm.aot_arm_compiler --help
@@ -67,7 +67,7 @@ The `aot_arm_compiler.py` is called from the scripts below so you don't need to,
 ## ExecuTorch on Arm Ethos-U55/U65 and U85
 
 This example code will help you get going with the Corstone&trade;-300/320 platforms and
-run on the FVP and can be used a a starting guide in your porting to your board/HW
+run on the FVP and can be used a starting guide in your porting to your board/HW
 
 We will start from a PyTorch model in python, export it, convert it to a `.pte`
 file - A binary format adopted by ExecuTorch. Then we will take the `.pte`
diff --git a/examples/arm/cortex_m_mv2_example.ipynb b/examples/arm/cortex_m_mv2_example.ipynb
diff --git a/examples/arm/ethos-u-porting-guide.md b/examples/arm/ethos-u-porting-guide.md
diff --git a/examples/arm/vgf_minimal_example.ipynb b/examples/arm/vgf_minimal_example.ipynb