Update Pico2 docs with CMSIS-NN INT8 support and latency instrumentation (#18898)

psiddh · claude · Copilot · web-flow · commit 841181ea8225 · 2026-04-14T17:48:03.000-07:00
Add documentation for the new --cmsis build flag, INT8 quantized model
export via export_mlp_mnist_cmsis.py, and updated serial output showing
per-inference latency timing and memory usage diagnostics.

---------

Co-authored-by: Claude &lt;noreply@anthropic.com&gt;
Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
diff --git a/docs/source/pico2_tutorial.md b/docs/source/pico2_tutorial.md
@@ -9,6 +9,7 @@ A 28×28 MNIST digit classifier running on memory constrained, low power microco
 - Input: ASCII art digits (0, 1, 4, 7)
 - Output: Real-time predictions via USB serial
 - Memory: <400KB total footprint
+- Two variants: FP32 (portable ops) and INT8 (CMSIS-NN accelerated)
 
 ## Prerequisites
 
@@ -24,29 +25,63 @@ which arm-none-eabi-gcc # --> arm/arm-scratch/arm-gnu-toolchain-13.3.rel1-x86_64
 
 ## Step 1: Generate pte from given example Model
 
+### FP32 model (default)
+
 - Use the [provided example model](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist.py)
 
 ```bash
+cd examples/raspberry_pi/pico2
 python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
 ```
 
 - **Note:** This is hand-crafted MNIST Classifier (proof-of-concept), and not production trained. This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors.
 
+### INT8 quantized model (CMSIS-NN accelerated)
+
+- Use the [CMSIS-NN export script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist_cmsis.py)
+
+```bash
+cd examples/raspberry_pi/pico2
+python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte
+```
+
+This uses the `CortexMQuantizer` to produce INT8 quantized ops that map to CMSIS-NN kernels on Cortex-M33. The model I/O stays float — quantize and dequantize nodes are inserted inside the graph.
+
 ## Step 2: Build Firmware for Pico2
 
+### FP32 build
+
 ```bash
 # Generate model (Creates balanced_tiny_mlp_mnist.pte)
 cd ./examples/raspberry_pi/pico2
 python export_mlp_mnist.py
 cd -
 
 # Build Pico2 firmware (one command!)
+./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte
+```
+
+### INT8 CMSIS-NN build
+
+```bash
+# Generate INT8 quantized model
+cd ./examples/raspberry_pi/pico2
+python export_mlp_mnist_cmsis.py
+cd -
 
-./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte   # This creates executorch_pico.uf2, a firmware image for Pico2
+# Build with CMSIS-NN backend
+./examples/raspberry_pi/pico2/build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte
 ```
 
 Output: **executorch_pico.uf2** firmware file (examples/raspberry_pi/pico2/build/)
 
+**Script options:**
+| Flag | Description |
+|------|-------------|
+| `--model=FILE` | Specify model file to embed (relative to pico2/) |
+| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration |
+| `--clean` | Clean build directories and exit; run separately before building if needed |
+
 **Note:** '[build_firmware_pico.sh](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/build_firmware_pico.sh)' script converts given model pte to hex array and generates C code for the same via this helper [script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/pte_to_array.py). This C code is then compiled to generate final .uf2 binary which is then flashed to Pico2.
 
 ## Step 3: Flash to Pico2
@@ -72,6 +107,10 @@ screen /dev/tty.usbmodem1101 115200
 
 Something like:
 
+📊 Memory usage after method load:
+   Method allocator: 45632 / 204800 bytes used
+   Activation pool: 204800 bytes allocated
+
 === Digit 7 ===
 ############################
 ############################
@@ -104,6 +143,7 @@ Something like:
 
 Input stats: 159 white pixels out of 784 total
 Running neural network inference...
+⏱️  Inference time: 245 us
 ✅ Neural network results:
   Digit 0: 370.000
   Digit 1: 0.000
@@ -116,7 +156,16 @@ Running neural network inference...
   Digit 8: -3.000
   Digit 9: -3.000
 
-� PREDICTED: 7 (Expected: 7) ✅ CORRECT!
+🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT!
+
+==================================================
+
+📊 Inference latency summary:
+  Digit 0: 312 us
+  Digit 1: 198 us
+  Digit 4: 267 us
+  Digit 7: 245 us
+  Average: 255 us
 ```
 
 ## Memory Optimization Tips
@@ -184,12 +233,29 @@ arm-none-eabi-objdump -t examples/raspberry_pi/pico2/build/executorch_pico.elf |
 arm-none-eabi-readelf -l examples/raspberry_pi/pico2/build/executorch_pico.elf
 ```
 
+## CMSIS-NN INT8 Acceleration
+
+The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops.
+
+### How it works
+
+1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8
+2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph
+3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels
+4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels
+
+### When to use CMSIS-NN
+
+- Lower latency on supported ops (linear, conv2d)
+- Smaller model size (INT8 weights vs FP32)
+- Trade-off: slight accuracy loss from quantization
+
 ## Next Steps
 
 ### Scale up your deployment
 
 - Use real production trained model
-- Optimize further → INT8 quantization, pruning
+- Optimize further → INT8 quantization with CMSIS-NN, pruning
 
 ### Happy Inference!
 
diff --git a/examples/raspberry_pi/pico2/README.md b/examples/raspberry_pi/pico2/README.md
@@ -82,18 +82,39 @@ This involves two steps:
 
 ### Generate your model:
 
+**FP32 model (default):**
 ```bash
 cd examples/raspberry_pi/pico2
 python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
 ```
 
+**INT8 quantized model (CMSIS-NN accelerated):**
+```bash
+cd examples/raspberry_pi/pico2
+python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte
+```
+
 ### Build firmware:
 
+**FP32 build:**
 ```bash
 # In the dir examples/raspberry_pi/pico2
-./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2
+./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte
 ```
 
+**INT8 CMSIS-NN build:**
+```bash
+# In the dir examples/raspberry_pi/pico2
+./build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte
+```
+
+**Script options:**
+| Flag | Description |
+|------|-------------|
+| `--model=FILE` | Specify model file to embed (relative to pico2/) |
+| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration |
+| `--clean` | Clean build directories and exit (run separately before building) |
+
 ### Flash Firmware
 
 Hold the BOOTSEL button on Pico2 and connect to your computer. It mounts as `RPI-RP2`. Copy `executorch_pico.uf2` to this drive.
@@ -105,10 +126,14 @@ The Pico2 LED blinks 10 times at 500ms intervals for successful execution. Via s
 ```bash
 ...
 ...
-PREDICTED: 4 (Expected: 4) ✅ CORRECT!
+🎯 PREDICTED: 4 (Expected: 4) ✅ CORRECT!
 
 ==================================================
 
+📊 Memory usage after method load:
+   Method allocator: 45632 / 204800 bytes used
+   Activation pool: 204800 bytes allocated
+
 === Digit 7 ===
 ############################
 ############################
@@ -141,6 +166,7 @@ PREDICTED: 4 (Expected: 4) ✅ CORRECT!
 
 Input stats: 159 white pixels out of 784 total
 Running neural network inference...
+⏱️  Inference time: 245 us
 ✅ Neural network results:
   Digit 0: 370.000
   Digit 1: 0.000
@@ -153,11 +179,18 @@ Running neural network inference...
   Digit 8: -3.000
   Digit 9: -3.000
 
-� PREDICTED: 7 (Expected: 7) ✅ CORRECT!
+🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT!
 
 ==================================================
 
-🎉 All tests complete! PyTorch neural network works on Pico2!
+📊 Inference latency summary:
+  Digit 0: 312 us
+  Digit 1: 198 us
+  Digit 4: 267 us
+  Digit 7: 245 us
+  Average: 255 us
+
+🎉 All tests complete! ExecuTorch inference of neural network works on Pico2!
 ```
 
 ### Debugging via Serial Terminal
@@ -170,4 +203,21 @@ screen /dev/tty.usbmodem1101 115200
 
 Replace `/dev/tty.usbmodem1101` with your device path. If LED blinks 10 times at 100ms intervals, check logs for errors, but if it blinks 10 times at 500ms intervals, it is successful!
 
-Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment! 🚀
+## CMSIS-NN INT8 Acceleration
+
+The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops.
+
+### How it works
+
+1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8
+2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph
+3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels
+4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels
+
+### When to use CMSIS-NN
+
+- Lower latency on supported ops (linear, conv2d)
+- Smaller model size (INT8 weights vs FP32)
+- Trade-off: slight accuracy loss from quantization
+
+Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment!