Skip to content

Commit 841181e

Browse files
psiddhclaudeCopilot
authored
Update Pico2 docs with CMSIS-NN INT8 support and latency instrumentation (#18898)
Add documentation for the new --cmsis build flag, INT8 quantized model export via export_mlp_mnist_cmsis.py, and updated serial output showing per-inference latency timing and memory usage diagnostics. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent dbd5118 commit 841181e

2 files changed

Lines changed: 124 additions & 8 deletions

File tree

docs/source/pico2_tutorial.md

Lines changed: 69 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ A 28×28 MNIST digit classifier running on memory constrained, low power microco
99
- Input: ASCII art digits (0, 1, 4, 7)
1010
- Output: Real-time predictions via USB serial
1111
- Memory: <400KB total footprint
12+
- Two variants: FP32 (portable ops) and INT8 (CMSIS-NN accelerated)
1213

1314
## Prerequisites
1415

@@ -24,29 +25,63 @@ which arm-none-eabi-gcc # --> arm/arm-scratch/arm-gnu-toolchain-13.3.rel1-x86_64
2425

2526
## Step 1: Generate pte from given example Model
2627

28+
### FP32 model (default)
29+
2730
- Use the [provided example model](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist.py)
2831

2932
```bash
33+
cd examples/raspberry_pi/pico2
3034
python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
3135
```
3236

3337
- **Note:** This is hand-crafted MNIST Classifier (proof-of-concept), and not production trained. This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors.
3438

39+
### INT8 quantized model (CMSIS-NN accelerated)
40+
41+
- Use the [CMSIS-NN export script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist_cmsis.py)
42+
43+
```bash
44+
cd examples/raspberry_pi/pico2
45+
python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte
46+
```
47+
48+
This uses the `CortexMQuantizer` to produce INT8 quantized ops that map to CMSIS-NN kernels on Cortex-M33. The model I/O stays float — quantize and dequantize nodes are inserted inside the graph.
49+
3550
## Step 2: Build Firmware for Pico2
3651

52+
### FP32 build
53+
3754
```bash
3855
# Generate model (Creates balanced_tiny_mlp_mnist.pte)
3956
cd ./examples/raspberry_pi/pico2
4057
python export_mlp_mnist.py
4158
cd -
4259

4360
# Build Pico2 firmware (one command!)
61+
./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte
62+
```
63+
64+
### INT8 CMSIS-NN build
65+
66+
```bash
67+
# Generate INT8 quantized model
68+
cd ./examples/raspberry_pi/pico2
69+
python export_mlp_mnist_cmsis.py
70+
cd -
4471

45-
./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2
72+
# Build with CMSIS-NN backend
73+
./examples/raspberry_pi/pico2/build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte
4674
```
4775

4876
Output: **executorch_pico.uf2** firmware file (examples/raspberry_pi/pico2/build/)
4977

78+
**Script options:**
79+
| Flag | Description |
80+
|------|-------------|
81+
| `--model=FILE` | Specify model file to embed (relative to pico2/) |
82+
| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration |
83+
| `--clean` | Clean build directories and exit; run separately before building if needed |
84+
5085
**Note:** '[build_firmware_pico.sh](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/build_firmware_pico.sh)' script converts given model pte to hex array and generates C code for the same via this helper [script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/pte_to_array.py). This C code is then compiled to generate final .uf2 binary which is then flashed to Pico2.
5186

5287
## Step 3: Flash to Pico2
@@ -72,6 +107,10 @@ screen /dev/tty.usbmodem1101 115200
72107

73108
Something like:
74109

110+
📊 Memory usage after method load:
111+
Method allocator: 45632 / 204800 bytes used
112+
Activation pool: 204800 bytes allocated
113+
75114
=== Digit 7 ===
76115
############################
77116
############################
@@ -104,6 +143,7 @@ Something like:
104143

105144
Input stats: 159 white pixels out of 784 total
106145
Running neural network inference...
146+
⏱️ Inference time: 245 us
107147
✅ Neural network results:
108148
Digit 0: 370.000
109149
Digit 1: 0.000
@@ -116,7 +156,16 @@ Running neural network inference...
116156
Digit 8: -3.000
117157
Digit 9: -3.000
118158

119-
� PREDICTED: 7 (Expected: 7) ✅ CORRECT!
159+
🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT!
160+
161+
==================================================
162+
163+
📊 Inference latency summary:
164+
Digit 0: 312 us
165+
Digit 1: 198 us
166+
Digit 4: 267 us
167+
Digit 7: 245 us
168+
Average: 255 us
120169
```
121170

122171
## Memory Optimization Tips
@@ -184,12 +233,29 @@ arm-none-eabi-objdump -t examples/raspberry_pi/pico2/build/executorch_pico.elf |
184233
arm-none-eabi-readelf -l examples/raspberry_pi/pico2/build/executorch_pico.elf
185234
```
186235

236+
## CMSIS-NN INT8 Acceleration
237+
238+
The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops.
239+
240+
### How it works
241+
242+
1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8
243+
2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph
244+
3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels
245+
4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels
246+
247+
### When to use CMSIS-NN
248+
249+
- Lower latency on supported ops (linear, conv2d)
250+
- Smaller model size (INT8 weights vs FP32)
251+
- Trade-off: slight accuracy loss from quantization
252+
187253
## Next Steps
188254

189255
### Scale up your deployment
190256

191257
- Use real production trained model
192-
- Optimize further → INT8 quantization, pruning
258+
- Optimize further → INT8 quantization with CMSIS-NN, pruning
193259

194260
### Happy Inference!
195261

examples/raspberry_pi/pico2/README.md

Lines changed: 55 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -82,18 +82,39 @@ This involves two steps:
8282

8383
### Generate your model:
8484

85+
**FP32 model (default):**
8586
```bash
8687
cd examples/raspberry_pi/pico2
8788
python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
8889
```
8990

91+
**INT8 quantized model (CMSIS-NN accelerated):**
92+
```bash
93+
cd examples/raspberry_pi/pico2
94+
python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte
95+
```
96+
9097
### Build firmware:
9198

99+
**FP32 build:**
92100
```bash
93101
# In the dir examples/raspberry_pi/pico2
94-
./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2
102+
./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte
95103
```
96104

105+
**INT8 CMSIS-NN build:**
106+
```bash
107+
# In the dir examples/raspberry_pi/pico2
108+
./build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte
109+
```
110+
111+
**Script options:**
112+
| Flag | Description |
113+
|------|-------------|
114+
| `--model=FILE` | Specify model file to embed (relative to pico2/) |
115+
| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration |
116+
| `--clean` | Clean build directories and exit (run separately before building) |
117+
97118
### Flash Firmware
98119

99120
Hold the BOOTSEL button on Pico2 and connect to your computer. It mounts as `RPI-RP2`. Copy `executorch_pico.uf2` to this drive.
@@ -105,10 +126,14 @@ The Pico2 LED blinks 10 times at 500ms intervals for successful execution. Via s
105126
```bash
106127
...
107128
...
108-
PREDICTED: 4 (Expected: 4) ✅ CORRECT!
129+
🎯 PREDICTED: 4 (Expected: 4) ✅ CORRECT!
109130

110131
==================================================
111132

133+
📊 Memory usage after method load:
134+
Method allocator: 45632 / 204800 bytes used
135+
Activation pool: 204800 bytes allocated
136+
112137
=== Digit 7 ===
113138
############################
114139
############################
@@ -141,6 +166,7 @@ PREDICTED: 4 (Expected: 4) ✅ CORRECT!
141166

142167
Input stats: 159 white pixels out of 784 total
143168
Running neural network inference...
169+
⏱️ Inference time: 245 us
144170
✅ Neural network results:
145171
Digit 0: 370.000
146172
Digit 1: 0.000
@@ -153,11 +179,18 @@ Running neural network inference...
153179
Digit 8: -3.000
154180
Digit 9: -3.000
155181

156-
PREDICTED: 7 (Expected: 7) ✅ CORRECT!
182+
🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT!
157183

158184
==================================================
159185

160-
🎉 All tests complete! PyTorch neural network works on Pico2!
186+
📊 Inference latency summary:
187+
Digit 0: 312 us
188+
Digit 1: 198 us
189+
Digit 4: 267 us
190+
Digit 7: 245 us
191+
Average: 255 us
192+
193+
🎉 All tests complete! ExecuTorch inference of neural network works on Pico2!
161194
```
162195

163196
### Debugging via Serial Terminal
@@ -170,4 +203,21 @@ screen /dev/tty.usbmodem1101 115200
170203

171204
Replace `/dev/tty.usbmodem1101` with your device path. If LED blinks 10 times at 100ms intervals, check logs for errors, but if it blinks 10 times at 500ms intervals, it is successful!
172205

173-
Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment! 🚀
206+
## CMSIS-NN INT8 Acceleration
207+
208+
The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops.
209+
210+
### How it works
211+
212+
1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8
213+
2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph
214+
3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels
215+
4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels
216+
217+
### When to use CMSIS-NN
218+
219+
- Lower latency on supported ops (linear, conv2d)
220+
- Smaller model size (INT8 weights vs FP32)
221+
- Trade-off: slight accuracy loss from quantization
222+
223+
Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment!

0 commit comments

Comments
 (0)