Skip to content

Add FP32 operators and tiling support for MicroLlama on Snitch #153

Open
lee2716 wants to merge 67 commits into
pulp-platform:develfrom
lee2716:devel
Open

Add FP32 operators and tiling support for MicroLlama on Snitch #153
lee2716 wants to merge 67 commits into
pulp-platform:develfrom
lee2716:devel

Conversation

@lee2716

@lee2716 lee2716 commented Jan 31, 2026

Copy link
Copy Markdown
Contributor

Add FP32 operator support to enable MicroLlama model deployment on Snitch
cluster, with both untiled and tiled execution modes.

Added

  • FP32 templates: FloatAddTemplate, FloatDivTemplate, FloatMulTemplate,
    FloatHardSwishTemplate, FloatRMSNormTemplate, FloatMatMulTemplate
  • FP32 parsers: SnitchDivParser, SnitchMulParser, SnitchRMSNormParser
  • FP32 bindings: BasicDivBindings, BasicMulBindings, BasicHardSwishBindings,
    BasicRMSNormBindings (untiled)
  • FP32 bindings: SnitchDivBindings, SnitchMulBindings,
    SnitchHardSwishBindings, SnitchRMSNormBindings (tiled)
  • C kernels: Add_fp32.c, Div_fp32.c, Mul_fp32.c, HardSwish.c, RMSNrom_fp32.c
    with multi-core parallelization
  • TileConstraints: FloatDivTileConstraint, FloatMulTileConstraint,
    ReshapeTileConstraint
  • SnitchTiledPlatform and SnitchTiledMapping for tiled execution
  • testRunner_tiled_snitch.py for tiled model testing
  • FP32 kernel tests with ONNX models, inputs, and expected outputs
  • CI workflows for both Snitch and Snitch_tiled platforms

Changed

  • Platform.py: Added separate SnitchPlatform (untiled) and SnitchTiledPlatform
    (tiled) with distinct mappings
  • platformMapping.py: Added Snitch_tiled platform support
  • Tiler.py: Registered new FP32 tile constraints
  • SnitchClusterTiling.py: Added tiled code transformation passes

Fixed

  • N/A

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@coderabbitai

coderabbitai Bot commented Jan 31, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a29791a-c815-4dd3-9c8f-b10e5527e3eb

📥 Commits

Reviewing files that changed from the base of the PR and between 7c8f2d8 and 4ec3fb3.

📒 Files selected for processing (111)
  • .github/workflows/ci-platform-snitch-tiled.yml
  • .github/workflows/ci-platform-snitch.yml
  • .yamllint
  • Deeploy/DeeployTypes.py
  • Deeploy/Targets/Generic/Bindings.py
  • Deeploy/Targets/Generic/Layers.py
  • Deeploy/Targets/Generic/Parsers.py
  • Deeploy/Targets/Generic/Templates/FloatDivTemplate.py
  • Deeploy/Targets/Generic/Templates/ReshapeTemplate.py
  • Deeploy/Targets/Generic/TileConstraints/BOPTileConstraint.py
  • Deeploy/Targets/Generic/TypeCheckers.py
  • Deeploy/Targets/Snitch/Bindings.py
  • Deeploy/Targets/Snitch/Parsers.py
  • Deeploy/Targets/Snitch/Platform.py
  • Deeploy/Targets/Snitch/Templates/AllocateTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatAddTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatDivTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatHardSwishTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatMatMulTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatMulTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatRMSNormTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatSoftmaxTemplate.py
  • Deeploy/Targets/Snitch/Templates/TransposeTemplate.py
  • Deeploy/Targets/Snitch/Tiler.py
  • DeeployTest/Platforms/Snitch/main.c
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Hardswish/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Hardswish/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Hardswish/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/network.onnx
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/network.onnx
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/outputs.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/activations.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/inputs.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/network.onnx
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/outputs.npz
✅ Files skipped from review due to trivial changes (1)
  • .yamllint
🚧 Files skipped from review as they are similar to previous changes (2)
  • Deeploy/Targets/Snitch/Templates/FloatHardSwishTemplate.py
  • DeeployTest/Tests/Kernels/FP32/Hardswish/network.onnx

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added FP32 support for several operators, including add, divide, multiply, HardSwish, RMS normalization, matmul, transpose, reshape, concat, and gather.
    • Expanded tiling-aware execution paths for Snitch targets to cover more models and elementwise cases, including scalar broadcasting.
    • Added new CI coverage for model-focused and tiled Snitch test jobs.
  • Bug Fixes

    • Improved broadcasting handling for addition and scalar elementwise operations.
    • Fixed initialization and type-handling issues that could affect exports and constant tensors.
  • Tests

    • Added a new FP32 HardSwish ONNX test model.
    • Updated Snitch platform startup to be safer across multiple cores.

Walkthrough

Adds FP32 operator support (Add with broadcasting, Div, Mul, HardSwish, RMSNorm, MatMul, Transpose, Reshape, Concat, Gather) to the Snitch target. Includes C kernel implementations, Snitch-specific codegen templates, parsers, tiling-ready bindings with tile constraints, platform mapper wiring, a guarded allocate template, multi-core InitNetwork fix, and two new CI jobs.

Changes

Snitch FP32 Operator Pipeline

Layer / File(s) Summary
Generic parsers, checkers, and layer contracts
Deeploy/DeeployTypes.py, Deeploy/Targets/Generic/Parsers.py, Deeploy/Targets/Generic/TypeCheckers.py, Deeploy/Targets/Generic/Layers.py, Deeploy/Targets/Generic/Bindings.py, Deeploy/Targets/Generic/Templates/FloatDivTemplate.py, Deeploy/Targets/Generic/Templates/ReshapeTemplate.py
AddParser gains broadcast stride computation; iRMSNormParser gains stash_type validation and expanded fields; DivParser switches to A/B/C keys; HardswishChecker handles FloatImmediate; RMSNormChecker and RMSNormLayer are added; float32 concat binding appended; export_dtype guard and reshape alias fix applied.
BOPTileConstraint scalar broadcasting
Deeploy/Targets/Generic/TileConstraints/BOPTileConstraint.py
addGeometricalConstraint branches on scalar second input to fix input2 dimension variables; serializeTilingSolution uses a fixed HyperRectangle load region for scalar input2.
C kernel headers and implementations
TargetLibraries/Snitch/inc/kernel/Add.h, Div.h, HardSwish.h, Mul.h, RMSNrom.h, Softmax.h, TargetLibraries/Generic/inc/macros.h, TargetLibraries/Snitch/inc/DeeploySnitchMath.h, TargetLibraries/Snitch/src/Add_fp32.c, Div_fp32.c, HardSwish.c, Mul_fp32.c, RMSNrom_fp32.c
Declares and implements multi-core FP32 kernels for Add (three variants), Div, Mul, HardSwish, and RMSNorm; renames softmax_fp32 to Softmax_fp32; adds #ifndef guards to MAX/MIN/CLAMP.
Snitch codegen templates
Deeploy/Targets/Snitch/Templates/FloatAddTemplate.py, FloatDivTemplate.py, FloatMulTemplate.py, FloatHardSwishTemplate.py, FloatRMSNormTemplate.py, FloatMatMulTemplate.py, TransposeTemplate.py, FloatSoftmaxTemplate.py, AllocateTemplate.py
Adds broadcast-aware _FloatAddTemplate, scalar-flag Div/Mul templates, FloatHardSwishTemplate, FloatRMSNormTemplate with SSR variant, FloatMatMulTemplate with ssrFrepTemplate, SnitchTransposeTemplate with per-dimension parallel loop selection, simplified Softmax template, and snitchGenericGuardedAllocate.
Snitch-specific parsers
Deeploy/Targets/Snitch/Parsers.py
Adds SnitchRMSNormParser, SnitchHardSwishParser, SnitchAddParser, and _ScalarElementwiseMixin composing SnitchDivParser/SnitchMulParser with is_scalar detection.
Snitch bindings and tiling-ready bindings
Deeploy/Targets/Snitch/Bindings.py, Deeploy/Targets/Snitch/Tiler.py
Adds SkipTransformer, extends TiledTransformer with L2 memory management, wires SnitchRMSNorm/HardSwish/Div/Mul/MatMul/Concat/Transpose/Reshape/Gather bindings, and creates matching TilingReadyNodeBindings with appropriate tile constraints.
Platform wiring and buffer allocation
Deeploy/Targets/Snitch/Platform.py
Rewires all Snitch mappers to tiling-ready bindings, extends SnitchMapping with new operators, switches SnitchTransientBuffer to guarded allocate template, adds SnitchConstantBuffer.__init__.
Tests, multi-core init fix, and CI
DeeployTest/Tests/Kernels/FP32/Hardswish/network.onnx, DeeployTest/Platforms/Snitch/main.c, DeeployTest/testUtils/typeMapping.py, .github/workflows/ci-platform-snitch*.yml, .yamllint
Adds HardSwish ONNX test model, fixes InitNetwork to run on all cores, adds float dtype priority in inferMinimalType, adds two CI model-test jobs, ignores .venv in yamllint.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • Victor-Jung
  • Xeratec
  • lukamac
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: FP32 operator support and tiling support for MicroLlama on Snitch.
Description check ✅ Passed The description matches the changeset and covers the added FP32 operators, parsers, bindings, kernels, and tiled execution support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Deeploy/Targets/Snitch/Platform.py (1)

197-277: ⚠️ Potential issue | 🟠 Major

Replace mutable default arguments in SnitchConstantBuffer, SnitchPlatform, and SnitchTiledPlatform.

Using list defaults and constructing engines in function signatures risks shared state across instances. Additionally, includeList is accepted but unused in both platform classes; wiring it into the default engine construction fixes that.

Proposed fix
-    def __init__(self, name: str = '', shape = [1], values = [0]):
-        super().__init__(name, shape, values)
+    def __init__(self, name: str = '', shape = None, values = None):
+        if shape is None:
+            shape = [1]
+        if values is None:
+            values = [0]
+        super().__init__(name, shape, values)
@@
 class SnitchPlatform(DeploymentPlatform):
 
     def __init__(self,
-                 engines = [SnitchClusterEngine("SnitchCluster")],
+                 engines = None,
                  variableBuffer = SnitchVariableBuffer,
                  constantBuffer = SnitchConstantBuffer,
                  structBuffer = SnitchStructBuffer,
                  transientBuffer = SnitchTransientBuffer,
                  includeList: List[str] = _includeList):
+        if engines is None:
+            engines = [SnitchClusterEngine("SnitchCluster", includeList = includeList)]
         super().__init__(engines, variableBuffer, constantBuffer, structBuffer, transientBuffer)
@@
 class SnitchTiledPlatform(DeploymentPlatform):
 
     def __init__(self,
-                 engines = [SnitchTiledClusterEngine("SnitchCluster")],
+                 engines = None,
                  variableBuffer = SnitchVariableBuffer,
                  constantBuffer = SnitchConstantBuffer,
                  structBuffer = SnitchStructBuffer,
                  transientBuffer = SnitchTransientBuffer,
                  includeList: List[str] = _includeList):
+        if engines is None:
+            engines = [SnitchTiledClusterEngine("SnitchCluster", includeList = includeList)]
         super().__init__(engines, variableBuffer, constantBuffer, structBuffer, transientBuffer)
🤖 Fix all issues with AI agents
In `@Deeploy/Targets/Generic/Parsers.py`:
- Around line 494-525: AddParser.parseNodeCtxt computes broadcasting strides
incorrectly when input ranks differ because it checks i < len(shape1/shape2)
instead of left-padding shapes with 1s per ONNX rules; update the code to
left-pad shape1 and shape2 to length ndim (length of out_shape) with leading 1s
and then compute strides1 and strides2 exactly as in
SnitchAddParser._compute_broadcast_strides (treat dimensions equal to out_shape
as non-broadcast and set stride 0 for broadcast dims, compute cumulative strides
from the right otherwise), and apply the same padding+stride logic to both
strides1 and strides2 so examples like shape1=[4], out_shape=[2,3,4] produce
strides1=[0,0,1].

In `@Deeploy/Targets/Snitch/Parsers.py`:
- Around line 239-285: In parseNodeCtxt of class SnitchDivParser, after
computing shape1, shape2, out_shape and their sizes
(operatorRepresentation['size1'], ['size2'], ['size']), add a guard that rejects
non-scalar broadcasting by returning (ctxt, False) whenever input2 is not a
scalar and its total size does not equal the output size (i.e.,
operatorRepresentation['size2'] != 1 and operatorRepresentation['size2'] !=
operatorRepresentation['size']). This ensures parseNodeCtxt fails for non-scalar
broadcasts (preventing the elementwise path from reading past input2) until a
general broadcast kernel is implemented; use the existing operatorRepresentation
keys and node.inputs/node.outputs to locate where to place the check.
- Around line 288-334: In SnitchMulParser.parseNodeCtxt, detect non-scalar
broadcasting and reject the node: after computing shape1, shape2, and out_shape
and setting size1/size2/size, add a guard that if shape2 (or shape1) does not
exactly match out_shape and operatorRepresentation['size2'] != 1 (i.e., input2
is not a scalar) then treat this as unsupported broadcasting — set an
error/unsupported marker and return ctxt, False so the generic elementwise
kernel is not used; keep the existing operatorRepresentation fields (shape1,
shape2, out_shape, size*, is_scalar) and reference parseNodeCtxt,
operatorRepresentation, size2, and is_scalar when implementing the check.

In `@Deeploy/Targets/Snitch/Platform.py`:
- Around line 25-55: The file imports RQAddMapper from
Deeploy.Targets.PULPOpen.Platform but then defines a local RQAddMapper =
NodeMapper(...) which shadows the import and triggers F811; fix it by removing
RQAddMapper from the import list on the first line (or alternatively rename the
local NodeMapper variable, e.g., LocalRQAddMapper) so there is no name collision
between the imported symbol and the local definition; ensure any other code
referencing the external RQAddMapper is updated if you choose to rename the
local variable.

In `@Deeploy/Targets/Snitch/Templates/MatMulTemplate.py`:
- Around line 42-56: The loop always advances input pointers
ref_${data_out}_${A} and ref_${data_out}_${B} for each batch, which breaks
broadcasting; update the batched MatMul loop to only increment those pointers
when the corresponding A_batched or B_batched flag is true (leave
ref_${data_out}_${data_out} always incremented), i.e. after calling
MatMul_s${A_type.referencedType.typeWidth}_s${B_type.referencedType.typeWidth}_s${data_out_type.referencedType.typeWidth}(...)
wrap the pointer advances in conditional checks using the parser-provided
A_batched and B_batched flags so broadcasted inputs (batch dim == 1) are not
advanced.

In `@Deeploy/Targets/Snitch/Templates/ReshapeTemplate.py`:
- Around line 18-44: The code incorrectly assumes
operatorRepresentation['shape'] and operatorRepresentation['indices'] live in
ctxt.globalObjects; replace these unsafe direct lookups by checking
ctxt.is_global(name) before accessing ctxt.globalObjects and avoid the
unnecessary 'indices' handling (remove the 'indices' block since Reshape never
sets it). Concretely, in the Reshape template handler (look for
operatorRepresentation, ctxt.lookup, bufferIn/bufferOut logic), guard any access
to ctxt.globalObjects[operatorRepresentation["shape"]] with
ctxt.is_global(operatorRepresentation["shape"]) and only set _deploy/_live when
that returns True; drop the indices branch entirely and rely on ctxt.lookup for
non-global buffers.

In `@Deeploy/Targets/Snitch/TileConstraints/FloatDivTileConstraint.py`:
- Around line 59-68: Add an explicit upfront shape validation before calling
tilerModel.getTensorDimVar: if the inputs are non-scalar (len(input1Shape) > 0)
compare input1Shape and input2Shape and raise a clear exception (or return an
error) when they differ, referencing the variables inputBuffer1Name,
inputBuffer2Name, input1Shape and input2Shape; keep the existing call to
tilerModel.addTensorDimToModel and the per-dimension constraints
(getTensorDimVar / addConstraint) only after this explicit equality check so
getTensorDimVar is never invoked on mismatched shapes.

In `@Deeploy/Targets/Snitch/TileConstraints/FloatMulTileConstraint.py`:
- Around line 59-68: Before adding per-dimension constraints, validate that
input2’s shape is compatible with input1: if len(input2Shape) !=
len(input1Shape) and input2 is not scalar (i.e., len(input2Shape) != 0), raise a
clear error (e.g., ValueError) explaining the mismatched shapes; this prevents
obscure failures from tilerModel.getTensorDimVar when you later call
addTensorDimToModel and iterate over input1Shape. Reference the symbols
input1Shape, input2Shape, inputBuffer2Name, tilerModel.addTensorDimToModel, and
tilerModel.getTensorDimVar to locate where to insert the check.

In `@Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py`:
- Around line 23-47: In addGeometricalConstraint remove the two unused local
variables by deleting the assignments "inputBuffer =
ctxt.lookup(inputBufferName)" and "outputBuffer = ctxt.lookup(outputBufferName)"
(or alternatively use them if needed); the current ctxt.lookup calls create
unused values causing Ruff F841—ensure any required buffer lookups are either
consumed (e.g., used in subsequent logic) or removed from the function body.
- Around line 83-134: The mapping from output tiles to input HyperRectangle(s)
in the loop over outputCubes (using outputShape, inputShape, outOffset, outSize,
inStrides, inCubeOffset, inCubeDims and creating HyperRectangle) is incorrect
for tiles that cross input dimension boundaries; fix by detecting when the
linear range [outOffset, outOffset+outSize) spans multiple input strides and
emit one or more input HyperRectangle pieces that together cover exactly outSize
elements instead of assuming a single axis-aligned block. Concretely, replace
the simplistic inCubeDims/inCubeOffset construction with logic that: 1) computes
the remainingCount = outSize and a cursor = outOffset, 2) in a loop maps cursor
to multi-dimensional coords using inStrides to get the start coord, determines
the maximal contiguous run length along the last (fastest) dimension without
crossing that dimension limit (using inputShape and start coord), 3) create a
HyperRectangle for that run, subtract its size from remainingCount, advance
cursor by run length, and repeat until remainingCount == 0; ensure the created
HyperRectangle tuples (offset, dims) sum exactly to outSize and add all pieces
instead of a single inputCube.

In `@TargetLibraries/Snitch/inc/kernel/HardSwish.h`:
- Around line 1-5: Update the SPDX header in HardSwish.h to correct the
copyright year from 2026 to the proper year (2024 or 2025); locate the
top-of-file comment block in TargetLibraries/Snitch/inc/kernel/HardSwish.h and
change the year in the SPDX-FileCopyrightText line so it matches the source
file.

In `@TargetLibraries/Snitch/src/Div_fp32.c`:
- Around line 10-49: The function Div_fp32 currently performs element-wise
division using input2[i] and does not handle scalar broadcasting; update the
top-of-file comment for Div_fp32 to remove the "If input2 is scalar (size=1):
divides all elements..." claim and clearly state that Div_fp32 assumes input1
and input2 have the same size (element-wise) and that scalar denominators must
call Div_fp32_scalar (or another scalar-specific routine); additionally, add a
short runtime assert/check in Div_fp32 (referencing the variables size, input2,
and function name Div_fp32) or document that input2_size must equal size to
prevent out-of-bounds access if scalar inputs are routed incorrectly.

In `@TargetLibraries/Snitch/src/Mul_fp32.c`:
- Around line 1-5: The SPDX header in the file Mul_fp32.c has an incorrect
future year (2026); update the copyright year(s) in the top comment block (the
SPDX header lines and the SPDX-FileCopyrightText entry) to the correct current
year or an appropriate range (e.g., 2024 or "2023-2024") so the
SPDX-FileCopyrightText and SPDX-License-Identifier comment reflects accurate
dates.
🧹 Nitpick comments (7)
Deeploy/Targets/Snitch/Templates/FloatSoftmaxTemplate.py (1)

33-36: Pre-existing issue: Implementation marked as broken with memory leak.

The comment on line 33 explicitly states this implementation is broken and has a memory leak. Additionally, multi-core parallelization is disabled (compute_num hardcoded to 1).

Would you like help addressing this implementation issue, or should a tracking issue be opened to ensure it's resolved before production use?

TargetLibraries/Snitch/src/CycleCounter.c (1)

9-11: Confusing comment formatting for the disabled macro.

The #define ENABLE_INSTR_COUNTER is embedded at the end of the comment on line 10, which makes it unclear that this is a commented-out preprocessor directive meant to be enabled. Consider placing it on a separate line for clarity.

✏️ Suggested formatting improvement
-// Define ENABLE_INSTR_COUNTER to enable instruction counting (causes warnings
-// in gvsoc) `#define` ENABLE_INSTR_COUNTER
+// Define ENABLE_INSTR_COUNTER to enable instruction counting
+// (causes warnings in gvsoc)
+// `#define` ENABLE_INSTR_COUNTER
TargetLibraries/Snitch/src/Gemm_fp32.c (1)

36-60: Non-transposed GEMM is mathematically correct; note significant performance trade-off.

The indexing B[k * ldB + n] correctly accesses B[k,n] for the non-transposed case. The shift from SSR-based vectorized loops to simple triple-nested scalar loops is a significant performance regression on Snitch hardware. If this is intentional for correctness validation or as a fallback path, consider adding a brief comment clarifying the purpose.

Minor nit: Line 41 has an extra blank line after (void)setup_SSR; that isn't present in gemm_fp32_transB_opt (line 14).

,

Optional: remove extra blank line for consistency
 void gemm_fp32_opt(uint32_t M, uint32_t N, uint32_t K, float32_t *A,
                    uint32_t ldA, float32_t *B, uint32_t ldB, float32_t *C,
                    uint32_t ldC, float32_t *Y, uint32_t BETA,
                    uint32_t setup_SSR) {
   (void)setup_SSR;
-
   uint32_t compute_id = snrt_global_compute_core_idx();
Deeploy/Targets/Snitch/Templates/GatherTemplate.py (1)

8-17: Use integer division // instead of / for width calculation.

On line 11, int(data_in_type.referencedType.typeWidth/8) performs floating-point division then truncates. While this works correctly for typical type widths (8, 16, 32, 64), using // is more idiomatic and explicit for integer division in Python.

♻️ Suggested fix
 referenceTemplate = NodeTemplate("""
 // Gather (Name: ${nodeName}, Op: ${nodeOp})
 <%
-width = int(data_in_type.referencedType.typeWidth/8)
+width = data_in_type.referencedType.typeWidth // 8
 %>
 if (snrt_cluster_core_idx() == 0) {
Deeploy/Targets/Generic/TypeCheckers.py (1)

649-663: Consider renaming for clarity to distinguish quantized vs. floating-point variants.

Both HardswishChecker (line 419) and HardSwishChecker (line 649) exist with intentionally different implementations:

  • HardswishChecker (quantized): 2^(4 * typeWidth), used with int8_t input in PULPOpen bindings
  • HardSwishChecker (floating-point): 2^(typeWidth), used with float32_t input in Snitch bindings

The subtle casing difference makes these easy to confuse. To improve clarity, consider renaming one explicitly: either QuantizedHardswishChecker for the quantized variant or FloatHardSwishChecker for the floating-point variant.

TargetLibraries/Snitch/src/Mul_fp32.c (1)

9-26: Misleading documentation about broadcasting support.

The comment states "If input2 is scalar (size=1): multiplies all elements of input1 by input2[0]", but Mul_fp32 performs strict element-wise multiplication without any broadcasting logic. The scalar case is handled separately by Mul_fp32_scalar. Consider updating the comment to reflect the actual behavior.

Suggested documentation fix
 /*
  * Element-wise Multiplication (FP32)
  *
  * Computes: output[i] = input1[i] * input2[i]
  *
- * Supports ONNX broadcasting rules:
- * - If input2 is scalar (size=1): multiplies all elements of input1 by
- * input2[0]
- * - If both have same size: element-wise multiplication
+ * Performs element-wise multiplication. Both inputs must have the same size.
+ * For scalar multiplication, use Mul_fp32_scalar instead.
  *
Deeploy/Targets/Snitch/Templates/FloatRMSNormTemplate.py (1)

27-29: Consider using type-templated kernel name for consistency.

The kernel name RMSNorm_fp32 is hardcoded, while FloatMatMulTemplate uses type-templated names like MatMul_fp${A_type.referencedType.typeWidth}_.... If this template is intended exclusively for FP32, this is acceptable. Otherwise, consider templating the type width for future extensibility.

Based on learnings, Deeploy templates should use explicit bitwidth types for type consistency with templated kernel calls.

Optional: Type-templated kernel name
 FloatRMSNormTemplateStr = r"""
-RMSNorm_fp32(${data_in}, ${weight}, ${data_out}, ${size}, ${lastDimLength}, ${eps});
+RMSNorm_fp${data_in_type.referencedType.typeWidth}(${data_in}, ${weight}, ${data_out}, ${size}, ${lastDimLength}, ${eps});
 """

Comment thread Deeploy/Targets/Generic/Parsers.py
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment on lines +239 to +285
class SnitchDivParser(DivParser):
"""
Snitch-specific Div Parser.
Inherits from Generic DivParser and adds shape/broadcasting information.
"""

def __init__(self):
super().__init__()

def parseNodeCtxt(self,
ctxt: NetworkContext,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:
"""
Extend Generic parser to add shape and broadcasting information.
"""
# Call parent method first
ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)

if not ret:
return ctxt, False

# Get shape information
data_in_1 = ctxt.lookup(node.inputs[0].name)
data_in_2 = ctxt.lookup(node.inputs[1].name)
data_out = ctxt.lookup(node.outputs[0].name)

shape1 = list(data_in_1.shape)
shape2 = list(data_in_2.shape)
out_shape = list(data_out.shape)

# Store shape information
self.operatorRepresentation['shape1'] = shape1
self.operatorRepresentation['shape2'] = shape2
self.operatorRepresentation['out_shape'] = out_shape

# Calculate sizes
self.operatorRepresentation['size1'] = int(np.prod(shape1))
self.operatorRepresentation['size2'] = int(np.prod(shape2))

# Update output size (may differ due to broadcasting)
self.operatorRepresentation['size'] = int(np.prod(out_shape))

# Check if scalar broadcasting (input2 is scalar)
self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)

return ctxt, True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject non-scalar broadcasting for Div until a general broadcast kernel exists.

Right now, any non-scalar broadcast falls through to the elementwise path, which will read past input2.

✅ Suggested guard
         # Check if scalar broadcasting (input2 is scalar)
         self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)
+
+        # Non-scalar broadcasting isn't supported by Div_fp32
+        if shape1 != shape2 and not self.operatorRepresentation['is_scalar']:
+            return ctxt, False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class SnitchDivParser(DivParser):
"""
Snitch-specific Div Parser.
Inherits from Generic DivParser and adds shape/broadcasting information.
"""
def __init__(self):
super().__init__()
def parseNodeCtxt(self,
ctxt: NetworkContext,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:
"""
Extend Generic parser to add shape and broadcasting information.
"""
# Call parent method first
ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)
if not ret:
return ctxt, False
# Get shape information
data_in_1 = ctxt.lookup(node.inputs[0].name)
data_in_2 = ctxt.lookup(node.inputs[1].name)
data_out = ctxt.lookup(node.outputs[0].name)
shape1 = list(data_in_1.shape)
shape2 = list(data_in_2.shape)
out_shape = list(data_out.shape)
# Store shape information
self.operatorRepresentation['shape1'] = shape1
self.operatorRepresentation['shape2'] = shape2
self.operatorRepresentation['out_shape'] = out_shape
# Calculate sizes
self.operatorRepresentation['size1'] = int(np.prod(shape1))
self.operatorRepresentation['size2'] = int(np.prod(shape2))
# Update output size (may differ due to broadcasting)
self.operatorRepresentation['size'] = int(np.prod(out_shape))
# Check if scalar broadcasting (input2 is scalar)
self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)
return ctxt, True
class SnitchDivParser(DivParser):
"""
Snitch-specific Div Parser.
Inherits from Generic DivParser and adds shape/broadcasting information.
"""
def __init__(self):
super().__init__()
def parseNodeCtxt(self,
ctxt: NetworkContext,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:
"""
Extend Generic parser to add shape and broadcasting information.
"""
# Call parent method first
ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)
if not ret:
return ctxt, False
# Get shape information
data_in_1 = ctxt.lookup(node.inputs[0].name)
data_in_2 = ctxt.lookup(node.inputs[1].name)
data_out = ctxt.lookup(node.outputs[0].name)
shape1 = list(data_in_1.shape)
shape2 = list(data_in_2.shape)
out_shape = list(data_out.shape)
# Store shape information
self.operatorRepresentation['shape1'] = shape1
self.operatorRepresentation['shape2'] = shape2
self.operatorRepresentation['out_shape'] = out_shape
# Calculate sizes
self.operatorRepresentation['size1'] = int(np.prod(shape1))
self.operatorRepresentation['size2'] = int(np.prod(shape2))
# Update output size (may differ due to broadcasting)
self.operatorRepresentation['size'] = int(np.prod(out_shape))
# Check if scalar broadcasting (input2 is scalar)
self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)
# Non-scalar broadcasting isn't supported by Div_fp32
if shape1 != shape2 and not self.operatorRepresentation['is_scalar']:
return ctxt, False
return ctxt, True
🤖 Prompt for AI Agents
In `@Deeploy/Targets/Snitch/Parsers.py` around lines 239 - 285, In parseNodeCtxt
of class SnitchDivParser, after computing shape1, shape2, out_shape and their
sizes (operatorRepresentation['size1'], ['size2'], ['size']), add a guard that
rejects non-scalar broadcasting by returning (ctxt, False) whenever input2 is
not a scalar and its total size does not equal the output size (i.e.,
operatorRepresentation['size2'] != 1 and operatorRepresentation['size2'] !=
operatorRepresentation['size']). This ensures parseNodeCtxt fails for non-scalar
broadcasts (preventing the elementwise path from reading past input2) until a
general broadcast kernel is implemented; use the existing operatorRepresentation
keys and node.inputs/node.outputs to locate where to place the check.

Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment on lines +288 to +334
class SnitchMulParser(MulParser):
"""
Snitch-specific Mul Parser.
Inherits from Generic MulParser and adds shape/broadcasting information.
"""

def __init__(self):
super().__init__()

def parseNodeCtxt(self,
ctxt: NetworkContext,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:
"""
Extend Generic parser to add shape and broadcasting information.
"""
# Call parent method first
ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)

if not ret:
return ctxt, False

# Get shape information
data_in_1 = ctxt.lookup(node.inputs[0].name)
data_in_2 = ctxt.lookup(node.inputs[1].name)
data_out = ctxt.lookup(node.outputs[0].name)

shape1 = list(data_in_1.shape)
shape2 = list(data_in_2.shape)
out_shape = list(data_out.shape)

# Store shape information
self.operatorRepresentation['shape1'] = shape1
self.operatorRepresentation['shape2'] = shape2
self.operatorRepresentation['out_shape'] = out_shape

# Calculate sizes
self.operatorRepresentation['size1'] = int(np.prod(shape1))
self.operatorRepresentation['size2'] = int(np.prod(shape2))

# Update output size (may differ due to broadcasting)
self.operatorRepresentation['size'] = int(np.prod(out_shape))

# Check if scalar broadcasting (input2 is scalar)
self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)

return ctxt, True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject non-scalar broadcasting for Mul until a general broadcast kernel exists.

Otherwise the elementwise kernel will index beyond input2 for broadcasted shapes.

✅ Suggested guard
         # Check if scalar broadcasting (input2 is scalar)
         self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)
+
+        # Non-scalar broadcasting isn't supported by Mul_fp32
+        if shape1 != shape2 and not self.operatorRepresentation['is_scalar']:
+            return ctxt, False
🤖 Prompt for AI Agents
In `@Deeploy/Targets/Snitch/Parsers.py` around lines 288 - 334, In
SnitchMulParser.parseNodeCtxt, detect non-scalar broadcasting and reject the
node: after computing shape1, shape2, and out_shape and setting
size1/size2/size, add a guard that if shape2 (or shape1) does not exactly match
out_shape and operatorRepresentation['size2'] != 1 (i.e., input2 is not a
scalar) then treat this as unsupported broadcasting — set an error/unsupported
marker and return ctxt, False so the generic elementwise kernel is not used;
keep the existing operatorRepresentation fields (shape1, shape2, out_shape,
size*, is_scalar) and reference parseNodeCtxt, operatorRepresentation, size2,
and is_scalar when implementing the check.

Comment thread Deeploy/Targets/Snitch/Platform.py Outdated
Comment on lines +25 to +55
from Deeploy.Targets.PULPOpen.Platform import RQAddMapper
from Deeploy.Targets.Snitch.Parsers import SnitchGEMMParser, SnitchRQGEMMParser
from Deeploy.Targets.Snitch.Bindings import BasicDivBindings, BasicHardSwishBindings, BasicMulBindings, \
BasicRMSNormBindings, SnitchAddBindings, SnitchGemmBindings, SnitchiNoNormBindings, SnitchiSoftmaxBindings, \
SnitchRQAddBindings, SnitchRqGemmBindings
from Deeploy.Targets.Snitch.Parsers import HardSwishParser, SnitchDivParser, SnitchGEMMParser, SnitchMulParser, \
SnitchRMSNormParser, SnitchRQGEMMParser
from Deeploy.Targets.Snitch.Templates import AllocateTemplate, FreeTemplate
from Deeploy.Targets.Snitch.Tiler import SnitchAddTileReadyBindings, SnitchGemmTilingReadyBindings, \
SnitchiNoNormTilingReadyBindings, SnitchiSoftmaxTilingReadyBindings, SnitchRQAddTilingReadyBindings, \
SnitchRqGemmTilingReadyBindings

# =============================================================================
# Mappers for UNTILED mode (using BasicBindings with BasicTransformer)
# These are used by generateNetwork.py (testRunner_snitch.py)
# =============================================================================
GatherMapper = NodeMapper(GatherParser(), BasicGatherBindings)
Pad1DMapper = NodeMapper(Pad1DParser(), BasicPad1DBindings)
Pad2DMapper = NodeMapper(Pad2DParser(), BasicPad2DBindings)
UnsqueezeMapper = NodeMapper(UnsqueezeParser(), BasicReshapeBindings)
ReshapeMapper = NodeMapper(ReshapeParser(), BasicReshapeBindings)
TransposeMapper = NodeMapper(TransposeParser(), BasicTransposeBindings)
ConcatMapper = NodeMapper(ConcatParser(), BasicConcatBindings)

RQIntegerDivMapper = NodeMapper(RQIntegerDivParser(), [BasicRQIntegerDivBinding])

MatMulMapper = NodeMapper(MatMulParser(), BasicMatMulBindings)
GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmTilingReadyBindings)
RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmTilingReadyBindings)
iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)
SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)
iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormTilingReadyBindings)
# These use TiledTransformer but work in both modes (original upstream behavior)
GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmBindings)
RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmBindings)
iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxBindings)
SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxBindings)
iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormBindings)
iLayerNormMapper = NodeMapper(iLayerNormParser(), BasicLayerNormBindings)
RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddTilingReadyBindings)
AddMapper = NodeMapper(AddParser(), SnitchAddTileReadyBindings)
RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddBindings)
AddMapper = NodeMapper(AddParser(), SnitchAddBindings)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid shadowing the imported RQAddMapper.

The local RQAddMapper = NodeMapper(...) redefines the imported symbol, which is already unused and triggers F811. Removing the import avoids confusion.

Proposed fix
-from Deeploy.Targets.PULPOpen.Platform import RQAddMapper
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from Deeploy.Targets.PULPOpen.Platform import RQAddMapper
from Deeploy.Targets.Snitch.Parsers import SnitchGEMMParser, SnitchRQGEMMParser
from Deeploy.Targets.Snitch.Bindings import BasicDivBindings, BasicHardSwishBindings, BasicMulBindings, \
BasicRMSNormBindings, SnitchAddBindings, SnitchGemmBindings, SnitchiNoNormBindings, SnitchiSoftmaxBindings, \
SnitchRQAddBindings, SnitchRqGemmBindings
from Deeploy.Targets.Snitch.Parsers import HardSwishParser, SnitchDivParser, SnitchGEMMParser, SnitchMulParser, \
SnitchRMSNormParser, SnitchRQGEMMParser
from Deeploy.Targets.Snitch.Templates import AllocateTemplate, FreeTemplate
from Deeploy.Targets.Snitch.Tiler import SnitchAddTileReadyBindings, SnitchGemmTilingReadyBindings, \
SnitchiNoNormTilingReadyBindings, SnitchiSoftmaxTilingReadyBindings, SnitchRQAddTilingReadyBindings, \
SnitchRqGemmTilingReadyBindings
# =============================================================================
# Mappers for UNTILED mode (using BasicBindings with BasicTransformer)
# These are used by generateNetwork.py (testRunner_snitch.py)
# =============================================================================
GatherMapper = NodeMapper(GatherParser(), BasicGatherBindings)
Pad1DMapper = NodeMapper(Pad1DParser(), BasicPad1DBindings)
Pad2DMapper = NodeMapper(Pad2DParser(), BasicPad2DBindings)
UnsqueezeMapper = NodeMapper(UnsqueezeParser(), BasicReshapeBindings)
ReshapeMapper = NodeMapper(ReshapeParser(), BasicReshapeBindings)
TransposeMapper = NodeMapper(TransposeParser(), BasicTransposeBindings)
ConcatMapper = NodeMapper(ConcatParser(), BasicConcatBindings)
RQIntegerDivMapper = NodeMapper(RQIntegerDivParser(), [BasicRQIntegerDivBinding])
MatMulMapper = NodeMapper(MatMulParser(), BasicMatMulBindings)
GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmTilingReadyBindings)
RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmTilingReadyBindings)
iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)
SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)
iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormTilingReadyBindings)
# These use TiledTransformer but work in both modes (original upstream behavior)
GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmBindings)
RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmBindings)
iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxBindings)
SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxBindings)
iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormBindings)
iLayerNormMapper = NodeMapper(iLayerNormParser(), BasicLayerNormBindings)
RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddTilingReadyBindings)
AddMapper = NodeMapper(AddParser(), SnitchAddTileReadyBindings)
RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddBindings)
AddMapper = NodeMapper(AddParser(), SnitchAddBindings)
from Deeploy.Targets.Snitch.Bindings import BasicDivBindings, BasicHardSwishBindings, BasicMulBindings, \
BasicRMSNormBindings, SnitchAddBindings, SnitchGemmBindings, SnitchiNoNormBindings, SnitchiSoftmaxBindings, \
SnitchRQAddBindings, SnitchRqGemmBindings
from Deeploy.Targets.Snitch.Parsers import HardSwishParser, SnitchDivParser, SnitchGEMMParser, SnitchMulParser, \
SnitchRMSNormParser, SnitchRQGEMMParser
from Deeploy.Targets.Snitch.Templates import AllocateTemplate, FreeTemplate
# =============================================================================
# Mappers for UNTILED mode (using BasicBindings with BasicTransformer)
# These are used by generateNetwork.py (testRunner_snitch.py)
# =============================================================================
GatherMapper = NodeMapper(GatherParser(), BasicGatherBindings)
Pad1DMapper = NodeMapper(Pad1DParser(), BasicPad1DBindings)
Pad2DMapper = NodeMapper(Pad2DParser(), BasicPad2DBindings)
UnsqueezeMapper = NodeMapper(UnsqueezeParser(), BasicReshapeBindings)
ReshapeMapper = NodeMapper(ReshapeParser(), BasicReshapeBindings)
TransposeMapper = NodeMapper(TransposeParser(), BasicTransposeBindings)
ConcatMapper = NodeMapper(ConcatParser(), BasicConcatBindings)
RQIntegerDivMapper = NodeMapper(RQIntegerDivParser(), [BasicRQIntegerDivBinding])
# These use TiledTransformer but work in both modes (original upstream behavior)
GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmBindings)
RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmBindings)
iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxBindings)
SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxBindings)
iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormBindings)
iLayerNormMapper = NodeMapper(iLayerNormParser(), BasicLayerNormBindings)
RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddBindings)
AddMapper = NodeMapper(AddParser(), SnitchAddBindings)
🧰 Tools
🪛 Ruff (0.14.14)

[error] 54-54: Redefinition of unused RQAddMapper from line 25: RQAddMapper redefined here

(F811)

🤖 Prompt for AI Agents
In `@Deeploy/Targets/Snitch/Platform.py` around lines 25 - 55, The file imports
RQAddMapper from Deeploy.Targets.PULPOpen.Platform but then defines a local
RQAddMapper = NodeMapper(...) which shadows the import and triggers F811; fix it
by removing RQAddMapper from the import list on the first line (or alternatively
rename the local NodeMapper variable, e.g., LocalRQAddMapper) so there is no
name collision between the imported symbol and the local definition; ensure any
other code referencing the external RQAddMapper is updated if you choose to
rename the local variable.

Comment on lines +42 to +56
for(uint32_t i=0;i<${batch};i++){
MatMul_s${A_type.referencedType.typeWidth}_s${B_type.referencedType.typeWidth}_s${data_out_type.referencedType.typeWidth}(
ref_${data_out}_${A},
ref_${data_out}_${B},
ref_${data_out}_${data_out},
${M},
${N},
${O},
${A_offset}, ${B_offset}, ${C_offset}
);

ref_${data_out}_${A} += ${M} * ${N};
ref_${data_out}_${B} += ${N} * ${O};
ref_${data_out}_${data_out} += ${M} * ${O};
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for MatMul-related parser and binding logic
rg -n "MatMul" Deeploy/Targets/Snitch -g '*.py' -A 3

Repository: pulp-platform/Deeploy

Length of output: 8307


🏁 Script executed:

# Search in Generic targets for broadcast/batch handling patterns
rg -n "MatMul" Deeploy/Targets/Generic -g '*.py' -A 3

Repository: pulp-platform/Deeploy

Length of output: 10799


🏁 Script executed:

# Examine the MatMulTemplate.py file directly
cat -n Deeploy/Targets/Snitch/Templates/MatMulTemplate.py

Repository: pulp-platform/Deeploy

Length of output: 2731


🏁 Script executed:

# Search for batch and broadcast handling in the codebase
rg -n "broadcast" Deeploy/Targets -g '*.py' -B 2 -A 2

Repository: pulp-platform/Deeploy

Length of output: 22048


🏁 Script executed:

# Read MatMulParser implementation
sed -n '1775,1850p' Deeploy/Targets/Generic/Parsers.py

Repository: pulp-platform/Deeploy

Length of output: 3828


🏁 Script executed:

# Check MatMulLayer validation
sed -n '191,225p' Deeploy/Targets/Generic/Layers.py

Repository: pulp-platform/Deeploy

Length of output: 1374


🏁 Script executed:

# Search for batch variable computation in MatMul context
rg -n "batch" Deeploy/Targets/Generic/Parsers.py -B 3 -A 3 | grep -A 5 -B 5 "MatMul"

Repository: pulp-platform/Deeploy

Length of output: 47


🏁 Script executed:

# Check if there's any broadcast constraint validation in the parser
rg -n "broadcast.*MatMul\|MatMul.*broadcast" Deeploy/Targets -g '*.py'

Repository: pulp-platform/Deeploy

Length of output: 47


Guard against broadcasted A/B in batched MatMul.
The template unconditionally advances A and B pointers each batch iteration. The parser already computes A_batched and B_batched flags to indicate which inputs carry the full batch dimension, but the template ignores them. When either input is broadcast (batch dimension = 1), those unconditional increments will skip past valid data and corrupt results.

Conditionally increment only inputs where the batched flag is true, similar to how FloatAddTemplate and other Snitch templates handle broadcasting.

🤖 Prompt for AI Agents
In `@Deeploy/Targets/Snitch/Templates/MatMulTemplate.py` around lines 42 - 56, The
loop always advances input pointers ref_${data_out}_${A} and
ref_${data_out}_${B} for each batch, which breaks broadcasting; update the
batched MatMul loop to only increment those pointers when the corresponding
A_batched or B_batched flag is true (leave ref_${data_out}_${data_out} always
incremented), i.e. after calling
MatMul_s${A_type.referencedType.typeWidth}_s${B_type.referencedType.typeWidth}_s${data_out_type.referencedType.typeWidth}(...)
wrap the pointer advances in conditional checks using the parser-provided
A_batched and B_batched flags so broadcasted inputs (batch dim == 1) are not
advanced.

Comment on lines +23 to +47
def addGeometricalConstraint(tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:

inputBufferName = parseDict['data_in']
outputBufferName = parseDict['data_out']

pointer: List[str] = []

for key, value in parseDict.items():
if not isinstance(value, str):
continue

if ctxt.is_global(value) or ctxt.is_local(value):
pointer.append(value)

# Add I/O dimensions to the model as variables
for bufferName in [inputBufferName, outputBufferName]:
_buffer = ctxt.lookup(bufferName)
tilerModel.addTensorDimToModel(ctxt, bufferName)

for idx, shapeDim in enumerate(_buffer.shape):
tilerModel.addConstraint(tilerModel.getTensorDimVar(tensorName = bufferName, dimIdx = idx) <= shapeDim)

# Constrain total elements to be equal
inputBuffer = ctxt.lookup(inputBufferName)
outputBuffer = ctxt.lookup(outputBufferName)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused buffer lookups to avoid Ruff F841.

inputBuffer and outputBuffer are assigned but never used.

🧹 Suggested cleanup
-        inputBuffer = ctxt.lookup(inputBufferName)
-        outputBuffer = ctxt.lookup(outputBufferName)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def addGeometricalConstraint(tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:
inputBufferName = parseDict['data_in']
outputBufferName = parseDict['data_out']
pointer: List[str] = []
for key, value in parseDict.items():
if not isinstance(value, str):
continue
if ctxt.is_global(value) or ctxt.is_local(value):
pointer.append(value)
# Add I/O dimensions to the model as variables
for bufferName in [inputBufferName, outputBufferName]:
_buffer = ctxt.lookup(bufferName)
tilerModel.addTensorDimToModel(ctxt, bufferName)
for idx, shapeDim in enumerate(_buffer.shape):
tilerModel.addConstraint(tilerModel.getTensorDimVar(tensorName = bufferName, dimIdx = idx) <= shapeDim)
# Constrain total elements to be equal
inputBuffer = ctxt.lookup(inputBufferName)
outputBuffer = ctxt.lookup(outputBufferName)
def addGeometricalConstraint(tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:
inputBufferName = parseDict['data_in']
outputBufferName = parseDict['data_out']
pointer: List[str] = []
for key, value in parseDict.items():
if not isinstance(value, str):
continue
if ctxt.is_global(value) or ctxt.is_local(value):
pointer.append(value)
# Add I/O dimensions to the model as variables
for bufferName in [inputBufferName, outputBufferName]:
_buffer = ctxt.lookup(bufferName)
tilerModel.addTensorDimToModel(ctxt, bufferName)
for idx, shapeDim in enumerate(_buffer.shape):
tilerModel.addConstraint(tilerModel.getTensorDimVar(tensorName = bufferName, dimIdx = idx) <= shapeDim)
# Constrain total elements to be equal
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 30-30: Loop control variable key not used within loop body

Rename unused key to _key

(B007)


[error] 46-46: Local variable inputBuffer is assigned to but never used

Remove assignment to unused variable inputBuffer

(F841)


[error] 47-47: Local variable outputBuffer is assigned to but never used

Remove assignment to unused variable outputBuffer

(F841)

🤖 Prompt for AI Agents
In `@Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py` around lines
23 - 47, In addGeometricalConstraint remove the two unused local variables by
deleting the assignments "inputBuffer = ctxt.lookup(inputBufferName)" and
"outputBuffer = ctxt.lookup(outputBufferName)" (or alternatively use them if
needed); the current ctxt.lookup calls create unused values causing Ruff
F841—ensure any required buffer lookups are either consumed (e.g., used in
subsequent logic) or removed from the function body.

Comment on lines +83 to +134
for cube in outputCubes:
# Calculate the flat offset and size for the output cube
outSize = np.prod(cube.dims)
replacements["size"].append(outSize)

# For reshape, we need to map output cube to input cube
# Calculate flat index range for output cube
outOffset = 0
outStrides = []
stride = 1
for dim in reversed(outputShape):
outStrides.insert(0, stride)
stride *= dim

for i, (off, dim) in enumerate(zip(cube.offset, cube.dims)):
outOffset += off * outStrides[i]

# Convert flat offset to input coordinates
inStrides = []
stride = 1
for dim in reversed(inputShape):
inStrides.insert(0, stride)
stride *= dim

inOffset = []
remaining = outOffset
for i, stride in enumerate(inStrides):
inOffset.append(remaining // stride)
remaining = remaining % stride

# Calculate input cube dimensions
# For simplicity, treat as 1D cube in input space
inCubeDims = list(inputShape)
inCubeOffset = [0] * len(inputShape)

# Set the last dimension to the size, and offset based on flat index
totalSize = outSize
if len(inputShape) > 0:
# Compute proper input cube that covers the same elements
# Use a simple approach: linearize the input
inCubeOffset = list(inOffset)
inCubeDims = [1] * len(inputShape)
inCubeDims[-1] = min(totalSize, inputShape[-1] - inCubeOffset[-1])
remaining = totalSize - inCubeDims[-1]

for i in range(len(inputShape) - 2, -1, -1):
if remaining <= 0:
break
inCubeDims[i] = min(remaining // np.prod(inputShape[i + 1:]) + 1, inputShape[i])
remaining -= (inCubeDims[i] - 1) * np.prod(inputShape[i + 1:])

inputCube = HyperRectangle(tuple(inCubeOffset), tuple(inCubeDims))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reshape tile → input cube mapping can drop or mis-map elements.

The current construction of inCubeOffset/inCubeDims assumes a contiguous flat segment always maps to a single axis-aligned input hyperrectangle. When an output tile crosses input row/plane boundaries, inCubeDims can under-cover outSize or include unrelated elements, leading to incorrect DMA loads. Consider either constraining tiling to contiguous last-dimension segments (or a flattened 1D representation) or splitting such tiles into multiple input cubes.

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 97-97: Loop control variable dim not used within loop body

(B007)


[warning] 97-97: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


[warning] 109-109: Loop control variable i not used within loop body

(B007)

🤖 Prompt for AI Agents
In `@Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py` around lines
83 - 134, The mapping from output tiles to input HyperRectangle(s) in the loop
over outputCubes (using outputShape, inputShape, outOffset, outSize, inStrides,
inCubeOffset, inCubeDims and creating HyperRectangle) is incorrect for tiles
that cross input dimension boundaries; fix by detecting when the linear range
[outOffset, outOffset+outSize) spans multiple input strides and emit one or more
input HyperRectangle pieces that together cover exactly outSize elements instead
of assuming a single axis-aligned block. Concretely, replace the simplistic
inCubeDims/inCubeOffset construction with logic that: 1) computes the
remainingCount = outSize and a cursor = outOffset, 2) in a loop maps cursor to
multi-dimensional coords using inStrides to get the start coord, determines the
maximal contiguous run length along the last (fastest) dimension without
crossing that dimension limit (using inputShape and start coord), 3) create a
HyperRectangle for that run, subtract its size from remainingCount, advance
cursor by run length, and repeat until remainingCount == 0; ensure the created
HyperRectangle tuples (offset, dims) sum exactly to outSize and add all pieces
instead of a single inputCube.

Comment thread TargetLibraries/Snitch/inc/kernel/HardSwish.h
Comment thread TargetLibraries/Snitch/src/Div_fp32.c Outdated
Comment on lines +10 to +49
* Element-wise Division (FP32)
*
* Computes: output[i] = input1[i] / input2[i]
*
* Supports ONNX broadcasting rules:
* - If input2 is scalar (size=1): divides all elements of input1 by input2[0]
* - If both have same size: element-wise division
*
* input1: Numerator tensor (float32)
* input2: Denominator tensor (float32)
* output: Output tensor (same shape as input1)
* size: Total number of elements in input1
*
* multi-core = yes
* parallelization = element-wise across input1
*/
void Div_fp32(float32_t *input1, float32_t *input2, float32_t *output,
uint32_t size) {

uint32_t core_id = snrt_global_compute_core_idx();
uint32_t numThreads = snrt_global_compute_core_num();

// Parallelize across elements
uint32_t elements_per_core = size / numThreads;
uint32_t remainder = size % numThreads;

uint32_t start_elem, num_elems;
if (core_id < remainder) {
num_elems = elements_per_core + 1;
start_elem = core_id * num_elems;
} else {
num_elems = elements_per_core;
start_elem = core_id * elements_per_core + remainder;
}

// Check if input2 is a scalar (size=1, broadcasted)
// Note: This assumes the parser has set input2_size correctly
// For now, we assume element-wise division (same size)
for (uint32_t i = start_elem; i < start_elem + num_elems; i++) {
output[i] = input1[i] / input2[i];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align Div_fp32 docs with the actual scalar handling.
The function always uses input2[i], so the broadcast claim is misleading and could mask out-of-bounds access if a scalar is ever routed here. Consider documenting that Div_fp32_scalar is required for scalar inputs.

💡 Suggested doc fix
- * Supports ONNX broadcasting rules:
- * - If input2 is scalar (size=1): divides all elements of input1 by input2[0]
- * - If both have same size: element-wise division
+ * Expects input1 and input2 to have identical sizes (element-wise division).
+ * Use Div_fp32_scalar for scalar broadcasting.
@@
-  // Check if input2 is a scalar (size=1, broadcasted)
-  // Note: This assumes the parser has set input2_size correctly
-  // For now, we assume element-wise division (same size)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* Element-wise Division (FP32)
*
* Computes: output[i] = input1[i] / input2[i]
*
* Supports ONNX broadcasting rules:
* - If input2 is scalar (size=1): divides all elements of input1 by input2[0]
* - If both have same size: element-wise division
*
* input1: Numerator tensor (float32)
* input2: Denominator tensor (float32)
* output: Output tensor (same shape as input1)
* size: Total number of elements in input1
*
* multi-core = yes
* parallelization = element-wise across input1
*/
void Div_fp32(float32_t *input1, float32_t *input2, float32_t *output,
uint32_t size) {
uint32_t core_id = snrt_global_compute_core_idx();
uint32_t numThreads = snrt_global_compute_core_num();
// Parallelize across elements
uint32_t elements_per_core = size / numThreads;
uint32_t remainder = size % numThreads;
uint32_t start_elem, num_elems;
if (core_id < remainder) {
num_elems = elements_per_core + 1;
start_elem = core_id * num_elems;
} else {
num_elems = elements_per_core;
start_elem = core_id * elements_per_core + remainder;
}
// Check if input2 is a scalar (size=1, broadcasted)
// Note: This assumes the parser has set input2_size correctly
// For now, we assume element-wise division (same size)
for (uint32_t i = start_elem; i < start_elem + num_elems; i++) {
output[i] = input1[i] / input2[i];
* Element-wise Division (FP32)
*
* Computes: output[i] = input1[i] / input2[i]
*
* Expects input1 and input2 to have identical sizes (element-wise division).
* Use Div_fp32_scalar for scalar broadcasting.
*
* input1: Numerator tensor (float32)
* input2: Denominator tensor (float32)
* output: Output tensor (same shape as input1)
* size: Total number of elements in input1
*
* multi-core = yes
* parallelization = element-wise across input1
*/
void Div_fp32(float32_t *input1, float32_t *input2, float32_t *output,
uint32_t size) {
uint32_t core_id = snrt_global_compute_core_idx();
uint32_t numThreads = snrt_global_compute_core_num();
// Parallelize across elements
uint32_t elements_per_core = size / numThreads;
uint32_t remainder = size % numThreads;
uint32_t start_elem, num_elems;
if (core_id < remainder) {
num_elems = elements_per_core + 1;
start_elem = core_id * num_elems;
} else {
num_elems = elements_per_core;
start_elem = core_id * elements_per_core + remainder;
}
for (uint32_t i = start_elem; i < start_elem + num_elems; i++) {
output[i] = input1[i] / input2[i];
🤖 Prompt for AI Agents
In `@TargetLibraries/Snitch/src/Div_fp32.c` around lines 10 - 49, The function
Div_fp32 currently performs element-wise division using input2[i] and does not
handle scalar broadcasting; update the top-of-file comment for Div_fp32 to
remove the "If input2 is scalar (size=1): divides all elements..." claim and
clearly state that Div_fp32 assumes input1 and input2 have the same size
(element-wise) and that scalar denominators must call Div_fp32_scalar (or
another scalar-specific routine); additionally, add a short runtime assert/check
in Div_fp32 (referencing the variables size, input2, and function name Div_fp32)
or document that input2_size must equal size to prevent out-of-bounds access if
scalar inputs are routed incorrectly.

Comment thread TargetLibraries/Snitch/src/Mul_fp32.c
@diaconuccalin diaconuccalin self-requested a review February 2, 2026 12:52
@Victor-Jung

Copy link
Copy Markdown
Member

Super cool stuff! 🔥🔥🔥

@diaconuccalin diaconuccalin marked this pull request as draft February 2, 2026 14:22

@diaconuccalin diaconuccalin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only managed to look through 40/56 of all the changed files. I will continue tomorrow to dive into tiling, tests, and the other remaining components. In the meantime, I've already left some comments on parts that need fixes, so I don't block you and you can start before I finish my review.

Comment thread .github/workflows/ci-platform-snitch-tiled.yml
Comment thread .github/workflows/ci-platform-snitch.yml
Comment thread Deeploy/Targets/Generic/Layers.py
Comment thread Deeploy/Targets/Generic/Layers.py Outdated
super().__init__()

def parseNode(self, node: gs.Node) -> bool:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to run make format to fix these weird blank line changes

Comment thread TargetLibraries/Snitch/src/RMSNrom_fp32.c
Comment thread TargetLibraries/Snitch/src/CycleCounter.c Outdated
Comment thread TargetLibraries/Snitch/src/Add_fp32.c Outdated
Comment thread TargetLibraries/Snitch/inc/kernel/Add.h Outdated
Comment thread DeeployTest/testUtils/typeMapping.py Outdated

@diaconuccalin diaconuccalin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left comments for the remaining files from yesterday.

I would also like to have more extensive test for the broadcasting operators. Please generate new tests (and include them in the CI pipeline) for all the new snitch kernels with broadcast (add, div, mul) covering all the edge cases (in terms of inputs dimensions) from DeeployTest/Tests/Kernels/FP32/ReduceMean and from the broadcasting documentations of Torch and ONNX.

Comment thread Deeploy/Targets/Snitch/TileConstraints/FloatDivTileConstraint.py Outdated
Comment thread Deeploy/Targets/Snitch/TileConstraints/FloatDivTileConstraint.py Outdated
Comment thread Deeploy/Targets/Snitch/TileConstraints/FloatDivTileConstraint.py Outdated
Comment thread Deeploy/Targets/Snitch/TileConstraints/FloatDivTileConstraint.py Outdated
Comment thread Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py Outdated
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment thread DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/network.onnx

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need more extensive testing:

  • multiple values for axis (both positive and negative, not just the default '-1')
  • 1 and non-1 all values for scale

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 this was not addressed. I see many of the tests only having a value of "1" in the weights

Comment thread DeeployTest/Tests/Models/microLlama/microLlama_fp32_1/network.onnx Outdated
@Xeratec Xeratec added this to the Release 0.2.2 milestone Feb 3, 2026
@Xeratec Xeratec added this to Deeploy Feb 3, 2026
@Xeratec Xeratec added the Feature Addition of new features label Feb 3, 2026
@Xeratec Xeratec moved this to In progress in Deeploy Feb 3, 2026
@lee2716 lee2716 force-pushed the devel branch 8 times, most recently from 8eaf843 to 9426ccf Compare February 9, 2026 17:04

@diaconuccalin diaconuccalin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting the first part of the second round of reviews so that it doesn't get lost. Will continue tomorrow

Comment thread DeeployTest/Tests/Models/microLlama/FP32/microLlama1/network.onnx
Comment thread .github/workflows/ci-platform-snitch-tiled.yml
Comment thread .github/workflows/ci-platform-snitch.yml
Comment thread Deeploy/Targets/Generic/TileConstraints/NOPTileConstraint.py
Comment thread Deeploy/Targets/Snitch/Templates/FloatSoftmaxTemplate.py Outdated
Comment thread Deeploy/Targets/Snitch/Templates/GatherTemplate.py Outdated
Comment thread Deeploy/Targets/Snitch/Templates/ReshapeTemplate.py Outdated
Comment thread Deeploy/Targets/Snitch/Templates/TransposeTemplate.py
Comment thread Deeploy/Targets/Snitch/Bindings.py

@diaconuccalin diaconuccalin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 2/3 of the second review. Will come back to the last 12 files tomorrow.

Comment thread Deeploy/Targets/Snitch/TileConstraints/FloatScalarBOPTileConstraint.py Outdated
Comment thread Deeploy/Targets/Snitch/TileConstraints/FloatMulTileConstraint.py Outdated
Comment thread Deeploy/Targets/Snitch/Bindings.py Outdated
Comment thread Deeploy/Targets/Snitch/Bindings.py
Comment thread Deeploy/Targets/Snitch/Bindings.py Outdated
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated
Comment thread Deeploy/Targets/Snitch/Parsers.py Outdated

@diaconuccalin diaconuccalin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second round of review is done :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights in this op are all 1, even if they are FP32. Let's adjust these values to more interesting ones (non-int)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 this was not addressed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this file, I think is related to quantized models, I don't think it should exist for FP32 models, we should delete this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 this was not addressed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed f2f, let's also use more interesting (non-int) values here as well, for FP32 weights.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 this was not addressed

print(f"Warning: Empty input array for type inference for {values}!")
return default

# For all-zero arrays, use original dtype to distinguish int vs float

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Victor-Jung @Xeratec These look fine to me, but since it's closer to Deeploy-Core, maybe take a closer look that it doesn't break the function's intended functionality.

Comment thread DeeployTest/test_platforms.py
Comment thread TargetLibraries/Snitch/src/Add_fp32.c Outdated
Add support for FP32 operators required by MicroLlama model:
- RMSNorm: Fused RMS normalization
- HardSwish: Activation function
- Div: Element-wise division
- Mul: Element-wise multiplication
- MatMul: Matrix multiplication
- Add: Element-wise addition (FP32 support)
- Reshape, Transpose, Concat, Gather: Shape operations

Components added:
- Generic: Parsers, TypeCheckers, Layers, Bindings
- Snitch Templates: FloatAdd, FloatDiv, FloatHardSwish, FloatMul,
  FloatRMSNorm, FloatMatMul, Reshape, Transpose, Gather
- Snitch Kernels: C implementations for all FP32 operators
- Test data: Hardswish, RMSNorm_fused kernels, microLlama_fp32_1 model

This enables running MicroLlama FP32 model on Snitch in untiled mode:
  python testRunner_snitch.py -t Tests/Models/microLlama/microLlama_fp32_1
Add SnitchTiledPlatform with TileConstraints for FP32 operators:
- FloatDivTileConstraint: Division tiling with scalar broadcast
- FloatMulTileConstraint: Multiplication tiling with scalar broadcast
- ReshapeTileConstraint: Pass-through tiling for reshape

Updates:
- SnitchClusterTiling with tiled code transformation passes
- Tiler.py with new tile constraints registration
- platformMapping.py adds Snitch_tiled platform
- testRunner_tiled_snitch.py for tiled model testing
- CI workflows for both untiled and tiled Snitch
Previously the Snitch platform had separate tiled/untiled mappings
(SnitchMapping vs SnitchTiledMapping), but mapPlatform() only returned
SnitchPlatform with the untiled mapping, causing tiled tests to fail
with missing tileConstraint. Unified to a single mapping using
TilingReadyBindings for all operators, matching the upstream pattern
used by PULPOpen.
The binding was lost when switching from BasicMatMulBindings to
Snitch-specific SnitchMatMulBindings, which only had FP32 MatMul.
- Switch Snitch GatherTemplate to Generic version (identical after macros
  fix, removes code duplication)
- Revert NOPTileConstraint.py to devel (serializeTilingSolution not needed
  since Snitch Reshape uses SkipTransformer)
- Simplify SnitchRMSNormParser/SnitchHardSwishParser via inheritance from Generic
- Remove unnecessary TilingVariableReplacement("L2") from TiledTransformer
- Delete unused Snitch GatherTemplate.py (now using Generic)
- Remove redundant comments in FloatAdd/FloatMul/FloatRMSNorm templates
- CI: split snitch-kernels and snitch-models into separate jobs (matching Siracusa pattern)
- Revert out-of-scope LoweringOptimizationPasses.py change
- Merge duplicate FloatDivTileConstraint/FloatMulTileConstraint into FloatScalarBOPTileConstraint
- Consolidate TileConstraint imports in Tiler.py
- Remove unused imports (NodeParser in Parsers.py, BasicConcatBindings in Platform.py)

@diaconuccalin diaconuccalin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea to move the barrier to SnitchCoreFilterPass/SnitchSynchCoresPass :) maybe we can extend this adaptation to other targets as well and extend reuse of the generic kernels.

Otherwise, very close to moving the PR out of draft mode :)

Comment thread .github/workflows/ci-platform-snitch-tiled.yml
# Check if scalar broadcasting
is_scalar = operatorRepresentation.get('is_scalar', False)

# IMPORTANT: Must recompile self.template (Mako Template object),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment. Is this a warning for future developers, is there a recompile step they need to do?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my debug notes, when I writing switch template I used:
if is_scalar:
self.templateStr = FloatDivScalarTemplateStr
else:
self.templateStr = FloatDivTemplateStr
instead of using:
self.template = MakoTemplate(FloatDivScalarTemplateStr, strict_undefined=True)
Then generate() functions read the "self.template" instead of self.templateStr
I think I can delete this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this change has not been reverted

Comment thread Deeploy/Targets/Snitch/Templates/FloatDivTemplate.py
Comment thread Deeploy/Targets/Snitch/Templates/FloatMatMulTemplate.py
# Check if scalar broadcasting
is_scalar = operatorRepresentation.get('is_scalar', False)

# IMPORTANT: Must recompile self.template (Mako Template object),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, it's unclear to me the meaning of this comment + moving this under Generic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 same here

Comment thread Deeploy/Targets/Snitch/Bindings.py
class FloatScalarBOPTileConstraint(TileConstraint):
"""Tile constraint for binary operators with scalar broadcasting support.

Extends BOPTileConstraint with scalar handling: when one input has size 1,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this extra handle needed, since the tiler can take care of this? And if it is indeed needed, why not extend the Generic tile constraint with this handling instead of using a separate snitch one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tiler can not take care of this, But It's good idea to extend the Generic tile constraint.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 this doesn't seem to have been addressed

Comment thread DeeployTest/Platforms/Snitch/main.c Outdated
uint32_t const num_compute_cores = snrt_global_compute_core_num();
#endif

// All cores call InitNetwork: allocations inside are DM-core guarded

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very clear to me why this move is needed, can you please explain?

@lee2716 lee2716 May 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more explanation in the code comment, Calling InitNetwork only from snrt_is_dm_core() would deadlock the DM core on the first internal barrier.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 I see no new explanations here

Comment thread TargetLibraries/Snitch/src/Add_fp32.c Outdated

#include "DeeploySnitchMath.h"

void Add_fp32(float32_t *pIn1, float32_t *pIn2, float32_t *pOut,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a separate adder? Shouldn't the broadcast version be able to handle a "non-broadcast" cast?

@lee2716 lee2716 May 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to improve the performance, we need a separate adder, if we use broadcast version in microllama it would wast around 100000 cycles. I can try to merge them by adding a if ..else.. which is a good balance btw code size and performance and only wasted 10000 cycles.

@diaconuccalin diaconuccalin Jun 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that the broadcast version is not efficient? And please be more specific, 100k/10k extra cycles for what input sizes?

Comment thread TargetLibraries/Snitch/src/Softmax_fp32.c
lee2716 added 9 commits May 5, 2026 10:54
# Conflicts:
#	Deeploy/Targets/Snitch/Bindings.py
Restore PR pulp-platform#144's per-test-type job structure that was lost in PR pulp-platform#153:
- ci-platform-snitch-tiled.yml: keep snitch-kernels-tiled-singlebuffer-L2 with
  marker 'kernels and singlebuffer and l2', add new snitch-models-tiled-
  singlebuffer-L2 job with marker 'models and singlebuffer and l2'.
- ci-platform-snitch.yml: quote the kernels and models markers for consistency
  with other CI workflows.
…reuse Generic ReshapeTemplate

FloatDivTemplate / FloatMulTemplate (Snitch):
Replace the alignToContext-based template switching (which mutated the
module-level NodeTemplate singleton's self.template) with a single Mako
template that branches on `is_scalar` provided by the parser. This
eliminates a hidden state bug where, in networks containing both scalar
and element-wise Div/Mul nodes, the bind phase's last alignToContext
call would overwrite the template for all earlier-bound nodes; the
subsequent generate phase would then render them with whichever
template won the race. Same pattern as FloatPowTemplate.py.

ReshapeTemplate (Snitch):
Collapse to a re-export of the Generic referenceTemplate. The previous
Snitch subclass only added `bufferOut._alias = bufferIn.name`, which
PULPOpen carries as the same workaround.

ReshapeTemplate (Generic):
Set `bufferOut._alias = bufferIn.name` in alignToContext so the legacy
single-valued attribute that the TilingExtension still reads is
populated centrally, removing the need to carry this workaround in
per-platform Reshape subclasses (Snitch and PULPOpen both did).
…n/devel style

In Tiler.py, pull GemmTileConstraint and RqGemmTileConstraint out of the
package-level import line and import each via its full submodule path,
and drop the corresponding two re-exports from TileConstraints/__init__.py.
This only restores the import style that was on origin/devel before this PR --
no functional change, and no other files affected.
Replace the inline "for good load balancing" comment, which was misleading
(both heuristics saturate all 8 cores when multiple dims >= 8), with the
actual rationale: outer-dim parallelization gives each core a single
contiguous slab of memory, simplifying the access pattern and reducing
per-core loop-control overhead vs fragmenting work across an inner dim.
The largest-dim fallback only triggers when no dim is big enough for
every core, to minimize idle cores.
…Constraint

BOPTileConstraint previously asserted dim equality across all three
tensors, which made it incompatible with binary ops where one input is
scalar (size 1). The Snitch FP32 Div/Mul path worked around this with
its own FloatScalarBOPTileConstraint subclass that detected is_scalar
and added the right constraints.

Push the scalar-aware logic into BOPTileConstraint itself: same-shape
inputs follow the existing element-wise path, scalar input2 keeps the
second tensor full-size and ties input1/output dims together. The
Snitch wrapper class is no longer needed -- Snitch Tiler.py now uses
Generic MulTileConstraint (a thin BOPTileConstraint subclass that
overrides the parser tensor names to A/B/C, which matches what
SnitchDivParser/SnitchMulParser produce).

Other BOPTileConstraint subclasses (Generic AddTileConstraint /
MulTileConstraint, PULPOpen SGDTileConstraint / GeluGradTileConstraint)
inherit the new is_scalar branch; their existing same-shape behavior
is unchanged.
Replace the vague "for barrier balance" comment with the actual reason:
InitNetwork now contains snrt_cluster_hw_barrier() calls after each
DM-core-guarded allocation. The guard + barrier is what makes the
allocation multi-core safe -- snrt_l1alloc / snrt_l3alloc would
otherwise bump a cluster-shared pointer once per core and leave each
core with a divergent address for the same global. Calling InitNetwork
only from snrt_is_dm_core() would deadlock the DM core on the first
internal barrier waiting for compute cores that never entered.
…wise fast path

Collapse the two FP32 Add kernels (simple Add_fp32 and Add_fp32_broadcast)
into a single Add_fp32_broadcast that detects identity strides upfront
and falls through to a plain elementwise loop. The stride check is an
ndim-step compare (a handful of cycles per call); the inner loop is
unchanged for the elementwise case, so no per-element overhead is added.

FloatAddTemplate now always synthesizes natural strides + out_shape +
ndim from the parser's data_out shape (instead of only on the
need_broadcast path), so every Add call goes through the unified kernel.

Measured impact on untiled microLlama1 (80 elementwise Adds, sizes
32/64/256, ndim 3-4): runtime went from 2,224,436 to 2,237,793 cycles
(+0.6%), which matches the expected stride-check and per-call array
setup overhead.
bufferIn.aliases.add(bufferOut.name)
bufferOut.aliases.add(bufferIn.name)

# Tiling still reads the legacy single-valued `_alias` attribute

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better practice to do the fix in the tiler (prevent it from relying on _alias and instead use the new alises parameter)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workload is too heavy, it would touch the deeploy core:
Deeploy/DeeployTypes.py
Deeploy/TilingExtension/TilerExtension.py
Deeploy/TilingExtension/MemoryScheduler.py
Deeploy/Targets/Generic/Templates/ReshapeTemplate.py
Deeploy/Targets/PULPOpen/Templates/ReshapeTemplate.py

@lee2716 lee2716 Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to open a new pr to fix this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get a second opinion on this one from @Victor-Jung maybe

Comment thread Deeploy/Targets/Generic/TileConstraints/BOPTileConstraint.py
Comment thread Deeploy/Targets/Generic/TileConstraints/BOPTileConstraint.py
Comment thread Deeploy/Targets/Generic/TileConstraints/BOPTileConstraint.py
Div_fp32_scalar(${A}, scalar, ${C}, ${size});
}
% else:
Div_fp32(${A}, ${B}, ${C}, ${size});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the 2 functions, I think we can avoid some code duplication if we combine the 2 kernels in a single function, with an extra is_scalar function attribute, with an if over this parameter at the end of the kernel that changes switches between the 2 possible loops

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lee2716 I think that this was not addressed


# Reshape on Snitch reduces to a pointer alias (data is reinterpreted, not
# copied). The Generic implementation already covers this and now also sets
# the legacy `_alias` attribute required by the tiling extension, so we just

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, let's adjust the tiling to stop using the legacy _alias instead of doing workarounds

# loop-control overhead vs. fragmenting work across an inner dim.
# Fall back to the largest dim only when no dim is big enough for
# every core, so as few cores idle as possible.
parallelDims = [idx for idx, dim in enumerate(data_out_shape) if dim >= 8]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it's good practice to have so much logic in here, do so many operations. Usually, these kind of complex operator adaptations are done in the parsers, but maybe let's wait for a second opinion from the other reveiwers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I had the same hesitation when I wrote this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, if you also agree, let's adjust this

Comment thread Deeploy/Targets/Snitch/Bindings.py
Comment thread Deeploy/Targets/Snitch/Templates/ReshapeTemplate.py Outdated
Comment thread TargetLibraries/Snitch/src/Add_fp32.c
lee2716 added 8 commits June 4, 2026 13:00
Definition was inconsistent with the scalar broadcasting note below it and
narrower than ONNX's binary op concept. Also added a warning that only
scalar input2 is handled.
Drop Div_fp32_scalar / Mul_fp32_scalar. Single kernel with a runtime
is_scalar branch at the end -- avoids duplicating the per-core chunk
setup. Scalar branch keeps the 1/input2[0] hoist.
Import ReshapeTemplate from Generic.Templates directly in Bindings.py
instead of going through a Snitch wrapper file that only re-exported
the Generic referenceTemplate. Removes the noqa: F401 marker and one
layer of indirection.
Cover the (N,M) + (1,) -> (N,M) path that exercises BOPTileConstraint's
is_scalar branch and the is_scalar=1 path of the merged kernels.
Registered for both untiled and tiled (L1 = 2000/5000/10000) CI.
SnitchDivParser and SnitchMulParser were identical copies, and both set
is_scalar whenever either input had size 1. But Div_fp32/Mul_fp32 only
read input2[0], so an input1-scalar gave wrong/OOB results, and there's
no broadcasting kernel at all.

Share one mixin: set is_scalar only for a scalar input2 and reject real
broadcasts so they fail to bind instead of emitting OOB reads. This lines
up with BOPTileConstraint, which already treats scalar as input2 size 1.
Add SSR + FREP accelerated FP32 kernels for the Snitch tiled flow, where
DMA stages operands into TCDM/L1 (SSR can only stream from cluster memory):

- matmul_fp32_ssr_frep: M-parallel, splits rows across compute cores.
- matmul_fp32_ssr_frep_oparallel: O-parallel, keeps all cores busy when
  M=1 (autoregressive decode).
- RMSNorm_fp32_ssr_frep: SSR+FREP on the sum-of-squares reduction only
  (register accumulate, no DM2 write stream); element-wise scale and odd
  lastDimLength fall back to scalar.

The shared 8-column FREP inner sequence is factored into a single
static inline _matmul_fp32_ssr_frep_otile helper used by both matmul
variants. Add a v2f32 packed-pair typedef to DeeploySnitchMath.h for the
64-bit FPU register width.

Wire the new kernels through ssrFrep NodeTemplates and switch the tiled
FP32 MatMul and RMSNorm bindings to them.
Resolves two add/add conflicts from parallel development:
- Generic/Layers.py: keep both sides' new layers; use devel's
  HardSwishLayer (the MAGIA pulp-platform#193 version) and drop the duplicate
  microLlama-side definition; keep the new RMSNormLayer.
- Generic/Parsers.py: keep both compute_broadcast_strides and
  devel's UnaryElementWiseParser.
Keep the non-tiled FP32 MatMul on the scalar template.

@diaconuccalin diaconuccalin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some leftover unadressed comments plus very few new ones. Also, please be sure to run make format, currently the format CI doesn't pass.

Other than this, I think it's ready to be pulled out of draft mode and receive a review also from one of the maintainers. Thanks a lot for your contribution and for your patience with the rather long review process :)

return ctxt, operatorRepresentation, []


FloatRMSNormTemplateStr = r"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the regular template used anywhere anymore?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the Kernels/FP32/Div test leftovers (inputs, onnx, outputs) in a new subdir following the same pattern as other tests (Kernels/FP32/Div/Regular)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the Kernels/FP32/Mul test leftovers (inputs, onnx, outputs) in a new subdir following the same pattern as other tests (Kernels/FP32/Mul/Regular)


L2_SINGLEBUFFER_MODELS = {}
L2_SINGLEBUFFER_MODELS = {
"Models/microLlama/FP32/microLlama1": [10000, 20000],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any tiling happen in the model at the 10000 L2 limit?

* multi-core = yes (splits O-tiles across cores; keeps cores busy when M=1)
* accel = SSR streams (DM0/DM1) + FREP 8x FMA (v2f32)
*/
void matmul_fp32_ssr_frep_oparallel(const float32_t *__restrict__ pSrcA,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Snitch use all 3 versions of the matmul defined here?

@diaconuccalin diaconuccalin marked this pull request as ready for review June 27, 2026 15:10
@diaconuccalin diaconuccalin requested a review from runwangdl as a code owner June 27, 2026 15:10

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
Deeploy/Targets/Snitch/Templates/TransposeTemplate.py (1)

72-82: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Derive the parallelization threshold from the configured core count.

The generated C already reads snrt_cluster_compute_core_num(), but this heuristic still selects the parallel axis with dim >= 8. On non-8-core Snitch configs that picks the wrong axis and can leave a lot of cores idle. Please thread the platform's compute-core count into the template instead of hard-coding 8.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Snitch/Templates/TransposeTemplate.py` around lines 72 - 82,
The parallel-axis selection in TransposeTemplate currently hard-codes a
threshold of 8, which should instead come from the platform’s compute-core
count. Update the logic around parallelDims/parallelDim to use the configured
core count value already available from the generated C path via
snrt_cluster_compute_core_num(), so the heuristic matches the target Snitch
configuration. Keep the same fallback behavior to
data_out_shape.index(max(data_out_shape)) when no dimension is large enough, but
replace the fixed 8 check with the dynamic core-count-derived threshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Deeploy/DeeployTypes.py`:
- Around line 526-535: The NetworkContext constructor currently accepts a name
but does not propagate it into the context state used by _mangle(), so
non-default network names are ignored. Update the NetworkContext.__init__ path
to store the provided name and ensure both NetworkContext construction sites
pass the intended name through so generated symbols use that value instead of
always defaulting to DeeployNetwork.

In `@Deeploy/Targets/Generic/Bindings.py`:
- Around line 185-186: Rename the comprehension loop variable in the NodeBinding
collection from type to a non-builtin name to satisfy Ruff A001; update the
generator expression that builds the FloatMaxPoolTemplate.reference1DTemplate
bindings and keep the rest of the MaxPoolChecker/PointerClass logic unchanged.

In `@Deeploy/Targets/Generic/Layers.py`:
- Around line 726-740: The op-count in computeOps is using NormalizedAxesSize
where it should scale by the number of RMS groups/slices. Update the formula in
Deeploy/Targets/Generic/Layers.py’s computeOps to derive the group count from
the input shape and normalized axes, then use that count for the reduce/add/sqrt
terms instead of treating NormalizedAxesSize as the number of groups. Keep the
XSquared and Normalize/Mul contributions tied to inputSize, and only replace the
reduction-related part so nodeOps matches shapes like [B, S, D] with axis=-1.

In `@Deeploy/Targets/Generic/Parsers.py`:
- Around line 131-133: The RMSNorm parser is assuming `node.inputs[1]` is always
a constant, which can raise an `AttributeError` for dynamic scale inputs. Update
the RMSNorm handling in `Parsers.py` to keep the existing `gs.Constant` fast
path for `self.operatorRepresentation['scale']`, but set a sentinel such as
`None` when the scale input is not constant. Then adjust
`RMSNormLayer.computeOps()` so it interprets `scale is None` as the non-unit
case and continues cleanly without dereferencing `values`.
- Around line 553-567: The broadcast metadata is being skipped because `shape1`
and `shape2` are read after `ONNXLayer.parse()`/`parseNodeCtxt()` has already
normalized variable inputs via `broadcast()` and `AddLayer.computeShapes()`.
Update the broadcast check in this parser block to detect the original
variable-input broadcast case before shapes are rewritten, using the
pre-broadcast operand shapes available in
`parseNodeCtxt()`/`operatorRepresentation`, so `need_broadcast`, `ndim`,
`strides1`, `strides2`, and `out_shape` are still emitted for non-constant
broadcasts.

In `@Deeploy/Targets/Generic/TypeCheckers.py`:
- Around line 629-635: Update _inferSignedness in TypeCheckers.py so RMSNorm
signedness is derived from both the data input and the weight input, not just
inputs[0]._signed. Locate the RMSNorm handling in _inferSignedness and
incorporate inputs[1] (the weight/scale tensor) when deciding whether the output
can be signed, then return the correct boolean list so downstream checkers do
not inherit a stale unsigned flag.

In `@Deeploy/Targets/Snitch/Platform.py`:
- Around line 129-133: The __init__ method in Platform uses mutable default
arguments for shape and values, which should be replaced with None so each
instance gets its own data. Update the Platform.__init__ signature to accept
optional defaults, then initialize shape and values inside the constructor
before passing them to super().__init__, keeping the behavior of the _type
initialization unchanged.

In `@Deeploy/Targets/Snitch/Tiler.py`:
- Around line 52-53: The Snitch tiled concat binding currently only handles two
inputs, while ConcatParser allows 3+ inputs, so multi-input Concat can
incorrectly reach SnitchConcatTilingReadyBindings with an incomplete schedule.
Update the Snitch concat path in Tiler.py by either adding a parser/type-check
guard in the Concat handling to require exactly two inputs, or extending
SnitchConcatBindings and ConcatTileConstraint to serialize and schedule all
inputs. Use the existing SnitchConcatTilingReadyBindings, SnitchConcatBindings,
and ConcatTileConstraint symbols to locate the fix.

---

Nitpick comments:
In `@Deeploy/Targets/Snitch/Templates/TransposeTemplate.py`:
- Around line 72-82: The parallel-axis selection in TransposeTemplate currently
hard-codes a threshold of 8, which should instead come from the platform’s
compute-core count. Update the logic around parallelDims/parallelDim to use the
configured core count value already available from the generated C path via
snrt_cluster_compute_core_num(), so the heuristic matches the target Snitch
configuration. Keep the same fallback behavior to
data_out_shape.index(max(data_out_shape)) when no dimension is large enough, but
replace the fixed 8 check with the dynamic core-count-derived threshold.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a29791a-c815-4dd3-9c8f-b10e5527e3eb

📥 Commits

Reviewing files that changed from the base of the PR and between 7c8f2d8 and 4ec3fb3.

📒 Files selected for processing (111)
  • .github/workflows/ci-platform-snitch-tiled.yml
  • .github/workflows/ci-platform-snitch.yml
  • .yamllint
  • Deeploy/DeeployTypes.py
  • Deeploy/Targets/Generic/Bindings.py
  • Deeploy/Targets/Generic/Layers.py
  • Deeploy/Targets/Generic/Parsers.py
  • Deeploy/Targets/Generic/Templates/FloatDivTemplate.py
  • Deeploy/Targets/Generic/Templates/ReshapeTemplate.py
  • Deeploy/Targets/Generic/TileConstraints/BOPTileConstraint.py
  • Deeploy/Targets/Generic/TypeCheckers.py
  • Deeploy/Targets/Snitch/Bindings.py
  • Deeploy/Targets/Snitch/Parsers.py
  • Deeploy/Targets/Snitch/Platform.py
  • Deeploy/Targets/Snitch/Templates/AllocateTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatAddTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatDivTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatHardSwishTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatMatMulTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatMulTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatRMSNormTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatSoftmaxTemplate.py
  • Deeploy/Targets/Snitch/Templates/TransposeTemplate.py
  • Deeploy/Targets/Snitch/Tiler.py
  • DeeployTest/Platforms/Snitch/main.c
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Hardswish/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Hardswish/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Hardswish/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/network.onnx
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/network.onnx
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/outputs.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/activations.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/inputs.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/network.onnx
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/outputs.npz
✅ Files skipped from review due to trivial changes (1)
  • .yamllint
🚧 Files skipped from review as they are similar to previous changes (2)
  • Deeploy/Targets/Snitch/Templates/FloatHardSwishTemplate.py
  • DeeployTest/Tests/Kernels/FP32/Hardswish/network.onnx

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 8

🧹 Nitpick comments (1)
Deeploy/Targets/Snitch/Templates/TransposeTemplate.py (1)

72-82: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Derive the parallelization threshold from the configured core count.

The generated C already reads snrt_cluster_compute_core_num(), but this heuristic still selects the parallel axis with dim >= 8. On non-8-core Snitch configs that picks the wrong axis and can leave a lot of cores idle. Please thread the platform's compute-core count into the template instead of hard-coding 8.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Snitch/Templates/TransposeTemplate.py` around lines 72 - 82,
The parallel-axis selection in TransposeTemplate currently hard-codes a
threshold of 8, which should instead come from the platform’s compute-core
count. Update the logic around parallelDims/parallelDim to use the configured
core count value already available from the generated C path via
snrt_cluster_compute_core_num(), so the heuristic matches the target Snitch
configuration. Keep the same fallback behavior to
data_out_shape.index(max(data_out_shape)) when no dimension is large enough, but
replace the fixed 8 check with the dynamic core-count-derived threshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Deeploy/DeeployTypes.py`:
- Around line 526-535: The NetworkContext constructor currently accepts a name
but does not propagate it into the context state used by _mangle(), so
non-default network names are ignored. Update the NetworkContext.__init__ path
to store the provided name and ensure both NetworkContext construction sites
pass the intended name through so generated symbols use that value instead of
always defaulting to DeeployNetwork.

In `@Deeploy/Targets/Generic/Bindings.py`:
- Around line 185-186: Rename the comprehension loop variable in the NodeBinding
collection from type to a non-builtin name to satisfy Ruff A001; update the
generator expression that builds the FloatMaxPoolTemplate.reference1DTemplate
bindings and keep the rest of the MaxPoolChecker/PointerClass logic unchanged.

In `@Deeploy/Targets/Generic/Layers.py`:
- Around line 726-740: The op-count in computeOps is using NormalizedAxesSize
where it should scale by the number of RMS groups/slices. Update the formula in
Deeploy/Targets/Generic/Layers.py’s computeOps to derive the group count from
the input shape and normalized axes, then use that count for the reduce/add/sqrt
terms instead of treating NormalizedAxesSize as the number of groups. Keep the
XSquared and Normalize/Mul contributions tied to inputSize, and only replace the
reduction-related part so nodeOps matches shapes like [B, S, D] with axis=-1.

In `@Deeploy/Targets/Generic/Parsers.py`:
- Around line 131-133: The RMSNorm parser is assuming `node.inputs[1]` is always
a constant, which can raise an `AttributeError` for dynamic scale inputs. Update
the RMSNorm handling in `Parsers.py` to keep the existing `gs.Constant` fast
path for `self.operatorRepresentation['scale']`, but set a sentinel such as
`None` when the scale input is not constant. Then adjust
`RMSNormLayer.computeOps()` so it interprets `scale is None` as the non-unit
case and continues cleanly without dereferencing `values`.
- Around line 553-567: The broadcast metadata is being skipped because `shape1`
and `shape2` are read after `ONNXLayer.parse()`/`parseNodeCtxt()` has already
normalized variable inputs via `broadcast()` and `AddLayer.computeShapes()`.
Update the broadcast check in this parser block to detect the original
variable-input broadcast case before shapes are rewritten, using the
pre-broadcast operand shapes available in
`parseNodeCtxt()`/`operatorRepresentation`, so `need_broadcast`, `ndim`,
`strides1`, `strides2`, and `out_shape` are still emitted for non-constant
broadcasts.

In `@Deeploy/Targets/Generic/TypeCheckers.py`:
- Around line 629-635: Update _inferSignedness in TypeCheckers.py so RMSNorm
signedness is derived from both the data input and the weight input, not just
inputs[0]._signed. Locate the RMSNorm handling in _inferSignedness and
incorporate inputs[1] (the weight/scale tensor) when deciding whether the output
can be signed, then return the correct boolean list so downstream checkers do
not inherit a stale unsigned flag.

In `@Deeploy/Targets/Snitch/Platform.py`:
- Around line 129-133: The __init__ method in Platform uses mutable default
arguments for shape and values, which should be replaced with None so each
instance gets its own data. Update the Platform.__init__ signature to accept
optional defaults, then initialize shape and values inside the constructor
before passing them to super().__init__, keeping the behavior of the _type
initialization unchanged.

In `@Deeploy/Targets/Snitch/Tiler.py`:
- Around line 52-53: The Snitch tiled concat binding currently only handles two
inputs, while ConcatParser allows 3+ inputs, so multi-input Concat can
incorrectly reach SnitchConcatTilingReadyBindings with an incomplete schedule.
Update the Snitch concat path in Tiler.py by either adding a parser/type-check
guard in the Concat handling to require exactly two inputs, or extending
SnitchConcatBindings and ConcatTileConstraint to serialize and schedule all
inputs. Use the existing SnitchConcatTilingReadyBindings, SnitchConcatBindings,
and ConcatTileConstraint symbols to locate the fix.

---

Nitpick comments:
In `@Deeploy/Targets/Snitch/Templates/TransposeTemplate.py`:
- Around line 72-82: The parallel-axis selection in TransposeTemplate currently
hard-codes a threshold of 8, which should instead come from the platform’s
compute-core count. Update the logic around parallelDims/parallelDim to use the
configured core count value already available from the generated C path via
snrt_cluster_compute_core_num(), so the heuristic matches the target Snitch
configuration. Keep the same fallback behavior to
data_out_shape.index(max(data_out_shape)) when no dimension is large enough, but
replace the fixed 8 check with the dynamic core-count-derived threshold.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a29791a-c815-4dd3-9c8f-b10e5527e3eb

📥 Commits

Reviewing files that changed from the base of the PR and between 7c8f2d8 and 4ec3fb3.

📒 Files selected for processing (111)
  • .github/workflows/ci-platform-snitch-tiled.yml
  • .github/workflows/ci-platform-snitch.yml
  • .yamllint
  • Deeploy/DeeployTypes.py
  • Deeploy/Targets/Generic/Bindings.py
  • Deeploy/Targets/Generic/Layers.py
  • Deeploy/Targets/Generic/Parsers.py
  • Deeploy/Targets/Generic/Templates/FloatDivTemplate.py
  • Deeploy/Targets/Generic/Templates/ReshapeTemplate.py
  • Deeploy/Targets/Generic/TileConstraints/BOPTileConstraint.py
  • Deeploy/Targets/Generic/TypeCheckers.py
  • Deeploy/Targets/Snitch/Bindings.py
  • Deeploy/Targets/Snitch/Parsers.py
  • Deeploy/Targets/Snitch/Platform.py
  • Deeploy/Targets/Snitch/Templates/AllocateTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatAddTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatDivTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatHardSwishTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatMatMulTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatMulTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatRMSNormTemplate.py
  • Deeploy/Targets/Snitch/Templates/FloatSoftmaxTemplate.py
  • Deeploy/Targets/Snitch/Templates/TransposeTemplate.py
  • Deeploy/Targets/Snitch/Tiler.py
  • DeeployTest/Platforms/Snitch/main.c
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Add/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Div/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Hardswish/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Hardswish/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Hardswish/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/network.onnx
  • DeeployTest/Tests/Kernels/FP32/Mul/Scalar/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/network.onnx
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/separate_ops/outputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/inputs.npz
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/network.onnx
  • DeeployTest/Tests/Kernels/FP32/RMSNorm/single_fused_op/outputs.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/activations.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/inputs.npz
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/network.onnx
  • DeeployTest/Tests/Models/microLlama/FP32/microLlama1/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama128/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama16_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama1_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama256/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama2_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama32_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama4_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama64_parallel/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8/outputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/activations.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/inputs.npz
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/network.onnx
  • DeeployTest/Tests/Models/microLlama/INT8/microLlama8_parallel/outputs.npz
✅ Files skipped from review due to trivial changes (1)
  • .yamllint
🚧 Files skipped from review as they are similar to previous changes (2)
  • Deeploy/Targets/Snitch/Templates/FloatHardSwishTemplate.py
  • DeeployTest/Tests/Kernels/FP32/Hardswish/network.onnx
🛑 Comments failed to post (8)
Deeploy/DeeployTypes.py (1)

526-535: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Propagate the container name into NetworkContext.

_mangle() prefixes generated symbols with NetworkContext.name, but both context construction sites still omit this new parameter. Any non-default network name is ignored, so separate deployments still emit DeeployNetwork_* symbols and can collide.

Suggested fix
-        self.ctxt = NetworkContext(variableBuffer = self.Platform.VariableBuffer,
-                                   constantBuffer = self.Platform.ConstantBuffer,
-                                   structBuffer = self.Platform.StructBuffer,
-                                   transientBuffer = self.Platform.TransientBuffer)
+        self.ctxt = NetworkContext(variableBuffer = self.Platform.VariableBuffer,
+                                   constantBuffer = self.Platform.ConstantBuffer,
+                                   structBuffer = self.Platform.StructBuffer,
+                                   transientBuffer = self.Platform.TransientBuffer,
+                                   name = self.name)
...
-        self.ctxt = NetworkContext(variableBuffer = self.Platform.VariableBuffer,
-                                   constantBuffer = self.Platform.ConstantBuffer,
-                                   structBuffer = self.Platform.StructBuffer,
-                                   transientBuffer = self.Platform.TransientBuffer)
+        self.ctxt = NetworkContext(variableBuffer = self.Platform.VariableBuffer,
+                                   constantBuffer = self.Platform.ConstantBuffer,
+                                   structBuffer = self.Platform.StructBuffer,
+                                   transientBuffer = self.Platform.TransientBuffer,
+                                   name = self.name)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/DeeployTypes.py` around lines 526 - 535, The NetworkContext
constructor currently accepts a name but does not propagate it into the context
state used by _mangle(), so non-default network names are ignored. Update the
NetworkContext.__init__ path to store the provided name and ensure both
NetworkContext construction sites pass the intended name through so generated
symbols use that value instead of always defaulting to DeeployNetwork.
Deeploy/Targets/Generic/Bindings.py (1)

185-186: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Rename the comprehension variable to avoid Ruff A001.

Line 185 uses type as the loop variable, which shadows Python's builtin and matches the static-analysis error on Line 186.

Suggested fix
-    NodeBinding(MaxPoolChecker([PointerClass(type)], [PointerClass(type)]), FloatMaxPoolTemplate.reference1DTemplate,
-                BasicTransformer) for type in FloatDataTypes
+    NodeBinding(MaxPoolChecker([PointerClass(dtype)], [PointerClass(dtype)]), FloatMaxPoolTemplate.reference1DTemplate,
+                BasicTransformer) for dtype in FloatDataTypes
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    NodeBinding(MaxPoolChecker([PointerClass(dtype)], [PointerClass(dtype)]), FloatMaxPoolTemplate.reference1DTemplate,
                BasicTransformer) for dtype in FloatDataTypes
🧰 Tools
🪛 Ruff (0.15.18)

[error] 186-186: Variable type is shadowing a Python builtin

(A001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Generic/Bindings.py` around lines 185 - 186, Rename the
comprehension loop variable in the NodeBinding collection from type to a
non-builtin name to satisfy Ruff A001; update the generator expression that
builds the FloatMaxPoolTemplate.reference1DTemplate bindings and keep the rest
of the MaxPoolChecker/PointerClass logic unchanged.

Source: Linters/SAST tools

Deeploy/Targets/Generic/Layers.py (1)

726-740: 🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Use the number of RMS groups, not NormalizedAxesSize, in the op count.

NormalizedAxesSize is the size of one normalization slice (prod(input_shape[axis:])), not the number of slices. For shapes like [B, S, D] with axis=-1, the extra reduce/add/sqrt terms scale with B*S, but this formula uses D, so nodeOps is badly inflated.

Suggested fix
-        scale_ops = 0 if (scale == 1.0).all() else inputSize
-        ops = 6 * inputSize - 3 * NormalizedAxesSize + scale_ops
+        group_count = inputSize // NormalizedAxesSize
+        scale_ops = 0 if scale is not None and (scale == 1.0).all() else inputSize
+        ops = 3 * inputSize + 2 * group_count + scale_ops
         return ops
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def computeOps(self):
        inputSize = self.mapper.parser.operatorRepresentation['inputSize']
        NormalizedAxesSize = self.mapper.parser.operatorRepresentation['NormalizedAxesSize']
        scale = self.mapper.parser.operatorRepresentation['scale']

        # a. XSquared = Mul(X, X) => inputSize ops
        # b. XSquaredMean = ReduceMean<axes=normalized_axes>(XSquared)
        #    => inputSize ops (additions) + (inputSize - NormalizedAxesSize) ops (divisions)
        # c. MeanSquareEpsilon = Add(XSquaredMean, epsilon) => (inputSize - NormalizedAxesSize) ops
        # d. RMS = Sqrt(MeanSquareEpsilon) => (inputSize - NormalizedAxesSize) ops
        # e. Normalized = Div(X, RMS) => inputSize ops
        # f. Y = Mul(Normalized, Scale) => 0 if all(Scale == 1.0), else inputSize ops
        group_count = inputSize // NormalizedAxesSize
        scale_ops = 0 if scale is not None and (scale == 1.0).all() else inputSize
        ops = 3 * inputSize + 2 * group_count + scale_ops
        return ops
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Generic/Layers.py` around lines 726 - 740, The op-count in
computeOps is using NormalizedAxesSize where it should scale by the number of
RMS groups/slices. Update the formula in Deeploy/Targets/Generic/Layers.py’s
computeOps to derive the group count from the input shape and normalized axes,
then use that count for the reduce/add/sqrt terms instead of treating
NormalizedAxesSize as the number of groups. Keep the XSquared and Normalize/Mul
contributions tied to inputSize, and only replace the reduction-related part so
nodeOps matches shapes like [B, S, D] with axis=-1.
Deeploy/Targets/Generic/Parsers.py (2)

131-133: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't dereference values on dynamic RMSNorm scales.

node.inputs[1] is only guaranteed to be a tensor input. When it is not a gs.Constant, this line raises AttributeError and aborts parsing instead of cleanly supporting or rejecting the mapping. Keep the constant-only fast path for op counting, and fall back to a sentinel when the scale input is dynamic.

Suggested fix
-        self.operatorRepresentation['scale'] = node.inputs[1].values
+        self.operatorRepresentation['scale'] = (
+            node.inputs[1].values if isinstance(node.inputs[1], gs.Constant) else None
+        )

Then have RMSNormLayer.computeOps() treat scale is None as the non-unit case.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Generic/Parsers.py` around lines 131 - 133, The RMSNorm
parser is assuming `node.inputs[1]` is always a constant, which can raise an
`AttributeError` for dynamic scale inputs. Update the RMSNorm handling in
`Parsers.py` to keep the existing `gs.Constant` fast path for
`self.operatorRepresentation['scale']`, but set a sentinel such as `None` when
the scale input is not constant. Then adjust `RMSNormLayer.computeOps()` so it
interprets `scale is None` as the non-unit case and continues cleanly without
dereferencing `values`.

553-567: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This no longer sees variable-input broadcasts.

ONNXLayer.parse() calls self.broadcast() before parseNodeCtxt(), and AddLayer.computeShapes() rewrites the smaller operand to the output shape first. By the time this block reads ctxt.lookup(...).shape, shape1 and shape2 already match out_shape, so need_broadcast stays false for non-constant broadcasts and the stride metadata is never emitted.

Suggested fix
 def parseNode(self, node: gs.Node) -> bool:
     ret = all([len(node.inputs) == 2, len(node.outputs) == 1])
+    if ret:
+        self.operatorRepresentation['_raw_shapes'] = (
+            list(node.inputs[0].shape),
+            list(node.inputs[1].shape),
+            list(node.outputs[0].shape),
+        )
     return ret
...
-        shape1 = list(data_in_1.shape)
-        shape2 = list(data_in_2.shape)
-        out_shape = list(data_out.shape)
+        shape1, shape2, out_shape = self.operatorRepresentation['_raw_shapes']
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Generic/Parsers.py` around lines 553 - 567, The broadcast
metadata is being skipped because `shape1` and `shape2` are read after
`ONNXLayer.parse()`/`parseNodeCtxt()` has already normalized variable inputs via
`broadcast()` and `AddLayer.computeShapes()`. Update the broadcast check in this
parser block to detect the original variable-input broadcast case before shapes
are rewritten, using the pre-broadcast operand shapes available in
`parseNodeCtxt()`/`operatorRepresentation`, so `need_broadcast`, `ndim`,
`strides1`, `strides2`, and `out_shape` are still emitted for non-constant
broadcasts.
Deeploy/Targets/Generic/TypeCheckers.py (1)

629-635: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Propagate the weight input into RMSNorm signedness.

Line 632 only mirrors inputs[0]._signed, but RMSNorm multiplies by weight after the positive normalization term. That can mark the output unsigned even when the scale tensor allows negative outputs, and the stale _signed flag will then propagate to downstream checkers.

Suggested fix
     def _inferSignedness(self, inputs: List[VariableBuffer],
                          operatorRepresentation: OperatorRepresentation) -> List[bool]:
-        # RMSNorm output can be signed (depending on input signedness)
-        if inputs[0]._signed:
-            return [True]
-        else:
-            return [False]
+        return [bool(inputs[0]._signed or inputs[1]._signed)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def _inferSignedness(self, inputs: List[VariableBuffer],
                         operatorRepresentation: OperatorRepresentation) -> List[bool]:
        return [bool(inputs[0]._signed or inputs[1]._signed)]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Generic/TypeCheckers.py` around lines 629 - 635, Update
_inferSignedness in TypeCheckers.py so RMSNorm signedness is derived from both
the data input and the weight input, not just inputs[0]._signed. Locate the
RMSNorm handling in _inferSignedness and incorporate inputs[1] (the weight/scale
tensor) when deciding whether the output can be signed, then return the correct
boolean list so downstream checkers do not inherit a stale unsigned flag.
Deeploy/Targets/Snitch/Platform.py (1)

129-133: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Replace mutable default arguments with None.

Ruff flags shape = [1] and values = [0]; keep defaults instance-local.

Proposed fix
-    def __init__(self, name: str = '', shape = [1], values = [0]):
+    def __init__(self, name: str = '', shape = None, values = None):
+        if shape is None:
+            shape = [1]
+        if values is None:
+            values = [0]
         super().__init__(name, shape, values)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def __init__(self, name: str = '', shape = None, values = None):
        if shape is None:
            shape = [1]
        if values is None:
            values = [0]
        super().__init__(name, shape, values)
        # Initialize _type with a default value to prevent AttributeError
        # The actual type will be set later via annotateType
        self._type: Type[Pointer] = PointerClass(VoidType)
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 129-129: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


[warning] 129-129: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Snitch/Platform.py` around lines 129 - 133, The __init__
method in Platform uses mutable default arguments for shape and values, which
should be replaced with None so each instance gets its own data. Update the
Platform.__init__ signature to accept optional defaults, then initialize shape
and values inside the constructor before passing them to super().__init__,
keeping the behavior of the _type initialization unchanged.

Source: Linters/SAST tools

Deeploy/Targets/Snitch/Tiler.py (1)

52-53: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Reject or support multi-input Concat before enabling tiled Snitch concat.

ConcatParser accepts any len(inputs) >= 2, but ConcatTileConstraint only constrains/serializes data_in_1 and data_in_2. A 3+ input concat can bind here and produce an incomplete tiled load schedule. Add a Snitch parser/type-check guard for exactly two inputs, or extend the constraint/template to cover all inputs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Deeploy/Targets/Snitch/Tiler.py` around lines 52 - 53, The Snitch tiled
concat binding currently only handles two inputs, while ConcatParser allows 3+
inputs, so multi-input Concat can incorrectly reach
SnitchConcatTilingReadyBindings with an incomplete schedule. Update the Snitch
concat path in Tiler.py by either adding a parser/type-check guard in the Concat
handling to require exactly two inputs, or extending SnitchConcatBindings and
ConcatTileConstraint to serialize and schedule all inputs. Use the existing
SnitchConcatTilingReadyBindings, SnitchConcatBindings, and ConcatTileConstraint
symbols to locate the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants