CMSIS-NN ======== The operator list TFLM walks is mostly a handful of heavy operators: *convolution* sliding a small grid of learned weights over an input tensor and writing the weighted sum at each position; *depthwise convolution* doing the same per channel; *fully-connected* matrix multiplies between a vector of inputs and a matrix of weights; *pooling* shrinking a tensor by taking the max or average over small neighbourhoods; *activation functions* like ReLU and sigmoid running pointwise across every value. A vision inference spends most of its cycles inside those few operators. Implemented the straightforward way, they would be slow on a microcontroller. *CMSIS-NN* is Arm's library of fast ones -- hand-tuned in assembly, integer-quantized to the ``int8`` and ``uint8`` values :doc:`tensor I/O <../pipeline/tensor-io>` described, and written against the CPU's *SIMD* instructions. SIMD -- Single Instruction, Multiple Data -- lets the CPU run one arithmetic operation across several values in the same cycle. A plain scalar multiply-add produces one result per cycle; a SIMD multiply-add packs several values into a wide register and produces all of them at once. The Cortex-M7 on the H7 and the RT1062 has Arm's *DSP extension*, which holds four ``int8`` values in a 32-bit register and runs a multiply-add over all four in a cycle. The Cortex-M55 on the AE3 has *Helium* -- formally *MVE*, the M-profile Vector Extension -- which holds sixteen ``int8`` lanes in a 128-bit register, four times the throughput per cycle. Helium is a wider CPU instruction set, not an accelerator; the :doc:`Ethos-U55 NPU ` on the same die is the accelerator. The shipped TFLM builds are linked against CMSIS-NN, and TFLM dispatches each heavy operator to the right SIMD variant for the cam at runtime. On the AE3 the dispatch is a little more involved: the Vela compiler has already walked the model offline and marked connected slices of NPU-eligible operators -- *subgraphs* -- for dispatch to the Ethos-U. At inference time those subgraphs run on the accelerator in one block, and the rest fall back to Helium CMSIS-NN on the M55. Float operators bypass CMSIS-NN entirely and run through TFLM's portable reference kernels. The accuracy gap between an ``int8`` and a float model is usually small; the throughput gap is large. Models shipped on the cam are quantized to ``int8`` for this reason.