7.10. CMSIS-NN

The operator list TFLM walks is mostly a handful of heavy operators: convolution sliding a small grid of learned weights over an input tensor and writing the weighted sum at each position; depthwise convolution doing the same per channel; fully-connected matrix multiplies between a vector of inputs and a matrix of weights; pooling shrinking a tensor by taking the max or average over small neighbourhoods; activation functions like ReLU and sigmoid running pointwise across every value. A vision inference spends most of its cycles inside those few operators.

Implemented the straightforward way, they would be slow on a microcontroller. CMSIS-NN is Arm’s library of fast ones – hand-tuned in assembly, integer-quantized to the int8 and uint8 values tensor I/O described, and written against the CPU’s SIMD instructions. SIMD – Single Instruction, Multiple Data – lets the CPU run one arithmetic operation across several values in the same cycle. A plain scalar multiply-add produces one result per cycle; a SIMD multiply-add packs several values into a wide register and produces all of them at once.

The Cortex-M7 on the H7 and the RT1062 has Arm’s DSP extension, which holds four int8 values in a 32-bit register and runs a multiply-add over all four in a cycle. The Cortex-M55 on the AE3 has Helium – formally MVE, the M-profile Vector Extension – which holds sixteen int8 lanes in a 128-bit register, four times the throughput per cycle. Helium is a wider CPU instruction set, not an accelerator; the Ethos-U55 NPU on the same die is the accelerator.

The shipped TFLM builds are linked against CMSIS-NN, and TFLM dispatches each heavy operator to the right SIMD variant for the cam at runtime. On the AE3 the dispatch is a little more involved: the Vela compiler has already walked the model offline and marked connected slices of NPU-eligible operators – subgraphs – for dispatch to the Ethos-U. At inference time those subgraphs run on the accelerator in one block, and the rest fall back to Helium CMSIS-NN on the M55.

Float operators bypass CMSIS-NN entirely and run through TFLM’s portable reference kernels. The accuracy gap between an int8 and a float model is usually small; the throughput gap is large. Models shipped on the cam are quantized to int8 for this reason.