7.10. CMSIS-NN¶
The operator list TFLM walks is mostly a handful of heavy operators: convolution sliding a small grid of learned weights over an input tensor and writing the weighted sum at each position; depthwise convolution doing the same per channel; fully-connected matrix multiplies between a vector of inputs and a matrix of weights; pooling shrinking a tensor by taking the max or average over small neighbourhoods; activation functions like ReLU and sigmoid running pointwise across every value. A vision inference spends most of its cycles inside those few operators.
Implemented the straightforward way, they would be slow on a
microcontroller. CMSIS-NN is
Arm’s library of fast ones – hand-tuned in assembly,
integer-quantized to the int8 and uint8 values
tensor I/O described, and written
against the CPU’s SIMD instructions. SIMD – Single
Instruction, Multiple Data – lets the CPU run one arithmetic
operation across several values in the same cycle. A plain
scalar multiply-add produces one result per cycle; a SIMD
multiply-add packs several values into a wide register and
produces all of them at once.
The Cortex-M7 on the H7 and the RT1062 has Arm’s DSP extension,
which holds four int8 values in a 32-bit register and runs a
multiply-add over all four in a cycle. The Cortex-M55 on the AE3
has Helium – formally MVE, the M-profile Vector Extension –
which holds sixteen int8 lanes in a 128-bit register, four
times the throughput per cycle. Helium is a wider CPU instruction
set, not an accelerator; the Ethos-U55 NPU on the
same die is the accelerator.
The shipped TFLM builds are linked against CMSIS-NN, and TFLM dispatches each heavy operator to the right SIMD variant for the cam at runtime. On the AE3 the dispatch is a little more involved: the Vela compiler has already walked the model offline and marked connected slices of NPU-eligible operators – subgraphs – for dispatch to the Ethos-U. At inference time those subgraphs run on the accelerator in one block, and the rest fall back to Helium CMSIS-NN on the M55.
Float operators bypass CMSIS-NN entirely and run through TFLM’s
portable reference kernels. The accuracy gap between an int8
and a float model is usually small; the throughput gap is large.
Models shipped on the cam are quantized to int8 for this
reason.