8.3.9. Tips, tricks and broadcasting
This page collects techniques that help you get the most out of
ulab on a memory-constrained, modestly-clocked microcontroller.
Hardware varies a lot, so always measure on the device you actually
target. The general rules outlined here are useful starting points,
but a particular code snippet that is fast on one MCU may not gain
the same factor on another – some MCUs have no FPU, others have very
fast caches, and so on.
8.3.9.1. Use an ndarray when you can
Many ulab reductions accept either an iterable or an
ndarray:
from ulab import numpy as np
np.sum([1, 2, 3, 4, 5]) # works, but slow
np.sum(np.array([1, 2, 3, 4, 5])) # ~3x faster
Iterables force the interpreter to fetch each Python object, convert
it to a C numeric type, and accumulate it. With an ndarray, the C
type is already known and the inner loop is a tight, type-specialised
for. Compared to a pure-Python implementation of the same
reduction, the speedup is typically 30-50x.
Counter-tip: if the data only exists as a Python list and you call
the reduction once, do not convert to ndarray first – the
constructor itself has to walk the list and allocate. The conversion
only pays off when you use the array more than once.
8.3.9.2. Pick a reasonable dtype
The default dtype is float (4 or 8 bytes per element). For
data that is naturally 8-bit or 16-bit (ADC samples, image pixels,
sensor readings) use np.uint8, np.int8, np.uint16 or
np.int16 instead. You save 4-8x RAM and the inner loops are
faster.
adc = np.array(adc_samples, dtype=np.uint16) # not float!
Be aware of upcasting and overflow rules. Operations on two arrays of the same integer dtype keep that dtype, even when results overflow:
a = np.array([200, 200], dtype=np.uint8)
b = np.array([100, 100], dtype=np.uint8)
print(a + b) # array([44, 44], dtype=uint8) -- wraps!
If you need a wider intermediate, cast first:
c = np.array(a, dtype=np.uint16) + b
8.3.9.3. Broadcasting
Binary operators do not require shape equality – they broadcast.
The rules (which match numpy) are:
If the two operands have different rank, virtually prepend axes of size 1 to the smaller one until the ranks match.
Along each axis, the two sizes must be equal, or one of them must be 1. A size-1 axis is virtually stretched to match the other side.
If those two rules cannot be satisfied, you get a
ValueError("operands could not be broadcast together").
Add a scalar to every element:
a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float)
a + 10 # adds 10 everywhere
Add a row vector to every row of a matrix:
row = np.array([100, 200, 300], dtype=np.float)
a + row # (2, 3) + (3,) -> (2, 3)
Subtract a column mean from each column:
means = np.array([np.mean(a[:, i]) for i in range(3)])
a - means
Broadcasting also works through universal functions that accept two
arrays, e.g. np.arctan2:
y = np.array([1, 2, 3, 4])
np.arctan2(y, 1.0)
np.arctan2(1.0, y)
np.arctan2(y, y)
For more on the internals (how ulab rewrites strides to make
broadcasting work) see Advanced patterns.
8.3.9.4. Comparison-operator side rule
In ulab, the ndarray must be on the left of a relational
operator with a scalar. a > 2 works; 2 < a raises
TypeError. If you need the symmetric form, use
np.less(2, a) / np.greater(2, a) etc.
8.3.9.5. Beware the axis (memory layout matters)
ulab always loops innermost over the last axis of an array.
That means an array shaped (2, 1000) is much faster to iterate
than an array shaped (1000, 2) holding the same data, because
the long axis lines up with the inner loop. When you have control
over how data is laid out, put the long axis last:
a = np.array(range(2000)).reshape((2, 1000)) # fast
b = np.array(range(2000)).reshape((1000, 2)) # slower
In numpy the longest axis is sometimes called the “fast axis”,
and numpy is allowed to permute its loops to make the longest
axis innermost. ulab does not do this – the order is fixed by
the strides.
If you find yourself with the wrong layout, a.transpose() /
a.T is cheap (it only flips strides; no data is copied).
8.3.9.6. Reduce intermediate arrays
Compound expressions like a + b * c allocate a temporary array
for b * c. On a microcontroller these temporaries cost both
allocation time and memory fragmentation. When a calculation
appears in a hot loop, prefer in-place operators:
# makes 1 temporary
b = (a + 1) * 2
# no temporary
b = np.array(a)
b += 1
b *= 2
The same idea applies to slice assignment. Instead of one big expression, write out simple sub-assignments that target slices of a pre-allocated output. The classic example is interpolation by 2 over a small array:
# Linear interpolation: place originals at even indices, mid-points
# at odd indices, with no temporary arrays.
a = np.array([0, 10, 2, 20, 4], dtype=np.uint8)
b = np.zeros(2 * len(a) - 1, dtype=np.uint8)
b[::2] = a
b[1::2] = a[:-1]
b[1::2] += a[1:]
# divide by 2 if you want the actual average
The compound version, b[1::2] = (a[:-1] + a[1:]) // 2, would
allocate a temporary the size of a[:-1] + a[1:] plus another for
the division. The four-instruction version above touches only the
two views a[:-1] and a[1:], which exist anyway.
8.3.9.6.1. Up-scaling images with slice assignment
The same idea generalises to two dimensions, which makes it useful for upscaling small images (think 8x8 thermal cameras for a human-friendly preview). To double the resolution, place originals on even (row, column) cells and fill in averages on odd cells:
# a is, say, 8x8; b is 15x15
b = np.zeros((15, 15), dtype=np.uint8)
# rows: even-row destinations get the originals
b[::2, ::2] = a
# rows: odd-row destinations get vertical averages
b[1::2, ::2] = a[:-1, :]
b[1::2, ::2] += a[1:, :]
b[1::2, ::2] //= 2
# columns: odd-column destinations get horizontal averages
b[:, 1::2] = b[:, :-1:2]
b[:, 1::2] += b[:, 2::2]
b[:, 1::2] //= 2
Going larger than 2x is the same pattern with more assignments.
Note that this technique stays inside a small dtype (uint8).
Doing the same with floats would burn 8x more RAM for no benefit if
all you want is a smoothed preview.
8.3.9.7. Reuse buffers across iterations
Several APIs accept a pre-allocated buffer:
image.Image.to_ndarray()acceptsbuffer=so the conversion does not allocate every frame.The
image.Imageconstructor acceptsbuffer=for the same reason.np.frombufferaliases an existing buffer, never copying.Universal functions (
np.exp,np.sin,np.sqrt, …) acceptout=– write into a pre-allocated float array.utils.spectrogramacceptsout=andscratchpad=.
In a hot loop, allocate once and reuse:
import csi
from ulab import numpy as np
csi0 = csi.CSI()
csi0.reset()
csi0.pixformat(csi.GRAYSCALE)
csi0.framesize(csi.QVGA)
csi0.snapshot(time=2000)
buf = bytearray(320 * 240)
while True:
img = csi0.snapshot()
a = img.to_ndarray('B', buffer=buf) # no per-frame alloc
# ... process a ...
8.3.9.8. Watch out for boolean masks in tight loops
Boolean indexing and np.where necessarily produce a new array on
each call (the size of the result depends on the data, so they
cannot reuse a pre-allocated buffer). If you do this many times in a
row, the heap fills up with throwaway boolean arrays. Periodically
calling gc.collect() keeps fragmentation under control:
import gc
for _ in range(1000):
mask = a < threshold
_ = a[mask]
if _ % 100 == 0:
gc.collect()
The same caveat applies to compound boolean expressions like
(a > lo) & (a < hi) – each operator allocates a new bool array.
8.3.9.9. Use views, not copies, where you can
Slicing produces a view that shares data with the source – no
allocation, no copy. np.frombuffer, a.reshape((...)),
a[::2], and column / row indexing all return views when
possible. Reach for .copy() only when you genuinely need an
independent buffer.
You can confirm two arrays share data by comparing the data
pointer line printed by np.ndinfo:
a = np.arange(10, dtype=np.uint8)
np.ndinfo(a) # data pointer: 0x...
np.ndinfo(a[::2]) # same data pointer
8.3.9.10. Build the result, do not append to it
ndarray has no append – and that is on purpose. Growing an
array is implemented (in CPython numpy) by allocating a new,
larger buffer and copying. On a microcontroller you can almost
always pre-allocate the final size and fill it:
out = np.zeros(N, dtype=np.float)
for i in range(N):
out[i] = some_calculation(i)
If you genuinely don’t know N in advance, write to a Python
list and convert once at the end with np.array(list).
8.3.9.11. Where to go next
Advanced patterns – broadcasting internals and more advanced patterns.
Utilities – buffer conversions and
spectrogram.numpy — numpy-compatible array operations –
numpyAPI reference.scipy — subset of scipy via ulab –
scipyAPI reference for filters, optimisation, integration and special functions.numpy broadcasting basics – formal description of the broadcasting rules
ulabfollows.