8.3.10. Advanced patterns
This page is a collection of Python-level patterns for using ulab
effectively on a microcontroller. It assumes you already know the
basics from Getting started with numpy and Working with ndarrays.
8.3.10.1. Memory layout: the last axis is the fast axis
Inside an ndarray the data lives in a single contiguous chunk of
memory. The shape and the strides describe how that linear chunk is
read out as a tensor. ulab always loops innermost over the
last axis: for a (2, 1000) array it walks a[0, 0],
a[0, 1], …, a[0, 999], then jumps to a[1, 0].
Two consequences:
Long axes belong on the right. A
(2, 1000)array iterates faster than a(1000, 2)array even though they hold the same values.Transposes are essentially free, because they only flip strides.
If you have control over how the data is laid out (e.g. you decide how to reshape an input buffer), put the long axis last.
8.3.10.2. What “view” really means
A view is a second ndarray header that points at the same data
buffer as the source. The np.ndinfo function shows you the data
pointer; for any view, that pointer is the same as the source’s:
from ulab import numpy as np
a = np.arange(10, dtype=np.uint8)
np.ndinfo(a) # data pointer: 0x...
np.ndinfo(a[::2]) # data pointer: 0x... (SAME)
This is why writing to a view changes the source. It is also why
slicing is essentially free, even on large arrays – only a few
bytes of header are allocated. np.frombuffer, a[::2],
m[:, 0], a.reshape(...) and a.T all behave this way.
If you want an independent buffer (e.g. so that further mutations
don’t disturb the source), call .copy().
8.3.10.3. Broadcasting in detail
When two arrays appear as the operands of a binary operator
(a + b, np.arctan2(a, b), a < b, …), ulab applies
numpy’s broadcasting rules. The rules are:
If the two arrays have a different rank, virtually prepend size-1 axes to the lower-rank array until the ranks match.
Along each axis, the two sizes must be equal, or one of them must be 1. A size-1 axis is virtually stretched to match the other side.
If those two rules cannot be satisfied, you get
ValueError("operands could not be broadcast together").
Internally, ulab implements broadcasting by setting the stride
of any size-1 axis to zero. That way the iterator advances normally
for the other operand but never moves the data pointer for the
broadcast axis. There is no actual stretching of data and no
allocation involved.
Example – centring a 2-D array’s columns:
m = np.arange(12, dtype=np.float).reshape((3, 4))
col_means = np.array([np.mean(m[:, j]) for j in range(4)])
centred = m - col_means # (3, 4) - (4,) -> (3, 4)
Example – pairwise sums across two 1-D arrays:
x = np.array([1, 2, 3, 4]).reshape((4, 1)) # column
y = np.array([10, 20, 30]) # row
x + y # (4, 3) matrix
8.3.10.4. Reductions and the axis argument
Reductions like np.sum, np.mean, np.std, np.min,
np.max, np.median, np.argmin and np.argmax all take
an optional axis= argument. Without it, the reduction is over
all elements; with it, the named axis is contracted out.
m = np.arange(12, dtype=np.float).reshape((3, 4))
np.sum(m) # 66.0 -- scalar
np.sum(m, axis=0) # length-4 vector (column sums)
np.sum(m, axis=1) # length-3 vector (row sums)
Internally, the reduction places the named axis in the innermost loop and walks every other axis with the outer loops. Combined with the “last axis is fast” rule above, this means reducing along the last axis is the cheapest case.
8.3.10.5. Avoiding allocation: pre-allocate and write
The single biggest performance pitfall on a microcontroller is
allocation in the hot loop. Each new ndarray is a heap
allocation, and frequent allocations fragment the heap.
Strategies, roughly in order of effectiveness:
Pre-allocate the output buffer once, write into it. Most universal functions accept
out=;image.Image.to_ndarrayacceptsbuffer=;utils.spectrogramacceptsout=andscratchpad=:x = np.linspace(0, 2*np.pi, num=512) y = np.zeros(512) # allocate once while True: np.sin(x, out=y) # no allocation # ... use y ...
Use in-place operators.
b += 1is allocation-free;b = b + 1allocates a temporary, copies, and reassigns:# makes a temporary the size of b b = b + 1 # no temporary b += 1
Decompose compound expressions. Each operator produces a temporary. Splitting a complicated expression into several simple ones, each writing into a slice of a pre-allocated buffer, lets you skip the temporaries:
# one temporary for `a + b`, another for `* 2` out = (a + b) * 2 # zero temporaries out[:] = a out += b out *= 2
Slice-assign instead of building new arrays. Many “build a new array from pieces” patterns can be expressed as a single pre-allocated array with several slice assignments. See the up-scaling example in Tips, tricks and broadcasting.
8.3.10.6. Boolean masks: be careful in tight loops
A boolean mask (a < threshold) is a real array on the heap, of
size equal to a. Building masks in a hot loop – in particular
combining them with & / | – spawns lots of throwaway
arrays.
If a mask is reused, build it once and keep it:
mask = a < threshold
foo[mask] = 0
bar[mask] = 1
If a mask depends on values that change every iteration, it is
unavoidable that you reallocate. A periodic gc.collect() keeps
the heap from fragmenting.
8.3.10.7. Building from raw buffers
Three useful building blocks:
np.frombuffer(buf, dtype=...)– view abytes-like buffer as anndarrayof the given dtype, no copy. Use this for in-memory peripherals (UART, SPI, ADC, audio buffers).a.tobytes()– get the raw bytes underlying a densendarray(sliced views raiseValueError). Useful when you want to ship pixels out over a transport.a.byteswap()/a.byteswap(inplace=True)– flip endianness for multi-byte dtypes. Use when the peripheral that produced the buffer disagrees with the MCU on byte order.
For non-native dtypes (e.g. 32-bit ADC samples) the utils module
has converters: see Utilities.
8.3.10.8. Handling complex output safely
A few universal functions can produce complex results – the obvious
ones are sqrt of a negative number and exp of a complex
input. ulab does not silently widen the dtype. If your input is
real but the result might be complex, you have to ask for the
complex output explicitly:
a = np.array([1, -1])
np.sqrt(a) # array([1.0, nan]) (NaN, real out)
np.sqrt(a, dtype=np.complex) # array([1+0j, 0+1j], complex out)
This is most often a footgun in numerical code – if you see
NaN where you expected an imaginary number, you forgot
dtype=np.complex.
8.3.10.9. Where to go next
Tips, tricks and broadcasting – short-form list of practical tips.
Utilities – buffer conversions and
spectrogram.numpy — numpy-compatible array operations – complete API reference.