8.3.10. Advanced patterns

This page is a collection of Python-level patterns for using ulab effectively on a microcontroller. It assumes you already know the basics from Getting started with numpy and Working with ndarrays.

8.3.10.1. Memory layout: the last axis is the fast axis

Inside an ndarray the data lives in a single contiguous chunk of memory. The shape and the strides describe how that linear chunk is read out as a tensor. ulab always loops innermost over the last axis: for a (2, 1000) array it walks a[0, 0], a[0, 1], …, a[0, 999], then jumps to a[1, 0].

Two consequences:

  • Long axes belong on the right. A (2, 1000) array iterates faster than a (1000, 2) array even though they hold the same values.

  • Transposes are essentially free, because they only flip strides.

If you have control over how the data is laid out (e.g. you decide how to reshape an input buffer), put the long axis last.

8.3.10.2. What “view” really means

A view is a second ndarray header that points at the same data buffer as the source. The np.ndinfo function shows you the data pointer; for any view, that pointer is the same as the source’s:

from ulab import numpy as np

a = np.arange(10, dtype=np.uint8)
np.ndinfo(a)            # data pointer: 0x...
np.ndinfo(a[::2])       # data pointer: 0x...   (SAME)

This is why writing to a view changes the source. It is also why slicing is essentially free, even on large arrays – only a few bytes of header are allocated. np.frombuffer, a[::2], m[:, 0], a.reshape(...) and a.T all behave this way.

If you want an independent buffer (e.g. so that further mutations don’t disturb the source), call .copy().

8.3.10.3. Broadcasting in detail

When two arrays appear as the operands of a binary operator (a + b, np.arctan2(a, b), a < b, …), ulab applies numpy’s broadcasting rules. The rules are:

  1. If the two arrays have a different rank, virtually prepend size-1 axes to the lower-rank array until the ranks match.

  2. Along each axis, the two sizes must be equal, or one of them must be 1. A size-1 axis is virtually stretched to match the other side.

If those two rules cannot be satisfied, you get ValueError("operands could not be broadcast together").

Internally, ulab implements broadcasting by setting the stride of any size-1 axis to zero. That way the iterator advances normally for the other operand but never moves the data pointer for the broadcast axis. There is no actual stretching of data and no allocation involved.

Example – centring a 2-D array’s columns:

m = np.arange(12, dtype=np.float).reshape((3, 4))
col_means = np.array([np.mean(m[:, j]) for j in range(4)])
centred = m - col_means          # (3, 4) - (4,) -> (3, 4)

Example – pairwise sums across two 1-D arrays:

x = np.array([1, 2, 3, 4]).reshape((4, 1))     # column
y = np.array([10, 20, 30])                     # row
x + y                                           # (4, 3) matrix

8.3.10.4. Reductions and the axis argument

Reductions like np.sum, np.mean, np.std, np.min, np.max, np.median, np.argmin and np.argmax all take an optional axis= argument. Without it, the reduction is over all elements; with it, the named axis is contracted out.

m = np.arange(12, dtype=np.float).reshape((3, 4))

np.sum(m)               # 66.0  -- scalar
np.sum(m, axis=0)       # length-4 vector (column sums)
np.sum(m, axis=1)       # length-3 vector (row sums)

Internally, the reduction places the named axis in the innermost loop and walks every other axis with the outer loops. Combined with the “last axis is fast” rule above, this means reducing along the last axis is the cheapest case.

8.3.10.5. Avoiding allocation: pre-allocate and write

The single biggest performance pitfall on a microcontroller is allocation in the hot loop. Each new ndarray is a heap allocation, and frequent allocations fragment the heap.

Strategies, roughly in order of effectiveness:

  1. Pre-allocate the output buffer once, write into it. Most universal functions accept out=; image.Image.to_ndarray accepts buffer=; utils.spectrogram accepts out= and scratchpad=:

    x   = np.linspace(0, 2*np.pi, num=512)
    y   = np.zeros(512)            # allocate once
    while True:
        np.sin(x, out=y)           # no allocation
        # ... use y ...
    
  2. Use in-place operators. b += 1 is allocation-free; b = b + 1 allocates a temporary, copies, and reassigns:

    # makes a temporary the size of b
    b = b + 1
    
    # no temporary
    b += 1
    
  3. Decompose compound expressions. Each operator produces a temporary. Splitting a complicated expression into several simple ones, each writing into a slice of a pre-allocated buffer, lets you skip the temporaries:

    # one temporary for `a + b`, another for `* 2`
    out = (a + b) * 2
    
    # zero temporaries
    out[:] = a
    out += b
    out *= 2
    
  4. Slice-assign instead of building new arrays. Many “build a new array from pieces” patterns can be expressed as a single pre-allocated array with several slice assignments. See the up-scaling example in Tips, tricks and broadcasting.

8.3.10.6. Boolean masks: be careful in tight loops

A boolean mask (a < threshold) is a real array on the heap, of size equal to a. Building masks in a hot loop – in particular combining them with & / | – spawns lots of throwaway arrays.

If a mask is reused, build it once and keep it:

mask = a < threshold
foo[mask] = 0
bar[mask] = 1

If a mask depends on values that change every iteration, it is unavoidable that you reallocate. A periodic gc.collect() keeps the heap from fragmenting.

8.3.10.7. Building from raw buffers

Three useful building blocks:

  • np.frombuffer(buf, dtype=...) – view a bytes-like buffer as an ndarray of the given dtype, no copy. Use this for in-memory peripherals (UART, SPI, ADC, audio buffers).

  • a.tobytes() – get the raw bytes underlying a dense ndarray (sliced views raise ValueError). Useful when you want to ship pixels out over a transport.

  • a.byteswap() / a.byteswap(inplace=True) – flip endianness for multi-byte dtypes. Use when the peripheral that produced the buffer disagrees with the MCU on byte order.

For non-native dtypes (e.g. 32-bit ADC samples) the utils module has converters: see Utilities.

8.3.10.8. Handling complex output safely

A few universal functions can produce complex results – the obvious ones are sqrt of a negative number and exp of a complex input. ulab does not silently widen the dtype. If your input is real but the result might be complex, you have to ask for the complex output explicitly:

a = np.array([1, -1])
np.sqrt(a)                       # array([1.0, nan])  (NaN, real out)
np.sqrt(a, dtype=np.complex)     # array([1+0j, 0+1j], complex out)

This is most often a footgun in numerical code – if you see NaN where you expected an imaginary number, you forgot dtype=np.complex.

8.3.10.9. Where to go next