Tips, tricks and broadcasting ============================= This page collects techniques that help you get the most out of ``ulab`` on a memory-constrained, modestly-clocked microcontroller. Hardware varies a lot, so always measure on the device you actually target. The general rules outlined here are useful starting points, but a particular code snippet that is fast on one MCU may not gain the same factor on another -- some MCUs have no FPU, others have very fast caches, and so on. Use an ndarray when you can --------------------------- Many ``ulab`` reductions accept either an iterable or an ``ndarray``:: from ulab import numpy as np np.sum([1, 2, 3, 4, 5]) # works, but slow np.sum(np.array([1, 2, 3, 4, 5])) # ~3x faster Iterables force the interpreter to fetch each Python object, convert it to a C numeric type, and accumulate it. With an ``ndarray``, the C type is already known and the inner loop is a tight, type-specialised ``for``. Compared to a *pure-Python* implementation of the same reduction, the speedup is typically 30-50x. Counter-tip: if the data only exists as a Python list and you call the reduction once, do *not* convert to ``ndarray`` first -- the constructor itself has to walk the list and allocate. The conversion only pays off when you use the array more than once. Pick a reasonable dtype ----------------------- The default ``dtype`` is ``float`` (4 or 8 bytes per element). For data that is naturally 8-bit or 16-bit (ADC samples, image pixels, sensor readings) use ``np.uint8``, ``np.int8``, ``np.uint16`` or ``np.int16`` instead. You save 4-8x RAM and the inner loops are faster. :: adc = np.array(adc_samples, dtype=np.uint16) # not float! Be aware of upcasting and overflow rules. Operations on two arrays of the same integer dtype keep that dtype, even when results overflow:: a = np.array([200, 200], dtype=np.uint8) b = np.array([100, 100], dtype=np.uint8) print(a + b) # array([44, 44], dtype=uint8) -- wraps! If you need a wider intermediate, cast first:: c = np.array(a, dtype=np.uint16) + b Broadcasting ------------ Binary operators do not require shape equality -- they *broadcast*. The rules (which match ``numpy``) are: 1. If the two operands have different rank, virtually prepend axes of size 1 to the smaller one until the ranks match. 2. Along each axis, the two sizes must be equal, or one of them must be 1. A size-1 axis is virtually stretched to match the other side. If those two rules cannot be satisfied, you get a ``ValueError("operands could not be broadcast together")``. Add a scalar to every element:: a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float) a + 10 # adds 10 everywhere Add a row vector to every row of a matrix:: row = np.array([100, 200, 300], dtype=np.float) a + row # (2, 3) + (3,) -> (2, 3) Subtract a column mean from each column:: means = np.array([np.mean(a[:, i]) for i in range(3)]) a - means Broadcasting also works through universal functions that accept two arrays, e.g. ``np.arctan2``:: y = np.array([1, 2, 3, 4]) np.arctan2(y, 1.0) np.arctan2(1.0, y) np.arctan2(y, y) For more on the internals (how ``ulab`` rewrites strides to make broadcasting work) see :doc:`programming`. Comparison-operator side rule ----------------------------- In ``ulab``, the ``ndarray`` *must* be on the **left** of a relational operator with a scalar. ``a > 2`` works; ``2 < a`` raises ``TypeError``. If you need the symmetric form, use ``np.less(2, a)`` / ``np.greater(2, a)`` etc. Beware the axis (memory layout matters) --------------------------------------- ``ulab`` always loops innermost over the *last* axis of an array. That means an array shaped ``(2, 1000)`` is much faster to iterate than an array shaped ``(1000, 2)`` holding the same data, because the long axis lines up with the inner loop. When you have control over how data is laid out, put the long axis last:: a = np.array(range(2000)).reshape((2, 1000)) # fast b = np.array(range(2000)).reshape((1000, 2)) # slower In ``numpy`` the longest axis is sometimes called the "fast axis", and ``numpy`` is allowed to permute its loops to make the longest axis innermost. ``ulab`` does *not* do this -- the order is fixed by the strides. If you find yourself with the wrong layout, ``a.transpose()`` / ``a.T`` is cheap (it only flips strides; no data is copied). Reduce intermediate arrays -------------------------- Compound expressions like ``a + b * c`` allocate a *temporary* array for ``b * c``. On a microcontroller these temporaries cost both allocation time and memory fragmentation. When a calculation appears in a hot loop, prefer in-place operators:: # makes 1 temporary b = (a + 1) * 2 # no temporary b = np.array(a) b += 1 b *= 2 The same idea applies to slice assignment. Instead of one big expression, write out simple sub-assignments that target slices of a pre-allocated output. The classic example is interpolation by 2 over a small array:: # Linear interpolation: place originals at even indices, mid-points # at odd indices, with no temporary arrays. a = np.array([0, 10, 2, 20, 4], dtype=np.uint8) b = np.zeros(2 * len(a) - 1, dtype=np.uint8) b[::2] = a b[1::2] = a[:-1] b[1::2] += a[1:] # divide by 2 if you want the actual average The compound version, ``b[1::2] = (a[:-1] + a[1:]) // 2``, would allocate a temporary the size of ``a[:-1] + a[1:]`` plus another for the division. The four-instruction version above touches only the two views ``a[:-1]`` and ``a[1:]``, which exist anyway. Up-scaling images with slice assignment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The same idea generalises to two dimensions, which makes it useful for upscaling small images (think 8x8 thermal cameras for a human-friendly preview). To double the resolution, place originals on even (row, column) cells and fill in averages on odd cells:: # a is, say, 8x8; b is 15x15 b = np.zeros((15, 15), dtype=np.uint8) # rows: even-row destinations get the originals b[::2, ::2] = a # rows: odd-row destinations get vertical averages b[1::2, ::2] = a[:-1, :] b[1::2, ::2] += a[1:, :] b[1::2, ::2] //= 2 # columns: odd-column destinations get horizontal averages b[:, 1::2] = b[:, :-1:2] b[:, 1::2] += b[:, 2::2] b[:, 1::2] //= 2 Going larger than 2x is the same pattern with more assignments. Note that this technique stays inside a small dtype (``uint8``). Doing the same with floats would burn 8x more RAM for no benefit if all you want is a smoothed preview. Reuse buffers across iterations ------------------------------- Several APIs accept a pre-allocated buffer: * :py:meth:`image.Image.to_ndarray` accepts ``buffer=`` so the conversion does not allocate every frame. * The :py:class:`image.Image` constructor accepts ``buffer=`` for the same reason. * ``np.frombuffer`` *aliases* an existing buffer, never copying. * Universal functions (``np.exp``, ``np.sin``, ``np.sqrt``, ...) accept ``out=`` -- write into a pre-allocated float array. * ``utils.spectrogram`` accepts ``out=`` and ``scratchpad=``. In a hot loop, allocate once and reuse:: import csi from ulab import numpy as np csi0 = csi.CSI() csi0.reset() csi0.pixformat(csi.GRAYSCALE) csi0.framesize(csi.QVGA) csi0.snapshot(time=2000) buf = bytearray(320 * 240) while True: img = csi0.snapshot() a = img.to_ndarray('B', buffer=buf) # no per-frame alloc # ... process a ... Watch out for boolean masks in tight loops ------------------------------------------ Boolean indexing and ``np.where`` necessarily produce a new array on each call (the size of the result depends on the data, so they cannot reuse a pre-allocated buffer). If you do this many times in a row, the heap fills up with throwaway boolean arrays. Periodically calling ``gc.collect()`` keeps fragmentation under control:: import gc for _ in range(1000): mask = a < threshold _ = a[mask] if _ % 100 == 0: gc.collect() The same caveat applies to compound boolean expressions like ``(a > lo) & (a < hi)`` -- each operator allocates a new bool array. Use views, not copies, where you can ------------------------------------ Slicing produces a view that shares data with the source -- no allocation, no copy. ``np.frombuffer``, ``a.reshape((...))``, ``a[::2]``, and column / row indexing all return views when possible. Reach for ``.copy()`` only when you genuinely need an independent buffer. You can confirm two arrays share data by comparing the *data pointer* line printed by ``np.ndinfo``:: a = np.arange(10, dtype=np.uint8) np.ndinfo(a) # data pointer: 0x... np.ndinfo(a[::2]) # same data pointer Build the result, do not append to it ------------------------------------- ``ndarray`` has no ``append`` -- and that is on purpose. Growing an array is implemented (in CPython ``numpy``) by allocating a new, larger buffer and copying. On a microcontroller you can almost always pre-allocate the final size and *fill* it:: out = np.zeros(N, dtype=np.float) for i in range(N): out[i] = some_calculation(i) If you genuinely don't know ``N`` in advance, write to a Python ``list`` and convert once at the end with ``np.array(list)``. Where to go next ---------------- * :doc:`programming` -- broadcasting internals and more advanced patterns. * :doc:`utils` -- buffer conversions and ``spectrogram``. * :doc:`/library/omv.ulab.numpy` -- ``numpy`` API reference. * :doc:`/library/omv.ulab.scipy` -- ``scipy`` API reference for filters, optimisation, integration and special functions. * `numpy broadcasting basics `_ -- formal description of the broadcasting rules ``ulab`` follows.