Performance =========== The same design decisions that make :mod:`numpy` fast on the camera -- whole-array library calls, packed typed buffers, views that share data with their source -- also expose a set of habits that are worth knowing about. The :doc:`shape/shape-and-strides` page already covered the last-axis layout rule; this page catalogues the allocation and dtype habits that matter most in a streaming loop. Pick a reasonable dtype ----------------------- The default dtype of every constructor is ``float``. For data that is naturally 8-bit or 16-bit -- ADC samples, image pixels, sensor readings -- pass ``dtype=`` explicitly to one of the integer types:: adc = np.array(adc_samples, dtype=np.uint16) The RAM saving is 4-8x. The math also runs faster because the integer code paths inside :mod:`numpy` are tighter than the generic float ones. The integer overflow rule covered on :doc:`basics/dtypes` applies -- cast to a wider type before arithmetic that might overflow. Use an ndarray when you can --------------------------- Most reductions and universal functions accept either an iterable or an :class:`~ulab.numpy.ndarray`:: np.sum([1, 2, 3, 4, 5]) # works, but slow np.sum(np.array([1, 2, 3, 4, 5])) # ~3x faster The iterable form forces :mod:`numpy` to step through the input one Python object at a time, converting each to a number before it can use it. Against an :class:`~ulab.numpy.ndarray` the conversion is already done and the call runs straight through the packed buffer. When the same data is used more than once, build the :class:`~ulab.numpy.ndarray` once and pass it around. When the data exists only as a Python list and is consumed once, the conversion cost can outweigh the speedup -- the :func:`~ulab.numpy.array` constructor itself has to walk the list and allocate. Prefer views to copies ---------------------- Slicing, single-axis indexing of a higher-rank array, :meth:`~ulab.numpy.ndarray.reshape`, :meth:`~ulab.numpy.ndarray.transpose`, and :func:`~ulab.numpy.frombuffer` all return *views* that share data with the source. They are essentially free. :meth:`~ulab.numpy.ndarray.copy`, :meth:`~ulab.numpy.ndarray.flatten`, boolean indexing (``a[mask]``), and any arithmetic expression allocate a *copy*. Reach for them only when an independent buffer is genuinely needed. When in doubt, :func:`~ulab.numpy.ndinfo` prints the location of the underlying buffer; two arrays that report the same address share their data. The complete view-vs-copy table is on :doc:`shape/views-and-copies`. Allocate once, then write ------------------------- The single biggest performance pitfall on the camera is allocating fresh arrays inside a loop that runs many times a second. Each new :class:`~ulab.numpy.ndarray` asks the cam for RAM, and frequent fresh allocations waste it. Most universal functions accept ``out=`` so the result can be written into an array that already exists:: x = np.linspace(0, 2 * np.pi, num=512) y = np.zeros(512) # allocate once while True: np.sin(x, out=y) # use y ... :py:meth:`image.Image.to_ndarray` accepts ``buffer=`` for the same reason; :func:`ulab.utils.spectrogram` and the :func:`~ulab.utils.from_int32_buffer`-style converters accept both ``out=`` and ``scratchpad=``. Allocate everything once and reuse it. Use in-place operators ---------------------- ``b = b + 1`` allocates a temporary the size of ``b``, copies, and re-assigns. ``b += 1`` modifies ``b`` directly:: # makes a temporary b = b + 1 # no temporary b += 1 The same idea applies to compound expressions. ``a + b * c`` allocates a temporary for ``b * c``. Splitting the expression into simple sub-assignments writing into a pre-allocated buffer eliminates the temporaries:: # one temporary for (a + b), another for the ``* 2`` out = (a + b) * 2 # zero temporaries out[:] = a out += b out *= 2 Build the result, do not append to it ------------------------------------- :class:`~ulab.numpy.ndarray` has no ``append`` -- on purpose. Growing an array would mean allocating a fresh, larger buffer and copying the old contents into it. On a microcontroller, pre-allocate the final size and *fill* it:: out = np.zeros(N, dtype=np.float) for i in range(N): out[i] = some_calculation(i) When ``N`` genuinely is not known in advance, write to a Python :class:`list` and convert once at the end with :func:`~ulab.numpy.array`. Slice assignment instead of new arrays -------------------------------------- Many "build a new array from pieces" patterns can be expressed as slice assignments into a pre-allocated buffer. The classic example is linear interpolation by 2:: # Originals at even indices; midpoints at odd indices. a = np.array([0, 10, 2, 20, 4], dtype=np.uint8) b = np.zeros(2 * len(a) - 1, dtype=np.uint8) b[::2] = a b[1::2] = a[:-1] b[1::2] += a[1:] # divide by 2 if you want the average The compound form ``b[1::2] = (a[:-1] + a[1:]) // 2`` would allocate a temporary the size of ``a[:-1] + a[1:]`` plus another for the division. The four-line version above touches only views. The same idea generalises to two dimensions, which makes it the right tool for upscaling small images -- think 8x8 thermal sensors that need a human-friendly preview:: # ``a`` is 8x8; ``b`` is 15x15 b = np.zeros((15, 15), dtype=np.uint8) b[::2, ::2] = a b[1::2, ::2] = a[:-1, :] b[1::2, ::2] += a[1:, :] b[1::2, ::2] //= 2 b[:, 1::2] = b[:, :-1:2] b[:, 1::2] += b[:, 2::2] b[:, 1::2] //= 2 Watch out for boolean masks in streaming loops ---------------------------------------------- Boolean indexing and :func:`~ulab.numpy.where` produce a new array on each call -- the size of the result depends on the data, so no pre-allocated buffer can absorb the allocation. Repeated mask building in a streaming loop fills RAM with throwaway arrays. A periodic ``gc.collect()`` reclaims the space:: import gc for i in range(1000): mask = a < threshold _ = a[mask] if i % 100 == 0: gc.collect() The same caveat applies to compound boolean expressions like ``(a > lo) & (a < hi)`` -- each operator allocates a new bool array. When a mask is reused, build it once and keep it:: mask = a < threshold foo[mask] = 0 bar[mask] = 1