Advanced patterns
=================

This page is a collection of Python-level patterns for using ``ulab``
effectively on a microcontroller. It assumes you already know the
basics from :doc:`getting-started` and :doc:`ndarray`.

Memory layout: the last axis is the fast axis
---------------------------------------------

Inside an ``ndarray`` the data lives in a single contiguous chunk of
memory. The shape and the strides describe how that linear chunk is
read out as a tensor. ``ulab`` always loops innermost over the
**last axis**: for a ``(2, 1000)`` array it walks ``a[0, 0]``,
``a[0, 1]``, ..., ``a[0, 999]``, then jumps to ``a[1, 0]``.

Two consequences:

* Long axes belong on the right. A ``(2, 1000)`` array iterates
  faster than a ``(1000, 2)`` array even though they hold the same
  values.
* Transposes are essentially free, because they only flip strides.

If you have control over how the data is laid out (e.g. you decide
how to reshape an input buffer), put the long axis last.

What "view" really means
------------------------

A view is a second ``ndarray`` header that points at the *same* data
buffer as the source. The ``np.ndinfo`` function shows you the data
pointer; for any view, that pointer is the same as the source's::

   from ulab import numpy as np

   a = np.arange(10, dtype=np.uint8)
   np.ndinfo(a)            # data pointer: 0x...
   np.ndinfo(a[::2])       # data pointer: 0x...   (SAME)

This is why writing to a view changes the source. It is also why
slicing is essentially free, even on large arrays -- only a few
bytes of header are allocated. ``np.frombuffer``, ``a[::2]``,
``m[:, 0]``, ``a.reshape(...)`` and ``a.T`` all behave this way.

If you want an independent buffer (e.g. so that further mutations
don't disturb the source), call ``.copy()``.

Broadcasting in detail
----------------------

When two arrays appear as the operands of a binary operator
(``a + b``, ``np.arctan2(a, b)``, ``a < b``, ...), ``ulab`` applies
``numpy``'s broadcasting rules. The rules are:

1. If the two arrays have a different rank, virtually prepend size-1
   axes to the lower-rank array until the ranks match.
2. Along each axis, the two sizes must be equal, or one of them must
   be 1. A size-1 axis is virtually stretched to match the other
   side.

If those two rules cannot be satisfied, you get
``ValueError("operands could not be broadcast together")``.

Internally, ``ulab`` implements broadcasting by setting the stride
of any size-1 axis to zero. That way the iterator advances normally
for the other operand but never moves the data pointer for the
broadcast axis. There is no actual stretching of data and no
allocation involved.

Example -- centring a 2-D array's columns::

   m = np.arange(12, dtype=np.float).reshape((3, 4))
   col_means = np.array([np.mean(m[:, j]) for j in range(4)])
   centred = m - col_means          # (3, 4) - (4,) -> (3, 4)

Example -- pairwise sums across two 1-D arrays::

   x = np.array([1, 2, 3, 4]).reshape((4, 1))     # column
   y = np.array([10, 20, 30])                     # row
   x + y                                           # (4, 3) matrix

Reductions and the ``axis`` argument
------------------------------------

Reductions like ``np.sum``, ``np.mean``, ``np.std``, ``np.min``,
``np.max``, ``np.median``, ``np.argmin`` and ``np.argmax`` all take
an optional ``axis=`` argument. Without it, the reduction is over
all elements; with it, the named axis is contracted out.

::

   m = np.arange(12, dtype=np.float).reshape((3, 4))

   np.sum(m)               # 66.0  -- scalar
   np.sum(m, axis=0)       # length-4 vector (column sums)
   np.sum(m, axis=1)       # length-3 vector (row sums)

Internally, the reduction places the named axis in the innermost
loop and walks every other axis with the outer loops. Combined with
the "last axis is fast" rule above, this means reducing along the
last axis is the cheapest case.

Avoiding allocation: pre-allocate and write
-------------------------------------------

The single biggest performance pitfall on a microcontroller is
allocation in the hot loop. Each new ``ndarray`` is a heap
allocation, and frequent allocations fragment the heap.

Strategies, roughly in order of effectiveness:

1. **Pre-allocate the output buffer once, write into it.** Most
   universal functions accept ``out=``; ``image.Image.to_ndarray``
   accepts ``buffer=``; ``utils.spectrogram`` accepts ``out=`` and
   ``scratchpad=``::

      x   = np.linspace(0, 2*np.pi, num=512)
      y   = np.zeros(512)            # allocate once
      while True:
          np.sin(x, out=y)           # no allocation
          # ... use y ...

2. **Use in-place operators.** ``b += 1`` is allocation-free; ``b =
   b + 1`` allocates a temporary, copies, and reassigns::

      # makes a temporary the size of b
      b = b + 1

      # no temporary
      b += 1

3. **Decompose compound expressions.** Each operator produces a
   temporary. Splitting a complicated expression into several simple
   ones, each writing into a slice of a pre-allocated buffer, lets
   you skip the temporaries::

      # one temporary for `a + b`, another for `* 2`
      out = (a + b) * 2

      # zero temporaries
      out[:] = a
      out += b
      out *= 2

4. **Slice-assign instead of building new arrays.** Many "build a
   new array from pieces" patterns can be expressed as a single
   pre-allocated array with several slice assignments. See the
   up-scaling example in :doc:`tricks`.

Boolean masks: be careful in tight loops
----------------------------------------

A boolean mask (``a < threshold``) is a real array on the heap, of
size equal to ``a``. Building masks in a hot loop -- in particular
combining them with ``&`` / ``|`` -- spawns lots of throwaway
arrays.

If a mask is reused, build it once and keep it::

   mask = a < threshold
   foo[mask] = 0
   bar[mask] = 1

If a mask depends on values that change every iteration, it is
unavoidable that you reallocate. A periodic ``gc.collect()`` keeps
the heap from fragmenting.

Building from raw buffers
-------------------------

Three useful building blocks:

* ``np.frombuffer(buf, dtype=...)`` -- view a ``bytes``-like buffer
  as an ``ndarray`` of the given dtype, no copy. Use this for
  in-memory peripherals (UART, SPI, ADC, audio buffers).
* ``a.tobytes()`` -- get the raw bytes underlying a *dense*
  ``ndarray`` (sliced views raise ``ValueError``). Useful when you
  want to ship pixels out over a transport.
* ``a.byteswap()`` / ``a.byteswap(inplace=True)`` -- flip endianness
  for multi-byte dtypes. Use when the peripheral that produced the
  buffer disagrees with the MCU on byte order.

For non-native dtypes (e.g. 32-bit ADC samples) the ``utils`` module
has converters: see :doc:`utils`.

Handling complex output safely
------------------------------

A few universal functions can produce complex results -- the obvious
ones are ``sqrt`` of a negative number and ``exp`` of a complex
input. ``ulab`` does not silently widen the dtype. If your input is
real but the result might be complex, you have to ask for the
complex output explicitly::

   a = np.array([1, -1])
   np.sqrt(a)                       # array([1.0, nan])  (NaN, real out)
   np.sqrt(a, dtype=np.complex)     # array([1+0j, 0+1j], complex out)

This is most often a footgun in numerical code -- if you see
``NaN`` where you expected an imaginary number, you forgot
``dtype=np.complex``.

Where to go next
----------------

* :doc:`tricks` -- short-form list of practical tips.
* :doc:`utils` -- buffer conversions and ``spectrogram``.
* :doc:`/library/omv.ulab.numpy` -- complete API reference.