9.18. Performance¶
The same design decisions that make numpy fast
on the camera – whole-array library calls, packed
typed buffers, views that share data with their source
– also expose a set of habits that are worth knowing
about. The Shape and strides page already
covered the last-axis layout rule; this page catalogues
the allocation and dtype habits that matter most in a
streaming loop.
9.18.1. Pick a reasonable dtype¶
The default dtype of every constructor is float. For
data that is naturally 8-bit or 16-bit – ADC samples,
image pixels, sensor readings – pass dtype=
explicitly to one of the integer types:
adc = np.array(adc_samples, dtype=np.uint16)
The RAM saving is 4-8x. The math also runs faster
because the integer code paths inside numpy are
tighter than the generic float ones. The integer
overflow rule covered on Dtypes applies
– cast to a wider type before arithmetic that might
overflow.
9.18.2. Use an ndarray when you can¶
Most reductions and universal functions accept either an
iterable or an ndarray:
np.sum([1, 2, 3, 4, 5]) # works, but slow
np.sum(np.array([1, 2, 3, 4, 5])) # ~3x faster
The iterable form forces numpy to step through
the input one Python object at a time, converting each
to a number before it can use it. Against an
ndarray the conversion is already
done and the call runs straight through the packed
buffer.
When the same data is used more than once, build the
ndarray once and pass it around.
When the data exists only as a Python list and is
consumed once, the conversion cost can outweigh the
speedup – the array() constructor
itself has to walk the list and allocate.
9.18.3. Prefer views to copies¶
Slicing, single-axis indexing of a higher-rank array,
reshape(),
transpose(), and
frombuffer() all return views that
share data with the source. They are essentially free.
copy(),
flatten(), boolean indexing
(a[mask]), and any arithmetic expression allocate a
copy. Reach for them only when an independent buffer
is genuinely needed.
When in doubt, ndinfo() prints the
location of the underlying buffer; two arrays that
report the same address share their data. The
complete view-vs-copy table is on
Views and copies.
9.18.4. Allocate once, then write¶
The single biggest performance pitfall on the camera is
allocating fresh arrays inside a loop that runs many
times a second. Each new
ndarray asks the cam for RAM, and
frequent fresh allocations waste it.
Most universal functions accept out= so the result
can be written into an array that already exists:
x = np.linspace(0, 2 * np.pi, num=512)
y = np.zeros(512) # allocate once
while True:
np.sin(x, out=y)
# use y ...
image.Image.to_ndarray() accepts buffer=
for the same reason; ulab.utils.spectrogram() and
the from_int32_buffer()-style
converters accept both out= and scratchpad=.
Allocate everything once and reuse it.
9.18.5. Use in-place operators¶
b = b + 1 allocates a temporary the size of b,
copies, and re-assigns. b += 1 modifies b
directly:
# makes a temporary
b = b + 1
# no temporary
b += 1
The same idea applies to compound expressions.
a + b * c allocates a temporary for b * c.
Splitting the expression into simple sub-assignments
writing into a pre-allocated buffer eliminates the
temporaries:
# one temporary for (a + b), another for the ``* 2``
out = (a + b) * 2
# zero temporaries
out[:] = a
out += b
out *= 2
9.18.6. Build the result, do not append to it¶
ndarray has no append – on
purpose. Growing an array would mean allocating a fresh,
larger buffer and copying the old contents into it. On a
microcontroller, pre-allocate the final size and fill
it:
out = np.zeros(N, dtype=np.float)
for i in range(N):
out[i] = some_calculation(i)
When N genuinely is not known in advance, write to a
Python list and convert once at the end with
array().
9.18.7. Slice assignment instead of new arrays¶
Many “build a new array from pieces” patterns can be expressed as slice assignments into a pre-allocated buffer. The classic example is linear interpolation by 2:
# Originals at even indices; midpoints at odd indices.
a = np.array([0, 10, 2, 20, 4], dtype=np.uint8)
b = np.zeros(2 * len(a) - 1, dtype=np.uint8)
b[::2] = a
b[1::2] = a[:-1]
b[1::2] += a[1:]
# divide by 2 if you want the average
The compound form b[1::2] = (a[:-1] + a[1:]) // 2
would allocate a temporary the size of a[:-1] + a[1:]
plus another for the division. The four-line version
above touches only views.
The same idea generalises to two dimensions, which makes it the right tool for upscaling small images – think 8x8 thermal sensors that need a human-friendly preview:
# ``a`` is 8x8; ``b`` is 15x15
b = np.zeros((15, 15), dtype=np.uint8)
b[::2, ::2] = a
b[1::2, ::2] = a[:-1, :]
b[1::2, ::2] += a[1:, :]
b[1::2, ::2] //= 2
b[:, 1::2] = b[:, :-1:2]
b[:, 1::2] += b[:, 2::2]
b[:, 1::2] //= 2
9.18.8. Watch out for boolean masks in streaming loops¶
Boolean indexing and where() produce
a new array on each call – the size of the result
depends on the data, so no pre-allocated buffer can
absorb the allocation. Repeated mask building in a
streaming loop fills RAM with throwaway arrays. A
periodic gc.collect() reclaims the space:
import gc
for i in range(1000):
mask = a < threshold
_ = a[mask]
if i % 100 == 0:
gc.collect()
The same caveat applies to compound boolean expressions
like (a > lo) & (a < hi) – each operator allocates
a new bool array. When a mask is reused, build it once
and keep it:
mask = a < threshold
foo[mask] = 0
bar[mask] = 1