7.1. The Image object¶
An image-processing algorithm walks across an image one
pixel at a time. At each position it does something
simple – read a value, compare it against a
threshold, combine it with the corresponding pixel of
a second image, write a result back. Repeated across
a whole frame, those simple per-pixel decisions are
what edge detection, blob tracking, QR-code decoding,
and every other classical computer-vision technique
are built out of. To do that work efficiently, the
algorithm has to know where each pixel sits in
memory, what each pixel’s value actually means, and
which portion of the image it should be looking at.
The image.Image is the object that organises
that information.
Vision Sensors ended at the moment
csi.CSI.snapshot() returns. Whatever the
camera-side machinery did to produce the captured
frame is already done; the application has the
Image in hand and needs to know what to do with
it.
7.1.1. The buffer and its properties¶
Inside the Image is a pointer to a contiguous
block of bytes in RAM and a small header carrying
three pieces of metadata: the image’s width in
pixels, its height in pixels, and the pixel format
the bytes are in. The bytes are the pixels
themselves, stored in row-major order – all of the
top row’s pixels first, then all of the second row’s,
and so on down to the bottom. The properties describe
how to read them.
Width and height are plain integer counts. The pixel
format is the more interesting property, because it sets
how many bytes each pixel takes and what those bytes
encode. A grayscale image carries one byte per pixel
holding a brightness value. An RGB565 image carries
two bytes per pixel holding red, green, and blue
fields packed into a 16-bit word. A Bayer image
carries one byte per pixel, but each pixel is sampled
through one of three colour filters chosen by its
position in the mosaic.
Vision Sensors
enumerated the whole catalogue; what matters here is that
exactly
one of those formats is set on every Image, and
the choice drives the bytes-per-pixel arithmetic and
the meaning of any single byte in the buffer.
With a pointer to the buffer, the width, the height,
and the format, every other property an algorithm
might want falls out as a short calculation. The byte
that begins pixel (x, y) sits at offset
(y * width + x) * bytes_per_pixel from the start
of the buffer. The total byte count is
width * height * bytes_per_pixel. The address of
the next row down is exactly width *
bytes_per_pixel bytes after the start of the
current one. The Image exposes the three
properties through plain method calls –
width(),
height(),
format() – plus the derived
size through size(). Methods
elsewhere in the module use those values to do the
offset arithmetic themselves; application code rarely
has to.
An Image is a small Python wrapper that points
at a contiguous block of memory: a header carrying
the width, height, and pixel format, followed by
the pixel buffer itself.¶
7.1.2. Where the buffer comes from¶
The default story throughout this chapter is the one
Vision Sensors already covered: a captured frame
arrives from snapshot, the bytes are sitting in
the camera’s frame buffer, and the returned Image
points at them. Three other ways of obtaining one
come up regularly, and each implies something
different about where the buffer ends up.
Loading from a file looks like passing a path to the
constructor: image.Image("/sdcard/saved.jpg"). The
module reads the file into a freshly allocated
buffer on the Python heap. BMP, PGM, and PPM files
get decoded on the way in and the resulting
Image carries an uncompressed pixel format.
JPEG and PNG files stay compressed – the
Image carries the format
JPEG or PNG, and the
buffer holds the file’s byte stream essentially
unchanged. To do any pixel-level work on a
compressed image, the application converts it
through to_rgb565() or
to_grayscale() first, and that
conversion is where decompression – and the
corresponding heap balloon, where a 30 KB JPEG can
become 600 KB of RGB565 – actually happens.
Loading from file is most useful during development,
when an algorithm needs to be tested against a
known reference frame stored alongside the script.
Building one from scratch is the canvas case:
image.Image(320, 240, image.RGB565) asks the
module to allocate that many bytes in that format,
zero the contents, and hand the wrapper back. The
pixels do not mean anything yet – they are all
zero – but the empty image is the workhorse for a
handful of recurring patterns: reference frames
against which a current frame gets subtracted,
canvases on which graphics overlays get composed,
binary buffers that get filled in and used as masks.
Constructing from an ndarray bridges in the other
direction, from any numerical computation back into
the image module. Passing a float32
ulab.numpy.ndarray to the constructor
produces an Image whose dimensions match the
ndarray – a two-axis (h, w) shape becomes a
grayscale image, a three-axis (h, w, 3) shape
becomes RGB565 – with the float values scaled from
0.0 – 255.0 into the integer pixel range. A
neural-network heatmap, a numerical array of any kind,
anything produced by ml or ulab becomes
something the drawing and inspection side of the
image module can use.
All four sources hand back the same kind of
Image. Code that uses the returned object never
has to track where it came from.
7.1.3. Two views over the bytes¶
Most of the time application code treats an Image
as a typed image object – a thing with named
methods. The other half of the story is that the same
object also appears, transparently, as a flat
sequence of bytes to any MicroPython API that takes a
bytes argument. The bytes are not a copy of the
buffer; they are a direct view of it.
That arrangement is what makes pushing a captured frame out of the cam a one-liner. Hashing it, sending it over a serial port, forwarding it to a network socket – none of those needs a separate “convert the image to bytes” step:
import csi
import hashlib
csi0 = csi.CSI()
csi0.reset()
csi0.pixformat(csi.RGB565)
csi0.framesize(csi.QQVGA)
img = csi0.snapshot()
uart.write(img) # transmits the raw pixel bytes
hashlib.sha256(img) # hashes the same bytes
sock.send(img) # sends them over a socket
The bytes-like view is read-only by default, on
purpose. Image buffers are large and sometimes shared
between layers of the imaging stack, so giving a
casual buf[0] = 0 somewhere deep in a call stack
the power to silently corrupt one is too sharp an
edge to leave exposed. When read-write byte-level
access is what the application actually needs –
writing a calibration value into a known offset, for
instance – bytearray() returns a
separate, explicitly read-write view over the same
memory, signposting the intent at the call site.
7.1.4. Where the buffer lives¶
Pixel buffers are large enough that where they sit in RAM matters. A QQVGA RGB565 frame is 160 × 120 × 2 = 38,400 bytes; a VGA RGB565 frame is 614,400 bytes; a 224 × 224 RGB565 input that a neural-network classifier might consume is about 100 KB. The Python heap on the smallest cams can be only a few tens of kilobytes once the runtime has booted. Holding more than a frame or two of image data on the heap would crowd everything else off it.
The way out is that image buffers mostly do not live
on the Python heap. They live in the dedicated
region of RAM
Vision Sensors
introduced as the frame buffer – the same memory the
camera DMA
writes captured frames into and the IDE preview
reads finished frames out of. Most operations on an
Image modify their source in place: the
algorithm reads pixels, decides, writes new values
back, and no separate result image is allocated. The
operations that do produce a separate result –
format conversions and a handful of others – can
be asked to place that result in the frame buffer
through the copy_to_fb keyword argument.
copy_to_fb=True does two things at once: it
puts the result image into the frame buffer rather
than on the heap (sidestepping the heap pressure)
and it makes the result the next frame the IDE
preview will display. Tacking copy_to_fb=True
onto the final step of a pipeline, watching the
result appear on screen, and iterating from there
is one of the most useful debugging idioms in image
processing.
With a wrapper holding a labelled buffer, four ways
of getting one into existence, two views over its
bytes, and a switch deciding where new ones land,
the Image is no longer a mystery. The remaining
foundational questions – how a pixel position is
named, what each pixel actually holds, how to scope
an operation to a portion of one – are built on top
of it.