7.3. Hello BlazeFace¶
BlazeFace is a face-detection neural network from Google’s MediaPipe
collection. A single inference call returns a bounding rectangle
around each detected face along with six facial landmarks – right
eye, left eye, nose, mouth, right ear, left ear. Every OpenMV Cam
that ships with neural-network support carries the
blazeface_front_128.tflite model on flash, so running an
end-to-end face detector takes a few lines of Python.
7.3.1. The full script¶
import csi
import ml
from ml.postprocessing.mediapipe import BlazeFace
csi0 = csi.CSI()
csi0.reset()
csi0.pixformat(csi.RGB565)
csi0.framesize(csi.VGA)
csi0.window((400, 400))
model = ml.Model("/rom/blazeface_front_128.tflite",
postprocess=BlazeFace(threshold=0.4))
while True:
img = csi0.snapshot()
for (x, y, w, h), score, keypoints in model.predict([img]):
img.draw_rectangle((x, y, w, h), color=(0, 255, 0))
ml.utils.draw_keypoints(img, keypoints, color=(255, 0, 0))
That is the entire face detector. There is nothing else to it; the script captures a frame, hands it to the model, walks the returned list of detections, and draws each face’s bounding rectangle plus its six landmarks back into the frame. The IDE preview shows the boxes and landmarks in real time.
7.3.2. What each line does¶
The first three lines import the modules the script needs.
csi is the camera-sensor interface; ml is the
machine-learning module the rest of this chapter is about;
BlazeFace is the post-processor
that turns BlazeFace’s raw output tensors into the bounding-box and
landmark list the script iterates over.
The next five lines configure the sensor. The camera is reset to a known state, set to RGB565 colour, set to VGA resolution, and then windowed to a 400-by-400 square. The window matters: BlazeFace was trained on square crops, and giving it a square input lines up the network’s expected aspect ratio with what it sees in the captured frame.
The model-loading line opens the model file:
model = ml.Model("/rom/blazeface_front_128.tflite",
postprocess=BlazeFace(threshold=0.4))
ml.Model reads the file at the given path – /rom/ is a
flash-resident filesystem covered later – and returns a model object
the script will run inferences against. The postprocess= keyword
registers the BlazeFace post-processor; without it, predict would
return the network’s raw output tensors and the application would
have to decode them by hand. With it, predict returns the decoded
result directly. The threshold=0.4 argument on the post-processor
sets the minimum confidence the network must report before a
detection is kept; lower values catch fainter faces at the cost of
more false positives.
The remaining four lines are the main loop. Each pass through it captures one frame and asks the model what it sees:
img = csi0.snapshot()
for (x, y, w, h), score, keypoints in model.predict([img]):
img.draw_rectangle((x, y, w, h), color=(0, 255, 0))
ml.utils.draw_keypoints(img, keypoints, color=(255, 0, 0))
predict() takes a list of inputs (here, one captured
image) and returns a list of detection tuples. Each tuple holds the
bounding rectangle (x, y, w, h), a confidence score between
zero and one, and a (6, 2) ndarray of
landmark coordinates – the right eye, left eye, nose, mouth, right
ear, and left ear in that order. The drawing call uses
draw_rectangle() – the same primitive every
classical detector in the image chapter ended with – to outline
the face. ml.utils.draw_keypoints() is a small helper from
the ml utilities that marks each keypoint with a cross at its
(x, y) position.
7.3.3. What the script does not say¶
The script is seven runnable lines of inference work past the imports
and the sensor setup, but a great deal of arithmetic happens inside
those seven lines. The captured 400-by-400 RGB565 frame becomes a
128-by-128 quantized 8-bit tensor before reaching the network; the
network runs hundreds of operations against tens of thousands of
weights; the resulting tensors of confidence scores and box offsets
become a ranked list of non-overlapping bounding boxes with attached
landmarks before predict returns. Every one of those
transformations is something the application can control if it
needs to, and several of them have to be tuned for any non-default
model.
The next four subsections walk those transformations open. In order:
The ml module – what
ml.Modelexposes once a model is loaded, and where the model file actually lives on the cam.The inference pipeline – the four stages of every
predict()call.Inference engines – the CPU and NPU paths that run the network’s arithmetic.
Decoding the output – the post-processors that convert raw output tensors into the detections this script iterated over.
By the end of the chapter the reader can write the equivalent script for a model that did not ship with the cam, decode a tensor whose post-processor does not exist yet, and reason about why a particular model runs at 30 FPS on one cam and 3 FPS on another.