7.12. Post-processors¶
A detection network does not emit boxes. It emits one or more
tensors whose layout depends on the architecture the model was
trained against – a 2-D tensor of candidate predictions for a
YOLO-family detector, a pair of (boxes, scores) tensors for a
MediaPipe detector, a flat list of keypoint coordinates for a pose
network. The application cannot read any of these directly; what
it wants – a list of boxes, a list of keypoints, a per-class
breakdown – has to be decoded out of the raw tensor.
That decoder is a post-processor. The ml.postprocessing
module groups them by source ecosystem.
7.12.1. Darknet¶
ml.postprocessing.darknet decodes models from the original
YOLO era. YOLO v2 introduced the grid and anchor ideas most
later detectors inherited in some form, so the v2 layout is the
cleanest starting point.
YOLO v2 starts by dividing the input image into a coarse grid – a 13-by-13 layout for the canonical 416-pixel input, smaller for smaller models – and trains the network so each grid cell is responsible for detecting any object whose centre falls inside it. The spatial layout of the output tensor mirrors the layout of the input: one position in the output per cell in the image.
At each grid cell, the network does not predict a box out of thin
air. It picks from several pre-chosen reference shapes called
anchors – fixed (width, height) pairs derived offline by
clustering the box sizes in the training set so they cover the
typical objects the model is expected to see. The network’s job
at each cell is to predict, for each anchor, a small offset to
the box centre within the cell, a scale on the anchor’s width and
height, an objectness score (the likelihood that anything is
there), and a per-class probability vector. A 13-by-13 grid with
the default 5 anchors and 20 classes therefore emits
13 * 13 * 5 * (4 + 1 + 20) = 21,125 numbers per inference.
YoloV2 decodes that layout:
it walks the cells, applies each anchor’s offsets and scales to
recover absolute box coordinates, combines objectness with class
probability for a per-class score, thresholds, and pushes the
survivors to NMS. The class takes an anchors= constructor
argument when the model was trained against a custom anchor table
and falls back to a built-in default otherwise. Variants tuned
for specific class sets ship in the same submodule.
7.12.2. Ultralytics¶
ml.postprocessing.ultralytics decodes the newer YOLO
generations. YoloV8 reads
a column-major output where each column is one anchor prediction
holding box coordinates and a per-class score vector – the
objectness channel earlier YOLO outputs carried has been dropped
in v8, and the class scores stand alone. The
YOLOv8 walkthrough steps through the
decode tensor-by-tensor. Older Ultralytics-era versions ship in
the same submodule for models trained against their layouts.
7.12.3. MediaPipe¶
ml.postprocessing.mediapipe decodes Google’s lightweight
on-device family. BlazeFace
is the face detector covered in hello-blazeface: a fast anchor-based detector
that emits boxes and six landmark coordinates per face, returned
as (box, score, keypoints) tuples with the landmarks attached
to each box rather than as a separate output list. Hand-detection,
landmark, and pose models from the same family ship alongside it
and follow the same attached-keypoint return shape.
7.12.4. Picking one¶
The right post-processor is determined by the architecture the
model was trained against, not by what the application wants. A
YOLOv8 .tflite only decodes correctly through
YoloV8; a BlazeFace
.tflite only through
BlazeFace. Picking the
post-processor is part of picking the model. When a model’s
architecture is not represented by a shipped post-processor,
writing your own is straightforward.
Classification networks are the exception. Their single output
tensor is already what the application wants – a list of
per-class scores – and no post-processor is needed. Loading the
model without postprocess= and reading the predict result as
a flat ndarray is the right path, as tensor I/O covered.