7.12. Post-processors

A detection network does not emit boxes. It emits one or more tensors whose layout depends on the architecture the model was trained against – a 2-D tensor of candidate predictions for a YOLO-family detector, a pair of (boxes, scores) tensors for a MediaPipe detector, a flat list of keypoint coordinates for a pose network. The application cannot read any of these directly; what it wants – a list of boxes, a list of keypoints, a per-class breakdown – has to be decoded out of the raw tensor.

That decoder is a post-processor. The ml.postprocessing module groups them by source ecosystem.

7.12.1. Darknet

ml.postprocessing.darknet decodes models from the original YOLO era. YOLO v2 introduced the grid and anchor ideas most later detectors inherited in some form, so the v2 layout is the cleanest starting point.

YOLO v2 starts by dividing the input image into a coarse grid – a 13-by-13 layout for the canonical 416-pixel input, smaller for smaller models – and trains the network so each grid cell is responsible for detecting any object whose centre falls inside it. The spatial layout of the output tensor mirrors the layout of the input: one position in the output per cell in the image.

At each grid cell, the network does not predict a box out of thin air. It picks from several pre-chosen reference shapes called anchors – fixed (width, height) pairs derived offline by clustering the box sizes in the training set so they cover the typical objects the model is expected to see. The network’s job at each cell is to predict, for each anchor, a small offset to the box centre within the cell, a scale on the anchor’s width and height, an objectness score (the likelihood that anything is there), and a per-class probability vector. A 13-by-13 grid with the default 5 anchors and 20 classes therefore emits 13 * 13 * 5 * (4 + 1 + 20) = 21,125 numbers per inference.

YoloV2 decodes that layout: it walks the cells, applies each anchor’s offsets and scales to recover absolute box coordinates, combines objectness with class probability for a per-class score, thresholds, and pushes the survivors to NMS. The class takes an anchors= constructor argument when the model was trained against a custom anchor table and falls back to a built-in default otherwise. Variants tuned for specific class sets ship in the same submodule.

7.12.2. Ultralytics

ml.postprocessing.ultralytics decodes the newer YOLO generations. YoloV8 reads a column-major output where each column is one anchor prediction holding box coordinates and a per-class score vector – the objectness channel earlier YOLO outputs carried has been dropped in v8, and the class scores stand alone. The YOLOv8 walkthrough steps through the decode tensor-by-tensor. Older Ultralytics-era versions ship in the same submodule for models trained against their layouts.

7.12.3. MediaPipe

ml.postprocessing.mediapipe decodes Google’s lightweight on-device family. BlazeFace is the face detector covered in hello-blazeface: a fast anchor-based detector that emits boxes and six landmark coordinates per face, returned as (box, score, keypoints) tuples with the landmarks attached to each box rather than as a separate output list. Hand-detection, landmark, and pose models from the same family ship alongside it and follow the same attached-keypoint return shape.

7.12.4. Picking one

The right post-processor is determined by the architecture the model was trained against, not by what the application wants. A YOLOv8 .tflite only decodes correctly through YoloV8; a BlazeFace .tflite only through BlazeFace. Picking the post-processor is part of picking the model. When a model’s architecture is not represented by a shipped post-processor, writing your own is straightforward.

Classification networks are the exception. Their single output tensor is already what the application wants – a list of per-class scores – and no post-processor is needed. Loading the model without postprocess= and reading the predict result as a flat ndarray is the right path, as tensor I/O covered.