7.14. YOLOv8 walkthrough

The YOLOv8 post-processor is small enough to step through from input tensor to returned list. Reading it once shows what every other post-processor in the catalogue is doing: threshold quantized scores cheaply, take what survives, dequantize, decode geometry, push to NMS, return per-class lists.

7.14.1. The starting tensor

A YOLOv8 model emits a single output tensor whose shape is (1, C, A) – one frame, C channels, A anchor predictions. The first four channels are box geometry – cx, cy, w, h – normalized to [0, 1] of the network’s input dimensions. The remaining C - 4 channels are per-class scores, already in [0, 1] for a trained model. Each anchor is a column down the channels.

A grid laid on its side: rows labelled cx, cy, w, h, score_0, score_1, ..., score_(N-1); columns labelled anchor 0 through anchor A-1. One column is outlined to show that a single anchor's prediction is one column down the channels.

One YOLOv8 anchor is one column down the channels: four box numbers and N class scores.

The shipped yolov8n_192.tflite is a single-class person detector, so C = 5 and A is in the thousands; a custom model trained on the full 80-class COCO set has C = 84 against the same A. The decode below holds for any class count.

7.14.2. Thresholding before dequantizing

The cheap step first. The output tensor is in the model’s quantized integer dtype, and dequantizing every value would touch every element of a tensor with thousands of entries – most of which are below the score threshold and end up discarded. The post-processor instead quantizes the threshold once and compares in raw quantized space:

from ml.utils import quantize, threshold, dequantize, NMS
from ulab import numpy as np

oh, ow, oc = model.output_shape[0]
scale = model.output_scale[0]
t = quantize(model, self.threshold)

column_outputs = outputs[0].reshape((oh * ow, oc))

score_block = column_outputs[4:, :]
score_indices = threshold(score_block, t, scale,
                          find_max=True,
                          find_max_axis=0)
if not len(score_indices):
    return ()

column_outputs is the tensor reshaped to (C, A) so the channels are rows and the anchors are columns. score_block is the class-score sub-tensor – everything from row 4 down. ml.utils.threshold() reduces that block along axis 0 (find_max=True, find_max_axis=0) to the per-anchor maximum score, then returns the indices of anchors whose maximum passes the quantized threshold. The whole tensor was never dequantized; only a per-column max-reduction in quantized integer space.

If no anchor passes, the post-processor returns the empty tuple predict() interprets as no detection.

Two decisions in this code make the difference between a runnable post-processor and an unusably slow one. The first is driving the arithmetic through numpy: the output tensor has thousands of elements, and iterating it in raw Python takes whole seconds per inference, where the same arithmetic vectorised through numpy runs in milliseconds. The second is dequantizing after the threshold filter rather than before. Dequantizing first would allocate a float tensor four times the size of the quantized one and walk every element before discarding nearly all of them; dequantizing only the surviving columns touches a handful of values at most and saves both the time and the RAM the full conversion would have consumed.

7.14.3. Dequantizing the survivors

Only the surviving anchors need their geometry decoded. numpy.take() pulls those columns out and a single ml.utils.dequantize() call converts them to floats:

bb = dequantize(model,
                np.take(column_outputs, score_indices, axis=1))

bb is now (C, K) where K is the number of surviving anchors – typically a handful even when A was in the thousands.

7.14.4. Reading the geometry

The four box channels and the per-class scores are pulled out directly:

bb_scores  = np.max(bb[4:, :],    axis=0)
bb_classes = np.argmax(bb[4:, :], axis=0)

x_center = bb[0, :]
y_center = bb[1, :]
w_half   = bb[2, :] * 0.5
h_half   = bb[3, :] * 0.5

bb_scores is the best class score per surviving anchor; bb_classes is the class index that delivered that score. The box geometry is still in normalized [0, 1] of network input dimensions, so the next step scales it to pixels:

ib, ih, iw, ic = model.input_shape[0]
xmin = (x_center - w_half) * iw
ymin = (y_center - h_half) * ih
xmax = (x_center + w_half) * iw
ymax = (y_center + h_half) * ih

After this the boxes are in network input pixel space – the coordinate space NMS expects on input.

7.14.5. Non-max suppression

The survivors go through NMS and are returned as per-class lists:

nms = NMS(iw, ih, inputs[0].roi)
for i in range(bb.shape[1]):
    nms.add_bounding_box(xmin[i], ymin[i],
                         xmax[i], ymax[i],
                         bb_scores[i], bb_classes[i])
return nms.get_bounding_boxes(threshold=self.nms_threshold,
                              sigma=self.nms_sigma)

NMS reads inputs[0].roi so the returned boxes are in the original image’s coordinate space, not the network’s – the application draws them onto the captured frame directly without further remapping.

7.14.6. What the script gets back

The return value is a list of per-class lists indexed by class. A three-class example might look like:

[
    [((23, 41, 95, 142), 0.92), ((180, 60, 88, 130), 0.71)],
    [],
    [((310, 95, 55, 70), 0.85)],
]

Each entry is a ((x, y, w, h), score) tuple: (x, y) is the top-left corner of the bounding box in the original image’s pixel coordinates, w and h are its width and height in pixels, and score is the confidence the network assigned to the detection. So ((180, 60, 88, 130), 0.71) reads as a box whose top-left corner sits at pixel (180, 60), extends 88 pixels right and 130 pixels down, and was reported with confidence 0.71.

The outer list shows two surviving boxes for class 0, nothing for class 1, one for class 2. The empty list for class 1 is kept in place so that the outer index always matches the class index. For the shipped person detector the outer list has a single element whose inner list contains the surviving person boxes. For an 80-class model it has 80 inner lists, most empty on any given frame, with the non-empty entries holding the boxes for the classes that fired. The application reads the result with enumerate(boxes) to walk the class indices alongside the box lists – the same shape detection post-processors target across the catalogue.