YOLOv8 walkthrough
==================

The YOLOv8 post-processor is small enough to step through from
input tensor to returned list. Reading it once shows what every
other post-processor in the catalogue is doing: threshold
quantized scores cheaply, take what survives, dequantize, decode
geometry, push to NMS, return per-class lists.

The starting tensor
-------------------

A YOLOv8 model emits a single output tensor whose shape is
``(1, C, A)`` -- one frame, ``C`` channels, ``A`` anchor
predictions. The first four channels are box geometry --
``cx``, ``cy``, ``w``, ``h`` -- normalized to ``[0, 1]`` of the
network's input dimensions. The remaining ``C - 4`` channels are
per-class scores, already in ``[0, 1]`` for a trained model. Each
anchor is a column down the channels.

.. figure:: ../figures/yolov8-tensor.svg
   :alt: A grid laid on its side: rows labelled cx, cy, w, h,
         score_0, score_1, ..., score_(N-1); columns labelled
         anchor 0 through anchor A-1. One column is outlined to
         show that a single anchor's prediction is one column
         down the channels.

   One YOLOv8 anchor is one column down the channels: four box
   numbers and ``N`` class scores.

The shipped ``yolov8n_192.tflite`` is a single-class person
detector, so ``C = 5`` and ``A`` is in the thousands; a custom
model trained on the full 80-class COCO set has ``C = 84`` against
the same ``A``. The decode below holds for any class count.

Thresholding before dequantizing
--------------------------------

The cheap step first. The output tensor is in the model's
quantized integer dtype, and dequantizing every value would touch
every element of a tensor with thousands of entries -- most of
which are below the score threshold and end up discarded. The
post-processor instead quantizes the threshold once and compares
in raw quantized space::

    from ml.utils import quantize, threshold, dequantize, NMS
    from ulab import numpy as np

    oh, ow, oc = model.output_shape[0]
    scale = model.output_scale[0]
    t = quantize(model, self.threshold)

    column_outputs = outputs[0].reshape((oh * ow, oc))

    score_block = column_outputs[4:, :]
    score_indices = threshold(score_block, t, scale,
                              find_max=True,
                              find_max_axis=0)
    if not len(score_indices):
        return ()

``column_outputs`` is the tensor reshaped to ``(C, A)`` so the
channels are rows and the anchors are columns. ``score_block`` is
the class-score sub-tensor -- everything from row ``4`` down.
:func:`ml.utils.threshold` reduces that block along axis ``0``
(``find_max=True``, ``find_max_axis=0``) to the per-anchor maximum
score, then returns the indices of anchors whose maximum passes
the quantized threshold. The whole tensor was never dequantized;
only a per-column max-reduction in quantized integer space.

If no anchor passes, the post-processor returns the empty tuple
:meth:`~ml.Model.predict` interprets as no detection.

Two decisions in this code make the difference between a runnable
post-processor and an unusably slow one. The first is driving the
arithmetic through :mod:`numpy`: the output tensor has thousands
of elements, and iterating it in raw Python takes whole seconds
per inference, where the same arithmetic vectorised through
:mod:`numpy` runs in milliseconds. The second is dequantizing
*after* the threshold filter rather than before. Dequantizing
first would allocate a float tensor four times the size of the
quantized one and walk every element before discarding nearly
all of them; dequantizing only the surviving columns touches a
handful of values at most and saves both the time and the RAM
the full conversion would have consumed.

Dequantizing the survivors
--------------------------

Only the surviving anchors need their geometry decoded.
:func:`numpy.take` pulls those columns out and a single
:func:`ml.utils.dequantize` call converts them to floats::

    bb = dequantize(model,
                    np.take(column_outputs, score_indices, axis=1))

``bb`` is now ``(C, K)`` where ``K`` is the number of surviving
anchors -- typically a handful even when ``A`` was in the
thousands.

Reading the geometry
--------------------

The four box channels and the per-class scores are pulled out
directly::

    bb_scores  = np.max(bb[4:, :],    axis=0)
    bb_classes = np.argmax(bb[4:, :], axis=0)

    x_center = bb[0, :]
    y_center = bb[1, :]
    w_half   = bb[2, :] * 0.5
    h_half   = bb[3, :] * 0.5

``bb_scores`` is the best class score per surviving anchor;
``bb_classes`` is the class index that delivered that score. The
box geometry is still in normalized ``[0, 1]`` of network input
dimensions, so the next step scales it to pixels::

    ib, ih, iw, ic = model.input_shape[0]
    xmin = (x_center - w_half) * iw
    ymin = (y_center - h_half) * ih
    xmax = (x_center + w_half) * iw
    ymax = (y_center + h_half) * ih

After this the boxes are in network input pixel space -- the
coordinate space :class:`~ml.utils.NMS` expects on input.

Non-max suppression
-------------------

The survivors go through NMS and are returned as per-class lists::

    nms = NMS(iw, ih, inputs[0].roi)
    for i in range(bb.shape[1]):
        nms.add_bounding_box(xmin[i], ymin[i],
                             xmax[i], ymax[i],
                             bb_scores[i], bb_classes[i])
    return nms.get_bounding_boxes(threshold=self.nms_threshold,
                                  sigma=self.nms_sigma)

:class:`~ml.utils.NMS` reads ``inputs[0].roi`` so the returned
boxes are in the original image's coordinate space, not the
network's -- the application draws them onto the captured frame
directly without further remapping.

What the script gets back
-------------------------

The return value is a list of per-class lists indexed by class.
A three-class example might look like::

    [
        [((23, 41, 95, 142), 0.92), ((180, 60, 88, 130), 0.71)],
        [],
        [((310, 95, 55, 70), 0.85)],
    ]

Each entry is a ``((x, y, w, h), score)`` tuple: ``(x, y)`` is
the top-left corner of the bounding box in the original image's
pixel coordinates, ``w`` and ``h`` are its width and height in
pixels, and ``score`` is the confidence the network assigned to
the detection. So ``((180, 60, 88, 130), 0.71)`` reads as a
box whose top-left corner sits at pixel ``(180, 60)``, extends
88 pixels right and 130 pixels down, and was reported with
confidence ``0.71``.

The outer list shows two surviving boxes for class ``0``, nothing
for class ``1``, one for class ``2``. The empty list for class
``1`` is kept in place so that the outer index always matches
the class index.
For the shipped person detector the outer list has a single
element whose inner list contains the surviving person boxes.
For an 80-class model it has 80 inner lists, most empty on any
given frame, with the non-empty entries holding the boxes for
the classes that fired. The application reads the result with
``enumerate(boxes)`` to walk the class indices alongside the box
lists -- the same shape detection post-processors target across
the catalogue.