7.2. What ML changed

The image module carries a handful of legacy detection methods – find_features() for Haar-cascade face detection, find_eye() for the fixed pupil finder, find_hog() for gradient-direction summaries, the find_keypoints() and find_lbp() paths for arbitrary keypoints. All of them still work; all of them have been superseded by the machine-learning pipeline.

7.2.1. The classical split: hand-designed summaries, learned decisions

A classical vision pipeline was a two-step thing. The first step turned raw pixels into a compact set of numbers chosen to summarise what was in the picture – not the pixel values themselves, but a shorter description of which patterns showed up where. The second step took that summary and made a decision: face or not, this object or that one, same target or different.

The split mattered because the two steps had different authors. The first step was written by a human. Someone sat down and decided that the brightness difference between two specific rectangles was a good summary of an eye region, that the dominant edge direction in each cell of a grid was a good summary of a standing person’s outline, that the bright-or-dark pattern around each pixel was a good summary of local texture. Each of those choices was a hand-written algorithm – written, debugged, and published. The legacy methods above were all summaries of this kind that had become standard tools:

  • find_features() summarises a window of the image by adding up the brightness inside several rectangles and comparing the totals. The rectangle layouts were chosen because human faces show reliable bright-against-dark contrasts: eyebrows against cheeks, eye sockets against forehead, nose against surrounding skin.

  • find_hog() summarises an image by walking a grid of small cells and recording which edge direction dominates in each cell. The grid was chosen because a standing person’s outline produces a recognisable pattern of edge directions regardless of clothing or lighting.

  • find_lbp() summarises each pixel’s neighbourhood by encoding which of its surrounding pixels are brighter and which are darker. The encoding was chosen because these brighter-than / darker-than patterns capture the texture of a surface independently of overall lighting.

  • find_keypoints() finds corner points in the image and describes the area around each corner in a way that stays the same when the corner is rotated. The corner-and-rotation scheme was chosen because the same corners reappear when a scene is viewed from a different angle.

Once a summary had been hand-written, a small learning step on top of it could combine the numbers into a decision. The face-detection algorithm bolted a learning step onto the rectangle-difference summary, training it on labelled face and non-face images to learn which combinations of differences signal a face. The edge-direction summary fed into a learning step trained on labelled person and non-person images. The corner descriptors fed into a matching step that learned how much weight to give each corner. Each of these second steps is a learning algorithm – a small one by modern standards, but a learning algorithm.

The contribution split is what mattered. The human contributed the summary. The machine learned the combination. Adding a new target meant writing a new summary.

7.2.2. What neural networks changed

A neural network erases the split. The first layers of the network do the summary work the hand-written algorithms used to do – detecting edges, corners, oriented bars, textures, exactly the things the legacy methods listed above were each tuned to detect – but they are not hand-written. They are learned from the same training data the decision step is learned from, in a single training pass that adjusts both halves of the network at once. The deeper layers do the combining that the small learning step on top of the hand-written summaries used to do, also learned, in the same pass.

The change in who designs what is total:

  • The human designs the input – captured frames of a given size and format.

  • The human designs the output – the layout of the result tensor (one score per class for classification, a list of boxes for detection, a grid of keypoints for landmarks).

  • The human supplies labelled training data – enough examples of the target and enough examples of non-targets that the training process has something to learn from.

Everything between input and output is generated by the training process. There is no separate summary-writing step. The early layers settle into edge and texture detectors not because anyone wrote them that way, but because edge and texture detection is what makes the network’s predictions match the labels. The deeper layers settle into shape and object detectors for the same reason. Both halves are trained together, which lets the summaries each layer produces be exactly the summaries the next layer needs – not the generic ones a hand-written pipeline had to settle for.

7.2.3. Composing with the image module

Neural-network pipelines still capture through the same sensor APIs, draw results through the same draw_rectangle() and draw_circle() primitives, and scope work through the same (x, y, w, h) ROIs. A typical pipeline captures a frame, optionally finds a coarse target with a classical detector like find_blobs() and passes its bounding box to the inference as an ROI, runs the inference, and annotates the returned detections back into the original frame. The classical primitives are the substrate; the network is the new step in the middle.