7.7. Normalization¶
ml.Model.predict() takes a list of inputs because some
networks have more than one input tensor, but the list has no way
to carry per-input arguments inline – there is no kwarg slot for
“crop this input to (x, y, w, h) but leave the other inputs
alone”. ml.preprocessing.Normalization is the wrapper that
fills that gap. A Normalization instance holds the
parameters for one input; the script passes the wrapped input in
the predict list whenever it needs anything other than the defaults.
The most common reason to reach for it is to crop a specific region of the captured frame into the network instead of the whole image.
7.7.1. Parameters¶
Normalization(scale=(0.0, 1.0),
mean=(0.0, 0.0, 0.0),
stdev=(1.0, 1.0, 1.0),
roi=None)
roi–(x, y, w, h)rectangle in the source frame to crop before resizing. Defaults to the whole frame. Most uses ofNormalizationset just this parameter.scale– the(min, max)range floating-point input tensors expect after normalization. The pixel range0..255is mapped linearly into this range. Common values are(0.0, 1.0)for ReLU-trained networks and(-1.0, 1.0)for symmetrically- normalised networks.mean– per-channel(R, G, B)mean subtracted from the image after scaling. Matches the channel statistics the network was trained against –(0.485, 0.456, 0.406)for ImageNet-derived networks is the canonical example. Grayscale networks reduce the mean to a luma value using the standard0.299*R + 0.587*G + 0.114*B.stdev– per-channel(R, G, B)standard deviation the image is divided by after the mean is subtracted, again matching the network’s training statistics. Reduced to luma the same way for grayscale networks.
7.7.2. When parameters matter¶
scale, mean, and stdev are ignored when the network’s
input_dtype is int8 or uint8. For
integer-input networks the cropped image bytes are written into the
tensor directly and the network’s own
input_scale and
input_zero_point handle the int-to-real
conversion. The three parameters matter only when the network
expects floating-point input.
roi is read in every case – it controls which part of the
source frame reaches the network regardless of the input dtype.
7.7.3. ROI and resize¶
The ROI is bilinearly scaled from its source dimensions to the network’s input dimensions. The image is centred in the destination and the scaling fills the destination – it does not preserve aspect ratio. A non-square ROI fed to a square network input comes out horizontally or vertically stretched.
Whether the stretch matters depends on the network. Face detection
and landmark models like the MediaPipe family (BlazeFace,
FaceLandmarks, HandLandmarks, MoveNet) were trained against square
crops and degrade quickly when the input aspect ratio is off; for
those, the application needs to give them a square ROI – either by
capturing at a square framesize through window() or
by cropping with the roi= parameter. YOLO-family object
detectors are typically trained with augmentation that includes
random stretches and accept non-square ROIs without much accuracy
loss; passing the full captured frame straight in is usually fine.
When the network’s input dimensions match the ROI exactly the scale collapses to a copy, which is the cheapest case.
7.7.4. Overriding the default¶
predict() wraps each image.Image input
with Normalization() automatically – the default parameters
above. Most models that ship with the cam were trained against
pixel ranges the defaults already cover, so the common case is to
pass the image directly:
result = model.predict([img])
To use a custom ROI – the most common override – build a
Normalization with the ROI set and bind the image to it:
from ml.preprocessing import Normalization
norm = Normalization(roi=(80, 60, 160, 120))
result = model.predict([norm(img)])
To match a network’s training-time channel statistics, set the floating-point parameters:
norm = Normalization(scale=(0.0, 1.0),
mean=(0.485, 0.456, 0.406),
stdev=(0.229, 0.224, 0.225))
result = model.predict([norm(img)])
Calling the Normalization instance on the image returns a
new bound instance the engine fills the tensor from. The bound
instance is what predict accepts in place of the raw image, and
because it is a per-input object, a multi-input network can mix
images with different ROIs in the same predict list.
For networks that expect inputs the application has already
produced in tensor form – a buffer from a peripheral, an
ndarray computed by another pipeline,
non-image numeric data – skip Normalization entirely and
pass the ndarray or a callable that produces it. predict()
passes those through to the engine without wrapping.