Normalization ============= :meth:`ml.Model.predict` takes a *list* of inputs because some networks have more than one input tensor, but the list has no way to carry per-input arguments inline -- there is no kwarg slot for "crop *this* input to ``(x, y, w, h)`` but leave the other inputs alone". :class:`ml.preprocessing.Normalization` is the wrapper that fills that gap. A :class:`Normalization` instance holds the parameters for one input; the script passes the wrapped input in the predict list whenever it needs anything other than the defaults. The most common reason to reach for it is to crop a specific region of the captured frame into the network instead of the whole image. Parameters ---------- :: Normalization(scale=(0.0, 1.0), mean=(0.0, 0.0, 0.0), stdev=(1.0, 1.0, 1.0), roi=None) * ``roi`` -- ``(x, y, w, h)`` rectangle in the source frame to crop before resizing. Defaults to the whole frame. Most uses of :class:`Normalization` set just this parameter. * ``scale`` -- the ``(min, max)`` range floating-point input tensors expect after normalization. The pixel range ``0..255`` is mapped linearly into this range. Common values are ``(0.0, 1.0)`` for ReLU-trained networks and ``(-1.0, 1.0)`` for symmetrically- normalised networks. * ``mean`` -- per-channel ``(R, G, B)`` mean subtracted from the image after scaling. Matches the channel statistics the network was trained against -- ``(0.485, 0.456, 0.406)`` for ImageNet-derived networks is the canonical example. Grayscale networks reduce the mean to a luma value using the standard ``0.299*R + 0.587*G + 0.114*B``. * ``stdev`` -- per-channel ``(R, G, B)`` standard deviation the image is divided by after the mean is subtracted, again matching the network's training statistics. Reduced to luma the same way for grayscale networks. When parameters matter ---------------------- ``scale``, ``mean``, and ``stdev`` are ignored when the network's :attr:`~ml.Model.input_dtype` is ``int8`` or ``uint8``. For integer-input networks the cropped image bytes are written into the tensor directly and the network's own :attr:`~ml.Model.input_scale` and :attr:`~ml.Model.input_zero_point` handle the int-to-real conversion. The three parameters matter only when the network expects floating-point input. ``roi`` is read in every case -- it controls which part of the source frame reaches the network regardless of the input dtype. ROI and resize -------------- The ROI is bilinearly scaled from its source dimensions to the network's input dimensions. The image is centred in the destination and the scaling fills the destination -- it does not preserve aspect ratio. A non-square ROI fed to a square network input comes out horizontally or vertically stretched. Whether the stretch matters depends on the network. Face detection and landmark models like the MediaPipe family (BlazeFace, FaceLandmarks, HandLandmarks, MoveNet) were trained against square crops and degrade quickly when the input aspect ratio is off; for those, the application needs to give them a square ROI -- either by capturing at a square framesize through :meth:`~csi.CSI.window` or by cropping with the ``roi=`` parameter. YOLO-family object detectors are typically trained with augmentation that includes random stretches and accept non-square ROIs without much accuracy loss; passing the full captured frame straight in is usually fine. When the network's input dimensions match the ROI exactly the scale collapses to a copy, which is the cheapest case. Overriding the default ---------------------- :meth:`~ml.Model.predict` wraps each :class:`image.Image` input with ``Normalization()`` automatically -- the default parameters above. Most models that ship with the cam were trained against pixel ranges the defaults already cover, so the common case is to pass the image directly:: result = model.predict([img]) To use a custom ROI -- the most common override -- build a :class:`Normalization` with the ROI set and bind the image to it:: from ml.preprocessing import Normalization norm = Normalization(roi=(80, 60, 160, 120)) result = model.predict([norm(img)]) To match a network's training-time channel statistics, set the floating-point parameters:: norm = Normalization(scale=(0.0, 1.0), mean=(0.485, 0.456, 0.406), stdev=(0.229, 0.224, 0.225)) result = model.predict([norm(img)]) Calling the :class:`Normalization` instance on the image returns a new bound instance the engine fills the tensor from. The bound instance is what predict accepts in place of the raw image, and because it is a per-input object, a multi-input network can mix images with different ROIs in the same predict list. For networks that expect inputs the application has already produced in tensor form -- a buffer from a peripheral, an :class:`~ulab.numpy.ndarray` computed by another pipeline, non-image numeric data -- skip :class:`Normalization` entirely and pass the ndarray or a callable that produces it. :meth:`~ml.Model.predict` passes those through to the engine without wrapping.