7.8. Tensor I/O

The engine accepts a single tensor on the input side and produces one or more on the output side. The tensors are ndarray objects with the shape, dtype, and descriptor vocabulary the numpy chapter introduced. Their shapes and dtypes come from the model file and are reported through input_shape / output_shape and input_dtype / output_dtype.

7.8.1. Quantization

Most networks the cam runs operate on quantized integer tensors – int8 or uint8 – to fit within the cam’s RAM and compute budget. A quantized tensor carries integer values that represent real-valued numbers through a per-tensor scale and zero point:

\[\text{real} = \text{scale} \times (q - \text{zero_point})\]
\[q = \mathrm{round}(\text{real} / \text{scale}) + \text{zero_point}\]

The scale and zero point come from the model’s training-time calibration and are stored in the model file. They are exposed as input_scale, input_zero_point, output_scale, and output_zero_point – each a list with one entry per input or output tensor.

ml.utils.quantize() and ml.utils.dequantize() apply the formulas against a specified output index:

import ml.utils

real_tensor = ml.utils.dequantize(model, q_tensor, index=0)
q_tensor    = ml.utils.quantize(model, real_tensor, index=0)

Both functions return the value unchanged when the output dtype at the given index is already float, so the call is safe regardless of the model’s quantization status.

7.8.2. What the script sees on the output side

What predict() returns depends on whether a post-processor is registered.

With no post-processor, the engine’s raw integer outputs are auto-dequantized to float and returned as a list of float ndarray objects. The script receives real-valued numbers ready to read. This is the right path for classification networks, whose single output tensor is already a list of per-class confidence scores the application iterates over – no decoding step needed. It is also the easy path for getting an unknown model running quickly or for ad-hoc inspection from the REPL.

With a post-processor registered (through postprocess= on the constructor or callback= on the predict call), the raw quantized tensors are handed to the post-processor’s callable directly. The post-processor receives the raw quantized tensors and is responsible for whatever dequantization it needs.

The split is a performance choice. Auto-dequantization allocates a new float tensor for each output and walks every element. A post-processor that only needs a few values from each tensor – threshold the confidence scores, then decode boxes for the survivors – skips the cost of dequantizing the rest. The box decoders shipped under ml.postprocessing all take this route, and ml.utils.threshold() is built for exactly this case: it takes a quantized score tensor and returns the indices whose dequantized values pass a real-valued threshold, without dequantizing the whole tensor.