16.3.2. The watchdog¶
The hardware watchdog is the floor every other hardening choice sits on. It is a tiny independent timer that resets the processor when it has not been told otherwise for too long. A script that wedges on a flaky sensor, a network call that blocks past its timeout, a memory allocator stuck in a corner of the heap, an exception that escaped the loop – none of them stop the watchdog. The timer counts down regardless, and the cam reboots.
For a shipped product, a watchdog is not optional. Without it, any of the failure modes above leaves the cam dead until someone notices and power-cycles it. With it, the cam comes back up by itself and the only evidence of the failure is one line in the log.
See also
The hardware chapter’s watchdog timer page
covers what a watchdog is at the hardware level and the
basics of the machine.WDT API. This page covers
what changes for a production deployment.
16.3.2.1. Starting the watchdog¶
machine.WDT is the API. It is hardware-backed: once
constructed, the timer runs until the next reset. There is no
stop(), no deinit(), no Ctrl-C escape. That is the
point.
A typical setup at the top of main.py, immediately before
the loop it protects:
from machine import WDT
wdt = WDT(timeout=10_000) # milliseconds
main.py is the right home for the watchdog because that
is where the loop lives. A watchdog reset is a hardware
reset, so the cold-boot path re-runs and main.py re-enters
the loop on its own – recovery works without any wiring in
boot.py. Starting the watchdog in boot.py instead
means every soft reset (a developer’s Ctrl-D, for instance)
hands the application a hardware timer it has no way to
stop, which is an annoyance at the bench and a trap in
production setup code that runs before the loop is ready.
Pick the timeout to be 2 to 3 times longer than the worst observed iteration time of the main loop. Frame-rate jitter, a slow sensor read on a cold sensor, a brief Wi-Fi hiccup – none of those should trip the watchdog. A real hang (an infinite loop, a blocked I/O call) should. Too-short timeouts turn the watchdog into a source of false resets; too-long timeouts let the cam sit unresponsive for minutes before recovery fires.
16.3.2.2. Feeding it¶
wdt.feed() resets the countdown. Call it once per
iteration of the main loop, at the top of the loop body so
the feed happens unconditionally before any work that might
hang:
while True:
wdt.feed()
frame = csi0.snapshot()
process(frame)
16.3.2.3. Surviving exceptions¶
The watchdog handles hangs. Exceptions are a different
failure mode. An unhandled exception bubbles up to the
script’s top level, main.py exits, and the cam drops to
the REPL. The watchdog then trips after its timeout because
nothing is feeding it from the REPL, the cam resets, and
main.py runs again – so recovery does work, but the
field pays a full timeout-plus-reboot for every crash, the
traceback goes to USB stdout that nothing reads, and any
in-memory state the application was keeping is gone.
Wrapping the main loop in a top-level try / except
turns a crash into a logged event the application continues
through, without paying for a reset:
import logging
log = logging.getLogger(__name__)
while True:
wdt.feed()
try:
frame = csi0.snapshot()
process(frame)
except Exception:
log.exception("frame loop iteration failed")
Catching Exception (not BaseException) keeps
KeyboardInterrupt and SystemExit working, which
is what a developer attached over USB wants.
This pattern is the software half of liveness: the watchdog catches the hangs, the wrapper catches the crashes, and the log records what either of them caught.
16.3.2.4. Knowing why a boot happened¶
Every soft reset and every watchdog reset eventually shows up
as a fresh boot. The boot-time diagnostics helper logs
machine.reset_cause() on every cold start; the reset
cause line is what tells the field whether recovery
actually fired versus the cam just power-cycling normally.
The reset-cause line is what makes the watchdog’s work
visible in the log. A log full of watchdog timeout
resets says the application has been hanging and the
watchdog has been recovering it. A log without them says the
watchdog has not had to fire – which usually means the
application is healthy, but can also mean the timeout is set
too long to catch the hangs that are actually happening.
16.3.2.5. A complete starter¶
A main.py that pulls watchdog, logging setup, boot-time
diagnostics, and the wrapper together looks like:
import logging
from machine import WDT
from app.logging_setup import setup_logging, log_boot_diagnostics
setup_logging('/sdcard/logs/app.log')
log_boot_diagnostics()
log = logging.getLogger(__name__)
wdt = WDT(timeout=10_000)
while True:
wdt.feed()
try:
step()
except Exception:
log.exception("loop iteration failed")
step() is the application’s per-iteration work; the rest
of this scaffold does not change between products. Hardening
is one watchdog, one wrapper, and a logged boot every cold
start – not much code, and the difference between a cam that
recovers on its own and one that needs a service call.