Sometimes Your Device Is Alive But Is Actually Dead

A hardware watchdog timer is a standard mechanism for embedded systems. The idea is simple: a countdown timer that resets the microcontroller unless the firmware explicitly resets it. If the firmware gets stuck in an infinite loop or crashes, the watchdog timer runs out and so the device reboots. And when it wakes up again, it will be starting from a known state and the device will (hopefully) work again.

A hardware watchdog is great, but often it by itself is not enough.

The problem is that the hardware watchdog only checks whether the firmware is alive. It does not know if the firmware really is doing anything useful. This makes it possible for devices to end up in states where it technically alive, but also dead to the outside world.

Watchdogs At More Than One Level

A simple solution to this is to define software-level watchdogs that understands the semantics of the system, at multiple levels. Both “is the CPU running?” and “has a measurement been sent in the last hour?”, “is the wireless stack alive?” and “has any data actually reached the cloud?”

We can call this structure a multi-layered watchdog:

Layer 1 – Hardware watchdog: Is the CPU running?
Layer 2 – Software watchdogs: Is the firmware doing its job?
Layer 3 – Cloud watchdog: Is the system observable from the outside?

Each layer catches failure modes the others cannot see. They will all react the same to failure though: just reboot the device and start from scratch. This may bother users, at least those who will catch the device in the act, but it will bother them significantly less than having their device appear to be dead.

The multi-layered watchdog is a simple design principle: just have a watchdog at every layer of the system.

Defining the Liveliness Criterion

For each watchdog layer to work, it needs a liveliness criterion: a testable condition that defines what “alive” means at that level. Without one, the watchdog has nothing to watch for.

The hardware watchdog has its liveliness criterion built in: the firmware must kick the timer within a fixed interval. That part is simple.

The software watchdogs (there may be more than one) need criteria that is specific to what the system is supposed to do. For a sensor that reports temperature every 15 minutes, a reasonable criterion might be: has a measurement been sent in the last 30 minutes? If not, something is wrong. Reboot.

The cloud watchdog needs a criterion from the outside looking in: has the cloud received data from this device within the expected window? A device that sends data every 15 minutes but has been silent for two hours has failed its liveliness criterion, regardless of what the device itself believes.

The key is to make the criterion specific enough to catch real failures, but not so tight that normal variation in timing triggers false alarms. A device that sends every 15 minutes should have a cloud liveliness criterion of an hour or more – enough to tolerate a few missed transmissions without crying wolf.

The Cloud Watchdog

The cloud watchdog is special because the trigger is not happening inside the device itself. The trigger happens in the cloud, but the reboot must happen at the device.

The cloud checks its liveliness criterion for each device and if the device does not meet the criterion, the cloud sends a reboot command to the device. And we hope that the reboot command reaches the device.

The device may not know it is dead. From its own perspective, everything is fine: the firmware is running, the software watchdog is not detecting any issues. So while the device has no reason to reboot, it must act on reboot commands from the cloud unconditionally, even when it believes it is healthy.

It is a good idea to build in such a remote reboot command in from the start. It is easy to add when the system is young but may be painful to retrofit once devices are in the field.

It is also a good idea to protect against too many or too frequent such reboots. There is also a risk that the cloud software is faulty and starts sending reboot commands, so we would like to protect against that too.

Conclusion

A device that is alive but not useful is not a working device. Hardware watchdogs are a standard feature, but they are often not enough. Multi-layered watchdogs are a simple design principle that keeps devices working even in the face of harsh conditions in the field.