Unplugging the Debugger: Live and post-mortem debugging in a remote system

Main Track,

When building an embedded system, there comes a time when you must unplug your debugger and seal the device in a box. Unfortunately, things can still go wrong after this point! The Hubris embedded OS has put a strong emphasis on debuggability, even to the point of writing Humility, an in-house debugger. However, Humility has historically assumed a physical connection to a running microcontroller. As we begin sealing up boxes, we’ve had to develop a range of tricks for mostly-seamless debugging of a network-connected system. Strategies range from forging inter-task IPC messages and calling arbitrary functions over the network, to (safely) reading and decoding memory from the running system, and even using a secondary microcontroller to take a full memory image of the main system for post-mortem debugging!

This talk is a whirlwind tour of adapting a system from physically-connected to networked debuggability, without breaking existing user workflows and tools. It focuses on the Hubris embedded OS and Humility debugger, and is specifically motivated by their use in the Service Processor for the Oxide Computer Company server rack. Like a baseboard management controller, the SP is responsible for low-level system control, e.g. thermal management. It also provides visibility into other hardware on the board – for example, if the host CPU isn't responding, the service processor can connect to its serial console for debugging.

Debugging the SP – and everything that it’s connected to – requires both specialized and general-purpose tooling. Some use cases are known in advance; for example, Humility includes a purpose-built command to show a live graph of temperature readings. Other use cases have to be improvised on the fly, e.g. restarting the system thousands of times and reading back register values to track down a rare bug. With the right abstractions, everything from dashboards and hardware interfaces to ad-hoc experiments can work seamlessly in both attached and networked configurations.