Keeping humans in the loop
When designing complex technical systems, you should ask yourself, “how does the human operator fit into the picture”.
Duct Tape and Baling Wire
There’s a sort of in-joke that exists in programming that all software is broken. Bugs abound, and errors are constant. By all rights, the complex ball of mud that we’ve created today should not exist, it should come crashing down all around us. However, the world does not come crashing down. The sun rises, and reddit has new videos of cats falling over. I believe that the reason things are alive today is because we keep humans in the loop.
Operating Complex Systems
Operating a complex system is fucking hard. Most of the time things work perfectly: the hallmark of a well-made system is the fact that it’s going to work well for what it’s designed to do. That being said, any piece of software that gets significant adoption is also going to find ways to get abused. Sometimes that “abuse” comes from people working at the same company as the people writing the software! If the software you’re writing is a tool, then sometimes people are going to use it like a hammer, even though you intended to work like a saw. This is a good thing: it’s a mark of success that people are finding novel ways to use the thing. It’s also a bad thing: there’s a mismatch between the scenarios the software was written to support, and the context it’s actually being used in. Eventually, things might come crashing down. In that moment, the human operator needs to enter the loop.
When software breaks, there’s an intended outcome: the user wanted some result, and the user isn’t getting the result they expected. Note that in this case, “user” could refer to another piece of software. We need humans to mediate between the software and the users. Only humans are able to hold in their heads the two abstract ideas of how the software actually works, and the expectations the users have of that software. By synthesizing the two ideas, they can find a way to work within the limits of the software to get the result they want. This usually takes the form of some kind of corrective action, but that action works by creating unintended effects so that the software gets into the “right state”.
Putting this into practice
The fact of the matter is that we’re all doing this every day, constantly, and we don’t even realize it. Every time you ssh into the production database and do some data manipulation by hand, you’re the human entering the loop. The software doesn’t have a control to make the change you need to make, and so you edit the underlying state by hand.
I once worked at a place that heavily used RabbitMQ. Lots of different systems listened on queues for work to perform, and there was an admin interface where humans could manually put messages into those queues. This was incredibly useful. First, if a message ever errored, it could be retried by hand-inputting the message back in the admin interface. But over time we saw a pattern emerge with the message formats themselves. Originally, the messages were very imperative: the message would cause some destructive actions to happen in the system, and then if it failed halfway through, it was on the human operator to finish the job. Over time, we naturally started tending towards idempotent messages. These were messages that said “hey, there’s probably some work for you to do over here”, without explicitly saying what that work was. These messages were great for operators, because they were always safe: if there wasn’t any work to be done, then the system did nothing. You could spam the message all day and nothing problematic would happen.
A third example is a strategy that I see at my current employer. We have a lot of feature flags that we dub “control rods”. Control rods is a name we stole from nuclear reactors, where giant rods are used to control the balance of reactivity and power output. Control rods in our lingo means some feature flag that is able to adjust the runtime behavior of a component in some way. Maybe it can be as simple as completely turning a feature off in extreme emergencies. Others are a bit more subtle, such as having a minute-by-minute cron do less work each minute, load-shedding so that we can keep operating at a degraded state while we fix the system. Finally, others are even more subtle than that, such as ones that control the length of an in-memory queue. These have complicated trade-offs and we tend to treat them with a bit more respect and caution.
There are lots of ways that you can have humans enter the loop: I’m sure you can think of times recently when you’ve had to do so. But I think that great software systems are ones that are designed with humans in mind. Using commodity databases means that humans can update the underlying state of your application. The messaging formats between your components matters not just because distributed systems are complicated, but also because humans are going to be writing those messages someday. Finally, making your software configurable at runtime can turn an outage into a degradation of service.