Kafka: Ideas about Events & Commands
I’ve been thinking about patterns for handling events & commands when you’re communicating with an event broker like kafka.
I suppose I shouldn’t lock myself into a specific technology, but kafka imposes certain constraints that I think are worth exploring.
For this, I’ll assume a baseline level of knowledge in kafka that you can get from watching these videos
The scenario
To give the discussion some context, I’m going to make up a fake scenario to ground things.
Let’s sayyyyyy I’m building a GitHub-style pull-request system. Users can open up diffs for review, set titles, approve diffs, check on the results of CI, and finally merge the PR if everything is working well. My hope is that most readers will have some basic familiarity with this process.
Events vs Commands
I think it’s important to distinguish the difference between events & commands. It’s something I get asked about a bit, and it can be a little confusing at first!
Commands
A Command is a description of something you want to happen within a system. For example, we could make a MergePullRequest
command that looks like this:
[MergePullRequest]
pull_request_id=10
merged_by_user="Nate Lincoln"
timestamp="yyyy-mm-dd"
commit_message_override="New blog post about kafka!"
What’s vitally important here is that commands can fail. I can want for my PR to be merged all day long, but if I don’t fix my linting errors then it ain’t gonna happen. However, commands can fail for other reasons too. If there’s a filesystem issue when git is merging the branch, then the command will fail, at no fault of my own.
(ideas time) I think that Commands have both synchronous and asynchronous components. First, there is the synchronous validation phase, where we check that the command is something that we want to happen. If that succeeds, we return a message that we intend to make the change, then perform the actions to change the system to match what the command wants.
This is actually why I’m dubious about having command processors consume directly from a kafka topic. With kafka, it’s difficult to reply back with a success or failure code. This means that we need to use other mechanisms to return the status back to the command issuer.
This problem could probably be alleviated if we separated our system into two components: a synchronous command validation layer, and an asynchronous command processing layer. The validation layer would know enough about the domain to say whether a command could succeed or fail, and do so synchronously, and then hand off the work to the command processor, which assumes that it’s allowed to process the command. At this point the possibility of failure is smaller, and many errors can be recovered from.
That’s also a lot of complexity! If the command was simpler, like ChangePullRequestTitle
, then we would probably combine command validation & processing into a single atomic step.
Honestly there’s still more things I’d like to think about here (like the availability of the command validation side), but I’ve talked about Commands way too much. Let’s get onto other topics.
Events
An event is a description of something that happened within the system. For example, our MergePullRequest
command could cause the following events:
[PullRequestStatusChanged]
pull_request_id=10
status="merged"
[CommitAddedToBranch]
branch_name="main"
commit_sha="sha"
The key difference is that there is no notion of failure for these events! You cannot look at the PullRequestStatusChanged
event and say “nah, not gonna happen”.
It’s too late for that! It’s already said and done, the state of the system is irrevocably changed. If you want, you could issue corrective commands like RevertPullRequest
, but
then you’re back into failure modes.
This “no failures” rule might seem a bit restricting, but in practice it doesn’t matter. If an event was emitted about the state of the system changing, then it’s likely that you intended for that to happen. A good way to think about this, for developers that are used to CRUD-like application development: you don’t bother validating that the data you pulled from the database is correct, right? You assume that when you inserted the data, you did the proper validations to make sure it wouldn’t cause issues with the rest of the system. It’s the same way with events: if an event is emitted, then the rest of the system should be able to handle it.
This, of course, opens up the can of worms of “what if my validation requirements change over time?”. I won’t delve into that here - I don’t think the solution is complicated enough to warrant much discussion. If you’d like to chat about it, email me or something.
Considerations with Kafka
Kafka is a combination of a message broker & a database. This means that while we get the benefits of both, we also get some of the drawbacks of both.
The thing that sticks out most in my head is topic deletion and compaction. If we want to keep our topics from growing unbounded, then we need to have a strategy for deleting the data in them.
For commands, the solution is pretty obvious: we really don’t care that much about historically-issued commands, so we can set a retention period of, say, 1 day on the topics. If we need historical data for auditing or on-call reasons, then we can kafka-connect it into a data lakehouse. It’ll be much more convenient to work with anyways.
For events, however, there are other considerations. Consumers of event topics typically want to build up some kind of local state about the rest of the system around them. For example, we could have
a consumer on PullRequestStatusChanged
events, and build up a list of the current status of all the pull-requests in the system. Or send emails when PRs are merged. How do we keep the PullRequestsStatusChanged
topic
from becoming a multi-terabyte monstrosity that takes new consumers several days to consume, and requires a dedicated set of brokers? I’m being hyperbolic, but these are real considerations you need to keep in mind.
First off, we need to think about what use-cases we have for the data. If we’re sending emails based on the events, then we really don’t care about about events from a long time ago. So we could theoretically set a retention period for the data.
For the “get the status of all pull-requests in the system” problem, it’s a bit more difficult. We want to use topic compaction to only keep the latest event per PR. This only works if the topic exclusively contains PullRequestStatusChanged
events.
If there’s a combination of multiple types of events, then you need to somehow keep the latest of each event type, which gets complicated quickly.
The fix here, I think, is to realize that those consumers don’t really care about PullRequestStatusChanged
events, they care about the status of the pull-request (stay with me here). What if, instead, we output events that contained the entire
PullRequest object, status and all? That means we can compact all we want, because the history becomes unimportant for the consumers.
These two patterns can coexist, of course. There’s no rule that says you can’t put data into multiple topics, after all :). You could have a PullRequestEvents
topic that has a retention period set, and a PullRequests
topic that is compacted.
If you want to get really fancy, you could even have a kafka-streams app that consumes from PullRequestsEvents
and produces to PullRequests
, but I wouldn’t recommend it unless you know what you’re doing.
Conclusion
Honestly I really just wanted to put some thoughts about kafka down, that’s informed by my limited usage of it. I’d really advise against taking anything I say as gospel: I’m by no means an expert, I’m just some random person on the internet who can afford the $12/yr for my domain.
If you like this post, consider giving some money to an organization that is helping people with whatever climate-change-related crisis we’re currently going through :).