Application File Formats

Posted on Oct 5, 2019

On Application File Formats

So a few weeks ago, I was talking to a friend of mine. They were talking about the D&D campaign they were starting in a few weeks, and all the pain they had to go through to DM. Keeping track of all the NPCs, generating encounters, choosing loot, etc. They wanted to build a small desktop app that could handle all of these concerns, allowing them to focus on worldbuilding and storytelling. They had a lot of concerns figured out. Python for the language, and PyQT for the UI. Deployment didn’t matter, because all of this would run on their laptop. After they got to work, however, they quickly ran into a problem. How do you manage saving the data to disk? Keeping everything in memory at once became very cumbersome, especially as the size of the data grew. More and more time was being spent writing a huge json file to disk.

With all that said, I couldn’t help but feel that it would have been easier for them if they had used some database like SQLite. With SQLite, they could have avoided having all of the data in memory at once. It would have been easier to save data, because SQLite would have handled it. Perhaps most of all, SQLite would have made it easy to do certain operations one would need in a D&D campain. Need to get all the NPCs who live in a given town? SQL’s got you covered.

The whole situation has me thinking about what I like to call application file formats. Inevitably, every app has to persist some kind of information somehow. Why is SQLite such a popular choice? Can we do better?

Why do people use SQLite?

Advantage over memdumping

When I first talked about this blogpost with some friends of mine, one question they had was: what is different about this from just dumping a bunch of JSON to disk when your app shuts down, and then reloading it when your app starts again? There are a few disadvantages that come to mind.

Amount of memory

If you’re going to keep all of your application state in memory, then there’s no way to limit your applications memory usage down as your app grows. Every piece of information, no matter how inconsequential, must be loaded from disk, parsed into some in-memory representation, and then never touched again until it’s time to quit the app, at which point you must save every bit of information back to disk.

Looking at this from a D&D perspective, it’d be like loading every NPC in the entire game, just to find out which characters live in a given town.

I/O efficiency

With the load-and-save-everything-at-once approach, not only do you have to spend a great amount of time just reading and writing from the disk, there is also a lot of time spent transforming those on-disk structures into their in-memory counterparts. This means that every integer must be transformed from a textual representation to a base-2 representation, but also that every string must be checked for proper utf8 encoding. With SQLite, all of these things are handled for you. Maybe it doesn’t completely remove the cost of parsing & instantiating numbers or utf8 strings, but it at least makes it so you don’t have to care.

Being able to query data

I think that this is the most compelling reason to go with a database file format, as opposed to something handmade. Efficient queries are available out of the box! Need to quickly find out what the name of that one tavern owner is? Want to know if a player is telling the truth about how much gold they’ve acquired? All these queries and more are at your fingertips! And I’d like to emphasize that these queries can be answered without parsing and processing an entire array of JSON.

A huge part of this is indexing. With a database like SQLite, the process of looking up an invididual record by ID is almost instantaneous. Other constraints, such as foreign keys and uniqueness, are also trivial to implement with SQLite, and are incredibly fast.

Perhaps the most important feature of SQLite, however, is it’s resiliency. Lose power in the middle of an intense gaming session? No worries, everything has been saved to the database.

Advantage of a single file

So there are a lot of compelling reasons to use a database to store application state. SQLite has an interesting restriction, however: it requires the entire database to reside in a single file. “Industrial-grade” databases like PostgreSQL or MySQL take over entire directories on the filesystem. From the perspective of the database, this is actually much simpler! Instead of having to interleave data for multiple tables together in some fashion, each table can have its own file. So what benefits does this approach have, that SQLite would accept the constraint?

First of all, it is incredible easy to implement a backup system on top of a single database file. The backup script could even be as simple as cp data.db backup-$(date -Iseconds).db, which copies the data in data.db to a new file named backup- and the current time. If you want to get really fancy, you put the database file inside of git, allowing you to attach messages to individual backups.

A second benefit is debugging. While this might not be applicable to a highly personalized software like my friends D&D database, it isn’t hard to imagine a scenario where they had distributed the software to multiple people. If someone has a bug, it is much easier to say, “send me the data.db file in this folder”, as opposed to having to zip up an entire directory of data and send it as an email attachment.

Can we do better?

So I’ve spent a lot of time discussing the benefits of SQLite. I truly believe that, for most desktop applications, it is the Way To Go for state. With that being said, there’s a lot of complexity with SQLite. It requires you to learn some additional skills you might not have otherwise. Why bother with ORM models, SQL schemas, and constraints, when I can just… dump things to some JSON? There is even a benefit to this approach: it’s far easier to read the saved file by hand.

There must be some middle ground. Can we keep the efficient storage, fast lookup, quick updates, and have an API that developers will prefer over dumping JSON?

Removing SQL

I mean, it makes a good start, doesn’t it? Plus, it’s a pretty huge differentiator from SQLite, seeing as SQL is in the name and all.

In all seriousness, I’m not convinced that SQL gives us much benefit for the use cases we’re concerned with. Most of the operations we would want to perform are things like “find the entity with this ID” or “find all the NPCs that live in this town”. Maybe some aggregate operations like summing up some values. While SQL does offer a great way to express these queries, it can also be a bit much for some applications. Sometimes as a developer it’s easier to write a for loop than it is to write a SQL query.

Other improvements

My ideal for this would be something like the following: An API that allows for some kind of structured data to be stored in a single file. It should support extremely fast lookup of individual entities, as well the ability to filter down entities by some simple constraints. Of course, it should be possible to store multiple types of entities in the file. Indexes should work easily out of the box. Updates & Deletes should also be simple.

What would be Very Cool is if you could layer a more expressive API on top! Aggregate functions, sorting, etc can all be expressed via the above API.

This blog post is mostly a brain dump of something that I’ve been thinking about for awhile now. Frankly, I don’t have the time to actually work on something of this magnitude at the moment. I just feel that there are a lot of people who struggle with storing things on disk, and I want to make something that makes it easy. In short, I want to empower any developer to write database applications, by only abstracting over the very lowest-level concerns, and allowing the application programmer to handle the rest.