A Very Relatable Database

Posted on Jan 19, 2019

Note

A lot of the things recommended in this article are pretty bad ideas, if you know anything about databases. If you want to get some better ideas, send me an email & we can chat about other databases.

If you don’t really care about how to store data on a disk, then a library like sled.rs might be for you.

The problem

I like to find ways to fill up my free time over winter breaks. Since I’m not in school for a month, I usually take the time to do some self-study on a topic I find interesting. This past month, it was building a database. Unfortunately, I had only a casual understanding of how a database actually works under the hood, and no idea how to get started. Additionally, I decided to write the database in Rust, since I knew there would be complicated management of data, and I wanted to avoid headaches involving invalid data. Finally, rust’s static garbage collector meant I wouldn’t have to rely on a runtime to guarantee this.

That was a month ago. While I’m not too far along (mostly due to being busier than expected during the break), I still have things I think are worth sharing.

What do we mean by database?

It’s important to start out by defining the goals for the project. Having an idea at the start of what the project should accomplish is a good way to start on the path to success. At the very start, I wanted to design a database with the following constraints in mind:

  • It must use SQL as its query language.
  • All of the information on the database - including the schema - must be on a single file on disk
  • Its design must be amenable to adding features like transactions.
  • It must be ACID (although this one is a stretch goal).

I took a great amount of inspiration from sqlite. Their docs on how the internals of sqlite work are simply amazing!

Writing things to the disk

The constraint having everything on one file poses some interesting problems. Let’s examine them.

We have to be careful that when we add a new row in the database, or a new table, or make any kind of change that inserts data, we don’t overwrite the data that was already there. This means that whatever layout we use for the data, it must be able to handle tables growing indefinitely in size.

So the naive database layout would look something like:

|----------------------------------|
| pre-allocated schema information |
|----------------------------------|
| pre-allocated space for table 1  |
|----------------------------------|
| pre-allocated space for table 2  |
|----------------------------------|

The problem is what happens when table 1 grows beyond our pre-allocated space? What happens when a table is small, and we waste most of the pre-allocated space? We can think of the previous solution as analogous to a Vec: A single block of memory that will copy its data when it runs out of space. While it may be a good default for in-memory storage, it runs into problems when applied to on-disk data. What we’d really like to do is allocate a reasonably small amount of space up-front for a table, but when the table outgrew that amount of space, we didn’t have to copy anything around. Instead, the new data could be placed somewhere else in the file.

Something like the following:

before:

|---------------------|
| schema              |
|---------------------|
| users table: 4kb    |
| users table: 4kb    |
| users table: 4kb    | We want to gracefully handle writing
|                     | more than 4kb of data to this table
|---------------------|
| products table: 4kb |
|                     |
|                     |
|                     |
|---------------------|

after inserting more data

|--------------------------------------------|
| schema                                     |
|--------------------------------------------|
| users table: 4kb                           |
| users table: 4kb                           |
| users table: 4kb                           |
| users table: 4kb                           |
| location of where the rest of the table is |
|--------------------------------------------|
| products table: 4kb                        |
|                                            |
|                                            |
|                                            |
|--------------------------------------------|
| users table: 4kb                           |
|                                            |
|                                            |
|                                            |
|--------------------------------------------|

Basically we want to allocate data in Blocks that form a Linked List structure on disk, where each block points to the next block in the series. If there is no next block, we use some null value.

Implementing Blocks

Now that we have a good understanding of the problem we’re trying to solve, and the way we’re going to solve it, it’s time to actually implement a solution.

The current code for this can be found on GitHub.

First off, what does a block look like? Well, we know at the moment we need two things: the actual data inside of the block, as well as the location of the next block:

#[derive(Debug)]
pub struct Block {
  data: Vec<u8>,
  next_block_location: u64
}

Hmmmm ok, but what else would be useful here? Well, if every block on the disk is a set number of bytes, how will we know where the data ends? If we didn’t know that, we could keep reading zeroes infinitely off the end of the block!

#[derive(Debug)]
pub struct Block {
  data: Vec<u8>,
  next_block_location: u64,
  size: u64,
}

One final change: there is a good chance that there might not be a next block. What do we put for next_block_location then? (spoiler: 0). We can represent this nicely with Option<u64>, however. I’m aware that NonZeroU64 now exists, but I haven’t gotten around to updating my code to use it. Finally, we’re going to extract out the metadata out to a BlockMeta.

#[derive(Debug)]
pub struct BlockMeta {
  next_block: Option<u64>,
  size: u64,
}

#[derive(Debug)]
pub struct Block {
  data: Vec<u8>,
  meta: BlockMeta
}

Reading & Writing

Rusts I/O system is really well-designed, in my opinion. There are three core traits: Read, Write, and Seek. Files implement all three of these (as you would expect), but so do Cursors, which are sort’ve like an in-memory file that is really good for testing.

For example, if you had a function that just needed to write some data to some kind of disk, you could write it as:

use std::io::{self, Write};

fn write_some_data(disk: &mut impl Write) -> io::Result<()> {
  disk.write_all(vec![0u8, 1, 2])?;
  Ok(())
}

And you could call it with either a File or Cursor:

// this might not actually work, but it nicely illustrates the principle
write_some_data(&mut File::open("filename.txt"))?;
write_some_data(&mut Cursor::new(vec![]))?;

And of course, with rust being rust, there is no runtime overhead for doing this. With that out of the way, we should be able to implement two methods for a block: persist, and from_persisted:

impl Block {
  /// Write the contents of the block onto the disk
  pub fn persist(&self, disk: &mut (impl Write + Seek)) -> io::Result<()> {
    use std::io::SeekFrom;
    disk.seek(SeekFrom::Start(self.meta.offset))?;
    // write_u64 and BigEndian come from the excellent byteorder crate
    disk.write_u64::<BigEndian>(self.meta.next_block.unwrap_or(0))?;
    disk.write_u64::<BigEndian>(self.meta.size)?;
    Ok(())
  }

  /// Read some data from the disk that was written with `persist`
  pub fn from_persisted(offset: u64, blocksize: u64, disk: &mut (impl Read + Seek)) -> io::Result<Self> {
    use std::io::SeekFrom;
    disk.seek(SeekFrom::Start(offset))?;

    let meta = {
      let next_block = disk.read_u64::<BigEndian>()?;
      let next_block = if next_block == 0 {
        None
      } else {
        Some(next_block)
      };
      let size = disk.read_u64::<BigEndian>()?;
      Ok(BlockMeta {
        next_block,
        size,
        offset,
      })
    };
    let bytes_to_read = blocksize as usize - BlockMeta::size_on_disk(); // size_on_disk is 16
    let mut buf = vec![0; bytes_to_read];
    disk.read_exact(&mut buf)?;
    Ok(Block { data: buf, meta })
  }
}

The implementation that I have on GitHub is a tiny bit different, and has been slightly modified for brevity. I know, this example doesn’t seem short, but it follows a surprisingly nice structure. First, we seek to the proper location on the disk. Then we read or write the data in the same order each time.

Closing Thoughts

We started by looking at the constraints we wanted for our database, and from that, a workable design for storing data on disk. Originally, I wanted to blog more than this, but I’ve since learned that the above is not the way you want to write a database. Still, I think it’s an interesting bit of learning, so I’m leaving it up :).