An overview of Leveled Compaction in LSM-trees

LSM-trees never overwrite existing data on disk (“differential index”). Instead incoming writes are buffered in a sorted in-memory data structure (MemTable). When a size threshold is reached, the memtable is flushed to the first level (L0) as an immutable, sorted file on disk, called a Segment (a.k.a. SSTable, SST). As more data arrives, segments build up and read performance suffers. We either need to reduce the amount of segments, or reorganize them in a way to make searching requested data easier. Reducing the amount of segments is the goal of Tiered Compaction (STCS), but today we will look at Leveled Compaction (LCS), which may be a bit harder to grasp initially.

Overview of Leveling

The basic idea of Leveling is to work with fixed-size segments and organize them to be disjoint to each other in any level that is not L0. Initially, as segments are written in L0 we cannot guarantee they do not overlap.

Once L0 grows too large, its segments can be written out in sorted order and split into fixed-size segments, making L1 disjoint. When L1 grows too large, segments will be picked to reduce it back to its intended size. If these selected segments have some overlap with segments in L2, they will be merged and written out in new segments, keeping L2 disjoint. This process is repeated for any level that grows too large.

Point reads

To read a single key in a Leveled LSM-tree, first the write buffer is checked. The write buffer consists of the active memtable and possibly multiple, temporary immutable memtables. If the key can be found in either of these memtables, the search can be terminated without ever touching the disk.

If the key is not found in memory, first the segments in L0 are checked, from most recently written to least recently written. Because the segments in L0 may overlap, multiple segments may need to be checked. While segments that do can not possibly contain the requested key (because it lies outside their key range) can be discarded immediately, any segment which could possibly contain the key needs to be checked, which may involve unnecessary disk I/O if the key is not contained in them.

Every level after L0 segments is disjoint, so only one segment in each level can be a candidate containing the searched key:

The worst-case amount of disk I/O operations in a leveled LSM-tree is L0_segments + (level_count - 1).

Consider 1’000 segments in a single non-L0 level: We know the level is disjoint, and we know what part of the keyspace each segment holds. This enables us to binary-search a segment for any requested key, effectively nullifying the impact of storing many segments on read performance:

(For a level with less than 5 segments, lsm-tree uses linear search instead of binary search)

Please note that this is a log scale, with a linear scale the magnitude of binary search’s impact becomes more obvious:

Bloom filters

Consider in the tree above, our requested key (blue line) lies in L3. We need to check all segments in L0 that may contain the key, plus one segment in L1 and L2 each, and then read our item from L3. This degrades the read performance from 1 disk I/O to 7 disk I/Os.

To reduce unnecessary disk I/O, we can build a Bloom filter for each segment. Bloom filters use little memory, but can filter out a lot of superfluous work done if no bloom filters were used. Especially in L0 and L1, bloom filters are trivially small, so very low false positive rates can be achieved without using a lot of memory. Because the level’s segments do not contain a huge amount of items, we can essentially eliminate superfluous I/O operations in those levels, which is especially great because these levels are always visited first. This is well described in Niv Dayan et. al.: Monkey: Optimal Navigable Key-Value Store, 2017.

Example: With a FPR = 0.0001%, we only get 1 false positive per 10’000 items. For a segment with 100’000 items, the bloom filter only needs 234 KiB of memory (and disk space, which is essentially negligible).

Point reads in L1 with and without bloom filters

Point reads in L1 with and without bloom filters (logarithmic)

Range reads

Range reads need to reconcile a range by reading multiple runs of segments. To do so, segments are collected, discarding all those that fall outside the requested range (“culling”). Those segments are then scanned and merged to get the final, reconciled range.

This is done by creating a segment reader over each “run”, seeked to the range start, and advancing it, and comparing all segment readers (k-way merge). The first read will incur the highest cost, as each reader’s start block needs to be loaded from disk (unless it is cached), because the minimum item needs to be found first. In the image above, this results in 8 I/O operations, highlighted using the blue arrows.

But the requested range is affecting way more than 8 segments, so why are there only 8 arrows? Due to the disjoint nature of levels L1 up to Lmax, only one segment in those levels needs to be checked initially. When that segment is fully read, the next segment reader is initialized. In the LSM-tree above, this optimization saves 5 I/O operations when initiating the shown range.

Range reads over a disjoint set of segments ("run")

Compare this to reading a disjoint set of segments the naive way:

The naive implementation of the range needs 4 I/O operations initially, decreasing performance of short ranges by 4x

In lsm-tree this is implemented in the SegmentMultiReader, which stores each level as a double-ended queue and only consumes from the first (forward iteration) or last segment (reverse iteration), discarding fully read segments, until it is completely done:

// The implementation is simplified a bit by
// omitting unnecessary details such as lifetimes.

pub struct MultiReader {
    readers: VecDeque<BoxedIterator>,
}

impl Iterator for MultiReader {
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            if let Some(item) = self.readers.front_mut()?.next() {
                return Some(item);
            }
            self.readers.pop_front();
        }
    }
}

// DoubleEndedIterator::next_back works the same way, just inverted

Scanning monotonic data

The above optimization can be further improved if the entire tree is disjoint. In that case, we do not need to create a SegmentMultiReader per level, but only a single one for the entire tree and enqueue each level. This only works for monotonic key inserts (e.g. time series data), but will reduce the initialization cost of the tree from N (number of non-vacant levels) to 1:

The tree is disjoint: no segment overlaps with another in any level

The resulting segment read order is fully ordered, so only a single block read is needed to initialize the range read. This reduces the above tree’s initialization cost down from 4 to 1 I/O.

While this initialization cost reduction is negligible for a large OLAP-style range scan, it can be valuable in a scenario in which a limited range needs to be read, e.g. in a FIFO queue, or a small slice in a time series data set.

Reading 100 latest data points in a time series data set (no cache) - greatly reduced read latency as initialization cost became O(1)

Range filters

Setting up a range read over all affected segments is expensive, and be undesirable for short range scans. As with point reads, in-memory filters may be used to avoid unnecessary disk seeks, especially in L0, where checking most, if not all, segments is unavoidable. There are a bunch of proposed range filters, such as SuRF, Rosetta, Grafite, GRF, REMIX and others. Some range filters may also double as a point read filter, replacing the default bloom filter implementation, which may save some memory. These filters are called PRF (Point-read-Range-Filter).

Range filters are currently not implemented in lsm-tree

See https://github.com/fjall-rs/lsm-tree/issues/46

Configuring Leveling

L0 compaction trigger

Because segments in L0 may always span the entire keyspace, they may have to be merged with every segment in L1, causing high write amplification and preventing parallel compactions in those levels. On the other hand, waiting for more segments to arrive in L0 will decrease read performance as segments in L0 are generally not disjoint.

We can control how aggressive to compact into L1 by configuring how many segments to amass in L0, allowing to adjust between better read performance or lower write amplification.

Intra-L0 compaction

If L1 is blocked by a running compaction, L0 will be prevented from being cleaned up. To avoid L0 growing infinitely, write stalling may occur to try and balance out the tree. This hurts write performance, in order to keep read performance acceptable. If there are many small segments in L0, it may be advantageous to consolidate them into a larger segment that is still kept in L0 (because L1 is busy). This reduces the amount of segments in L0, improving read performance.

This is called Intra-L0 compaction and is enabled by default in RocksDB, and explained in more detail in this blog post.

Another reason why intra-L0 compaction may desirable is that segments in L0 may be smaller than the configured target size. Then, if l0_compaction_trigger segments were present in L0, these small segments would be merged with the full-sized L1, resulting in high write amplification. As an example, it is more efficient to (a) merge 4 x 16 MB segments in L0, and repeating that four times, so we end up with 4 x 64 MB segments in L0 that can be merged into L1, than (b) merging those 4 x 16 MB segments into L1 (which can be ~500-600 MB in size) four times. If L1 is 512 MB, (a) writes ~1 GB of data, while (b) writes ~2.3 GB of data.

Segment size

Segments in a Leveled tree have a fixed size. Smaller segments allow more granular more compaction, especially in larger levels, where the keyspace is fragmented into a large amount of segments. Larger segments cause the tree to grow wider, amassing more data in each level, which decreases the amount of levels needed, thus increasing read performance as more levels may be vacant. Smaller segment cause each level to overflow more quickly, thus data needs to be moved down more quickly, decreasing read performance.

However, having a lot of segments comes with a caveat: unlike a B-tree, which stored in a single file, we may have hundreds or thousands of segment files! Ideally, we want to keep file descriptors open to enable faster reads by avoiding rather expensive fopen syscalls. File descriptors are kept in a global cache called the DescriptorTable (TableCache in RocksDB). If the descriptor table limit is much lower than amount of segments, file descriptors may get evicted resulting in higher read latency because fopen needs to be called more often.

Level base size

This is the size of L1. This is typically segmment_base_size * l0_compaction_trigger.

Level fanout

The level fanout has a pretty simple effect on LSM performance. Consider the tree configuration in the image above: L1 is configured to store 128 MB of data. The tree has a level fanout of 8, meaning L2 has a maximum size of 1024. When L1 grows too large, its overshoot will be merged into the next level, by picking as many L1 segments as needed to reduce the level back to its intended size, plus all the overlapping segments in L2 (coloured in red). Picking overlapped segments is required for the invariant of Leveling to hold: every level after L0 is disjoint. This operation needs to rewrite 320 MB of data. The target size of any level after L0 is: level_fanout ^ level_idx.

Example: With a segment size of 64 MB and level base size of 256 MB, L1 contains 4 segments. With a fanout of 10, L2 should then contain a maximum of 40 segments, L3 could house 400 segments etc.

Now, consider this LSM-tree with an extreme level fanout of 2: When L1 grows too large, the L1’s overshoot only needs to be merged with one segment in L2, resulting in 128 MB of rewritten data, which is a much lower write amplification than the tree with a fanout of 8. But the downside is how deep the tree has grown. Both trees store the same amount of data, but the narrow tree has twice as many levels (ignoring L0). This effectively doubles the worse case read latency of the tree compared to the wide tree.

With the level fanout we effectively control how much the tree behaves like a log (low fanout) or a sorted array (high fanout). Consider an absurd of level fanout of 1’000: With the default segment size, we can store 64 GB in just one level, which results in very fast writes, but any overshoot of the previous level may result in 100s of segments to be rewritten, resulting in extreme write amplification. The tree thus becomes more like a flat file that we are overwriting again and again.

Summarized:

High level fanout = faster reads, higher write amplification

Low level fanout = slower reads, lower write amplification

L0 fragmentation

By now you probably have realized that most issues stem from the fact L0 is not disjoint. While this is an unavoidable fact of life, it can be alleviated by fragmenting L0 segments across the key boundaries of L1 segments. Because we know have disjoint sets of segments, we can now be smarter about picking a subset of L0 (instead of all of it) and merge that into L1. More importantly, compaction can be parallelized because the compaction workers never get in the way of other compaction worker’s key ranges.

Fragmentation increases flushing times as one L0 segment may involve up to level_ratio fsyncs instead of 1, and also rolling over a segment writer incurs some overhead. Also, L0 will house more very small segments which increases bookkeeping costs a bit. But as shown above, point reads and range reads can both be optimized for disjoint sets of segments.

The key boundaries (vertical dotted lines) in L1 determine the fragmentation in L0, and thus the maximum possible parallelism for L0->L1 compactions

L0 fragmentation is currently not implemented in lsm-tree

See https://github.com/fjall-rs/lsm-tree/issues/48

Summary

LCS needs some care to both understand, implement and optimize effectively. It is the default compaction strategy used in RocksDB, CockroachDB’s Pebble, Dgraph’s BadgerDB, Fjall, and the only compaction strategy implemented in LevelDB (probably where it got its name from, really). Interestingly, Cassandra and ScyllaDB opt for STCS as default compaction strategy instead. While suffering from higher compaction overhead, LCS has more predictable read performance and is much more tuned for range-based queries, which a lot of databases depend on.

Leveling and Tiering are a spectrum [1], and switching between compaction strategies and fine-tuning sizing ratios [2] between different levels, thus allowing LSM-trees to morph depending on workload spikes (lazy leveling [3], leveled-N [4] and incremental [5] compaction strategies to name a few), is an interesting topic for future investigations.

Footnotes

[1] RocksDB Wiki - Tiered + Leveled

[2] Niv Dayan et. al.: The Log-Structured Merge-Bush & the Wacky Continuum, 2019

[3] Niv Dayan et. al.: Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging, 2018

[4] RocksDB Wiki - Leveled N

[5] ScyllaDB Blog - Incremental Compaction 2.0

Discord

Join the discord server if you want to chat about databases, Rust or whatever: https://discord.gg/HvYGp4NFFk

Interested in LSM-trees and Rust?

Check out fjall, an MIT-licensed LSM-based storage engine written in Rust.