Log Structured Merge Trees

LSMTs are primarily used in databases where the write load is much heavier than the read load.

There are 4 primary concepts

  • In-memory memtables and WAL
  • SSTables (Sorted String tables) on Disk
  • Compaction
  • Bloom Filters

LSMTs based databases are usually based on logs, writes just involve writing to a log in append manner. However just appending in this manner makes it so that reading is much harder, one would have to go through the entire logs to read, which becomes O(n) operation. Not really scalable.

What if the logs were sorted? Does'nt the read operation now to go O(logn). That is the underlying concept for LSMT based Databases.

Memtable

The in-memory memtable has a finite amount of memory allocated to it. All writes directly append to a memtable. (Note: the entries into memtable are sorted directly). The mem-table data structure could use something like AVL Trees or Red-Black Trees to maintain the sorting order and still be O(logn) for writes.

As the size of the memtable reaches the max memory allocated for it. The memtable is flushed to disk.

Writes are always written to WAL (Write Ahead Log), since the memtable is entirely in memory, which means if there is a database crash. The WAL allows for a smooth recovery.

Sorted String Tables

As writes are flushed to disk from memtable, they are written as sorted string tables. When reads come through, the database decides to search within 'x' Sorted String Tables. Since the string tables are sorted, the read performance becomes O(xlogn)

This can be made better by storing an SSTable index which could tell us which SSTables to search for based on input event.

Compaction

What if there is a separate process that runs in the background, which merges the 'x' Sorted String Tables (SST) into smaller number of SSTs? This would dramatically reduce the number of searches to make across SSTs. Lets say the compaction reduces the number of SSTs from 'x' to 'y'.

Note: Compactor also removes stale entries and updates to latest write for an entry on merge.

Then read performance is now O(ylogn) where y << x.

Bloom Filters

Can we do even Better? Yes! As the number of SSTables increase, the read performance will start to take a beating.

Bloom filters to the rescue. Bloom filters are a probablisitic data structure which tell us if an entry is present in a list which high accuracy. If compacted SSTables have bloom filters attached to them, we would now get O(1) operation on identifying if a SSTable holds an entry.

This reduces the read performance back to y * O(1) + O(logn)

Woohoo! Another blog post done.


Show Comments