Some study on database storage internals
I have been studying and stumbling upon how databases ( Relational & Non-relational ) store data internally. I must agree the topic is really vast, complex & what I managed to learn is probably 0.0000000000001% of internals. In this article I will try to put forward some design internals as well as how databases try to optimise operations to provide magnificent speed, advantages, disadvantages of such designs.
I guess when we talk about database internal storage, many people can imagine B Tree or B+ Tree etc as they might have heard this use case during their college or university life. We will not discuss how these data structures work as there are many articles on how they work, but we will see how they fit into the scenarios and what purpose they serve. Hereon I will refer to both B Tree and B+ Tree as B Tree only.
Traditional relational databases store data in B Tree structures on disk, they also maintain index to these data in B Tree structure mostly in RAM to guarantee faster access to the original data. The row(s) that we insert are stored at the leaf nodes of the B Tree. All intermediate nodes are used to store metadata to navigate CRUD queries through the B Tree. So when millions of rows are there in the database, it’s quite possible that the index & data storage size has grown to a huge level. So for efficiency if you want to load the data from disk to memory, you can’t do that as your RAM is probably running out of space. So your database is compelled to read data from disk. The scenario typically looks like the following:
Disk I/O is very costly, it dominates the database performance. B Tree is a random read/write, in-place replacement, very compact & self balancing data structure which suffers from disk I/O limitation. Random means to access some data on the disk, your disk manager has to move the disk head around the spinning disk, so naturally it’s a costly operation as the disk has to look for data like crazy.
B Tree is designed to store data in block fashion as it’s efficient for Operating Systems to read & write data in blocks instead of writing individual bytes. The MySQL InnoDB database engine has block size of 16 KB. It means every time you read or write data to the database, a block of disk pages of size 16 KB will be fetched from the disk into RAM, it will get manipulated and then written back to disk again if required. B Tree takes advantage of this block oriented operation. Say the average size of a row is 128 bytes ( The actual size may vary ), a disk block ( in this case, a leaf node ) of size 16 KB can store a total of (16 * 1024) / 128 = 128 rows. This is a very rough estimate just to show that a disk block can accommodate bunch of rows together. B Tree is a wide but shallow data structure, typically has height ≤ 10, can contain millions of data in such an arrangement. Due to these natures, B Tree is suitable for databases. As mentioned earlier, leaves of a B Tree reside at the same level, it takes O(logN) steps to search a data in B Tree ordered by primary key.
So read operation is comparatively very fast in B Tree as the operation has to traverse very few nodes and fire very less disk I/O call. Also range queries might perform faster when rows to be fetched or manipulated are found in the same page blocks and the page block is already loaded in memory.
You can make B Tree based database perform faster in following ways also:
- Minimise number of indexes: This is a very common strategy to increase the performance of relational databases as more index means more index updation on every INSERT & UPDATE operation. When the database grows older, delete old & probably unused indexes, hesitate to create indexes when your database crumble. Though you should be extra careful as less index means less select query performance. You might have to pay the penalty if you have less index or very heavy index.
- Sequential Insert: If you can manage to insert a lot of data together in sequential order where the primary key of the rows are sequential, then the insert operation will be faster as already the page blocks is there in memory, so possibly all the insertions will be applied in the same page blocks and then committed to database at once in a single disk I/O. A lot depends on database here though, but I assume, modern databases are designed to apply such optimizations when required.
But it’s not the most optimal storage structure for all situations. Write operation is very problematic, far worse than random read as the database has to often load the page corresponding to the data on disk, then modify it and write it back to disk, B Tree has the problem of write amplification. On an average you can write hardly 100 bytes/second in a random write fashion, this limitation basically comes from the design of how the magnetic disk works. In fact scaling read operation is quite easy depending on the use case by using cache or search indexes or more memory etc, but scaling write is painful. When you need to write / update tons of data in a database, B Tree is not the correct choice. Though over the time, there have been a lot of optimizations in traditional databases. Like — InnoDB tries to minimise disk I/O operation by using a buffer. Following is the representation:
InnoDB buffer inserts, deletes, and updates if the needed leaf node is not in memory. The buffer is flushed when it is full or the corresponding leaf nodes come into memory. This way InnoDB defers the disk I/O operation. But still, database write operation can be made much much faster by leveraging the available disk bandwidth which existing relational databases fail to do. Also relational database systems are very complex inside as they use locking, concurrency, ACID transaction semantics etc which makes read write operation more complex.
In modern information age, there are millions to billions write happening probably per hour or so in consumer centric services like messaging / chatting / realtime communication / IOT services or Big unstructured data oriented distributed systems. So these systems are high write contention system and in order to meet their need, databases should be able to catch up with data insertion as well as retrieval speed. Typical databases are not the correct match for such scenario as they are not able to meet the requirement of high availability, most possibly so called eventual consistency, flexibility of unstructured data definition & latency etc.
LSM Tree or ‘Log Structured Merge Tree’ comes into rescue here. Remember, the requirement is how to make the data writing insanely fast maintaining optimal read speed. LSM is not actually a data structure like B Tree. In this system, unlike B Tree, there is no in-place replacement of data i.e; once data is written to some portion in the disk, it never gets modified. Apparently it’s an append only write stream. Some log structured filesystems like ext3/4 work in this fashion. So it’s like logging the data sequentially somewhere. Hence the term ‘Log Structured’ comes into picture. Basically LSM system takes the advantage of sequential writes. Traditional magnetic hard drives can write data upto 100 MB/second, modern SSD drives can write far more in a sequential manner. In fact SSD drives have some internal parallelism mechanism where they can write 16 to 32 MB data together on itself at once. LSM tree is very suitable match for SSD. So just imagine how faster it is to write data sequentially on disk than writing randomly. In order to support high speed & huge amount of data writing, nothing can beat sequential write.
To understand this scenario properly, let’s briefly see how a very popular row oriented database invented in Facebook called Cassandra works on LSM principle.
Cassandra or any LSM system maintains one or more in-memory data structure ( in the above image memtable ) like self balancing trees — AVL tree, Red Black Tree, B Tree or Skip lists where data is written before it moves to disk. This in-memory data structure maintains sorted data set. Different LSM implementations may use different such data structures in their own way as per their requirement, there is no standard LSM implementation. When memtable contents exceed a configurable threshold, the memtable data is put in a queue to be flushed to disk. To flush the data, Cassandra writes sorted data to disk sequentially. Disk maintains data structure called ‘SSTable’ or Sorted Strings Table. It’s nothing but a sorted snapshot of data written to a file. SSTables are immutable. LSM systems can maintain multiple such files on disk. So if a data is not found in memtable, Cassandra needs to scan all the on-disk SSTables as the data may be scattered across all the SSTables. So Cassandra reads are logically slower than writes. But there are some remedies. Cassandra or any LSM system runs compaction process continuously in the background to minimize number of such SSTables. Compaction process essentially applies merge sort on SSTables and writes the new sorted data in a new SSTable and deletes old SSTables. But the catch is compaction process some times can’t catch up with tons of data updates happening in the database. So some probabilistic data structures like Bloom filters are used to get the idea if some data exists in a SSTable in constant time. Bloom filters work best when they reside in RAM as bloom filters internally fire many random queries in order to get the probabilistic idea about existence of data. Bloom filters can significantly reduce the effort to scroll through all the SSTables. Although I heard in a tech talk that bloom filters can’t help in range queries. Anyway, in order to make read faster, LSM systems keep up many house keeping tasks and workarounds like Bloom filters. So LSM systems solve the problem of writing giant amount of data at huge speed, but it possibly fails to retrieve data at optimal speed. LSM systems suffers from Read amplification problem — read more data than it is actually required due to such designing. So is there anything that can sit between B Tree & LSM Tree and give us optimal & marginal read write speed?
Fractal Tree Index: Fractal tree is based on B Tree structure. But there are certain optimizations to make it perform fast, very fast at least as per their performance benchmark. Fractal tree supports message buffers inside all nodes except leaf nodes. Fractal tree is used in Tokudb that was later acquired by Percona. Tokudb can be used as MySQL storage engine like InnoDB is used.
When you apply any operation like — add column, delete column, insert, update whatever operation everything is saved as message somewhere among the intermediate nodes in Fractal Tree. So logically the operations complete faster compared to InnoDB as they are simply saved in the buffer and any secondary index buffer if necessary. The messages are cascaded down from parent to leaf nodes through intermediate nodes when buffers starts filling up. The leaves still store the actual data like B Tree. So while reading, the operation has to consider all messages on its path in order to identify the actual state of the data. But this is still relatively faster as tokudb tries to keep intermediate nodes as much as possible in memory. This messaging operation apparently looks simple but how exactly it’s implemented inside can be only described by Percona engineers. Also the block size of tokudb is as high as 4 MB compared to 16 KB of InnoDB. Such size allows to fetch/save more data at once from/to disk with a single I/O, it also helps to compress more data at once thus reducing the resultant on disk data size. So tokudb emphasise on better data compression & less disk I/O in terms of larger data block size. Percona / tokudb claims that their storage engine is quite faster than InnoDB. It provides far more read/write throughput than InnoDB. Also tokudb claims it has less fragmentation issues compared to InnoDB, it has support for multiple clustering index etc. The disadvantage is tokudb compensates disk I/O by using more CPU power as it forces to process messages in buffers as they are applicable. Also it seems like it’s not as popular as InnoDB.
Following is a benchmark:
Only benchmark in your system can help you identify the correct data point & required solution. But yes database storage has been reforming continuously to support modern requirements as they appear. Thus LSM tree is intended for high write oriented system, while B Tree is still support traditional needs till now. And it’s really appreciable that fractal tree indexing identified some real issues with B Tree indexing and it has used those weaknesses to make itself a better indexing alternative. Still future will tell if tokudb will survive or not as technology is getting improvised constantly, so does the hardware, disk drives and operating systems.
Let me know if you have idea about other database efficient storage internals and can explain probably in some post. Also let me know if I have missed any point in this discussion.