How do Databases Find Data So Fast?

What’s the secret sauce that makes databases so fast?

Looking up a specific row in the database row by row might mean, at worst, visiting every single row (if the target data were in the last one).

Even if you were to guarantee the rows were ordered by their primary key, and you could then perform binary search to speed it up, any queries filtering based on a column that isn’t your primary key would not be able to use that search and would still have to look at all rows. Or what happens to your binary search if you delete rows?

This is where database indexes come in.

Database Indexes

While all of the data is stored for permanence “on disk” - on the hard disk drive or the SSD - the database will actually duplicate some data into a specific data structure in memory in order to help it find the data on disk faster.

If there are millions of rows, the index could potentially be too big to fit in memory too, in which case it would have to be on disk as well. In fact, it has to be on disk in some form no matter what, in case the database server crashes or needs maintenance - re-indexing a massive database can block read and write operations for hours depending on the sophistication and size of the database.

Most of today’s relational and document databases have chosen from three types of data structures to build indexes:

B trees
B+ trees
Log Structured Merging (LSM)

We’ll focus on B trees and B+ trees here.

B trees take your index value and create a tree-like structure where each node points to a range of values within that index. Let’s say you often query the “age” column of a “Person” table, so you configure the database to make an index on “age”. If age could go from 1 to 100, the root node might point at nodes with values of age ranging from 1-25, 26-50, 51-75, and 76-100. Then the first node might subdivide further by pointing to a few more nodes with values of age ranging from 1-8, 9-16, and 17-25.

Though Reads Speed Up, Writes Slow Down

As you insert, update, and delete data, these trees are kept up to date. This means your writes will be slower, since you have to update the data on disk and the index in memory. Add to this that you are likely duplicating a representation of the B tree onto disk so the database doesn’t have to re-index if it crashes and restarts, and you can see why indexes can be harmful to write speeds.

Many applications read data much more frequently than they write though, and that’s where indexes shine by speeding up database reads significantly.

Depending on your queries, speed-ups can increase even further when you use:

Additional indexes on different columns
Composite indexes where you sort on values from multiple columns
Covered/included indexes, where you sort on one column but add values from another column in order to prevent needing to read the actual row from disk.

Dive a bit deeper into your queries, you might just find some opportunities for performance improvements with a simple index or two.