Book notes

Designing Data-Intensive Applications: Ch 6. – Partitioning

After a certain point your data will become a bottleneck either because of size or access patterns and at that point you’ll need to partition the data. Although there are a few ways to do that and several open source solutions for partitioning massive data sets each has its own downsides that will need to be carefully considered for specific use cases.

The simplest thing to do is to scale key/value pairs based on the key. The idea is to spread the data across the partitions as evenly as possible to avoid skew. Skew just means that each node that holds a partition has a fair share of the workload. That could mean in terms of data access or data volume. There is no truly generic way to distribute the data that will be optimal. Even if the data is spread evenly it is still possible to get hot spots and overload a single node. Avoiding hot spots requires domain knowledge and potentially more control over how the database partitions the data than you have access to. So in general folks resort to various tricks like compound primary keys and partitioning based on compound primary keys to avoid hot spots as best as possible. As mentioned it is possible to sort the keys and partition by key range or to use some hashing function. When using a hashing function we lose the ability to do efficient range queries and have to use a scatter/gather scheme to query the data. No matter what strategy is chosen though avoiding hot spots is a hard problem. Did I mention avoiding hot spots is a hard problem?

After partitioning the data we also need to consider our query patterns and how to do them efficiently. This is where secondary indices come in. Secondary index is exactly what it sounds like. It is an index for non-primary data fields that makes it efficient to perform queries on those fields. There is a slight wrinkle though when the data is partitioned because that forces the indices to be partitioned as well and we need to fall back on scatter/gather scheme to get all the data or we can use a global index. Global indices in turn can be partitioned to spread the load. This introduces a problem though because now data updates across partitions require coordination to update the global indices which leads us back to problems with synchronous and asynchronous replication and index update lag. As the indices are again just data we can use whatever partitioning scheme we want depending on what kinds of workloads we plan to perform against the indices and what trade-offs we want to make in terms of complexity and performance.

So far we have assumed a fixed set of partitions but that is an unnecessary assumption if we are willing to re-balance the partitions. This can be done dynamically or statically. The static approach is too choose some large number and then assign some share of that large number to existing nodes. As new nodes are added they take some share of the work from existing nodes and we do some stuff to move around partitions to those new nodes. The dynamic scheme is slightly more complicated and requires some kind of coordination mechanism. HBase uses Zookeeper. Other databases accomplish the same with gossip protocols. Again there are trade-offs. When re-balancing the goal is to maintain data locality and move as little data as possible so that’s when we get into consistent hashing schemes and other desiderata. This part still doesn’t make much sense to me because it seems like to do the re-balancing and partitioning properly would require a lot of domain specific knowledge about the data distribution and access patterns and from what I’ve been reading it doesn’t look like the databases have the right set of knobs for doing that.

Now that the data is nicely partitioned and the imaginary load is spread as evenly as possible we somehow have to tell the clients how they go about getting to the data across all the different partitions. This is where that coordination stuff comes in again and there are various designs in terms of how to let the clients know what nodes they need to talk to. There are simple ways to do it but this is basically an entire research area called massively parallel processing (MPP). As you’d imagine people want to query their data with regular SQL even if behind the scenes there are 100 servers holding the partitioned data. Seems like something one could build a business around.