All Things Sharding: Techniques and Real-Life Examples in NoSQL Data Storage Systems

A comprehensive guide to sharding

51 min readJun 25, 2022

Goal

There are many materials on sharding on the internet. So why one more?

Well, a lot of those materials are shallow, mostly describing the theory of sharding, not really describing how it’s implemented in real life products.

We are going to have a glance on internals of sharding implemented in some of the most popular NoSQL based engineering systems in the world. We will go as much as details as we can — shard routing, metadata management, shard movement across the cluster etc. It’s up to you how you can extract the critical aspects of sharding in each of the systems to connect the dots so that next time, when someone talks about sharding or you need to implement one in real life, you can draw some value out of this article.

As usual, there is no right or wrong design. Every sharding implementation comes with its own pros, cons and complexity. I hope as you go through the article, you get enough sense of what is right or wrong for your use cases.

You can also use this article to build up your knowledge base even if you don’t have any use case to solve for. You can use it as a source for system design and architecture related interview preparation for Senior or Staff engineer level and above hopefully as well.

Introduction

Sharding is a popular technique for database performance optimization. When data size becomes bigger, it becomes difficult to place it in a single machine, performance & scalability takes a hit after some level of vertical scaling, cost sky rockets. This is especially true for companies that serve huge scale of users, store huge amount of data.

There are 2 types of sharding:

Horizontal sharding is effective when queries tend to return a subset of rows that are often grouped together. For example, queries that filter data based on short date ranges are ideal for horizontal sharding since the date range will necessarily limit querying to only a subset of the servers.
Vertical sharding is effective when queries tend to return only a subset of columns of the data. For example, if some queries request only names, and others request only addresses, then the names and addresses can be sharded onto separate servers.

In short, horizontally sharded tables have the same schema as the originally defined table, just that different nodes contain different set of entries.

Vertically sharded tables have subset of the originally defined table schema. So, for a given piece of data, some columns exist in one table, rest of the columns exist in other tables — the tables themselves might reside in a single node or multiple nodes.

Partitioning: Partitioning is often used as synonym to sharding, however, usually partitioning refers to grouping of data in a single database instance only. We will use partitioning as an alternative name to sharding.

General Definition of Key Terminologies

Different systems use different terminologies to denote shards. We will use the following definitions:

Shard / Logical Shard: A chunk of data (or entry, record etc) logically grouped together is called a shard. A shard key / partition key is something which is common across every entry that lands into a shard.

Physical Shard: A physical machine that generally hosts multiple logical shards.

Re-Sharding: It’s a process of splitting an existing shard into smaller shards and move those shards from one node to another in-order to balance the system.

Partitioned Data: When data is distributed across multiple logical or physical shards, it’s called partitioned data.

Fundamental Questions to ask before Sharding

Q. Where do we apply sharding?
A. Sharding is often applied to caching and database layer.

However, in general, the concepts and technique of partitioning could be applied to other computing paradigms as well like — how to partition data among different processes in a stream processing system like Apache Spark, batch processing system like Map Reduce or other in-memory systems like elastic search.

Q. When do you need sharding?
A. Sharding is not the “only” solution to scale any system at any random company. Each company solves different problems, each problem domain addresses different set of users, operates in different region in the world, some companies are global with operation spread globally or at-least in few countries, many companies operate in a single country yet serves huge user traffic. All these mean — not all applications experience very high scale or huge volume of consumer data.

Sometime ago, I wrote another article — “Understanding database scaling patterns” which describes several techniques to scale a system. Before jumping into a sharding solution, as an engineer, we should explore if any of those techniques help our system scale.

Sharding is often employed by companies having massive user base — high volume of requests & data to process. Since the last few years, many database solutions have come up with inbuilt support for sharding. We will see couple of examples in this article as we progress.

Q. Should you shard a table or the whole logical database?
A. A logical database or namespace consists of multiple tables or collections.

Shard by logical database: It means there are multiple physical machines, each running a database server process where an instance of your logical database is available. If there are n tables in your logical database, each of the machines would host n tables but each would contain a sub set of data.

In the following example, each physical machine contains a logical database each containing 2 tables: Table 1 and Table 2. Each table contains mutually exclusive subset of the original data.

This technique is comparatively easier to implement.

Shard by tables: In reality, not all tables grow at the same rate. Hence,
Table 1 in the above diagram might have huge data whereas Table 2 could be serving less data. In that case it only makes sense to shard Table 1 and other tables may remain unsharded. So, every database server would contain the same logical database but the internal tables would vary as shown in the diagram below.

You need to come up with the list of tables, their possible data set size and growth expectation. Then you can decide to go with one of the techniques.

Q. Do you need sharding for all the tables?
A. You might have hundreds of tables but their growth rate is not same. In an e-commerce application, order related tables would experience more growth than user profile related tables. So, segregate the tables based on their usage in the system, that would give you better idea about which tables need sharding.

Q. Should you buy a sharded solution or build your own?
A. For this, you need to consider the cost of 3rd party integration in terms of the following factors ( this is no way an exhaustive list ):

The data model and API offered by the system: Every data system has its own way of representing data inside it. How easy and convenient it is to integrate your applications with the product has to be evaluated.
Developer skills & learnability of the product would impact the time to integration.
Expertise in debugging, fixing and making changes in the product if required.
Whether your organization has the right tools and framework in place to streamline the deployment and regular maintenance of the product would play a role. If there is nothing in-place, how much time would it take to create such support has to be taken care of.
Online documentations and customer support would make a difference too during integration.
Nowadays, almost every company looks for cloud based solutions. Whether the 3rd party product is deployed in the cloud and how many regions are supported has to be evaluated.
License cost of the product etc.
Last but not the least, the performance, latency, availability, security features of the 3rd party product when a production traffic or at least some use cases are simulated has to be carefully examined.

Once, you have the above metrics, it would be easier to decide whether to build your own solution or buy the external one.

Q. How many shards do you need? What would be the size?
A. This is where capacity estimation comes into picture. This could very well be a hard exercise in real life. Often it’s not clear what would be the size of each shard and how many shards we really do need. You might need to simulate the production traffic in staging environment and experimentally figure out what would be roughly the right size of each shard for your use case.

But to conduct those experiments, you need to know:

Throughput: What is the desired throughput of the system — it would provide an indication on total network bandwidth required per unit time.
Read / Write ratio: What is the expected read : write ratio for your use cases. This would help you to identify if there are any use cases requiring any special shard consideration.
Size of data: In case of a database storage, do you have any rough idea on the current table (if exists) dump size or expected data size + its growth rate over time.
Does your organization already have some level of sharding strategy defined, ex: based on timestamp, is it recommended to put all the latest data to the same shard or any other requirement.
What would be an idle shard key in your business use case? Depending on the shard key distribution, traffic pattern across nodes changes, thus resource consumption could vary across nodes.

Depending on these factors, you could decide what would be an idle size of each shard or there is a special requirement to maintain different size for different shards or so.

Note: When you use cloud based database solutions, they already might have limit put on physical shard storage size.

Q. How will client applications discover shards?
A. It depends on the implementation:

You can have a router service whose responsibility is to route the client queries to the right shard abstracting away the logic from the client.
You can use some configuration management system like Zookeeper to keep track of key to shard mapping.
You can expose the shard key to node mapping in some way to the clients transparently so that the clients can directly talk to the shard.

All the approaches have their pros and cons. We’ll discuss more as we progress in the article.

Q. What to do if a logical shard grows very high in size?
A. If a logical shard grows in size, it can be migrated to a more powerful physical shard or depending on the sharding strategy, it can be split into smaller logical shard. We will get into some examples in this article.

Benefits of Sharding?

Horizontal Scaling: Sharding helps to logically divide the data into smaller portions which can be accommodated in low-cost / budget effective machine instances. Since data is usually spread across multiple machines, we are not constrained by the CPU, memory, disk or network limit of any single machine. Thus, API response time, query latency, database index building time is improved.

Increased Read / Write Throughput: With an uniformly distributed shard key, read-write load distributes across nodes almost equally. Read-write can be performed in parallel also. If a node can handle X requests / minute, adding an additional node to the system means adding another X requests / minute traffic capacity to the system. Thus it increases throughput of the system.

Storage Capacity: Since multiple machines host subset of data, multiple machines means more storage capacity in the system.

Protection from outage: If a machine goes down due to an unplanned outage, a small quantity of users are impacted. So the blast radius of impact / bad user experience could be well under control. But in an unsharded system, the impact would have been massive since the whole database would have gone down.

Parallel Processing: Parallel queries can work on different shards together, thus improving performance.

High Availability: Shards could be replicated to multiple data centres or cross-region making sure once traffic to a data centre is cut off, accessing another dc would fetch you the expected data. Irrespective of how shards are placed physically, the whole system appears as a single logical database to the clients.

Geo-distribution & Performance: Since replicated shards could be placed in different regions, consumers could be provided with low-latency access to their data i.e; redirect consumer requests to the shard nearer to them. Depending on data governance policy of a region (ex: GDPR regulation), specific shards can be configured to be placed only in specific region.

Overall, sharding can increase total cluster storage capacity, speed up processing, and offer higher availability at a lower cost than vertical scaling.

Sharding Complexities

Sharded systems are often very complex systems. It’s easier to think that sharding magically scales your system, but in its flip side, there are some trade-offs:

Operational complexity:
1. Managing shard means managing multiple nodes, hence managing health, reliability, availability (among many other parameters) of all those nodes becomes very important.
2. There should be some coordination mechanism which would take care of identifying which shard holds the requested piece of data. It introduces some level of complexity or sometimes even a point of failure in the system depending on the coordinator implementation.
Re-balancing: With increasing size of data, any node might hit the limit of its storage, in that case, a new node needs to be added in the system. On the flip side, if a node crashes, the system needs to perform a fail-over for that failed node or redistribute the node’s data to other nodes. In any case, re-balancing of nodes could be a complex process. During shard re-balancing, the system performance might get impacted too as a side-effect.
Hot-spot: Sharding often introduces hot-spot if data is not homogeneously distributed among all the nodes. Hot-spot means a particular shard gets comparatively more traffic than others at the cost of machine level resource shortage like CPU, memory usage, network bandwidth etc.
Transactions on a sharded system might span across multiple machines making the process complex and slow.

Sharding Strategies

Hash Based Sharding

This is the easiest sharding implementation. In my early years of career, I was part of a startup which is in travel domain and the company built a B2B ticket booking & inventory platform. Many extremely popular travel apps in India used to consume this B2B platform to book tickets, thus the scale of the platform was quite significant. We had a Memcached layer accessible from the application — it was a cache aside pattern implementation. There were 4 cache nodes. Distribution of data to all those nodes was the responsibility of the application. The application used the simplest function to distribute the data — cache node id = hash(some key available in the API request) % 4.

The above implementation is also called manual sharding, since the application takes charge of distributing the data — it exactly knows where to place data and how to query the data.

This strategy could be implemented in automatic mode also by extracting the hashing logic out of the application code and putting it as a wrapper on the cache or database layer. The application layer is not aware of how the hashing happens in this case since the wrapper takes care of it.

Pros

If you manage to find a suitable key to homogeneously distribute data across nodes, then this strategy would result in fast data access. But be aware of the cons mentioned below.
Suitable for queries having equality condition. Not suitable for range use cases.

Cons

In manual mode, since the client (in our example, the application) orchestrates the data distribution, the distribution logic is tightly coupled to the application, thus introducing some level of complexity there. As an application developer, you have to decide how many nodes to use, what hash function to use etc. If somehow, business needs change, sharding strategy also might need to change meaning change in the application code.
Homogeneous distribution of data is one of the biggest challenges in this strategy since often it’s not clear which key would help developers to create such a distribution. Thus often one or more nodes suddenly turns into hot-spot in the system and in turn a part of the system might crash.
Another problem happens when you want to add / remove a new node to / from the system which is a fundamental requirement in a distributed system. Since we are using node id = hash(key) % n formula, when you change n , majority of the node id changes causing huge re-distribution of the data. Thus, you need to expire data from almost all the nodes and feed them again with new set of data.
If you query for multiple keys together in a single transaction and they happen to spread across different nodes, the transaction would be fairly slow and application would suffer for that.
If any database employs this sharding strategy, range queries would be expensive since hash(key) does not necessarily support putting contagious data in the same node. JOIN queries also won’t scale since different data set might reside in different nodes.

Suitable Systems

Suitable for caching systems since it’s relatively easier to warm up cache when a node leaves the cluster or a new node is added.

Q. What are some good hash functions suitable for hashing strategy?
A. The hash function should be extremely random, robust against collision and deterministic i.e; it should always produce the same output given the same input.

Few of such hash functions are: CRC32, RIPEMD160 etc.

Real life Example: Couchbase Hash Based Sharding

Couchbase is a document database that stores data in binary or JSON format. Users can define bucket (equivalent to table in RDBMS) which logically organizes a set of documents.

A bucket is again divided into 1024 (64 in mac OS) virtual buckets (vBucket). Each virtual bucket is assigned to a physical node as shown below. Each vBucket acts as a logical shard. vBucket to node mapping is handled by couchbase cluster manager automatically.

Every document is assigned an id / key by the concerned application. Couchbase client creates a CRC32 hash of the key which acts as a shard key. The hash % 1024 (or 64 in mac OS) determines a number indicating which vBucket contains / should contain the data.

Once, the vBucket is identified, from the vBucket to node mapping, the physical node is identified and the actual document is written to / fetched from there.

Thus, vBucket actually divides a bucket into smaller chunks. Since, a vBucket is automatically assigned to a physical machine based on different parameters by the couchbase cluster manager, each physical node almost homogeneously shares the data set, thus making the system more scalable, high-performing and reliable.

Q. What is the benefit of using vBucket?
A. Deterministic hash functions always map the same data to the same vBucket for its life time as number of vBucket is constant per bucket. Thus the data to vBucket mapping remains constant from a client’s point of view but the cluster has full flexibility to move the vBucket around when a new node is added, removed or some cluster change event happens. This makes the physical layer of the system decoupled from the logical layer.

Also, even if the individual buckets are extremely skewed in terms of number of documents contained, vBucket roughly equally share the burden of the records, thus making the distribution better, scalability and performance of the system predictable.

Real life Example: Aerospike Hash Based Sharding

Aerospike is a very popular distributed schema-less key value storage.

In Aerospike:

A namespace is equivalent to database schema in RDBMS.
A set is equivalent to a table in RDBMS.
A record is comparable to a row in a table in RDBMS but without any strict schema. Each record is uniquely identified by a key of type string or number or Buffer.
Each namespace is divided into 4096 partitions.
Each partition has one master and configurable number of replica.
Each cluster node is identical. The partition master and replicas are distributed in roughly homogeneous manner across the cluster (multiple machines having aerospike installed creates a cluster). Hence, if there are n nodes, each node roughly contains 1/n of data. Ex: if there are 4 nodes, each node contains roughly 1024 partitions for a given namespace.

So, how does Aerospike actually partition the data?

For each record, Aerospike creates a 160 bit digest by applying RIPEMD160 hashing algorithm on the record’s namespace, set and key combination.
12 bit of this digest is used to create partition id for the record.

Each node in the cluster and each client maintains a partition map — a dynamic in-memory mapping stating which partition is mapped to which master and replicas. Essentially, using gossip protocol, nodes communicates with each other about the current cluster configuration.

Aerospike uses Paxos based voting mechanism to determine which nodes are part of the cluster and which are not.
If any node is down (due to a crash, network issue, human error etc), the master and replica partitions residing in this node also vanishes. Aerospike uses different parameters and heuristics to identify which of the alive nodes would become the new master and which would act as replica for the dead partitions.
The reverse happens when a new node is added in the cluster. In order to maintain homogeneous distribution of data, the system has to decide how to re-balance the partitions and what partitions to assign to the new node.

Q. What is the benefit of maintaining cluster map?
A. Both aerospike and couchbase works in pretty much the same way that they maintain a dynamic cluster map. In aerospike, it maps a partition to its master and replicas. In case, a cluster change event happens, the map is updated in all the nodes and connected clients. The clients are cluster aware. The benefit of this approach is that while reading or writing some data, the clients already know exactly which node contains the required data, the clients don’t have to depends on any external system to find out the partition map, thus saving extra network hops and latency.

Q. Why 12-bit partition key?
A. 12 bit means 2¹² = 4096 possible values, hence 4096 partitions per namespace. For Aerospike, 12 bit is big enough to distribute partitions across the cluster and small enough to maintain the partition map and other metadata.

Q. What happens if some namespace or set has more data than the other ones?
A. Even if a namespace or set has skewed data, the partitions won’t be skewed due to the fact that records are roughly uniformly distributed across partitions. Thus the performance and scalability aspects are predictable from physical storage point of view.

Real life Example: Elastic Search Hash based sharding

Elasticsearch is a distributed, lightening fast RESTful search and analytics engine. Many organizations use elastic search to power text based data search, aggregation and geo-location use cases etc.

Like a relational database has tables, elastic search has index. Each index is a collection of documents spread across multiple logical shards. The shards are distributed across all the nodes in the cluster (a cluster can contain one or usually more than one nodes).

In elastic search:

Each shard has one primary and one replica. For a given shard, usually primary and replica shards are distributed to physically different machines.

The following is how a 4-node cluster possibly looks like: a0, a1 are shards of some index where primary of a0 resides in node1 and replica of a0 resides in node4. Similarly, primary of a1 resides in node2 and replica of a1 resides in node3.

Each index by default has 1 primary shard assigned, but at the time of creation the number can be changed.
Each shard can be of any size but shard having size between 20 GB to
50 GB are believed to perform good.
A good rule-of-thumb is to ensure that the number of shards is below 20 per GB of heap space in a physical node. Hence, a node with a 30 GB heap should have a maximum of 600 shards — but the lesser than the limit, the better.
In a physical machine, one thread runs per shard.

So, how does the shard routing work in elastic search?
- Elastic search uses the following formula to route a document to a particular shard in an index.

routing_factor = num_routing_shards / num_primary_shards
shard_num = (hash(_routing) % num_routing_shards) / routing_factor

num_routing_shards is the value of the index.number_of_routing_shards index setting. num_primary_shards is the value of the index.number_of_shards index setting.

The value of _routing is usually the document’s _id. But the user can specify any custom value by using routing parameter while creating and retrieving the same document:

PUT my-index-000001/_doc/1?routing=user1&refresh=true 
{
  "title": "This is a document"
}

GET my-index-000001/_doc/1?routing=user1

Q. What is the disadvantage of using such hash based routing in elastic search?
A. Once you create an index with certain shard number specified, you can’t add or remove shard anymore to or from the index. If you want to change the shard count, you need to create a new index with desired shard configuration and re-index the new index with data from the current index.

Q. How does elastic search move shards from one node to another?
A. Shard movement (dynamic or even manual) is an important part of sharding in a cluster. For the curious mind, here is some discussion on how elastic search does it.

In general, every dynamically sharded system probably considers the following parameters while allocating new shards to nodes or moving shards from one node to another:

How many shards each of the nodes currently handle.
How many shards each collection or index currently has.
How does the current disk & network usage of each node look like when the shard re-balancing decision is made.
Is there any node where multiple shards are currently going through re-balancing or replication etc.
Are there any user preference of how shards should be moved like whether a specific type of shard to move to or stick to a specific type of node etc.

Real life Example: Redis Hash Slots based Sharding

Redis Enterprise is a scalable deployment on top of open-source redis. It support sharding. It uses hashing over key only but in a different way.

The keys are divided into hash slots — a range of key hash values. A range of slots are allocated to a shard — a shard is made of a master node and one or more replica nodes for high availability (HA) purpose. For a shard, master replicates its hash slots asynchronously to all its replicas. Generally, the master is responsible for serving all read and write operations in its hash slot, but, for scaling read, client can use replicas in read-only mode.

Redis supports maximum 2¹⁴ = 16384 slots. So, theoretically, maximum 16384 shards can be added to the system each handling one slot, however, maximum recommended master nodes (essentially shards) is 1000 for better performance.
CRC 16 hashing algorithm is used to find hash corresponding to a given key: hash = CRC 16(your key) % 16384. The same algorithm is used by clients and shards for computing hash of keys.

Example, in a cluster with 3 master nodes, the hash slot distribution looks the following:

Redis cluster uses mesh topology where a node can talk to every other node in the cluster. If there are n nodes in the system, each node maintains n - 1 incoming connections and n - 1 outgoing connections. Nodes can ping each other with heart beat messages to check their health and update the cluster information accordingly.
The current cluster state describing which node is handling which hash slot, replica and leader node different hash slots, status of cluster nodes , status of hash slots and some other cluster related metadata are constantly propagated among the nodes using gossip mechanism. Hence, each node in the cluster is aware of which hash slots are being served by which nodes.
On startup, a client can issue CLUSTER SHARDS command to retrieve the cluster topology and hash slots to node mapping, clients can cache this data as well. Whenever, the client wants to get data for a key, it can find out the corresponding hash key, check which node handles the hash from the hash slot to node mapping (probably saved in cache) and make a direct request to the concerned node. Caching the map helps to retrieve the value for a key in a single network round trip.

Q. What happens when the cluster topology changes — a new master node is added to or removed from the system?
A. When a new node is added or removed, hash slots have to be adjusted across nodes.

A new master node (essentially a new shard) is added: If there are x keys and n shards currently, roughly each shard handles x / n hash keys. Now if n changes to n + 1 (new node added), ideally, the amount of hash keys to be handled per shard also should change accordingly.

Redis needs administrators to use rebalance command manually (auto rebalance is not supported right now) ex: redis-cli --cluster rebalance any_node_address:port --cluster-use-empty-masters to start moving hash slots to newly added nodes. The command will automatically try to allocate almost equivalent number of slots across all the nodes.

Consider the previous 3 nodes example, let’s see how the cluster config changes when we add 2 more nodes:

Alternatively, administrators can use reshard command (ex: redis-cli --cluster reshard any_host_ip : port) to move hash slots from one node to another. The admin has to specify the source and destination nodes, how many slots to move etc — reshard will move those slots from the source to the destination.

A new master node (a shard) is removed: Same reshard command can be used.

Note: As we can see here, cluster rebalance and reshard commands trigger a good amount of data re-distribution. This could prove to be a costly operation.

Q. What is the difference between re-balancing vs re-sharding?
A. Both fundamentally move hash slots from one node to another. rebalance command calculates weight of the master nodes in terms of how many hash slots each of them currently hosts — optionally, you can specify weight for each node in the command using --cluster-weight <node1=w1...nodeN=wN>. It compares weight of all cluster nodes and decides how many slots to move from a node to another. rebalance by default does not consider empty master nodes (nodes without any assigned slots ex: new nodes), you need to set
--cluster-use-empty-masters flag in the command to distribute hash slots to empty master nodes.

On the other hand, reshard asks admin to specify the source and destination nodes irrespective of whether the nodes are existing nodes or new ones, how many hash slots to move etc. reshard can be used in the existing cluster, when you add a new node or remove an existing node.

Q. What happens if a client has invalid hash slots to node mapping in the event of a cluster change event?
A. It’s possible for the client to have some stale cluster map. In that case, if the client request for a key reaches some master which does not contain they key, the master checks in its local mapping and finds out which host contains the key and returns the host address with a MOVED redirection.

GET x
-MOVED 3999 127.0.0.1:6381

Here,
-MOVED means the key and associated value has been permanently moved to some other host.
3999 represents the hash of the key.
127.0.0.1:6381 means the key x is now hosted in a redis database running at host 127.0.0.1 on port 6381.

Now, client can either cache the mapping for this particular key to the new host or it can again fetch the cluster topology using CLUSTER SHARDS command and update the mapping.

From the next query on-wards, request for the same key will always be served by the new host.

Q. Why does Redis cluster support only 16,384 slots?
A. Let’s hear the answer from Salvatore Sanfilippo himself:

In short, choosing, 2¹⁴ = 16384 optimizes the gossip message size and hash slot distribution across all masters.

Q. What happens if a master node fails?
A. Redis cluster has auto fail-over capability. If a master fails, other nodes identify absence of the master and they internally promote one of its replicas as a new master for the concerned hash slots. Redis cluster supports manual fail-over also.

Cluster fail-over is essentially a switch in the configuration marking a replica as the new master. If a master node abruptly fails due to some issue, a replica can be forced to become the new master meaning there could be some data loss. In other case, when for maintenance or machine health reason, a replica has to failed over, admins can use appropriate command to achieve that. You can checkout the failure documentation here.

Q. If you have an use case to place a group of keys to the same shard, how would you achieve that in redis cluster?
A. Redis cluster has a feature called hash tag. Generally, redis hashes the full key, but in hash tag, only the portion within { } is hashed. If multiple keys have the same content (string or binary data) under {} , irrespective of how the full key looks like, all the keys ends up in the same shard.

Examples:

The two keys {user1000}.following and {user1000}.followers will hash to the same hash slot since only the sub string user1000 will be hashed in order to compute the hash slot.
For the key foo{}{bar} the whole key will be hashed as usually since the first occurrence of { is followed by } on the right without characters in the middle.
For the key foo{{bar}}zap the sub string {bar will be hashed, because it is the sub string between the first occurrence of { and the first occurrence of } on its right.
For the key foo{bar}{zap} the sub string bar will be hashed, since the algorithm stops at the first valid or invalid (without bytes inside) match of { and }.
If the key starts with {}, it is guaranteed to be hashed as a whole. This is useful when using binary data as key names.

Redis has MGET command where the client can request for multiple keys together in the same request, the request is blocking. If all the requested keys don’t reside on the same shard, redis cluster fails the command with error message: `CROSSSLOT Keys in request don't hash to the same slot. For such use cases, hash tag is super useful.

Risk: Since, hash tag gives the clients / users the power of data distribution, if huge amount of popular keys are tagged with the same content, it can create hot-shard problem. So, users have to evaluate wisely when and how to go with this option as per business use case.

Q. What are the limitations of redis cluster?
A. There are few limitations:

Client Support: Before moving to redis cluster, check if your client supports it, not all client may support it.
Multi Get or Set commands: As discussed above, Redis cluster does not support MGET or MSET commands. For that hash tags need to be enabled in the keys.
Supports only one database: The open source stand alone redis supports multiple databases (indexed from 0 - 15) and SELECT command to choose a database. Redis cluster supports only one database and SELECT command is disabled.

Q. Edge Case: What happens in a case when some key is under migration from one node to another, who answers the client query?
A. When a slot is under migration from node A to node B, the keys that map to the slot might be present in either of the nodes. Hence, clients have to first query node A. If the key does not exist in A, it has probably landed in B or in-flight. A responds with -ASK redirection.

-ASK means reach out to the other server (in our example, B) which possibly has the key. Clients then send ASKING command before sending the query to B. B checks whether the slot’s current status is IMPORTING (this is the status of a slot in destination when it’s being copied from a source), if yes, then serve the key if available. If somehow B also does not have the key yet, it issues
-MOVED redirection to the other server (here A), the client then goes back to A .

Once, the slot has been completely moved, A can send -MOVED redirection and the clients now know that the key has been moved permanently to B, they can now query B from next time on-wards.

Note

ASK makes the client query the other node only for the next request. Subsequent requests after the next request would be made to the old node.
MOVED makes the client query the other node permanently from next request on-wards.

Range Sharding

In range sharding, every shard is responsible to hold a subset of rows that are sorted by the chosen shard key (usually a part of primary key in case of databases). Every individual shard is sorted, thus the collection of such shards are also globally sorted.

In the following illustration, all records with shard key in the range 0x0000 — 0x1000 always go to the Shard #1 , similarly, records with shard key in the range 0x2000 — 0x3000 always end up in Shard #3.

Generally, the system initializes with only one shard. As the shard size crosses some threshold, it’s divided at the middle of the key range, transforming the shard into two shards. The process goes on and applies to all of the shards in the system over time.

Pros

Range sharding enables range look up by shard key.
If implemented correctly, the system could be very robust because of its dynamic nature.

Cons

While starting out, a single shard on some physical machine possibly takes all the traffic, thus depending on the system architecture, it could be a single point of failure. If you know the data distribution and key ranges well, you might start with multiple shards at the start itself making the system less riskier.
Since, keys in a particular range always end up in the same shard, depending on the key distribution in the system, any shard could become hot sooner or later, thus traffic and data distribution in the system could be skewed making the operational-load and management of shard extremely complex.
If data distribution dynamically changes over time, managing these shards would practically become a tough challenge.

Suitable Systems

If you need global ordering of records by some key or column and you have range queries to perform within some lower or upper boundary, this sharding technique could be used. But be mindful of the data distribution before implementing the technique.

Real life example: HBase Range Sharding

HBase, modelled after Google Bigtable, is a NoSQL data store providing random, real time access to Big Data applications. It uses HDFS for storage.

There are couple of concepts in HBase: Region, Region Server, HMaster, Zookeeper, hbase:meta table (previously called .META)
A Table in HBase is a collection of entries. Each entry has a unique row key, couple of column families. User has to provide value for all these fields.

Table = Map [row key => 
             Map[column family => 
                                 Map[column name => value]
             ]
        ]

Each Table is horizontally divided into multiple regions.
A region is a collection entries sorted by row key. So each region has a start key and a end key.
Regions are non-overlapping. At a given point in time, a row exactly belongs to a single region.
A region belongs to a region server. A region server can contain multiple regions. Hence a table when divided into multiple regions, is distributed across multiple physical machines. These regions are nothing but logical shards in the world of HBase.

Each region is replicated across multiple region servers. Write operations for a region are always handled by the Primary region, thus ensuring strong write consistency. Secondary regions are just replica of the primary. In the following image, green dotted lines flow towards secondary regions and black ones flow towards primary regions.

HMaster or HBase Master primarily takes care of allocating, assigning and balancing regions across the region servers to ensure roughly equal data distribution in the system.
Zookeeper maintains the cluster state information, it also tracks location of the region server that contains hbase:meta . It’s a special table that maintains information about a table’s key range to region mapping. It has key of the format ([table],[region start key],[region id]), region info and server information as the value. The hbase:meta table is sorted in nature. So, by comparing the region start key of two consecutive rows, you can determine where a key actually belongs to.

HBase initially starts with a single region since it has no way to know how many regions to start with. The number of regions depends on factors like data size, traffic pattern etc. There are tools which can be used by clients to do pre-splitting i.e; initialize multiple regions from the start. But unless the data distribution is known already, it would be difficult to identify number of initial regions and homogeneously distribute traffic among them. Once the data ingestion starts, automatic splitting can take over.

Q. How does a client perform a read operation in HBase?
A. A client first talks to the Zookeeper to identify the hbase:meta location. It then queries hbase:meta with whatever table and key range it’s interested in to figure out which region to be accessed for reading data. From this point on, the client can directly talk to the region server for the same table and key range. The client also can cache the hbase:meta information. If a new region is introduced or some region dies causing updates in hbase:meta , talking to the cached region server would raise exception, the client then can restart the step.

Q. How does a write operation happen?
A. A write operation also has similar steps, but they are abstracted from the user. The HBase clients automatically figures out regions from meta table, write the data and caches the meta information at the client side.

Q. What are the possible point of contention in HBase architecture?
A. Zookeeper and hbase:meta table. These components are often accessed by the clients thus making them point of failures.

Q. What factors are considered for automated region split?
A. Several configurable region split policies are available for clients:

ConstantSizeRegionSplitPolicy: As long as any of the column family data size in a region exceeds a maximum configurable size, the region is split.
IncreasingToUpperBoundRegionSplitPolicy:

Region split happens depending on the formula: min(R² * "hbase.hregion.memstore.flush.size", "hbase.hregion.max.filesize") , where R is the number of regions of the same table hosted on the same region server. So for example, with the default memstore flush size of 128 MB and the default max store size of 10 GB, the first region on the region server will be split just after the first flush at 128 MB. As number of regions hosted in the region server increases, it will use increasing split sizes: 512 MB, 1152 MB, 2 GB, 3.2 GB, 4.6 GB, 6.2 GB, etc. After reaching 9 regions, the split size will go beyond the configured “hbase.hregion.max.filesize”, at which point, 10 GB split size will be used from then on.

KeyPrefixRegionSplitPolicy: If there are known prefixes for keys, this policy can ensure that all keys with the same prefix end up in the same region. This feature is useful if some application wants to take advantage of local transaction in a region.

For the curious mind: How does HBase implement region split?

Region split is a complex process. The following excerpts are taken from an old cloudera blog, but it should give you a good rough idea on how such splitting processes are implemented in real life:

As write requests are handled by the region server, they accumulate in an in-memory storage system called the “memstore”. Once the memstore fills, its content are written to disk as additional store files. This event is called a “memstore flush”. As store files accumulate, the RegionServer will “compact” them into combined, larger files. After each flush or compaction finishes, a region split request is enqueued if the RegionSplitPolicy decides that the region should be split into two. Since all data files in HBase are immutable, when a split happens, the newly created daughter regions will not rewrite all the data into new files. Instead, they will create small sym-link like files, named Reference files, which point to either top or bottom part of the parent store file according to the split point. The reference file will be used just like a regular data file, but only half of the records. The region can only be split if there are no more references to the immutable data files of the parent region. Those reference files are cleaned gradually by compaction, so that the region will stop referring to its parents files, and can be split further.
Although splitting the region is a local decision made at the RegionServer, the split process itself must coordinate with many actors. The RegionServer notifies the Master before and after the split, updates the .META. table so that clients can discover the new daughter regions, and rearranges the directory structure and data files in HDFS. Split is a multi task process. To enable rollback in case of an error, the RegionServer keeps an in-memory journal about the execution state. The steps taken by the RegionServer to execute the split are illustrated by the following figure. Each step is labelled with its step number. Actions from RegionServers or Master are shown in red, while actions from the clients are show in green.

1. RegionServer decides locally to split the region, and prepares the split. As a first step, it creates a znode in zookeeper under /hbase/region-in-transition/region-name in SPLITTING state.
2. The Master learns about this znode, since it has a watcher for the parent region-in-transition znode.
3. RegionServer creates a sub-directory named “.splits” under the parent’s region directory in HDFS.
4. RegionServer closes the parent region, forces a flush of the cache and marks the region as offline in its local data structures. At this point, client requests coming to the parent region will throw NotServingRegionException. The client will retry with some backoff.
5. RegionServer create the region directories under .splits directory, for daughter regions A and B, and creates necessary data structures. Then it splits the store files, in the sense that it creates two Reference files per store file in the parent region. Those reference files will point to the parent regions files.
6. RegionServer creates the actual region directory in HDFS, and moves the reference files for each daughter.
7. RegionServer sends a Put request to the .META. table, and sets the parent as offline in the .META. table and adds information about daughter regions. At this point, there won’t be individual entries in .META. for the daughters. Clients will see the parent region is split if they scan .META., but won’t know about the daughters until they appear in .META.. Also, if this Put to .META. succeeds, the parent will be effectively split. If the RegionServer fails before this RPC succeeds, Master and the next region server opening the region will clean dirty state about the region split. After the .META. update, though, the region split will be rolled-forward by Master.
8. RegionServer opens daughters in parallel to accept writes.
9. RegionServer adds the daughters A and B to .META. together with information that it hosts the regions. After this point, clients can discover the new regions, and issue requests to the new region. Clients cache the .META. entries locally, but when they make requests to the region server or .META., their caches will be invalidated, and they will learn about the new regions from .META..
10. RegionServer updates znode /hbase/region-in-transition/region-name in zookeeper to state SPLIT, so that the master can learn about it. The balancer can freely re-assign the daughter regions to other region servers if it chooses so.
11. After the split, meta and HDFS will still contain references to the parent region. Those references will be removed when compactions in daughter regions rewrite the data files. Garbage collection tasks in the master periodically checks whether the daughter regions still refer to parents files. If not, the parent region will be removed.

Real life example: MongoDB Range Sharding

MongoDB is a document based NoSQL storage. It supports hash and range sharding.

Range sharding in MongoDB. Courtesy: MongoDB

In MongoDB terminologies:

Shard refers to a physical node.
A chunk refers to a logical shard. The default chunk size is 64 megabytes. A shard (physical node) might contains multiple chunks (logical shards).
Shards must be deployed as a replica set where a member designated as primary serve all the write traffic, rest are just secondary members replicating data from the primary. If the primary replica fails, the secondary replicas can initiate a voting process and one of them becomes the new primary.
Config Server: It contains config database that is source of truth for the current configuration of the cluster. It contains several collections having mapping of shard key range to chunks, mapping of chunks to shards, information about primary and secondary shards, cluster re-balancing related information like which chunk is currently in migration and the associated distributed lock etc. Config server is deployed as a replica set in the MongoDB cluster to ensure high availability.
mongos: mongos is an interface between clients (applications) and sharded cluster, it acts as query router, aggregator for clients. Clients talk to mongos directly. mongos can talk to the config servers to identify the chunk which contains the client requested key or key range. It can cache the information locally so that in future it can directly talk to the shards.

Q. How can a user choose a shard key?
A. In MongoDB, user can choose a single field or a compound index field as a shard key. More details can be found here.

Q. Can you shard an already populated collection?
A.

To shard a populated collection, the collection must have an index that starts with the shard key. When sharding an empty collection, MongoDB creates the supporting index if the collection does not already have an appropriate index for the specified shard key. See Shard Key Indexes.

Q. What if a query does not specify shard key?
A. If a read query does not specify a shard key, the query is broadcast to all the shards in the cluster for evaluation. This kind of query is called scatter-gather query, they are very expensive in nature.

Q. What are the possible bottlenecks in the MongoDB sharding architecture?
A. The config server becomes a significant bottleneck in the system. Config server manages config database which itself is a set of MongoDB collections. If the config server is down, the system performance would be in question till the time fail-over happens. Any manual change by administrators to the config database is highly discouraged.

Config Server Availability
If the config server replica set loses its primary and cannot elect a primary, the cluster’s metadata becomes read only. You can still read and write data from the shards, but no chunk migration or chunk splits will occur until the replica set can elect a primary.
In a sharded cluster, mongod and mongos instances monitor the replica sets in the sharded cluster (e.g. shard replica sets, config server replica set).
If all config servers become unavailable, the cluster can become inoperable. To ensure that the config servers remain available and intact, backups of config servers are critical. The data on the config server is small compared to the data stored in a cluster, and the config server has a relatively low activity load.

Q. What is Primary Shard in MongoDB?
A. By default, collections in a database are non-sharded. User can choose to shard individual collections. Sharded collections are spread across multiple shards (physical nodes) through replica sets as discussed above. In MongoDB, non-sharded collections for a given database reside on a single shard — traffic for all those non-sharded collections move to this shard. This is called Primary Shard. At time of database creation, a primary shard is allocated to the database. Out of all available shards, MongoDB chooses the shard with minimum chunks as the primary shard.

In the above illustration, Collection 1 is sharded by the user. Hence, the chunks of this collection are spread across Shard A and Shard B . But Collection 2 is non-sharded. Hence it resides only in Shard A .

Q. Which component takes care of managing chunks?
A. Since MongoDB offers the capability of auto sharding, it has to find out when to split and move chunks across the shards. There a component called balancer, it’s a background process that runs in the primary of the config server replica set. It looks into number of chunks in each shard and depending on different strategy, it decides how to move the chunks. The sole responsibility of the balancer is to ensure chunks are evenly distributed across nodes.

Q. How does the chunk migration algorithm look like at a high level?
A. According to the chunk migration procedure, following steps are followed:

The balancer process sends the moveChunk command to the source shard.
The source starts the move with an internal moveChunk command. During the migration process, operations to the chunk route to the source shard. The source shard is responsible for incoming write operations for the chunk.
The destination shard builds any indexes required by the source that do not exist on the destination.
The destination shard begins requesting documents in the chunk and starts receiving copies of the data.
After receiving the final document in the chunk, the destination shard starts a synchronization process to ensure that it has the changes to the migrated documents that occurred during the migration.
When fully synchronized, the source shard connects to the config database and updates the cluster metadata with the new location for the chunk.
After the source shard completes the update of the metadata, and once there are no open cursors on the chunk, the source shard deletes its copy of the documents.

Q. How does chunk splitting work in MongoDB?
A. When a sharded collection is created by invoking sh.shardCollection() , a chunk with range [minKey, maxKey] for that collection is created and stored on the database’s primary shard. As write operation happens on the chunk and the chunk size crosses a threshold or the chunk contains maximum number of documents possible to migrate, it’s split and chunks are migrated to another suitable shards automatically by default.

Since, a collection initiates sharding with a single shard by default, for optimization purpose, different thresholds are defined based on number of chunks to trigger chunk split for a given collection:

Chunk split might also trigger when the chunks unevenly distribute on the shard after the chunk split.

Q. How does chunk migration work in MongoDB?
A. MongoDB chunk migration is enabled by default to balance load across shards. It can be also manually triggered by calling moveChunk() . There are couple of factors that affect chunk migration:

Shard Tag: MongoDB has a feature that allows to tag a shard and key range in a collection. Balancer then ensures that “the range with a tag is allocated to the shard with the same tag”.

Remove Shard: Removing a shard also triggers chunk migration. All the chunks from the to-be removed shard are migrated to existing shards.

Chunk difference between shards: Thresholds can be defined based on the chunk difference between “shard with the most chunks” and the “shard with the least chunks”. If the threshold is crossed, balancer can automatically trigger chunk migration.

Q. What are the side effects of chunk migration?
A. Since a chunk migrates from one shard to another, it consumes network bandwidth, thus it could impact database performance. To optimize this, in MongoDB, a shard can participate only in one chunk migration at a given point in time but overall, MongoDB allows parallel chunk migration across different shards.

MongoDB briefly pauses all application reads and writes to the collection being migrated, on the source shard, before updating the config servers with the new location for the chunk, and resumes the application reads and writes after the update. The chunk move requires all writes to be acknowledged by majority of the members of the replica set both before and after committing the chunk move to config servers.

Q. What is the impact of chunk size on chunk splitting and migration?
A.

Smaller chunk size means more the chunk split and migration actions, more balanced data distribution. Higher chunk size means fewer chunk split and migration actions but possibly uneven data distribution.
Since a chunk is based on min and max key range, with smaller chunk size and a certain shard key appearing too often (possible edge case), it might happen that all the documents are stored in one chunk and there is no way to split or merge the chunk. With larger chunk size, a chunk might end up with too many documents to be eligible for migration. (The number of documents in a chunk cannot exceed 250,000 for migration).
The chunk automatic split only triggers at data writes. If you change the chunk size to a smaller value, the system will need some time to split the chunk to the specified size.
A chunk can only be split but not merged. Even if you increase the chunk size value, the number of existing chunks will not reduce. However, the chunk size will grow with new data writes until the chunk size reaches the target value.

Q. What kind of optimizations could be done to optimize chunk migration?
A. For hash sharding, initial number of chunks can be specified while calling shardCollection() on a brand new collection (empty collection). The maximum number of initial chunks can be no more than 8192. This process is called pre-sharding.

MongoDB balancer can be configured to turn migration on or off for specific collections, or run migration only at a specified period of time.

Also, as mentioned earlier, to reduce the impact of network bandwidth consumption during chunk migration, a single shard participates only in a single migration at a given point in time.

Q. Is pre-sharding applicable for range based sharding?
A. Pre-sharding won’t work for range based sharding because while creating the chunks on the collection, the system has no idea about what is the range (min key, max key) for the initial chunks. The chunk min, max range are dynamically decided by the system. Hence many chunks may remain empty.

Q. Can shard key be null or absent in a document?
A. MongoDB allows null shard key. If a document does not have a shard key, that document also falls under the same chunk as the one with null shard key.

Consistent Hash Sharding

Consistent hashing is a very fancy term in tech that I believe almost every engineer knows. There are many materials on this already. But let’s discuss few critical things about it.

We already know some prominent problems with the plain hash based sharding (the first sharding strategy that we discussed), reminding here:

If the number of nodes changes (when a new node is added to or removed from the cluster), keys assignment to all nodes changes drastically because position of the key is decided by hash(key) % n. Hence, smooth horizontal scaling (up or down) is a problem.
This almost entire key redistribution is unnecessary and it causes unnecessary networking bandwidth usage in the cluster while moving data around.
Also, depending on how the system is implemented, uneven hash key distribution can create hot-spot.

Consistent hashing comes in rescue to solve these issues. It:

primarily minimizes this key redistribution in a cluster change event.
Helps balance the load (read / write traffic) across nodes thus minimizing the chance of hot-spot.
thus, it supports incremental scalability in the system.

In consistent hashing, the data distribution does not directly depend on the number of nodes. The hash space for key is huge and constant. Typically, the hash range could be something like [0, 2¹²⁸ — 1].

If there are total x keys and n nodes in the system (n is calculated after a new node is added or an existing node is removed), theoretically, only x / n keys needs to be moved around:

If a new node is added, roughly x / n keys move to the new node from adjacent nodes.
If a node is removed, adjacent nodes get extra x / n keys.

Suitable Systems: Large scale systems requiring petabyte scale storage or more and low latency read or write operations.

Real Life Example: Cassandra

Cassandra and DynamoDB follows a token ring based consistent hashing approach.

A ring is nothing but a circular buffer theoretically. You can visualize a circular linked list or a circular array where the first and last node or index essentially meets together.
A token is nothing but hash of a key — the output produced by a hash function when applied on the given key. Tokens are evenly and randomly distributed across the ring.
On this ring, physical nodes are placed evenly — visualize: some designated nodes in a linked list or some indices in the array holds reference to physical node (address and other metadata). The idea is that all the physical nodes should share roughly equal amount of tokens. As you can see in the following picture, there are 8 nodes n1, n2, … , n8 evenly placed on the ring. The tokens that are placed between n1 and n2 are stored in n2, tokens placed between n5 and n6 are stored in n6 and so on.
Dynamo like systems are like peer-to-peer — every node is a peer which is responsible for its token range. The node which accepts a write request for key acts as a coordinator and to maintain high availability, the data is replicated to multiple other nodes in a specific direction (typically clock-wise) depending on the configured replication factor of the system.
In the following example, when n2 locally accepts a write request for any key in the token range (t1, t2], it also replicates the data to n3 and n4 — essentially, n2 looks for some configured number of distinct physical nodes in clock-wise manner and replicates the data over those nodes.

Hence, every node is primarily responsible for hosting a range of keys determined by token range and secondarily responsible for hosting replica of keys hosted by adjacent nodes.

Q. What happens when a new physical node is added in the system?
A. If a new node say n9 is added between n1 and n2, the token range
(t1, t2] has to be divided into two parts e.g; (t1, t9] and (t9, t2]. Keys with token in the range (t1, t9] now move to n9 whereas tokens in range (t9, t2] continue to move to n2. Here is a problem once you add the new node. As you can see, in the algorithm, the token range is fixed per physical node. The moment you add a new node in some token range, that range is divided but other range remain as they are. This means not all the ranges are sharing equal load now. Considering your hash function distributes the keys homogeneously in the ring, now our new token ranges (t1, t9] and
(t9, t2] are bound to take lesser load compared to other ranges.

Same happens when a node is removed from the system. If n2 is removed, the existing ranges (t1, t2], (t2, t3] merge into one range: (t1, t3] thus making n3 overloaded and the new range (t1, t3] completely imbalanced.

Q. What is an easy way to tackle the above scenario?
A. Try to minimize the range of fixed tokens and add as many as physical nodes as you would like to. But this is not a very feasible solution as each physical node costs some money and not every system needs a lot of physical machine.

To tackle such cases, Dynamo like systems use virtual nodes or vnodes.

A vnode is a logical identification for a node. Multiple vnodes can be mapped to the same physical node. Instead of placing physical nodes in the ring at certain key range, if we place vnodes frequently with smaller key range, that would better load balance the system.

In the following image, each node has 3 virtual nodes. A has A0, A1, A2. B has B0, B1 and B2 and so on. Instead of placing physical nodes on the ring, we place these virtual nodes in such a manner that the key range between any two virtual nodes is small.

Courtesy: https://liuzhenglaichn.gitbook.io/

If we add a new server D, we essentially add new virtual nodes like D0, D1 and D2 at different places in the ring. Small key ranges across the ring get moved to the physical node D.

Now, if the physical node C is to be removed, only key range mapped to virtual nodes C0, C1 and C2 present at discrete places on the ring have to be remapped to the next virtual nodes present a specific direction (clock-wise direction). Essentially, the key ranges that shift to the next virtual nodes (hence a new physical node) is very small thus minimizing load imbalance.

Q. What happens if a physical node abruptly goes down?
A. The system needs to manage internal metadata about which virtual node is mapped to which token range. Once a node abruptly goes down, the data from its replica (we can manage separate metadata also to maintain the replica map per physical node) can be retrieved and passed on to the physical nodes corresponding to the adjacent virtual servers.

Q. Does all the physical server have same number of virtual nodes?
A. Not necessary. If a physical machine is bigger (has more resources) than another machines, it can have more virtual nodes mapped.

Q. What happens when a key / token range becomes very hot?
A. The sole purpose of consistent hashing is to minimize hot shards and data redistribution by using virtual nodes and good hash functions. But still if it happens, one option is to add more virtual nodes in the hot key range, another option is to vertically scale up the affected physical machines definitely up to an allowed limit.

Q. How can we implement consistent hashing? Are there any popular framework to achieve it?
A. Here is a good example, you can take a look how Ably implemented it.

In general, any data structure which support binary search e.g; TreeMap in Java or array can be used to implement consistent hashing in memory.

You can also take a look at libraries like Twemproxy which support consistent hashing.

Choosing a shard / partition key

By now, we have talked a lot about homogeneous distribution of data across shards. So, how to choose a shard key that would help us achieve the same?

Following factors are important to find out a good shard key candidate:

High Cardinality: Large number of different possible values. An unique field (a key with a random value) is a good candidate. Ex: user id, email id, phone no etc are better choice than the country or city where the user lives in. But it’s not necessary to choose only unique fields as shard key — any possible high cardinality fields work. The judgement of how high is the real high, depends on the business domain too.

Sharding Anti-pattern: Using a low cardinality key like product_id, customer_name, city or county as a partition key.

Monotonic Key: If you need to use monotonic key (such as auto incremented field) as a shard key, consider hash based sharding strategy. In case of a range based strategy, monotonically increasing or decreasing keys are likely going to end up in the same shard resulting into possible uneven load. Only if you are going to make range queries on some field, then it makes sense to use the field as shard key in range based sharding.
Compound / Composite Key: If a single key does not work, then a combination of keys could be used. However, how to choose the combination depends on the business use cases. Ex: in an e-commerce application, a combination of customer_id, product_id, country_code could be used as a partition key.
Adding random number to partition key: For write heavy use cases, one option is to add some bit of random number (prefix or suffix) to the partition key to better distribute the data. Example: if your partition key is a date field like 2022–06–22, you can add a random number within some range as a suffix to the key, e.g; if you decide a random number range of [1, 300], your partition keys now look like: 2022–06–22.1, 2022–06–22.2, … , 2022–06–22.300 etc which are going to be better distributed across shards. This random number does not need to be assigned in a complete random way, it can be generated deterministically based on some attribute of the data. Like, for an object in the system, if you have an use case to access the object based on id and date, you can generate a random hash id in the range [0, 300] based on the id in our example and append it to the date to form the final partition key. Such deterministic way helps to retrieve individual objects easily. If you want to access all objects on a given date, you have to shoot parallel queries to all shards (2022–06.22.X where X in [1, 300]) for a given date.
Frequency: Only high cardinality is not the most efficient signal, frequency of the key also matters. If you are expecting very high frequency read or write for entities with certain shard keys, your system could still face the hot-spot problem. In the worst case, you might not be even able to break the entity set any further to smaller set. The following image shows how such a skewed data access looks like:

Now, you might be tempted to choose a low frequency key, but the fact is as stated earlier — even distribution of data is a function of: cardinality, frequency, randomness of the key etc.

Optimizing partition key choice
Often there would be situations where no single partition key solves all use cases — different use cases demand different partition keys, but, the database table / collection accepts only one partition key. In such scenario:

Identify data ingestion and access pattern for all such use cases, rank them based on usage. You need to optimize for the most popular use case first. Choose a partition key for the table which would solve the most popular use case without hitting the hot-spot issue.
Then, to solve rest of the cases, use secondary index. Database systems like Dynamo DB, Couchbase has Global Secondary Index (GSI) where you can define custom partition key for different use cases.

Sort Key

Dynamo like systems offers an option to use sort key. It’s an optional field.

In the absence of sort key, only partition key works as the primary key. Hence, each document needs to have unique partition key.

In the presence of sort key, the combination of partition key and sort key works as the primary key. Here, partition key can be non-unique, but the sort key has to be unique per partition to uniquely identify each document.

Partition key determines which shard should handle a particular record, whereas, sort key is used to sort all the records / documents on a given shard. This helps to support range query within a given shard.

Q. Usually cloud based solutions like dynamo db put a hard limit on space per partition e.g; maximum partition (physical shard) size in dynamo is 10 GB. What happens if a partition has to grow beyond this limit?
A. All the records in a partition has the same partition key, but the sort keys are different. Hence, if a partition has to grow beyond a threshold, a possible option is to split the partition into two halves — both of them has the same partition key but the sort key range would be different. Hence, to serve read query involving a particular partition key, both the partitions have to be queried. For write operation, depending on the sort key value, a single partition can be chosen to write the particular document.

Conclusion

This is quite a comprehensive article explaining different aspects of sharding like shard allocation, request routing, movement of logical shards across physical nodes, choice of shard keys, different kind of possible edge cases etc.

Hope this article helped you to gain a good knowledge about how different real life systems have implemented sharding technique and how you can extract different concepts used in these systems to build your own system.

Please like and share for more such articles. Feedback is appreciated in the comments section.