Choosing a good Shard Key in MongoDB

I discussed, at a high level, the concept of sharing in MongoDB in a previous post. I mentioned that I would talk about what makes for a good shard key later on, and here we are. Before we talk about how to choose a good shard key, let’s first discuss it’s purpose.

When do we need to shard?

There are three typical reasons that we will want to shard a collection.

  1. The write workload on a single server exceeds that server’s capacity.
  2. The working set no longer fits into RAM.
  3. A single server can no longer handle the size of the dataset.

However, sharding earlier is better in an application’s lifecycle. There are significant performance considerations when sharding an existing collection.

Why do we need a Shard Key?

The shard key determines how data, specifically a collection’s documents, will be distributed across a sharded cluster in MongoDB. In a sharded collection, MongoDB uses the shard key to partition the data based on ranges of the key. Each range of the shard key values defines a non-overlapping value range. Further, each range is assigned to a specific chunk on the cluster. Sharded environments generally evenly distribute the chunks.

Sharding Example with Shard Key

The query router (the mongos server) directs the query to the appropriate chunk as data is requested from an application. It does the same for write requests, based on the shard key it directs to write to the appropriate chunk.

Choosing a Shard Key

As with MongoDB schema design considerations, choosing a shard key is heavily dependent on your application. The most frequent information needed to be read from and/or written to the database should be a key contributing factor. Depending on the application perhaps using a hashed ObjectId would be enough. In other applications, a single field may not be enough and a compound sharding key will be necessary.

When choosing a shard key, it impacts the cluster balancer and distribution of data across the shards. This has a large impact on sharded cluster performance and efficiency. An ideal shard key will allow for an even distribution of documents throughout the entire cluster. Beyond application specific requirements, there are a couple of technical factors to consider when choosing a shard key. A shard key cannot be changed once chosen for a collection, so these considerations are important.

Technical Considerations

The first consideration is cardinality. A shard key with high cardinality allows for better horizontal scaling. It does not, however, guarantee even data distribution across a sharded cluster.

The frequency of data is another factor in a good shard key. If the indexed field is, say user_name, and the data set is on the genealogical data for the Alger family, it would be expected that “Alger” occurs much more frequently than other names. This would cause the balance of documents to be uneven across the shards. One wants to choose a key that has low frequency, but that itself won’t guarantee even data distribution either.

High cardinality and low frequency are indeed important factors in shard key selection. The rate of change is the third leg of the shard key selection stool. Selecting a key that does not increase, or decrease, monotonically is important as well. A key that is always increasing results in inserts being routed to the shard with the maxKey as the upper bound. Thus, the shard will become unbalanced. With the minKey as the lower bound, decreasing results will be routed to that shard.

Each of these three legs of our stool contributes to data distribution. We need to consider each factor when selecting a key.

Example Shard Keys

Now that we have seen what some of the factors play into sharding performance, let’s look at some examples. To borrow a Hollywood movie title, The Good, the Bad & the Ugly, this can be true of our keys as well.

Shard keys must be based off an indexed field in the collection which exists across all documents. They can also be generated on a compound index in which the shard key is a prefix of that index. I covered indexing in MongoDB previously, as it is a topic all of itself. Since there are things to think about when indexing, it also impacts the shard keys.

The Ugly

I would say that ugly shard keys are those that not only don’t help your application but actually result in worse performance than when starting out. This would be something akin to using the _id field without hashing. Since _id is monotonically increasing, it would lead to an unbalanced system.

The Bad

I would categorize shard keys as bad that don’t fully take advantage of the concept of sharding. If, for example, collections are distributed across the shards and chunks, but the application’s read requests typically use an index different from the shard key index. While they may not have a negative impact on physical data distribution, they just aren’t properly optimized.

This results in a scatter/gather approach to the queries. The mongos server will scatter the request to all shards. Then gather the results together to send to the application. A targeted single server should be receiving the query with a well-chosen key.

The Good

This is where we all want to live, right? In a highly performant system environment with everything highly scaled and efficient. This can be achieved by selecting a shard key for a collection which optimizes an application’s needs with the technical strategies listed above.

Wrap Up

Choosing a correct shard key and the development of a sharding strategy is not always easy. Don’t make the mistake that because one application is sharded in a particular way, with a particular shard key, that it is a universal approach. Your application’s strategy may be entirely different, or sharding may not be the right solution for a particular collection.

Remember though, that it is important to establish a shard key and split your data early on. Sharding becomes more challenging in production with large data sets.

There are a lot of MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB what is a Shard Key?” and get a helpful response.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwitterredditlinkedinmail

MongoDB Horizontal Scaling through Sharding

There comes a time in many MongoDB database, or any database for that matter, life cycles in which our data outgrows our servers. Either physically outgrows storage capabilities or the data grows so large that performance is degraded. Even scaling our physical servers up with a more powerful CPU, more RAM, or hard drives (vertical scaling) may not be enough. This is where horizontal scaling through sharding comes into practice.

Sharding

What is sharding exactly? It is the practice of distributing data across multiple machines. In MongoDB it supports instances with large data sets and needed high throughput operations. The data is distributed across all shards allowing the workload to be evenly shared. This has the potential for much better efficiency than a single server.

Sharded data distribution example

Each cluster should, for redundancy, also be replica sets with primary and secondary servers.

Sharded Cluster Example

MongoDB data is sharded at a collection level, therefore it isn’t necessary to distribute the entire database across a sharded environment.

Cluster Configuration

There are three components to a sharded cluster that we need. We need the shards themselves, a query router in the way of a mongos server, and configuration or config servers.

The shards are what store the subset of the data. The config servers store all of the metadata about the cluster and which shard houses what data. Data is stored in chunks on each shard and the config server keeps track of all that information.

The mongos, in its role as a query router, acts as the interface between the application and the data. It routes queries and write operations to the appropriate shards. An application, therefore, will only access the data through a mongos, never by touching the data itself. Queries are routed via the mongos to all shards unless it can be determined the data resides on a particular shard.

Broadcast Sharding

There has to be a better way than broadcasting that request to all shards though, right? As you might imagine, this “scatter/gather” approach to querying can result in some long running operations. Well, I kind of eluded to it before by qualifying it with the “unless”, so there is a better way! Enter the shard key.

Shard Keys

A shard key determines how documents in a collection are distributed across the shards. It is an indexed field, or indexed compound fields, that exists in every document in the collection. Recall that MongoDB allows for a flexible schema within the document model. This is one consideration when choosing a shard key; every document must have the indexed field.

If provided with a shard key during a query, the mongos knows how to route the request.

Targeted shard query

This can greatly enhance performance. Choosing a good shard key, however, is very important. I’ve covered what goes into the selection of a good shard key in a different post.

Trade offs to Sharding

As the saying goes, there’s no such thing as a free lunch. Sharding is the same way. Sharding your data sets increases infrastructure complexity as well as maintenance. One solution to help mitigate both of these is to utilize a DBaaS such as Atlas to host your MongoDB data.

If queries are run without including a shard key, the “scatter/gather” approach is used. This can result in slow queries, therefore it is definitely something to remember when writing your applications.

Once a collection is sharded, it cannot be unsharded. Similarly, once a shard key is selected, it cannot be changed. So these steps need to be undertaken with careful planning.

If you are handling things yourself on your own hardware I have briefly discussed some of the tools which can be used to check performance of a sharded collection in previous posts. Specifically in MongoDB CLI Tools and briefly in MongoDB explain() explained.

Wrap Up

When your data has outgrown a single server, sharding is a great approach to keep your database performing well. There are some things to watch out for though. Make sure you choose a good shard key and stay up to date with database maintenance.

There are a lot of MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB what is a Shard?” and get a helpful response.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwitterredditlinkedinmail