MongoDB Horizontal Scaling through Sharding

There comes a time in many MongoDB database, or any database for that matter, life cycles in which our data outgrows our servers. Either physically outgrows storage capabilities or the data grows so large that performance is degraded. Even scaling our physical servers up with a more powerful CPU, more RAM, or hard drives (vertical scaling) may not be enough. This is where horizontal scaling through sharding comes into practice.

Sharding

What is sharding exactly? It is the practice of distributing data across multiple machines. In MongoDB it supports instances with large data sets and needed high throughput operations. The data is distributed across all shards allowing the workload to be evenly shared. This has the potential for much better efficiency than a single server.

Sharded data distribution example

Each cluster should, for redundancy, also be replica sets with primary and secondary servers.

Sharded Cluster Example

MongoDB data is sharded at a collection level, therefore it isn’t necessary to distribute the entire database across a sharded environment.

Cluster Configuration

There are three components to a sharded cluster that we need. We need the shards themselves, a query router in the way of a mongos server, and configuration or config servers.

The shards are what store the subset of the data. The config servers store all of the metadata about the cluster and which shard houses what data. Data is stored in chunks on each shard and the config server keeps track of all that information.

The mongos, in its role as a query router, acts as the interface between the application and the data. It routes queries and write operations to the appropriate shards. An application, therefore, will only access the data through a mongos, never by touching the data itself. Queries are routed via the mongos to all shards unless it can be determined the data resides on a particular shard.

Broadcast Sharding

There has to be a better way than broadcasting that request to all shards though, right? As you might imagine, this “scatter/gather” approach to querying can result in some long running operations. Well, I kind of eluded to it before by qualifying it with the “unless”, so there is a better way! Enter the shard key.

Shard Keys

A shard key determines how documents in a collection are distributed across the shards. It is an indexed field, or indexed compound fields, that exists in every document in the collection. Recall that MongoDB allows for a flexible schema within the document model. This is one consideration when choosing a shard key; every document must have the indexed field.

If provided with a shard key during a query, the mongos knows how to route the request.

Targeted shard query

This can greatly enhance performance. Choosing a good shard key, however, is very important. I’ve covered what goes into the selection of a good shard key in a different post.

Trade offs to Sharding

As the saying goes, there’s no such thing as a free lunch. Sharding is the same way. Sharding your data sets increases infrastructure complexity as well as maintenance. One solution to help mitigate both of these is to utilize a DBaaS such as Atlas to host your MongoDB data.

If queries are run without including a shard key, the “scatter/gather” approach is used. This can result in slow queries, therefore it is definitely something to remember when writing your applications.

Once a collection is sharded, it cannot be unsharded. Similarly, once a shard key is selected, it cannot be changed. So these steps need to be undertaken with careful planning.

If you are handling things yourself on your own hardware I have briefly discussed some of the tools which can be used to check performance of a sharded collection in previous posts. Specifically in MongoDB CLI Tools and briefly in MongoDB explain() explained.

Wrap Up

When your data has outgrown a single server, sharding is a great approach to keep your database performing well. There are some things to watch out for though. Make sure you choose a good shard key and stay up to date with database maintenance.

There are a lot of MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB what is a Shard?” and get a helpful response.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwitterredditlinkedinmail

Overview of the MQTT Protocol

There are several different ways one can utilize to connect an Internet of Things (IoT) device to a network. I have previously discussed some of the setup for a NodeMCU development board to connect to a WiFi network. In an IoT environment full blown network protocols such as HTTP can be a bit heavy. A popular, lighter weight, solution I’d like to discuss the basics of is MQTT.

MQTT

While MQTT used to stand for MQ Telemetry Transport, it is now classified as not being an acronym. So what is it? What does it do? Well, it is a messaging protocol using TCP of TCP/IP fame to provide publish/subscribe services. It was, and is, designed for small, constrained devices and makes design decisions based on those constraints. Concepts which are important in the IoT world, such as memory, bandwidth, latency, power consumption, and network reliability. Let’s focus in on one of the main MQTT concepts, the publish/subscribe pattern.

MQTT Publish/Subscribe Pattern

In a publish/subscribe pattern a client publishes information and another client can subscribe to the information it wants. In many cases there is a broker between the clients who facilitates and/or filters the information. This allows for a loose coupling between entities.

The decoupling can occur in a few different ways, space, time, and synchronization.

  • Space – the subscriber doesn’t need to know who the publisher is, for example by IP address, and vice-versa
  • Time – the two clients don’t have to be running at the same time
  • Synchronization – Publishing and receiving doesn’t halt operations

Through the filtering done on the broker not all subscribers have to get the same messages. The broker can filter on subject, content, or type of message. A client, therefore could subscribe to only messages about temperate data. Or only messages with content about centrifuge machines. Or, perhaps, we only want to receive information about specific types of errors.

Thinking about an IoT situation with, for example, a NodeMCU device with some environmental sensors connected to it such as a TMP36 temperature sensor and a moisture sensor we could be publishing that data. Some clients that are connected to our broker may be interested in that, others may not. They would subscribe to the information they wanted.

MQTT Publish-Subscribe Model

 

Once connected to the broker the publishing client simply sends its data to the broker. Once there, the broker relays the appropriate data onto the clients who have subscribed for that data. Again, those subscriptions can be filtered. All of this data transferring is done in a light weight fashion designed for small, resource limited devices.

Message Packet

The message packet shown in the above diagram is just an example. Along with the message, or payload, a real packet would include additional information such as a packet ID, topic name, quality of service (QoS) level. Also included in the packet would be flags so the broker knows how long to retain the message and if the message is a duplicate.

Wrap Up

This is just the tip of the iceberg for MQTT, there are several other features of interest as well. Features such as Retained Messages, Quality of Service, Last Will and Testament, Persistent Sessions, and SYS Topics. It should also come as no surprise given the importance of security in today’s world, that MQTT has security features as well.


Follow me on Twitter @kenwalger to get the latest updates on my postings on IoT topics. If you enjoyed this article, or have questions, leave comments below.

Facebooktwitterredditlinkedinmail