Schema Design Considerations in MongoDB

I’ve previously touched on some of the benefits and a few examples of how to do schema design in MongoDB. One often raised question when it comes to modeling data in MongoDB is how best to handle data schema in a non-relational database. I’d like to explore in more depth some of the considerations required for effective schema design for MongoDB implementations.

One of the key things to remember when modeling your data in MongoDB is how the application is going to use the data. Your data access patterns should be of foremost thought when designing your data model. Unlike data normalization concerns in relational databases, embedding data in a document often provides better performance.

When, however, does one decide to embed documents inside another document? What are some of the considerations for doing so when thinking about schema design?

Types of Relationships

In the relational database world modeling different relationships comes down to examining how to model “One-to-N” relationships and the normalization of data. In MongoDB, there are a variety of ways to model these relationships. When doing schema design in MongoDB there is more to consider than a blanket model for a “One-to-N” relationship model.

We need to consider the size of “N” for our modeling because in this instance, size matters. One-to-one relationships can easily be handled with embedding a document inside another document. But what happens if “N” grows? Let’s have a look at the following cases, “One-to-few”, “One-to-Many”, and “One-to-Tons”.

One-to-Few

This is a pretty common occurrence, even in the relational database world. A single record that needs to be associated with a relatively small number of other data points. Something like keeping customer information and their associated phone numbers or addresses. We can embed an array of information inside the document for the customer.

{ 
  "_id" : ObjectId("56cb1cfb72d245023179fda4"),
  "name" :  "Harvey Waldrip",
  "phone" : [
     { "type" : "mobile", "number" : "503-555-5555" }, 
     { "type" : "home", "number" : "503-555-1111"}
  ]
}

This showcases the benefits, and drawbacks, of embedding. We can easily get the embedded information with a single query. The downside, however, is that the embedded can’t be accessed as autonomous data.

One-to-Many

“Many” here covers up to a few thousand or so in number. Say that we are modeling a product made up of smaller parts. For example, if we had an electronic parts kit each part in the kit could be referenced as a separate part.

{ 
  "_id" : ObjectId("AAAA"),
  "part_no" : "150ohm-0.5W"
  "name" : "150ohm 1/2 Watt Resistor"
  "qty" : 1
  "cost" : { NumberDecimal("0.13"), currency: "USD" }
}

Each piece in the kit would have its own document. Notice the format of the “cost” value, I discussed that in a post on Modeling Monetary Data in MongoDB. Each final product, or kit in our example, will contain an array of references to the necessary parts.

{
  "_id" : ObjectId("57d7a121fa937f710a7d486e"),
  "manufacturer" : "Elegoo",
  "catalog_number" : 123789,
  "parts" : [
     ObjectID("AAAA"),
     ObjectID("AAAB"),
     ObjectID("G9D6"),
     ...
  ]
}

Now we can utilize an application level join or depending on the use case the $lookup aggregation pipeline operator to get information about specific parts in a kit. For best performance, we need to make sure we have proper indexes in place on our collections as well.

This style of reference allows for quick and easy search and updating of the parts in the kit. It has basically become an “N-to-N” schema design without needing a separate join table. Pretty slick.

One-to-Tons

As I mentioned, “One-to-Many” is great for up to several thousand references. But what about cases when that isn’t enough?  Further, what if the referencing poses schema design concerns around the document limitation of 16MB? This is where parent referencing becomes very useful.

Let’s imagine an event log situation. We would have a document for the host machine and store that host machine in the log message documents.

Host

{ "_id" : "Bunyan", 
  "name" : "logger.lumberjack.com", 
  "ip_address" : "127.55.55.55"
}

Message

{ "_id" : "MongoDB", 
  "time" : ISODate("2017-08-29T17:25:00.000Z"),
  "message" : "Timber!!!", 
  "host" : ObjectId("Bunyan")
}

Again, for optimum searching, we would want to make sure indexes are properly in place.

Schema Design – Key Considerations

Now that we have seen some of the schema design options, how do we determine which is the best one to utilize? There are a few things to think about before choosing and have somewhat become the standard questions to ask when doing schema design in MongoDB.

Golden Rules for MongoDB Schema Design
  1. Unless there is a compelling reason to do so, favor embedding.
  2. Needing to access an object on its own is a compelling reason to not embed the object in another document.
  3. Unbounded array growth is a bad design.
  4. Don’t be afraid of joins on the application side. A properly indexed collection and query can have highly performant results.
  5. When denormalizing your data, consider the read to write ratio of the application.
  6. Finally, how you model your data depends on your application’s data access patterns. Match your schema design to how your application reads and writes the data.

Wrap Up

There are some great references available for designing your schemas in MongoDB. Some of my favorites are MongoDB Applied Design Patterns and MongoDB in Action. While I have not seen or read it, The Little Mongo DB Schema Design Book looks like a promising resource as well.

Juan Roy has a nice slide deck available on this topic as well. Definitely worth having a look.

There are several MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB what is a document?” and get a helpful response.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwittergoogle_plusredditlinkedinmail

Why NoSQL, and what are some of the options?

Have you noticed how locked into a particular technology developers get? Python vs. Java, Angular vs. React, Windows vs. Mac, cats vs. dogs… all topics and subjects that are capable of starting a good “flame war” on most any discussion outlet. Alas, the same goes for relational databases (RDBMS) versus non-relational databases. A large percentage of RDBMS utilize Structured Query Language (SQL) as their language. Non-relational databases, generally speaking, don’t use SQL. The debate, therefore, has been framed as SQL vs. NoSQL.

SQL and RDMS database design have been around for a long time and are widely used technologies. In fact, they are excellent solutions for many different data models. Does all of your data have the same properties? SQL is great for that. Are you working with complex transactional requests? SQL shines in that environment.

As a real world example, think of your doctor’s office and all of the neatly sorted files in their office. Do they store them in a relational way with all diagnosis of diabetes in this file? Cancer over here, and then include a reference to a patient’s contact information, and another file for everyone with Blue Cross insurance, again with more links? This is the SQL way. Or do they store all of the information about a given patient in one file? That’s an example of a NoSQL data model.

Similarly, in the day and age of data coming in from a variety of devices with a wide range of data types, being able to store data in a more flexible format seems to make a lot of sense. A format which should be dependent on your application’s access pattern and not necessarily a rigidly formatted data model that has to fit into an existing table structure. NoSQL data stores allow for this flexibility. Often allowing for faster application development and iterations as the flexibility of design is carried through from the schema itself down to the data record level.

It should come as no surprise that there are a lot of different formats of NoSQL databases. There are key-value store options such as Redis, graph models such as Neo4j or Giraph, column models such as Cassandra and HBase, multi-model options like Couchbase or MarkLogic, and document data models like IBM Domino and MongoDB. Yes, I know, I left out many examples in each of those models, let the flame wars begin…

What are the strengths and weaknesses of each of these different database models? There are several different ways to evaluate the variety of features each database model offers to determine if it is a good fit for a particular organization or application structure. I’d like to look at two of the larger considerations, the way the data itself is modeled, and the way in which the data is queried.

NoSQL Data Models

Key-Value and Column Model

If we have a look at how key-value models store their data, which is similar to the column model, we see that they fall into a rather basic type of model. Each database item is stored as a key, or attribute name, and is associated with a value or in the case of the column model a multi-dimensional sorted map. This can work very well for unstructured data as the database does not require a set schema across key-value pairs. This can scale very well and have high performance due to simplicity of design and the fact that only the key is of interest to the database.

While this is a fast way to represent data, the values in this design cannot be queried, only the keys can. This, obviously, makes it more challenging to do complex queries and aggregations. Having the ability to only query data by a single key value can be limiting.

Graph Model

Graph models use nodes, edges, and properties to represent data. A good example here is a social network. Each person is a node, their connections to other people are edges, and properties are the information about a given node, such as a person’s name. These databases can require a bit of a learning curve to understand, but work well for things like business supply chains, social networks, or complex hierarchical structures which are often challenging to model in a relational database.

However, it tends to be a rather niche sort of data store. Many traditional sorts of information don’t fit well into the graph data store concept. They perform well when the relationship between data records is the important concept than the actual data itself. Exploration of how many relationships away from someone else you are would be a good use case, i.e. the “6 degrees of Kevin Bacon” problem.

Document Model

Instead of rows and columns, document model databases store their information in, well, documents. If you missed my post on Modeling your data with Documents, it is goes into more detail. Most often the data is stored in a structure similar to JSON (JavaScript Object Notation). This allows data to be stored in a manner familiar to most developers. Further it allows for each piece of data to effectively become an object. A key and important concept which is closely aligned with the familiar object-oriented programming pattern.

Data in a document database can have a dynamic schema. Each document can have different fields and fields can be represented by different data types. This can make it very appealing due to the ease of adding new fields during development. The loss of being able to directly do multi-record transactions and JOIN operations is often overcome by this flexible and dynamic schema ability.

NoSQL Query Models

Since each application has it’s own requirements for data retrieval and storage, determining how your data needs to be retrieved is an important consideration when making a decision on how to store your data. Some applications may have very basic query needs. Others may have complex queries which search for a variety of values on each record.

If we look back at our patient record example, a doctor’s office may have some queries that only require the lookup of a patient’s name. Most of the time, however, they will be looking up additional information such as appointment schedule, patient’s with a specific diagnosis, certain age ranges, etc. Having a robust enough query language and database to handle such requests is important. Further, since data is rarely static and often needs to be updated, having the ability to update records based on one or more fields is important.

KEY-VALUE AND COLUMN MODEL

These systems can search and retrieve information based on a single, or limited number of, keys. Users are often required, in fact encouraged, to develop and maintain their own indexes for more complex queries. Updating records in these systems are often expensive, requiring multiple steps and trips to the database. In fact, it is often required to do an entire rewrite of the record to update, regardless of the size of the update.

GRAPH MODEL

As you might imagine with the requirements of these systems, an abundant query language is needed to explore simple and complex relationships. The query language needs to be able to provide direct and indirect assumptions about the data. Therefore, relationship style analysis is of prime importance here with other, general purpose applications being less commonly implemented.

DOCUMENT MODEL

One of the highlights of many document model databases are rich query languages. This allows for searches to be performed on any field within a document. Often this includes the ability to add secondary indexes to the database to further increase performance. Data updates can frequently be done in a single trip to the database. This can be accomplished with some version of a find and modify method.

Takeaways

At the end of the day then where does this leave us? How can we decide on which non-relational database to use? For starters, I would recommend looking at your specific application needs. If schema flexibility, rapid development, and/or efficient data queries are important, NoSQL is definitely worth considering. That choice being made, the selection of which type of non-relational database is next. For many cases, the document model has the broadest, most developer friendly option set.

Follow me on Twitter @kenwalger and let me know how you are using NoSQL for your application needs.

Facebooktwittergoogle_plusredditlinkedinmail