Building With Patterns: The Outlier Pattern

So far in this Building with Patterns series, we’ve looked at the Polymorphic, Attribute, and Bucket patterns. While the document schema in these patterns has slight variations, from an application and query standpoint, the document structures are fairly consistent. What happens, however, when this isn’t the case? What happens when there is data that falls outside the “normal” pattern? What if there’s an outlier?

Imagine you are starting an e-commerce site that sells books. One of the queries you might be interested in running is “who has purchased a particular book”. This could be useful for a recommendation system to show your customers similar books of interest. You decide to store the user_id of a customer in an array for each book. Simple enough, right?

Well, this may indeed work for 99.99% of the cases, but what happens when J.K. Rowling releases a new Harry Potter book and sales spike in the millions? The 16MB BSON document size limit could easily be reached. Redesigning our entire application for this outlier situation could result in reduced performance for the typical book, but we do need to take it into consideration.

The Outlier Pattern

With the Outlier Pattern, we are working to prevent a few queries or documents driving our solution towards one that would not be optimal for the majority of our use cases. Not every book sold will sell millions of copies.

A typical book document storing user_id information might look something like:

{
    "_id": ObjectID("507f1f77bcf86cd799439011")
    "title": "A Genealogical Record of a Line of Alger",
    "author": "Ken W. Alger",
    …,
    "customers_purchased": ["user00", "user01", "user02"]

}

This would work well for a large majority of books that aren’t likely to reach the “best seller” lists. Accounting for outliers though results in the customers_purchased array expanding beyond a 1000 item limit we have set, we’ll add a new field to “flag” the book as an outlier.

{
    "_id": ObjectID("507f191e810c19729de860ea"),
    "title": "Harry Potter, the Next Chapter",
    "author": "J.K. Rowling",
    …,
   "customers_purchased": ["user00", "user01", "user02", …, "user999"],
   "has_extras": "true"
}

We’d then move the overflow information into a separate document linked with the book’s id. Inside the application, we would be able to determine if a document has a has_extras field with a value of true. If that is the case, the application would retrieve the extra information. This could be handled so that it is rather transparent for most of the application code.

Many design decisions will be based on the application workload, so this solution is intended to show an example of the Outlier Pattern. The important concept to grasp here is that the outliers have a substantial enough difference in their data that, if they were considered “normal”, changing the application design for them would degrade performance for the more typical queries and documents.

Sample Use Case

The Outlier Pattern is an advanced pattern, but one that can result in large performance improvements. It is frequently used in situations when popularity is a factor, such as in social network relationships, book sales, movie reviews, etc. The Internet has transformed our world into a much smaller place and when something becomes popular, it transforms the way we need to model the data around the item.

One example is a customer that has a video conferencing product. The list of authorized attendees in most video conferences can be kept in the same document as the conference. However, there are a few events, like a company’s all hands, that have thousands of expected attendees. For those outlier conferences, the customer implemented “overflow” documents to record those long lists of attendees.

Conclusion

The problem that the Outlier Pattern addresses is preventing a few documents or queries to determine an application’s solution. Especially when that solution would not be optimal for the majority of use cases. We can leverage MongoDB’s flexible data model to add a field to the document “flagging” it as an outlier. Then, inside the application, we handle the outliers slightly differently. By tailoring your schema for the typical document or query, application performance will be optimized for those normal use cases and the outliers will still be addressed.

One thing to consider with this pattern is that it often is tailored for specific queries and situations. Therefore, ad hoc queries may result in less than optimal performance. Additionally, as much of the work is done within the application code itself, additional code maintenance may be required over time.

In our next Building with Patterns post, we’ll take a look at the Computed Pattern and how to optimize schema for applications that can result in unnecessary waste of resources. If you have questions, please leave comments below.

This post was originally published on the MongoDB Blog.

Facebooktwittergoogle_plusredditlinkedinmail

MongoDB Schema Design Patterns – The Bucket Pattern

In this edition of the Building with Patterns series, we’re going to cover the Bucket Pattern. This pattern is particularly effective when working with Internet of Things (IoT), Real-Time Analytics, or Time-Series data in general. By bucketing data together we make it easier to organize specific groups of data, increasing the ability to discover historical trends or provide future forecasting and optimize our use of storage.

The Bucket Pattern

With data coming in as a stream over a period of time (time series data) we may be inclined to store each measurement in its own document. However, this inclination is a very relational approach to handling the data. If we have a sensor taking the temperature and saving it to the database every minute, our data stream might look something like:


{
   sensor_id: 12345,
   timestamp: ISODate("2019-01-31T10:00:00.000Z"),
   temperature: 40
}

{
   sensor_id: 12345,
   timestamp: ISODate("2019-01-31T10:01:00.000Z"),
   temperature: 40
}

{
   sensor_id: 12345,
   timestamp: ISODate("2019-01-31T10:02:00.000Z"),
   temperature: 41
}

This can pose some issues as our application scales in terms of data and index size. For example, we could end up having to index sensor_id and timestamp for every single measurement to enable rapid access at the cost of RAM. By leveraging the document data model though, we can “bucket” this data, by time, into documents that hold the measurements from a particular time span. We can also programmatically add additional information to each of these “buckets”.

By applying the Bucket Pattern to our data model, we get some benefits in terms of index size savings, potential query simplification, and the ability to use that pre-aggregated data in our documents. Taking the data stream from above and applying the Bucket Pattern to it, we would wind up with:


{
    sensor_id: 12345,
    start_date: ISODate("2019-01-31T10:00:00.000Z"),
    end_date: ISODate("2019-01-31T10:59:59.000Z"),
    measurements: [
       {
       timestamp: ISODate("2019-01-31T10:00:00.000Z"),
       temperature: 40
       },
       {
       timestamp: ISODate("2019-01-31T10:01:00.000Z"),
       temperature: 40
       },
       … 
       {
       timestamp: ISODate("2019-01-31T10:42:00.000Z"),
       temperature: 42
       }
    ],
   transaction_count: 42,
   sum_temperature: 2413
} 

By using the Bucket Pattern, we have “bucketed” our data to, in this case, a one hour bucket. This particular data stream would still be growing as it currently only has 42 measurements; there’s still more measurements for that hour to be added to the “bucket”. When they are added to the measurements array, the transaction_count will be incremented and sum_temperature will also be updated.

With the pre-aggregated sum_temperature value, it then becomes possible to easily pull up a particular bucket and determine the average temperature (sum_temperature / transaction_count) for that bucket. When working with time-series data it is frequently more interesting and important to know what the average temperature was from 2:00 to 3:00 pm in Corning, California on 13 July 2018 than knowing what the temperature was at 2:03 pm. By bucketing and doing pre-aggregation we’re more able to easily provide that information.

Additionally, as we gather more and more information we may determine that keeping all of the source data in an archive is more effective. How frequently do we need to access the temperature for Corning from 1948, for example? Being able to move those buckets of data to a data archive can be a large benefit.

Sample Use Case

One example of making time-series data valuable in the real world comes from an IoT implementation by Bosch. They are using MongoDB and time-series data in an automotive field data app. The app captures data from a variety of sensors throughout the vehicle allowing for improved diagnostics of the vehicle itself and component performance.

Other examples include major banks that have incorporated this pattern in financial applications to group transactions together.

Conclusion

When working with time-series data, using the Bucket Pattern in MongoDB is a great option. It reduces the overall number of documents in a collection, improves index performance, and by leveraging pre-aggregation, it can simplify data access.

The Bucket Design pattern works great for many cases. But what if there are outliers in our data? That’s where the next pattern we’ll discuss, the Outlier Design Pattern, comes into play.

If you have questions, please leave comments below.


This post was originally published on the MongoDB Blog.

Facebooktwittergoogle_plusredditlinkedinmail