February 2019 | Blog of Ken W. Alger

We’ve looked at various ways of optimally storing data in the Building with Patterns series. Now, we’re going to look at a different aspect of schema design. Just storing data and having it available isn’t, typically, all that useful. The usefulness of data becomes much more apparent when we can compute values from it. What’s the total sales revenue of the latest Amazon Alexa? How many viewers watched the latest blockbuster movie? These types of questions can be answered from data stored in a database but must be computed.

Running these computations every time they’re requested though becomes a highly resource-intensive process, especially on huge datasets. CPU cycles, disk access, memory all can be involved.

Think of a movie information web application. Every time we visit the application to look up a movie, the page provides information about the number of cinemas the movie has played in, the total number of people who’ve watched the movie, and the overall revenue. If the application has to constantly compute those values for each page visit, it could use a lot of processing resources on popular movies

Most of the time, however, we don’t need to know those exact numbers. We could do the calculations in the background and update the main movie information document once in a while. These _computations _then allow us to show a valid representation of the data without having to put extra effort on the CPU.

The Computed Pattern

The Computed Pattern is utilized when we have data that needs to be computed repeatedly in our application. The Computed Pattern is also utilized when the data access pattern is read intensive; for example, if you have 1,000,000 reads per hour but only 1,000 writes per hour, doing the computation at the time of a write would divide the number of calculations by a factor 1000.

In our movie database example, we can do the computations based on all of the screening information we have on a particular movie, compute the result(s), and store them with the information about the movie itself. In a low write environment, the computation could be done in conjunction with any update of the source data. Where there are more regular writes, the computations could be done at defined intervals – every hour for example. Since we aren’t interfering with the source data in the screening information, we can continue to rerun existing calculations or run new calculations at any point in time and know we will get correct results.

Other strategies for performing the computation could involve, for example, adding a timestamp to the document to indicate when it was last updated. The application can then determine when the computation needs to occur. Another option might be to have a queue of computations that need to be done. Selecting the update strategy is best left to the application developer.

Sample Use Case

The Computed Pattern can be utilized wherever calculations need to be run against data. Datasets that need sums, such as revenue or viewers, are a good example, but time series data, product catalogs, single view applications, and event sourcing are prime candidates for this pattern too.

This is a pattern that many customers have implemented. For example, a customer does massive aggregation queries on vehicle data and store the results for the server to show the info for the next few hours.

A publishing company compiles all kind of data to create ordered lists like the “100 Best …”. Those lists only need to be regenerated once in a while, while the underlying data may be updated at other times.

Conclusion

This powerful design pattern allows for a reduction in CPU workload and increased application performance. It can be utilized to apply a computation or operation on data in a collection and store the result in a document. This allows for the avoidance of the same computation being done repeatedly. Whenever your system is performing the same calculations repeatedly and you have a high read to write ratio, consider the Computed Pattern.

We’re over a third of the way through this Building with Patterns series. Next time we’ll look at the features and benefits of the Subset Pattern and how it can help with memory shortage issues. If you have questions, please leave comments below.

Previous Parts of Building with Patterns:

The Polymorphic pattern
The Attribute pattern
The Bucket pattern
The Outlier pattern

This post was originally published on the MongoDB Blog.

So far in this Building with Patterns series, we’ve looked at the Polymorphic, Attribute, and Bucket patterns. While the document schema in these patterns has slight variations, from an application and query standpoint, the document structures are fairly consistent. What happens, however, when this isn’t the case? What happens when there is data that falls outside the “normal” pattern? What if there’s an outlier?

Imagine you are starting an e-commerce site that sells books. One of the queries you might be interested in running is “who has purchased a particular book”. This could be useful for a recommendation system to show your customers similar books of interest. You decide to store the user_id of a customer in an array for each book. Simple enough, right?

Well, this may indeed work for 99.99% of the cases, but what happens when J.K. Rowling releases a new Harry Potter book and sales spike in the millions? The 16MB BSON document size limit could easily be reached. Redesigning our entire application for this outlier situation could result in reduced performance for the typical book, but we do need to take it into consideration.

The Outlier Pattern

With the Outlier Pattern, we are working to prevent a few queries or documents driving our solution towards one that would not be optimal for the majority of our use cases. Not every book sold will sell millions of copies.

A typical book document storing user_id information might look something like:

{
    "_id": ObjectID("507f1f77bcf86cd799439011")
    "title": "A Genealogical Record of a Line of Alger",
    "author": "Ken W. Alger",
    …,
    "customers_purchased": ["user00", "user01", "user02"]

}

This would work well for a large majority of books that aren’t likely to reach the “best seller” lists. Accounting for outliers though results in the customers_purchased array expanding beyond a 1000 item limit we have set, we’ll add a new field to “flag” the book as an outlier.

{
    "_id": ObjectID("507f191e810c19729de860ea"),
    "title": "Harry Potter, the Next Chapter",
    "author": "J.K. Rowling",
    …,
   "customers_purchased": ["user00", "user01", "user02", …, "user999"],
   "has_extras": "true"
}

We’d then move the overflow information into a separate document linked with the book’s id. Inside the application, we would be able to determine if a document has a has_extras field with a value of true. If that is the case, the application would retrieve the extra information. This could be handled so that it is rather transparent for most of the application code.

Many design decisions will be based on the application workload, so this solution is intended to show an example of the Outlier Pattern. The important concept to grasp here is that the outliers have a substantial enough difference in their data that, if they were considered “normal”, changing the application design for them would degrade performance for the more typical queries and documents.

Sample Use Case

The Outlier Pattern is an advanced pattern, but one that can result in large performance improvements. It is frequently used in situations when popularity is a factor, such as in social network relationships, book sales, movie reviews, etc. The Internet has transformed our world into a much smaller place and when something becomes popular, it transforms the way we need to model the data around the item.

One example is a customer that has a video conferencing product. The list of authorized attendees in most video conferences can be kept in the same document as the conference. However, there are a few events, like a company’s all hands, that have thousands of expected attendees. For those outlier conferences, the customer implemented “overflow” documents to record those long lists of attendees.

Conclusion

The problem that the Outlier Pattern addresses is preventing a few documents or queries to determine an application’s solution. Especially when that solution would not be optimal for the majority of use cases. We can leverage MongoDB’s flexible data model to add a field to the document “flagging” it as an outlier. Then, inside the application, we handle the outliers slightly differently. By tailoring your schema for the typical document or query, application performance will be optimized for those normal use cases and the outliers will still be addressed.

One thing to consider with this pattern is that it often is tailored for specific queries and situations. Therefore, ad hoc queries may result in less than optimal performance. Additionally, as much of the work is done within the application code itself, additional code maintenance may be required over time.

In our next Building with Patterns post, we’ll take a look at the Computed Pattern and how to optimize schema for applications that can result in unnecessary waste of resources. If you have questions, please leave comments below.

This post was originally published on the MongoDB Blog.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28

Month: February 2019

Building with Patterns: The Computed Pattern

The Computed Pattern

Sample Use Case

Conclusion

Previous Parts of Building with Patterns:

Building With Patterns: The Outlier Pattern

The Outlier Pattern

Sample Use Case

Conclusion