March 2019 | Blog of Ken W. Alger

Imagine a fairly decent sized city of approximately 39,000 people. The exact number is pretty fluid as people move in and out of the city, babies are born, and people die. We could spend our days trying to get an exact number of residents each day. But most of the time that 39,000 number is “good enough.” Similarly, in many applications we develop, knowing a “good enough” number is sufficient. If a “good enough” number is good enough then this is a great opportunity to put the Approximation Pattern to work in your schema design.

The Approximation Pattern

We can use the Approximation Pattern when we need to display calculations that are challenging or resource expensive (time, memory, CPU cycles) to calculate and for when precision isn’t of the highest priority. Think again about the population question. What would the cost be to get an exact calculation of that number? Would, or could, it have changed since I started the calculation? What’s the impact on the city’s planning strategy if it’s reported as 39,000 when in reality it’s 39,012?

From an application standpoint, we could build in an approximation factor which would allow for fewer writes to the database and still provide statistically valid numbers. For example, let’s say that our city planning strategy is based on needing one fire engine per 10,000 people. 100 people might seem to be a good “update” period for planning. “We’re getting close to the next threshold, better start budgeting.”

In an application then, instead of updating the population in the database with every change, we could build in a counter and only update by 100, 1% of the time. Our writes are significantly reduced here, in this example by 99%. Another option might be to have a function that returns a random number. If, for example, that function returns a number from 0 to 100, it will return 0 around 1% of the time. When that condition is met, we increase the counter by 100.

Why should we be concerned with this? Well, when working with large amounts of data or large numbers of users, the impact on performance of write operations can get to be large too. The more you scale up, the greater that impact is too and at scale, that’s often your most important consideration. By reducing writes and reducing resources for data that doesn’t need to be “perfect,” it can lead to huge improvements in performance.

Sample Use Case

Population patterns are an example of the Approximation Pattern. An additional use case where we could use this pattern is for website views. Generally speaking, it isn’t vital to know if 700,000 people visited the site, or 699,983. Therefore we could build into our application a counter and update it in the database when our thresholds are met.

This could have a tremendous reduction in the performance of the site. Spending time and resources on business critical writes of data makes sense. Spending them all on a page counter doesn’t seem to be a great use of resources.

The effect of approximation on write workload

Movie Website – Write Workload Reduction

In the image above we see how we could use the Approximation Pattern and reduce not only writes for the counter operations, but we might also see a reduction in architecture complexity and cost by reducing those writes. This can lead to further savings beyond just the time for writing data. Similar to the Computed Pattern we explored earlier, it saves on overall CPU usage by not having to run calculations as frequently.

Conclusion

The Approximation Pattern is an excellent solution for applications that work with data that is difficult and/or expensive to compute and the accuracy of those numbers isn’t mission critical. We can make fewer writes to the database increasing performance and still maintain statistically valid numbers. The cost of using this pattern, however, is that exact numbers aren’t being represented and that the implementation must be done in the application itself.

The next post in this series will look at the Tree Pattern.

If you have questions, please leave comments below.

Previous Parts of Building with Patterns:

The Polymorphic pattern
The Attribute pattern
The Bucket pattern
The Outlier pattern
The Computed pattern
The Subset pattern
The Extended Reference pattern

This post was originally published on the MongoDB Blog.

Throughout this Building With Patterns series, I hope you’ve discovered that a driving force in what your schema should look like, is what the data access patterns for that data are. If we have a number of similar fields, the Attribute Pattern may be a great choice. Does accommodating access to a small portion of our data vastly alter our application? Perhaps the Outlier Pattern is something to consider. Some patterns, such as the Subset Pattern, reference additional collections and rely on JOIN operations to bring every piece of data back together. What about instances when there are lots of JOIN operations needed to bring together frequently accessed data? This is where we can use the Extended Reference pattern.

The Extended Reference Pattern

There are times when having separate collections for data make sense. If an entity can be thought of as a separate “thing”, it often makes sense to have a separate collection. For example, in an e-commerce application, the idea of an order exists, as does a customer, and inventory. They are separate logical entities.

From a performance standpoint, however, this becomes problematic as we need to put the pieces of information together for a specific order. One customer can have N orders, creating a 1-N relationship. From an order standpoint, if we flip that around, they have an N-1 relationship with a customer. Embedding all of the information about a customer for each order just to reduce the JOIN operation results in a lot of duplicated information. Additionally, not all of the customer information may be needed for an order.

The Extended Reference pattern provides a great way to handle these situations. Instead of duplicating all of the information on the customer, we only copy the fields we access frequently. Instead of embedding all of the information or including a reference to JOIN the information, we only embed those fields of the highest priority and most frequently accessed, such as name and address.

Something to think about when using this pattern is that data is duplicated. Therefore it works best if the data that is stored in the main document are fields that don’t frequently change. Something like a user_id and a person’s name are good options. Those rarely change.

Also, bring in and duplicate only that data that’s needed. Think of an order invoice. If we bring in the customer’s name on an invoice, do we need their secondary phone number and non-shipping address at that point in time? Probably not, therefore we can leave that data out of the invoice collection and reference a customer collection.

When information is updated, we need to think about how to handle that as well. What extended references changed? When should those be updated? If the information is a billing address, do we need to maintain that address for historical purposes, or is it okay to update? Sometimes duplication of data is better because you get to keep the historical values, which may make more sense. The address where our customer lived at the time we ship the products make more sense in the order document, then fetching the current address through the customer collection.

Sample Use Case

An order management application is a classic use case for this pattern. When thinking about N-1 relationships, orders to customers, we want to reduce the joining of information to increase performance. By including a simple reference to the data that would most frequently be JOINed, we save a step in processing.

If we continue with the example of an order management system, on an invoice Acme Co. may be listed as the supplier for an anvil. Having the contact information for Acme Co. probably isn’t super important from an invoice standpoint. That information is better served to reside in a separate supplier collection, for example. In the invoice collection, we’d keep the needed information about the supplier as an extended reference to the supplier information.

Conclusion

The Extended Reference pattern is a wonderful solution when your application is experiencing many repetitive JOIN operations. By identifying fields on the lookup side and bringing those frequently accessed fields into the main document, performance is improved. This is achieved through faster reads and a reduction in the overall number of JOINs. Be aware, however, that data duplication is a side effect of this schema design pattern.

The next post in this series will look at the Approximation Pattern.

If you have questions, please leave comments below.

Previous Parts of Building with Patterns:

The Polymorphic pattern
The Attribute pattern
The Bucket pattern
The Outlier pattern
The Computed pattern
The Subset pattern

This post was originally published on the MongoDB Blog.

Help Support the Site

If you enjoy reading my blog and would like to help offset the costs of hosting and maintenance, I would be honored to accept your donations.

donate monthly
donate once only

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Month: March 2019

Building with Patterns: The Approximation Pattern

The Approximation Pattern

Sample Use Case

Conclusion

Previous Parts of Building with Patterns:

Building with Patterns: The Extended Reference Pattern

The Extended Reference Pattern

Sample Use Case

Conclusion

Previous Parts of Building with Patterns: