MongoDB Storage Engine Journaling

I came across a question the other day as it relates to journaling in MongoDB. Specifically how it is handled in the different supported storage engines and is it necessary to use. There was an interesting discussion on this topic so I thought I would generate some thoughts and explanations.

To start with, some questions arise. What exactly is a MongoDB journal? Why is journaling important? For the sake of this post, I’m going to be relating this information to 64-bit builds of mongod and based on the 3.4 version of the database.

What is Journaling?

Much like one uses a journal to record thoughts and daily events, MongoDB uses a journal to ensure data integrity. This is accomplished through writing data first to the journal files and then to the core data files. In the event of an untimely server shutdown, the data can be restored to a consistent state.

This is accomplished through MongoDB’s write operation durability guarantee. If your mongod process stops in an unexpected manner, data from the journal will be used to re-apply the write operations when it is restarted. MongoDB creates, when journaling is enabled, a subdirectory for the journal data called journal. This resides under the dbPath directory and contains the write ahead logs.

Since each different storage engine in MongoDB implements crash resiliency and data persistence slightly differently, let’s see how journaling is utilized.

Storage Engine Implementations

There are three different storage engines that are predominately used with MongoDB. MMAPv1, WiredTiger, and In-Memory. They each have their own strengths and weaknesses. Those differences are beyond the scope of this post, but I would like to look at how journaling is implemented in each.

MMAPv1

Starting in version 3.2 of MongoDB, MMAPv1 is no longer the default storage engine. However, it is still in use and in certain circumstances is a better option. Therefore, it is still good to understand how journaling works with this storage engine in its default configuration.

In a nutshell, when a write command is issued, the operation is applied to an internal private view, then written to the journal. Once the data has been updated in the journal the changes are applied to an internal shared view and then written to disk.

In MMAPv1, the journal is updated every 100 milliseconds in batch processes called group commits. Data is written to disk, though, every 60 seconds in the process flushing the shared view to disk. Depending on the quantity and availability of system memory, the flushing of data may occur more often.

Where then does the importance of the journal come in? Well, in the case of an unexpected shutdown of the mongod process the journal can be used to restore the data. Without journaling on a standalone server, there is a more lengthy and involved repair process involved.

On systems using a properly configured replica set, data recovery may be simplified without a journal over using the repair process. It is still not as clean as with journaling enabled, however.

WiredTiger

The WiredTiger storage engine takes a different approach to write operation data concurrency. WiredTiger uses checkpoints in conjunction with a journal. These checkpoints allow for data to be recovered after the last checkpoint.

When a write operation is called, a snapshot is taken of the data. When data is written to disk (every 60 seconds by default), the data is written across all data files and becomes durable. This becomes a new checkpoint and can be used as a recovery point.

This allows for WiredTiger to be covered from the last checkpoint without a journal. Pretty slick. However, if an unexpected shutdown occurs between checkpoints and journaling is disabled, data will be lost. The journal in WiredTiger, therefore, utilizes a write-ahead log similar to MMAPv1 between checkpoints for data durability.

So journaling and replica sets are still important pieces of a server environment when using WiredTiger. It just is implemented in a slightly different way than MMAPv1.

In-Memory

For those that are running an Enterprise version of MongoDB, there is a storage engine that stores data in memory. Because memory is stored in memory, the data is non-persistent. The concept of a journal does not apply in this situation.

Study Question

I have seen questions similar to “Why is the journal unnecessary for WiredTiger” listed in various study guides. As we have learned, it is indeed not required for data consistency. At least not in the same fashion as it is for MMAP. That being said, I might argue that “unnecessary” is a bit of a misleading word. WiredTiger’s data consistency model is just different than MMAP. Journaling may not be “necessary” perhaps, but I wouldn’t run a system without it.

Wrap Up

All of these details of journaling can be a lot to think about and potentially manage. This is one of the great advantages of MongoDB Atlas, in that these internal matters are handled by them. If you are running and/or managing a MongoDB server, it is a best practice to leave journaling on for data integrity. Further, it is recommended to have your system use a replica set at a minimum as data recovery is often simplified even more.

There are several MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB for the definition of a journal?” and get a helpful response.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwittergoogle_plusredditlinkedinmail

Flask Blood Glucose Tracker

My oldest daughter was diagnosed with Type 1 Diabetes at the age of two. Technology has come a long way in terms of tracking blood sugar levels, but I thought I would start a Python web application to do so. I’ll be using the Flask web framework for the project and, since I’m not a marketing or product naming genius, will call the project Flask Blood Glucose Tracker. I’m certainly open to other, more catchy, names.

This is somewhat of a different post for me, in that I’ll be walking through the generation of an actual application. The application will mostly be a minimal viable product (MVP) though. It should, however, be a good tutorial on how to integrate some concepts and technologies together. As such, in this first post, I’ll cover some of the project specifications and features that I’d like to include and work on implementing them in future posts.

Application for Diabetes

Diabetics have to check their blood glucose levels frequently, typically using a blood sugar monitor. These checks involve getting a small portion of blood from a finger (or arm) prick and having the monitor test it. It then returns a measurement of the amount of glucose in the blood as a number of milligrams per deciliter (mG/dL). These readings then will be the numbers we want to record in our application.

It is also important to know if the reading is taken before or after eating and the time of day the reading was taken. Therefore, the application will need to accommodate for that as well. The readings themselves, along with the time of day and relation to meal time are all factors into the amount of insulin needs to be injected.

Throughout this, and subsequent posts, I will do my best to explain diabetes-specific terms in as user-friendly of a way as possible. I am basing much of the numbers that I use, ranges of “good and bad”, etc. on managing my daughter’s diabetes over the last 14 years and the countless doctor appointments I’ve attended. As a disclaimer, I am not a medical professional. Please check with your physician about specifics with diabetes. There is a Diabetes For Dummies book which provides a decent overview as well.

Flask Blood Glucose Tracker Application Features

There are already a lot of excellent products on the market for keeping track of one’s blood sugar levels. With that in mind, this application is going to be fairly simple to start. I would like to build it with growth in mind, however, so building a REST API into the project seems like a good idea. My basic feature list to start with is:

  • Register for the BGT site, with email confirmation.
  • Login/Logout and based on role have different access.
    • Patient to access and edit my own data.
    • Physician to get a list of all current patient records.
    • Administrator for site maintenance.
  • Input blood sugar levels with date and time of reading and indication of before or after a meal or snack.
  • Display the data in a table with averages.
  • Display the data in a chart or visual format.
  • Typical CRUD operations for the data
  • REST API to expose patient blood sugars in a secure fashion
  • Data is stored in the cloud for accessibility and ease of database maintenance.

For the reporting features highlighting high and low blood sugar levels in the report would be helpful. Since what is “good” and “bad” can change for each individual, I’ll include a field for each individual to set that.

Application Stack

I have already mentioned that this application will be built with Flask. For the data store, I will use MongoDB. To keep in line with the feature request of storing data in the cloud, MongoDB offers a Database as a Service (DBaaS) called Atlas.

I haven’t quite figured out yet where I’ll ultimately host this application, perhaps Heroku? Or maybe on my own server.

Application Libraries

When it comes to libraries, there are a lot of choices. Here’s what I’ll be using, which will also be included in a requirements.txt file.

  • Flask version 0.12.2
  • Flask-Login, 0.4.0
  • Flask-PyMongo, 0.5.1
  • Bokeh, 0.12.6
  • Jinja2, 2.9.6
  • pandas, 0.20.2

There are other libraries that will but used as well, but those are the main ones of interest.

I’m also more of a fan of Zurb Foundation than Twitter Bootstrap, so I’ll be using that for styling.

Pages and Routes

Web Pages

To start with we need a way for a user to log in, enter their personal data, enter a new blood sugar record, edit their record, and view their information in both a tabular format and then in a chart format.

We’ll need the following pages, at least to start.

  • Index
  • Registration
  • Login/Logout Page
  • Profile page, login required
  • Records page (create, read, update, delete), login required
  • Chart page, login required

For a Physician we would want to be able to:

  • Display all of their patients, login required

For an Administrator, we want to be able to

  • Have the ability to manage users (patients & physicians) but not be able to see patient medical data.
REST API Routes

From the API I want to expose the ability to securely read and write (GET and POST) data to a patient’s record. This will make it easier to, for example, write a mobile application to connect to our data. Or, with the advances in blood sugar monitors, perhaps automatically update our application with readings from a device.

Document Model

Since I will be using MongoDB to store data for this application, I’ll be leveraging the document model. This offers a lot of flexibility in how data is stored, among other benefits. I would encourage you to read my blog post on the document model if you are not familiar with it.

To start with, the basic data we want to capture and model will be as follows:

BGT Sample Document
Sample Patient Record document

In looking at this sample document, the groups field will keep track of values such as patientphysician, and admin. I have chosen to implement the postal_code and MRN values as strings instead of integers to accommodate alpha-numeric values.

Through the course of developing this application, we will see the flexibility of the document model in action. We’ll see how we can utilize some of the features from MongoDB’s aggregation pipeline to handle our data processing as well.

Wrap Up

I have outlined a nice project here which will utilize several different bits of programming. In the next few posts then, I’ll cover how to implement all of these features into an MVP application. I would definitely enjoy receiving feedback, so please leave comments below.

 


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwittergoogle_plusredditlinkedinmail