Using R with MongoDB

NOTE: MongoDB 3.6 has a new R Language support. See my other blog post for the latest information.

The R programming language is a powerful language used for statistical computing. When working with statistical computing it is frequently the case that the data being explored will come from a database. Some of the powers that R excels at are working with data in tables and matrices and joining columns and rows together. This seems like a great fit for SQL databases, but what about a NoSQL database like MongoDB? Can R analyze a MongoDB document as easily as a SQL table?

Well, this post would be pretty short if the answer was “No”, right? So let take a look and how to pull data into R from a MongoDB collection. Then we’ll take a brief look at examining our data.

Setting Up

While there are plugins available for a variety of IDEs, such as those by JetBrains, it is pretty common to use RStudio when working with R. Somewhere along the line, I picked up a “scores” database in MongoDB that we’ll use as our sample data. I’ve posted it here for download.

We can easily import the data into our MongoDB database using mongoimport. In my case, I put it into a database called kenblog and a collection called scores. Pretty creative, eh? Here’s what a sample document in the collection looks like:

{
   "_id" : ObjectId("5627207b33ff2cf40effc25e"),
   "student" : 2,
   "type" : "quiz",
   "score" : 74
}

There are 1,787 records in our collection with the type of assignment being either quiz, essay, or exam. Let’s see how we can access our data with R.

First, we need to get and load our package for interfacing MongoDB with R. For this example I’ll be using RMongo, but there is another package available, rmongodb. Sadly it doesn’t look like there has been much in the way of current activity with either package’s GitHub repositories. Aside from that we can still connect and do some queries.

Connecting R to MongoDB

We need to bring in our package and establish our connection:

require(RMongo)

mongo <- mongoDbConnect('kenblog', 'localhost', 27017)

In the mongoDbConnect method, we have options for the name of the database, server name, and port number to which we want to connect.

Next, we will want to send a query. For this example, let’s get only the exam data from our scores collection. We can use the dbGetQuery method for this which takes a connection object, the collection name, and the query.

examQuery <- dbGetQuery(mongo, 'scores', "{'type': 'exam'}")

This loads in all of the records from our scores collection of type exam. Let’s take the values of our exam scores and create a vector from them.

exam_scores <- examQuery[c('score')]

Nice! Now we can utilize some of the power of R to do some data analysis. Let’s get a simple summary of our data with summary(exam_scores):

     score       
 Min.   : 60.00  
 1st Qu.: 72.00  
 Median : 79.00  
 Mean   : 79.45  
 3rd Qu.: 86.00  
 Max.   :100.00 

Neat. I realize that this particular example could be computed using MongoDB’s powerful aggregation framework. However, there are times when using outside resources and languages, like R, for processing is called for.

Wrap Up

Connecting to MongoDB from R is pretty straightforward and simple using the RMongo package. However, many of the new features that MongoDB has implemented in the last few years have not been included in the community R drivers. Further, as of this post, there isn’t an “official” R driver supported by MongoDB.

R is a great statistical language and can definitely be used to query and analyze MongoDB collections. If you are using R in your work today, MongoDB is a definite option for storing your data to be analyzed.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

There are a few MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB for the definition of a document?” and get a helpful response.

Facebooktwitterredditlinkedinmail