Importing data with mongoimport

There comes a time in almost everyone’s experience with databases when it would be great to bring in data from an outside source. Often the data is in a spreadsheet format (CSV or TSV) or perhaps a JSON format. I discussed some of the command line tools MongoDB provides in a previous post. Importing data into a MongoDB database is made easy with the CLI tool, mongoimport.

For many use cases, mongoimport is pretty straight forward. It is, in fact, highly used in the MongoDB University courses as a way to quickly populate a database, for example. I’d like to look at some use cases beyond simply populating an empty collection, however.

mongoimport

Connections

The mongoimport will connect to a running mongod or mongos, instance running, by default, on port 27017 on localhost. The syntax of the mongoimport command is fairly straightforward. If for example, we want to populate the posts collection in the blog database with a posts.json file it is simple enough to run the following command.

mongoimport --db blog --collection posts --file posts.json

That is pretty easy. We can make it easier too by using the shorthand version of those flags.

mongoimport -d blog -c posts --file posts.json

If we want to make sure that our posts collection is dropped and only the new data is there, we can use the --drop flag.

mongoimport -d blog -c posts --drop --file posts.json

If you need to change the host or port number, there are flags for that as well, --host and --port, respectively. --host is even more convenient because it allows you to add the port at the end, and use a shorter flag -h. So the following are the same:

mongoimport --host 123.123.123.1 --port 1234 -d blog -c posts --file posts.json
mongoimport -h 123.123.123.1:1234 -d blog -c posts --file posts.json

That’s easy enough as well. What if, however, our MongoDB server requires user authentication, like any good server should?

mongoimport Server Authentication

Data security should be on everyone’s mind when it comes to server management. With that in mind, MongoDB offers a variety of ways to secure your data. Assuming that one needs to get authenticated access to the server, how can one use mongoimport to do so? You guessed it, there are flags for that too. --username or -u and --password or -p are your friends.

mongoimport -h 123.123.123.1:1234 -u user -p "pass" -d blog -c posts --file posts.json

We can add in some extra assurances by leaving off the --password flag and mongoimport will prompt for an appropriate password.

That works great for some simpler authentication options, but what if we have a more involved authentication system with an authentication database? We can specify one with the --authenticationDatabase flag. That’s pretty handy to keep only authorized people from importing data into your collection.

mongoimport provides a great range of flag options for connecting to secured servers. I would highly recommend looking at the documentation for specifics based on your environment.

File & Column Types

As stated earlier, mongoimport works on CSV, TSV, and JSON documents. By default the import format is JSON. With the --type flag we can import CSV or TSV files. Since CSV and TSV files can contain some special features, let’s look at some of the options for working with them and mongoimport.

Many times a CSV or TSV file will include a header line. It would be handy if we could utilize those header values as field names in our MongoDB documents, right? Well, mongoimport offers a --headerline flag that accomplishes that for us.

For times in which our CSV or TSV file doesn’t include header information, mongoimport has a solution for that as well. With the --fields flag, one can provide a comma-separated list of field names. Alternatively, you can generate a file of field names, with one name per line, and pass it along with the --fieldFile flag.

Along with some of the other features new to MongoDB version 3.4, there are some new features added to mongoimport. One of them is the
--columnsHaveTypes flag. When used in conjunction with the --fields,  --fieldFile, or --headerline flag it allows you to specify the types of each field. You pass in the field name in the format of columnName.type() along with any arguments into the type() method. So, for example, if you were typing a filed called isAdmin you would use isAdmin.bool(). Have a look at the --columnsHaveTypes documentation for a list of available types and supported arguments.

Fair warning here, the flags dealing with header information are for CSV and/or TSV files. If one attempts to use them with a JSON formatted file, mongoimport gets grumpy and returns an error.

Importing into an existing collection

One last concept and list of flags I’d like to cover is for those instances in which you want to import data into an existing collection. The
--mode flag offers a way to tell mongoimport how to handle existing collection documents which match incoming ones. There are three options to the --mode flag, insert, upsert, and merge.

  • Insert allows the documents to get put into the collection with the only check being on fields with a unique index. If there are duplicate values, mongoimport logs an error.
  • Upsert replaces documents in the database with the new documents from the import file. All other documents get inserted.
  • Merge, well, it merges existing documents with matching incoming documents and inserts the others. This is another new feature of version 3.4.

If you are needing to import documents in one of these ways, look at the documentation for options on upserting and merging based on field other than _id.

Wrap Up

MongoDB also provides a similar, but inverse, function mongoexport. While both tools are powerful they do not preseve the BSON data types than MongoDB uses. As such, these tools should not be used for production backups. MongoDB provides other tools for backup methods.

I hope that this post has given you some insights into one of the powerful MongoDB Package Component tools that are provided “out of the box”, mongoimport. Some programming languages have developed their own separate tools for importing data. Some of them are better than others. For me, since such a powerful import tool is already provided, I find myself using mongoimport more often than not.

If you haven’t tried it out yet yourself, I would encourage you to do so.

There are several MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB for the definition of authentication?” and get a helpful response.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwittergoogle_plusredditlinkedinmail

Choosing a good Shard Key in MongoDB

I discussed, at a high level, the concept of sharing in MongoDB in a previous post. I mentioned that I would talk about what makes for a good shard key later on, and here we are. Before we talk about how to choose a good shard key, let’s first discuss it’s purpose.

When do we need to shard?

There are three typical reasons that we will want to shard a collection.

  1. The write workload on a single server exceeds that server’s capacity.
  2. The working set no longer fits into RAM.
  3. A single server can no longer handle the size of the dataset.

However, sharding earlier is better in an application’s lifecycle. There are significant performance considerations when sharding an existing collection.

Why do we need a Shard Key?

The shard key determines how data, specifically a collection’s documents, will be distributed across a sharded cluster in MongoDB. In a sharded collection, MongoDB uses the shard key to partition the data based on ranges of the key. Each range of the shard key values defines a non-overlapping value range. Further, each range is assigned to a specific chunk on the cluster. Sharded environments generally evenly distribute the chunks.

Sharding Example with Shard Key

The query router (the mongos server) directs the query to the appropriate chunk as data is requested from an application. It does the same for write requests, based on the shard key it directs to write to the appropriate chunk.

Choosing a Shard Key

As with MongoDB schema design considerations, choosing a shard key is heavily dependent on your application. The most frequent information needed to be read from and/or written to the database should be a key contributing factor. Depending on the application perhaps using a hashed ObjectId would be enough. In other applications, a single field may not be enough and a compound sharding key will be necessary.

When choosing a shard key, it impacts the cluster balancer and distribution of data across the shards. This has a large impact on sharded cluster performance and efficiency. An ideal shard key will allow for an even distribution of documents throughout the entire cluster. Beyond application specific requirements, there are a couple of technical factors to consider when choosing a shard key. A shard key cannot be changed once chosen for a collection, so these considerations are important.

Technical Considerations

The first consideration is cardinality. A shard key with high cardinality allows for better horizontal scaling. It does not, however, guarantee even data distribution across a sharded cluster.

The frequency of data is another factor in a good shard key. If the indexed field is, say user_name, and the data set is on the genealogical data for the Alger family, it would be expected that “Alger” occurs much more frequently than other names. This would cause the balance of documents to be uneven across the shards. One wants to choose a key that has low frequency, but that itself won’t guarantee even data distribution either.

High cardinality and low frequency are indeed important factors in shard key selection. The rate of change is the third leg of the shard key selection stool. Selecting a key that does not increase, or decrease, monotonically is important as well. A key that is always increasing results in inserts being routed to the shard with the maxKey as the upper bound. Thus, the shard will become unbalanced. With the minKey as the lower bound, decreasing results will be routed to that shard.

Each of these three legs of our stool contributes to data distribution. We need to consider each factor when selecting a key.

Example Shard Keys

Now that we have seen what some of the factors play into sharding performance, let’s look at some examples. To borrow a Hollywood movie title, The Good, the Bad & the Ugly, this can be true of our keys as well.

Shard keys must be based off an indexed field in the collection which exists across all documents. They can also be generated on a compound index in which the shard key is a prefix of that index. I covered indexing in MongoDB previously, as it is a topic all of itself. Since there are things to think about when indexing, it also impacts the shard keys.

The Ugly

I would say that ugly shard keys are those that not only don’t help your application but actually result in worse performance than when starting out. This would be something akin to using the _id field without hashing. Since _id is monotonically increasing, it would lead to an unbalanced system.

The Bad

I would categorize shard keys as bad that don’t fully take advantage of the concept of sharding. If, for example, collections are distributed across the shards and chunks, but the application’s read requests typically use an index different from the shard key index. While they may not have a negative impact on physical data distribution, they just aren’t properly optimized.

This results in a scatter/gather approach to the queries. The mongos server will scatter the request to all shards. Then gather the results together to send to the application. A targeted single server should be receiving the query with a well-chosen key.

The Good

This is where we all want to live, right? In a highly performant system environment with everything highly scaled and efficient. This can be achieved by selecting a shard key for a collection which optimizes an application’s needs with the technical strategies listed above.

Wrap Up

Choosing a correct shard key and the development of a sharding strategy is not always easy. Don’t make the mistake that because one application is sharded in a particular way, with a particular shard key, that it is a universal approach. Your application’s strategy may be entirely different, or sharding may not be the right solution for a particular collection.

Remember though, that it is important to establish a shard key and split your data early on. Sharding becomes more challenging in production with large data sets.

There are a lot of MongoDB specific terms in this post. I created a MongoDB Dictionary skill for the Amazon Echo line of products. Check it out and you can say “Alexa, ask MongoDB what is a Shard Key?” and get a helpful response.


Follow me on Twitter @kenwalger to get the latest updates on my postings.

Facebooktwittergoogle_plusredditlinkedinmail