Setting up Java Applications to Communicate with MongoDB, Kerberos and SSL

Aug 19 • Posted 3 days ago

By Alex Komyagin, Technical Services Engineer at MongoDB

Setting up Kerberos authentication and SSL encryption in a MongoDB Java application is not as simple as other languages. In this post, I’m going to show you how to create a Kerberos and SSL enabled Java application that communicates with MongoDB.

My original setup consists of the following:

1) KDC server:

kdc.mongotest.com

kerberos config file (/etc/krb5.conf):

KDC has the following principals:

  • gssapitest@MONGOTEST.COM - user principle (for java app)
  • mongodb/rhel64.mongotest.com@MONGOTEST.COM - service principle (for mongodb server)

2) MongoDB server:

rhel64.mongotest.com

MongoDB version: 2.6.0

MongoDB config file:

This server also has the global environment variable $KRB5_KTNAME set to the keytab file exported from KDC.

Application user is configured in the admin database like this:

3) Application server: has stock OS with krb5 installed

All servers are running with RHEL6.4 onboard.

Now let’s talk about how to create a Java application with Kerberos and SSL enabled, and that will run on the application server. Here is the sample code that we will use (SSLApp.java):

Download the java driver:

Install java and jdk:

sudo yum install java-1.7.0 sudo yum install java-1.7.0-devel

Create a certificate store for Java and store the server certificate there, so that Java knows who it should trust:

(mongodb.crt is just a public certificate part of mongodb.pem)

Copy kerberos config file to the application server: /etc/krb5.conf or ““C:\WINDOWS\krb5.ini“` (otherwise you’ll have to specify kdc and realm as Java runtime options)

Use kinit to store the principal password on the application server:

kinit gssapitest@MONGOTEST.COM

As an alternative to kinit, you can use JAAS to cache kerberos credentials.

Compile and run the Java program

It is important to specify useSubjectCredsOnly=false, otherwise you’ll get the “No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)” exception from Java. As we discovered, this is not strictly necessary in all cases, but it is if you are relying on kinit to get the service ticket.

The Java driver needs to construct MongoDB service principal name in order to request the Kerberos ticket. The service principal is constructed based on the server name you provide (unless you explicitly asked to canonicalize server name). For example, if I change rhel64.mongotest.com to the host IP address in the connection URI, I would be getting Kerberos exceptions No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - UNKNOWN_SERVER)]. So be sure you specify the same server host name as you used in the Kerberos principal (). Adding -Dsun.security.krb5.debug=true to Java runtime options helps a lot in debugging kerberos auth issues.

These steps should help simplify the process of connecting Java applications with SSL. Before deploying any application with MongoDB, be sure to read through our 12 tips for going into production and the Security Checklist which outlines recommended security measures to protect your MongoDB installation. More information on configuring MongoDB Security can be found in the MongoDB Manual.

For further questions, feel free to reach out to the MongoDB team through google-groups.

Getting Started with MongoDB and Java: Part II

Aug 14 • Posted 1 week ago

By Trisha Gee, Java Engineer and Advocate at MongoDB

In the last article, we covered the basics of installing and connecting to MongoDB via a Java application. In this post, I’ll give an introduction to CRUD (Create, Read, Update, Delete) operations using the Java driver. As in the previous article, if you want to follow along and code as we go, you can use these tips to get the tests in the Getting Started project to go green.

Creating documents

In the last article, we introduced documents and how to create them from Java and insert them into MongoDB, so I’m not going to repeat that here. But if you want a reminder, or simply want to skip to playing with the code, you can take a look at Exercise3InsertTest.

Querying

Putting stuff in the database is all well and good, but you’ll probably want to query the database to get data from it.

In the last article we covered some basics on using find() to get data from the database. We also showed an example in Exercise4RetrieveTest. But MongoDB supports more than simply getting a single document by ID or getting all the documents in a collection. As I mentioned, you can query by example, building up a query document that looks a similar shape to the one you want.

For the following examples I’m going to assume a document which looks something like this:

person = {
  _id: "anId",
  name: "A Name",
  address: {
    street: "Street Address",
    city: "City",
    phone: "12345"
  }
  books: [ 27464, 747854, ...]
}  

Find a document by ID

To recap, you can easily get a document back from the database using the unique ID:

…and you get the values out of the document (represented as a DBObject) using a Map-like syntax:

In the above example, because you’ve queried by ID (and you knew that ID existed), you can be sure that the cursor has a single document that matches the query. Therefore you can use cursor.one() to get it.

Find all documents matching some criteria

In the real world, you won’t always know the ID of the document you want. You could be looking for all the people with a particular name, for example.

In this case, you can create a query document that has the criteria you want:

You can find out the number of results:

and you can, naturally, iterate over them:

A note on batching

The cursor will fetch results in batches from the database, so if you run a query that matches a lot of documents, you don’t have to worry that every document is loaded into memory immediately. For most queries, the first batch returned will be 101 documents. But as you iterate over the cursor, the driver will automatically fetch further batches from the server. So you don’t have to worry about managing batching in your application. But you do need to be aware that if you iterate over the whole of the cursor (for example to put it into a List), you will end up fetching all the results and putting them in memory.

You can get started with Exercise5SimpleQueryTest.

Selecting Fields

Generally speaking, you will read entire documents from MongoDB most of the time. However, you can choose to return just the fields that you care about (for example, you might have a large document and not need all the values). You do this by passing a second parameter into the find method that’s another DBObject defining the fields you want to return. In this example, we’ll search for people called “Smith”, and return only the name field. To do this we pass in a DBObject representing {name: 1}:

You can also use this method to exclude fields from the results. Maybe we might want to exclude an unnecessary subdocument from the results - let’s say we want to find everyone called “Smith”, but we don’t want to return the address. We do this by passing in a zero for this field name, i.e. {address: 0}:

With this information, you’re ready to tackle Exercise6SelectFieldsTest

Query Operators

As I mentioned in the previous article, your fields can be one of a number of types, including numeric. This means that you can do queries for numeric values as well. Let’s assume, for example, that our person has a numberOfOrders field, and we wanted to find everyone who had ordered more than, let’s say, 10 items. You can do this using the $gt operator:

Note that you have to create a further subdocument containing the $gt condition to use this operator. All of the query operators are documented, and work in a similar way to this example.

You might be wondering what terrible things could happen if you try to perform some sort of numeric comparison on a field that is a String, since the database supports any type of value in any of the fields (and in Java the values are Objects so you don’t get the benefit of type safety). So, what happens if you do this?

The answer is you get zero results (assuming all your documents contain names that are Strings), and you don’t get any errors. The flexible nature of the document schema allows you to mix and match types and query without error.

You can use this technique to get the test in Exercise7QueryOperatorsTest to go green - it’s a bit of a daft example, but you get the idea.

Querying Subdocuments

So far we’ve assumed that we only want to query values in our top-level fields. However, we might want to query for values in a subdocument - for example, with our person document, we might want to find everyone who lives in the same city. We can use dot notation like this:

We’re not going to use this technique in a query test, but we will use it later when we’re doing updates.

Familiar methods

I mentioned earlier that you can iterate over a cursor, and that the driver will fetch results in batches. However, you can also use the familiar-looking skip() and limit() methods. You can use these to fix up the test in Exercise8SkipAndLimitTest.

A last note on querying: Indexes

Like a traditional database, you can add indexes onto the database to improve the speed of regular queries. There’s extensive documentation on indexes which you can read at your own leisure. However, it is worth pointing out that, if necessary, you can programmatically create indexes via the Java driver, using createIndexes. For example:

There is a very simple example for creating an index in Exercise9IndexTest, but indexes are a full topic on their own, and the purpose of this part of the tutorial is to merely make you aware of their existence rather than provide a comprehensive tutorial on their purpose and uses.

Updating values

Now you can insert into and read from the database. But your data is probably not static, especially as one of the benefits of MongoDB is a flexible schema that can evolve with your needs over time.

In order to update values in the database, you’ll have to define the query criteria that states which document(s) you want to update, and you’ll have to pass in the document that represents the updates you want to make.

There are a few things to be aware of when you’re updating documents in MongoDB, once you understand these it’s as simple as everything else we’ve seen so far.

Firstly, by default only the first document that matches the query criteria is updated.

Secondly, if you pass in a document as the value to update to, this new document will replace the whole existing document. If you think about it, the common use-case will be: you retrieve something from the database; you modify it based on some criteria from your application or the user; then you save the updated document to the database.

I’ll show the various types of updates (and point you to the code in the test class) to walk you through these different cases.

Simple Update: Find a document and replace it with an updated one

We’ll carry on using our simple Person document for our examples. Let’s assume we’ve got a document in our database that looks like:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  address: {
    street: "123 Fake St",
    city: "Faketon",
    phone: "5559991234"
  }
  books: [ 27464, 747854, ...]
} 

Maybe Jo goes into witness protection and needs to change his/her name. Assuming we’ve got jo populated in a DBObject, we can make the appropriate changes to the document and save it into the database:

You can make a start with Exercise10UpdateByReplacementTest.

Update Operators: Change a field

But sometimes you won’t have the whole document to replace the old one, sometimes you just want to update a single field in whichever document matched your criteria.

Let’s imagine that we only want to change Jo’s phone number, and we don’t have a DBObject with all of Jo’s details but we do have the ID of the document. If we use the $set operator, we’ll replace only the field we want to change:

There are a number of other operators for performing updates on documents, for example $inc which will increment a numeric field by a given amount.

Now you can do Exercise11UpdateAFieldTest

Update Multiple

As I mentioned earlier, by default the update operation updates the first document it finds and no more. You can, however, set the multi flag on update to update everything.

So maybe we want to update everyone in the database to have a country field, and for now we’re going to assume all the current people are in the UK:

The query parameter is an empty document which finds everything; the second boolean (set to true) is the flag that says to update all the values which were found.

Now we’ve learnt enough to complete the two tests in Exercise12UpdateMultipleDocumentsTest

Upsert

Finally, the last thing to mention when updating documents is Upsert (Update-or-Insert). This will search for a document matching the criteria and either: update it if it’s there; or insert it into the database if it wasn’t.

Like “update multiple”, you define an upsert operation with a magic boolean. It shouldn’t come as a surprise to find it’s the first boolean param in the update statement (since “multi” was the second):

Now you know everything you need to complete the test in Exercise13UpsertTest

Removing from the database

Finally the D in CRUD - Delete. The syntax of a remove should look familiar now we’ve got this far, you pass a document that represents your selection criteria into the remove method. So if we wanted to delete jo from our database, we’d do:

Unlike update, if the query matches more than one document, all those documents will be deleted (something to be aware of!). If we wanted to remove everyone who lives in London, we’d need to do:

That’s all there is to remove, you’re ready to finish off Exercise14RemoveTest

Conclusion

Unlike traditional databases, you don’t create SQL queries in MongoDB to perform CRUD operations. Instead, operations are done by constructing documents both to query the database, and to define the operations to perform.

While we’ve covered what the basics look like in Java, there’s loads more documentation on all the core concepts in the MongoDB documentation:

Getting Started with MongoDB and Java: Part I

Aug 7 • Posted 2 weeks ago

By Trisha Gee, Java Engineer and Advocate at MongoDB

Java is one of the most popular programming languages in the MongoDB Community. For new users, it’s important to provide an overview of how to work with the MongoDB Java driver and how to use MongoDB as a Java developer.

In this post, which is aimed at Java/JVM developers who are new to MongoDB, we’re going to give you a guide on how to get started, including:

  • Installation
  • Setting up your dependencies
  • Connecting
  • What are Collections and Documents?
  • The basics of writing to and reading from the database
  • An overview of some of the JVM libraries

Installation

The installation instructions for MongoDB are extensively documented, so I’m not going to repeat any of that here. If you want to follow along with this “getting started” guide, you’ll want to download the appropriate version of MongoDB and unzip/install it. At the time of writing, the latest version of MongoDB is 2.6.3, which is the version I’ll be using.

A note about security

In a real production environment, of course you’re going to want to consider authentication. This is something that MongoDB takes seriously and there’s a whole section of documentation on security. But for the purpose of this demonstration, I’m going to assume you’ve either got that working or you’re running in “trusted mode” (i.e. that you’re in a development environment that isn’t open to the public).

Take a look around

Once you’ve got MongoDB installed and started (a process that should only take a few minutes), you can connect to the MongoDB shell. Most of the MongoDB technical documentation is written for the shell, so it’s always useful to know how to access it, and how use it to troubleshoot problems or prototype solutions.

When you’ve connected, you should see something like

MongoDB shell version: 2.6.3                           
connecting to: test
> _  

Since you’re in the console, let’s take it for a spin. Firstly we’ll have a look at all the databases that are there right now:

> show dbs

Assuming this is a clean installation, there shouldn’t be much to see:

> show dbs
admin  (empty)
local  0.078GB
>

That’s great, but as you can see there’s loads of documentation on how to play with MongoDB from the shell. The shell is a really great environment for trying out queries and looking at things from the point-of-view of the server. However, I promised you Java, so we’re going to step away from the shell and get on with connecting via Java.

Getting started with Java

First, you’re going to want to set up your project/IDE to use the MongoDB Java Driver. These days IDEs tend to pick up the correct dependencies through your Gradle or Maven configuration, so I’m just going to cover configuring these.

At the time of writing, the latest version of the Java driver is 2.12.3 - this is designed to work with the MongoDB 2.6 series.

Gradle

You’ll need to add the following to your dependencies in build.gradle:

compile 'org.mongodb:mongo-java-driver:2.12.3'

Maven

For maven, you’ll want:

<dependencies>
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>mongo-java-driver</artifactId>
        <version>2.12.3</version>
    </dependency>
</dependencies>

Alternatively, if you’re really old-school and like maintaining your dependencies the hard way, you can always download the JAR file.

If you don’t already have a project that you want to try with MongoDB, I’ve created a series of unit tests on github which you can use to get a feel for working with MongoDB and Java.

Connecting via Java

Assuming you’ve resolved your dependencies and you’ve set up your project, you’re ready to connect to MongoDB from your Java application.

Since MongoDB is a document database, you might not be surprised to learn that you don’t connect to it via traditional SQL/relational DB methods like JDBC. But it’s simple all the same:

Where I’ve put mongodb://localhost:27017, you’ll want to put the address of where you’ve installed MongoDB. There’s more detailed information on how to create the correct URI, including how to connect to a Replica Set, in the MongoClientURI documentation.

If you’re connecting to a local instance on the default port, you can simply use:

Note that this does throw a checked Exception, UnknownHostException. You’ll either have to catch this or declare it, depending upon what your policy is for exception handling.

The MongoClient is your route in to MongoDB, from this you’ll get your database and collections to work with (more on this later). Your instance of MongoClient (e.g. mongoClient above) will ordinarily be a singleton in your application. However, if you need to connect via different credentials (different user names and passwords) you’ll want a MongoClient per set of credentials.

It is important to limit the number of MongoClient instances in your application, hence why we suggest a singleton - the MongoClient is effectively the connection pool, so for every new MongoClient, you are opening a new pool. Using a single MongoClient (and optionally configuring its settings) will allow the driver to correctly manage your connections to the server. This MongoClient singleton is safe to be used by multiple threads.

One final thing you need to be aware of: you want your application to shut down the connections to MongoDB when it finishes running. Always make sure your application or web server calls MongoClient.close() when it shuts down.

Try out connecting to MongoDB by getting the test in Exercise1ConnectingTest to pass.

Where are my tables?

MongoDB doesn’t have tables, rows, columns, joins etc. There are some new concepts to learn when you’re using it, but nothing too challenging.

While you still have the concept of a database, the documents (which we’ll cover in more detail later) are stored in collections, rather than your database being made up of tables of data. But it can be helpful to think of documents like rows and collections like tables in a traditional database. And collections can have indexes like you’d expect.

Selecting Databases and Collections

You’re going to want to define which databases and collections you’re using in your Java application. If you remember, a few sections ago we used the MongoDB shell to show the databases in your MongoDB instance, and you had an admin and a local.

Creating and getting a database or collection is extremely easy in MongoDB:

You can replace "TheDatabaseName" with whatever the name of your database is. If the database doesn’t already exist, it will be created automatically the first time you insert anything into it, so there’s no need for null checks or exception handling on the off-chance the database doesn’t exist.

Getting the collection you want from the database is simple too:

Again, replacing "TheCollectionName" with whatever your collection is called.

If you’re playing along with the test code, you now know enough to get the tests
in Exercise2MongoClientTest to pass.

An introduction to documents

Something that is, hopefully, becoming clear to you as you work through the examples in this blog, is that MongoDB is different from the traditional relational databases you’ve used. As I’ve mentioned, there are collections, rather than tables, and documents, rather than rows and columns.

Documents are much more flexible than a traditional row, as you have a dynamic schema rather than an enforced one. You can evolve the document over time without incurring the cost of schema migrations and tedious update scripts. But I’m getting ahead of myself.

Although documents don’t look like the tables, columns and rows you’re used to, they should look familiar if you’ve done anything even remotely JSON-like. Here’s an example:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  age: 34,
  address: {
    street: "123 Fake St",
    city: "Faketon",
    state: "MA",
    zip: “12345”
  }
  books: [ 27464, 747854, ...]
}  

There are a few interesting things to note:

  1. Like JSON, documents are structures of name/value pairs, and the values can be one of a number of primitive types, including Strings and various number types.
  2. It also supports nested documents - in the example above, address is a subdocument inside the person document. Unlike a relational database, where you might store this in a separate table and provide a reference to it, in MongoDB if that data benefits from always being associated with its parent, you can embed it in its parent.
  3. You can even store an array of values. The books field in the example above is an array of integers that might represent, for example, IDs of books the person has bought or borrowed.

You can find out more detailed information about Documents in the documentation.

Creating a document and saving it to the database

In Java, if you wanted to create a document like the one above, you’d do something like:

At this point, it’s really easy to save it into your database:

Note that the first three lines are set-up, and you don’t need to re-initialize those every time.

Now if we look inside MongoDB, we can see that the database has been created:

> show dbs
Examples  0.078GB
admin     (empty)
local     0.078GB
> _

…and we can see the collection has been created as well:

> use Examples
switched to db Examples
> show collections
people
system.indexes
> _ 

…finally, we can see the our person, “Jo”, was inserted:

> db.people.findOne()
{
    "_id" : "jo",
    "name" : "Jo Bloggs",
        "age": 34,
    "address" : {
        "street" : "123 Fake St",
        "city" : "Faketon",
        "state" : "MA",
        "zip" : "12345"
    },
    "books" : [
        27464,
        747854
    ]
}
> _

As a Java developer, you can see the similarities between the Document that’s stored in MongoDB, and your domain object. In your code, that person would probably be a Person class, with simple primitive fields, an array field, and an Address field.

So rather than building your DBObject manually like the above example, you’re more likely to be converting your domain object into a DBObject. It’s best not to have the MongoDB-specific DBObject class in your domain objects, so you might want to create a PersonAdaptor that converts your Person domain object to a DBObject:

As before, once you have the DBObject, you can save this into MongoDB:

Now you’ve got all the basics to get the tests in Exercise3InsertTest to pass.

Getting documents back out again

Now you’ve saved a Person to the database, and we’ve seen it in the database using the shell, you’re going to want to get it back out into your Java application. In this post, we’re going to cover the very basics of retrieving a document - in a later post we’ll cover more complex querying.

You’ll have guessed by the fact that MongoDB is a document database that we’re not going to be using SQL to query. Instead, we query by example, building up a document that looks like the document we’re looking for. So if we wanted to look for the person we saved into the database, “Jo Bloggs”, we remember that the _id field had the value of “jo”, and we create a document that matches this shape:

As you can see, the find method returns a cursor for the results. Since _id needs to be unique, we know that if we look for a document with this ID, we will find only one document, and it will be the one we want:

Earlier we saw that documents are simply made up of name/value pairs, where the value can be anything from a simple String or primitive, to more complex types like arrays or subdocuments. Therefore in Java, we can more or less treat DBObject as a Map<String, Object>. So if we wanted to look at the fields of the document we got back from the database, we can get them with:

Note that you’ll need to cast the value to a String, as the compiler only knows that it’s an Object.

If you’re still playing along with the example code, you’re now ready to take on all the tests in Exercise4RetrieveTest

Overview of JVM Libraries

So far I’ve shown you the basics of the official Java Driver, but you’ll notice that it’s quite low-level - you have to do a lot of taking things out of your domain objects and poking them into MongoDB-shaped DBObjects, and vice-versa. If this is the level of control you want, then the Java driver makes this easy for you. But if it seems like this is extra work that you shouldn’t have to do, there are plenty of other options for you.

The tools I’m about to describe all use the MongoDB Java Driver at their core to interact with MongoDB. They provide a high-level abstraction for converting your domain objects into MongoDB documents, whilst also giving you a way to get to the underlying driver as well in case you need to use it at a lower level.

Morphia

Morphia is a really lightweight ODM (Object Document Mapper), so it’s similar to ORMs like Hibernate. Documents can be in a fairly similar shape to your Java domain objects, so this mapping can be automatic, but Morphia allows you point the mapper in the right direction.

Morphia is open source, and has contributors from MongoDB. Sample code and documentation can be found here.

Spring Data

Another frequently used ODM is Spring Data. This supports traditional relational and non-relational databases, including MongoDB. If you’re already using Spring in your application, this should be a familiar way to work.

As always with Spring projects, there’s a lot of really great documentation, including a Getting Started guide with example code.

MongoJack

If you’re working with web services or something else that supports JSON, and you’re using Jackson to work with this data, it probably seems like a waste to be turning it from this form into a Java object and then into a MongoDB DBObject. But MongoJack might make your job easier, as it’s designed to map JSON objects directly into MongoDB. Take a look at the example code and documentation.

Jongo

This is another Jackson-based ODM, but provides an interesting extra in the form of supporting queries the way you’d write them in the shell. Documentation and example code is available on the website.

Grails MongoDB GORM

The Grails web application framework also supports its own Object-Relational Mapping (GORM), including support for MongoDB. More documentation for this plugin can be found here.

Casbah

This isn’t an ODM like the other tools mentioned, but the officially supported Scala driver for MongoDB. Like the previous libraries, it uses the MongoDB Java Driver under the covers, but it provides a Scala API for application developers to work with. If you like working with Scala but are searching for an async solution, consider ReactiveMongo, a community-supported driver that provides async and non-blocking operations.

Other libraries and tools

This is far from an extensive list, and I apologise if I’ve left a favourite out. But we’ve compiled a list of many more libraries for the JVM, which includes community projects and officially supported drivers.

Conclusion

We’ve covered the basics of using MongoDB from Java - we’ve touched on what MongoDB is, and you can find out a lot more detailed information about it from the manual; we’ve installed it somewhere that lets us play with it; we’ve talked a bit about collections and documents, and what these look like in Java; and we’ve started inserting things into MongoDB and getting them back out again.

If you haven’t already started playing with the test code, you can find it in this github repository. And if you get desperate and look hard enough, you’ll even find the answers there too.

Finally, there are more examples of using the Java Driver in the Quick Tour, and there is example code in github, including examples for authentication.

If you want to learn more, try our 7-week online course, "Intro to MongoDB and Java" which started on MongoDB University this week.

Try it out, and hopefully you’ll see how easy it is to use MongoDB from Java.

MongoDB & The Soundwave Music Map

Aug 5 • Posted 2 weeks ago

By David Lynch: Principal Engineer at Soundwave

Soundwave is a smartphone app that tracks music as quickly as it is played. Soundwave tracks each user’s music-listening habits across multiple platforms and streaming services1, creating a listening profile. Users follow each other, facilitating listening, sharing, discovery, and discussion of music old and new, popular and niche.

MongoDB is our database of choice. We track around 250,000 plays per day from a user base that has grown to over 1 million in the past 13 months. Soundwave has users in all time zones, making availability critical. MongoDB replica sets provide us fault tolerance and redundancy, allowing us to perform scheduled maintenance without affecting users.

We consider responsiveness to be a key part of the Soundwave user experience so we use a sharded configuration that allows us to add compute and storage resources as needed.

We use indexes liberally, and app usage patterns require maintaining a fairly large working set in memory. Sharding allows us to deploy a larger number of relatively smaller, and disproportionately cheaper, AWS instances2 while maintaining the snappy user experience we require.

The volume of data we process today is significant, but not huge. It was important for us to architect a system from the outset that would scale easily and economically.

The complexity of data retrieval is moderate but well suited to MongoDB. Our App has been featured a couple of times by both Apple and Google. At our peak rate, we handle 30,000 new sign-ups per day. Our MongoDB configuration handles this load comfortably. In fact, with respect to scale and schema, our deployment is pretty boring and by the book.

Music Map

One of Soundwaves’ most compelling features is the Music Map. Our users can draw a free-form enclosing polygon on a Google Map of any area, at any zoom level, and see what music was recently played in that area. This is captured in the screenshot below. These constraints required some interesting engineering outside the scope of the MongoDB playbook.

We’re kind of proud of our implementation. Many location aware apps with similar features reduce the user experience to some form of circular $geoNear around a point. In our minds, this is not as interesting as a randomly drawn polygon - nor is it a great user experience to promise polygons and return some other approximation of what a user wanted.

Constraints

We set a target maximum 95th percentile latency of 1.5 seconds for the map feature. From a client perspective, we decided anything over this wait time is just boring for the user. We discovered that users perform searches that are either pretty-zoomed-out - like Ireland - or-pretty-zoomed in - like Landsdowne Road.

For lots of reasons, we have clusters of data around certain areas. Big cities typically have lots of dense data points, but some cities are better represented than others. This makes pretty-zoomed-out and pretty-zoomed-in hard to quantify. It follows also that zoom-level of the map is somewhat useless as an optimization datum. We wanted all our location data to be available to the user - there are a few plays in Antarctica, but they happened a while ago - it would be a shame to cut the search space by time and miss this.

Schema Design

Our deployment contains around 90 million play actions, each one represented by a single document. For any interaction with the app, for example capturing a play, there is an associated geo-coordinate and document. Our first implementation of this feature leveraged the 2d index type with a legacy location data point associated with each action document.

Our geo-query therefore took the form of a $within statement using a polygon built of points on the free form finger-drag.

The supporting index was constructed as follows.

This actually worked quite nicely out of the box. For legacy queries, MongoDB $within is not too fussy about self-intersecting nor is it concerned about the uniqueness of each point, or the completeness of the polygon.

The query, however was dog-slow. For dense areas at city sized zoom levels, we could not get response times under multiple seconds for our queries. For areas where significant numbers of plays were happening, hinting an index on action-time performed much better than 2d in all cases.

For areas of low activity, this strategy didn’t work particularly well. Searches at the country or continent scale didn’t complete in any reasonable time - 10s of seconds - and began to affect the latency of other parts of the app. MongoDB’s Explain showed scans over half, if not more, of the records in the collection for city sized searches, yet also confirmed that the 2d index was used. This led us to conclude that the 2d index may not be powerful enough for our needs.

Migration to GeoJSON & 2dsphere

After some hints from the MongoDB community and migration to MongoDB 2.4, we decided to move to the 2dsphere index. This required a rework of our schema and a rework of our query engine. Firstly, we needed GeoJSON points.

Next the index

Alert readers will notice the emergence of a new collection, action_loc, in our schema. Version 2.4 2dsphere indexes do not respect the sparse property, but this was changed in 2.6. Roughly 65% of our Action collection consists of documents that do not have an associated location.

Creating the 2dsphere index on the action collection results in 57 million documents indexed on location with no benefit. In practice this resulted in an index roughly 10GB in size. Moving to a collection where all records had locations resulted in a 7.3Gb decrease in the size of the index. Given that reads dominate writes in our application, we were happy to incur the latency of one additional write per play to the action_loc collection when location is available for specific actions.

GeoJSON support has some tougher constraints for queries. There can be no duplicate points. Polygons must form a closed linear ring. In practice, our client application sometimes produced multiple points i.e. when the polygon self-intersected, and didn’t provide a closed linear ring ever. Finally, self-intersecting polygons, which are a common side effect of finger-led free-form drawing and interesting random searches, are not acceptable. Technically, a self-intersecting polygon could be considered as a geometry containing n distinct non self-intersecting polygons, multiple polygon geometries are not supported on MongoDB 2.4 but were added in 2.5.1.

The JTS Java Library helped us fix all this at the application tier. De-duplication and closing of linear rings was trivial however, supporting self-intersecting polygons was a little trickier. The final solution involved calculation of the convex hull of the search polygon. This guaranteed a closed linear ring around any polygon geometry, therefore removing the chance of self-intersection. As illustrated in the graphic above it does, however, result in a larger result set than intended.

Rather than show these points and confuse the user, we cull them from the response. This preserves the bounding-box user experience we were after. In other words, users get what they would expect. With regard to wasted resources, this is not ideal in theory, but works well in practice. Users typically draw self-intersecting polygons by accident3. The 2nd polygon is usually an order of magnitude smaller than the larger of two polygons that result. Therefore the wasted search space is typically small.

Once migrated, city and street level searches were an order of magnitude faster in generating results. We attribute this to the 2dsphere index and the s2 cursor. It still remained, though, that low zoom level searches, at state and country level, pushed the 95th percentile latency beyond 1.5 seconds on average and were often a painfully slow user experience.

Optimization

We limit results of music map searches to 100 plays. If there are more than 100, we show the 100 most recent.

Queries at a low zoom level are usually dominated by plays that have happened recently. Tests showed that for low-zoom level wide-area searches an inverse time index was much faster at finding records than the 2dsphere index. Hence we created two indexes on location, one based on time, the other one 2dsphere.

However, In our experience, MongoDB’s query planner always chose the 2dsphere index over the time-based index for $geoWithin queries, even when tests showed that the time-based index would provide faster results in a specific case.

MongoDB’s query planner works by periodically using multiple indexes in parallel and seeing which one returns faster. The problem is that most of the time, the right answer is to use the 2dsphere index and MongoDB only periodically re-evaluates this choice. If the best choice of index varies for two queries that have the same signature, MongoDB will make the wrong choice some percentage of the time. MongoDB makes an index choice that works more often than not, but leaves the remaining cases unacceptably slow.

Some conjunction of polygon-area and pre-generated metrics of data-point density for the area was explored but quickly abandoned due to difficulty of implementation.

Finally a simple strategy of parallel execution was used. We run the query two ways, in parallel, one query hinting the 2dsphere index, the other hinting the time index. The winner is the query that returns results the fastest. If the loser runs for longer than 10 seconds it is forcibly killed to save wasted resources.

In production, we found that if we did not kill long-running speculative queries, the resulting load on the database adversely affected the read latency for other queries, and thus user experience.

With MongoDB 2.6 its possible to set the user defined time limit for queries and so this step could be avoided. At the time of writing, we were not using this feature in production, and we needed a Javascript worker that periodically killed queries on the collection running over 10 seconds. The function is below.

Some care is needed to ensure the operations on the collection that are long running are not related to the redistribution of chunks across shards.

Explaining Results

In production, our inverse time index beats our 2dsphere index roughly one third of the time, typically for pretty zoomed-out searches. Intuitively, this makes sense, because a continent is large and encompasses a significant portion of the geographic space. For example, if you evaluate the 1000 most recent items, 100 of them are likely to also satisfy your geographical constraint if your geographic constraint is the size of a continent.

Using $geoWithin when very zoomed out performs poorly. Again looking at the example of a continent at least 10% of all plays are probably within the continent, but as we only want to return the most recent data, we will wind up discarding nearly all the data we pull from the database as we search for the most recent data from the continent.

In the first example [Tbl. 1], the 2dsphere index wins by a huge margin. The cells covered by the areas that are considered in the query do not have very many recent plays in relation to the global set of recent plays. Therefore, the thread using time index needs to work much harder to fill the 100 pin bucket from a global perspective, scanning a factor of 40 more records to do the same work. Meanwhile, the query that uses the 2dsphere index finds that a relatively high proportion of the plays that satisfy the geographic constraint also satisfy the time constraint.

In the second example [Tbl. 2], the probability that a recent play is also geographically compatible is high, causing the time based index to win comfortably, scanning half the records of 2dsphere and bringing us back under our desired latency limit.

To reiterate, we always use the result from the faster query. The result from the slower query is discarded if it returns prior to being killed. By speculatively going down the path of both queries, we are able to get a result back to the user within our requirements without analyzing the query and its parameters. We don’t need to understand whether it’s zoomed out or zoomed in.

Parallel Index Usage

The graphic below shows a few days’ worth of queries and the respective index used. The data-points have been plotted for both winning and losing queries. For many of the of the points shown, results were not used in user responses, but latency was measured anyway.

The graph serves to show that it is the combination of indexes, that is both the red and the green points, are responsible for the p95 latency shown in the graphic below. Moreover, it illustrates roughly how the p95 might be affected should we use only one of the indexes in question. Also, the proportion of 2dsphere searches to time searches can be more roughly deduced.

We process roughly 0.4 map searches on average per second. The p95 response time is well under 1.5 seconds. We are able to achieve these results even though we are performing every query two ways, for up to 10 seconds. The graph shown below shows these results over a week in March.

Conclusion

MongoDB geo-search feature enables one of Soundwave’s most unique features: Music Maps. The Music Map is one of the main reasons we chose MongoDB as a primary data store.

Compared to older features like automatic sharded cluster rebalancing and aggregation, geo-search and supporting documentation are still somewhat immature. That said, with some extra work we managed to build a reasonably complex search feature without having to deploy a separate GIS stack.

Furthermore, MongoDB’s strategy of incorporating the S2 library and more GeoJSON geometries leaves us confident that we can easily build upon our map feature, now and in the future. This allows us to keep location awareness front and center as a feature of our app. Hopefully this article will clear up a few pain points for users implementing similar features.

footnotes

1. Soundwave is available for iOS and Android. Installation details at http://www.soundwave.com

2. We use r3.4xlarge instances.

3. We observed queries on self-intersecting polygons about half the time. This is mostly an accident; it’s hard to draw an exact linear-ring with your finger. However, sometimes people just want to draw figures of 8.

MongoDB Innovation Award Winners

Jun 24 • Posted 1 month ago

The MongoDB Innovation Awards recognize applications that are transforming businesses, enabling organizations to actualize trends like Big Data and the Internet of Things and attain impressive scale and agility. MongoDB customers, community members and partners nominated hundreds of organizations, tools and applications for the award. Tonight, as part of MongoDB World 2014, we announced the following award winners.

Analytics

Genentech

Genentech cut the time to introduce new genetic tests from 6-9 months to a matter of weeks. Testing instruments have gotten better at capturing complex and varied genetic data, while legacy technologies struggle to store it. Using MongoDB, Genentech can store and analyze that data more quickly, making it easier to develop medicines that matter.

eBay

eBay uses MongoDB to help ensure smooth operations for their Marketplace and Paypal businesses. The Topostore system indexes metric metadata with MongoDB to monitor and track metrics around essential internal systems including applications, systems, network devices and databases. As a result, their internal teams are able to expose real-time data and insights in an extensible visualization platform to keep operations running smoothly.

Big Apple

SumAll

SumAll is a business intelligence tool for small and medium sized businesses with over 200,000 companies on its platform across 35+ countries. They help clients derive insights from bringing all their data into single view by aggregating more than 30 business data sources (revenue, analytics, social, marketing services). With MongoDB, SumAll achieves real-time data processing at scale.

Cool Data

Met Office Weather Project

Coronal Mass Ejections (CMEs), solar flares and solar wind can impact the performance of the electricity grid, satellites, GPS systems, aviation and mobile communications. MongoDB allows the UK Met Office to collect and analyze vast amounts of different types of data - including solar flare imagery from NASA - to provide warnings of space weather events to government and business stakeholders.

Data Science

eHarmony

eHarmony uses MongoDB to help singles find their soulmates. The Compatibility Matching System crunches hundreds of traits and preferences to generate over 3 billion potential matches per day. By migrating from a relational database to MongoDB, eHarmony reduced the time to match by 95% — from 2+ weeks to under 12 hours — to make some of the happiest couples on the planet. Since migrating to MongoDB, eHarmony has increased communication between users by 30%, paid subscriptions by 50% and increased unique visitors by 60% year over year.

EA

EA SPORTS FIFA Online 3 is a free-to-play, massively multiplayer online football game, and the most popular sports game in Korea. EA SPORTS runs FIFA Online 3 on MongoDB. Players can choose to play and customize a team from any of over 30 leagues and 15,000 real world players. This large deployment includes 80 shards and over 250 servers.

Education

LinkedIn

LinkedIn is helping its employees accelerate their career growth with MongoDB. LinkedIn sought to build a learning management system with a learner-first mentality. The system, named LearnIn, needed to serve curated content to its users based on their needs and goals. With a modern database in hand, a team of 3 built LearnIn in just a few months, delivering a modern learning experience to thousands of LinkedIn employees.

Internet of Things

Bosch

Bosch Software Innovations, the software and systems house of the Bosch-Group, aims to connect billions of devices to the Internet of Things (IoT) using MongoDB. The IoT enables a vast range of applications, so to help businesses take advantage of it, Bosch Software Innovations is building a cross-industry IoT platform. As the underlying database, MongoDB stores, manages, and analyzes the data from these billions of devices and other data sources, helping Bosch and its customers bring modern IoT products to market on a repeatable and profitable basis.

Hadoop

UnitedHealth

UnitedHealth Group (UHG) uses MongoDB for its OptumInsight platform which gives business analysts in the healthcare industry the ability to reduce costs, meet compliance mandates, improve clinical performance and adapt to the changing health system landscape. Using OptumInsight, healthcare analysts can access a single view of Medicare and retirement claims data which allows them to audit, research and analyze this data in a much simpler way than before. UHG chose MongoDB for its high availability, agile development and schema flexibility features.

Open Source

3D Repo

Collaborating on 3D assets like CAD drawings is difficult. While back and forth sharing of these massive files is feasible, the process itself can be cumbersome as users often have trouble determining if they are working with the latest version. 3D Repo allows coordinated management, development, and transmission of large 3D data in the cloud. Users commit revisions to a common MongoDB repository, ensuring that interested parties can always work from a single master copy of the data. Over 100 international companies have already expressed interest in 3D Repo, including Arup, Boeing, and Lenovo.

Scale

Adobe

Adobe Experience Manager 6.0, the market leader in Web experience management, uses MongoDB to facilitate the storage of data. Experience Manager enables organizations to create, manage, and optimize customer-facing experiences such as web, mobile, social, video, and in-store. With MongoDB, Adobe Experience Manager customers can store, manage, and access petabytes of digital assets across an expanding number of channels and platforms.

Lockheed Martin

MongoDB is helping Lockheed Martin secure cyberspace. Lockheed Martin’s Advanced Threat Monitoring (ATM) not only identifies known signature-based threats, but also analyzes emails, links, attachments, DNS transactions, firewall traffic and multitudes of other data sources in real-time to track cyber events. Using MongoDB as a data store, the system stores tens of billions of documents and consumes hundreds of thousands of documents per second. By incorporating the massive volume and variety of data and analyzing it in-place, Lockheed Martin is securing its global network.

Startup

TwineHealth

Twine is a cloud-based software platform that offers collaborative care by connecting chronic disease patients and clinicians through synchronized apps that work seamlessly across devices. With MongoDB, Twine can develop their platform quickly, conceptualize and design data easily and grow to accommodate an increasing number of data points to support their mission of improving chronic disease care for both patients and healthcare providers.

Tools

Meteor

Meteor is an open-source platform for writing MongoDB-based web and mobile applications in pure JavaScript. Meteor allows developers to build apps in record time by making many common tasks simple. Logins and accounts, user-generated content, data coming from Twitter and Facebook, and performance and analytics statistics are all easy to add to a Meteor app and fit well into MongoDB.

JSON Studio

JSON Studio is democratizing data in MongoDB by making it accessible through a modern GUI. Developers, analysts, data scientists and anyone else that can use a mouse and keyboard can browse data, build queries, run aggregations, and visualize results using JSON Studio. By exposing MongoDB to every corner of the organization, JSON Studio is helping companies get more out of their data.

Thank you for your contributions and the applications you are building to challenge the world. We look forward to seeing how the community continues to realize their ideas with MongoDB.

6 Rules of Thumb for MongoDB Schema Design: Part 3

Jun 11 • Posted 2 months ago

By William Zola, Lead Technical Support Engineer at MongoDB

This is our final stop in this tour of modeling One-to-N relationships in MongoDB. In the first post, I covered the three basic ways to model a One-to-N relationship. Last time, I covered some extensions to those basics: two-way referencing and denormalization.

Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.

Read part one and part two if you’ve missed them.

Whoa! Look at All These Choices!

So, to recap:

  • You can embed, reference from the “one” side, or reference from the “N” side, or combine a pair of these techniques
  • You can denormalize as many fields as you like into the “one” side or the “N” side

Denormalization, in particular, gives you a lot of choices: if there are 8 candidates for denormalization in a relationship, there are 2 8 (1024) different ways to denormalize (including not denormalizing at all). Multiply that by the three different ways to do referencing, and you have over 3,000 different ways to model the relationship.

Guess what? You now are stuck in the “paradox of choice” — because you have so many potential ways to model a “one-to-N” relationship, your choice on how to model it just got harder. Lots harder.

Rules of Thumb: Your Guide Through the Rainbow

Here are some “rules of thumb” to guide you through these indenumberable (but not infinite) choices

  • One: favor embedding unless there is a compelling reason not to
  • Two: needing to access an object on its own is a compelling reason not to embed it
  • Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
  • Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database.
  • Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
  • Six: As always with MongoDB, how you model your data depends — entirely — on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it.

Your Guide To The Rainbow

When modeling “One-to-N” relationships in MongoDB, you have a variety of choices, so you have to carefully think through the structure of your data. The main criteria you need to consider are:

  • What is the cardinality of the relationship: is it “one-to-few”, “one-to-many”, or “one-to-squillions”?
  • Do you need to access the object on the “N” side separately, or only in the context of the parent object?
  • What is the ratio of updates to reads for a particular field?

Your main choices for structuring the data are:

  • For “one-to-few”, you can use an array of embedded documents
  • For “one-to-many”, or on occasions when the “N” side must stand alone, you should use an array of references. You can also use a “parent-reference” on the “N” side if it optimizes your data access pattern.
  • For “one-to-squillions”, you should use a “parent-reference” in the document storing the “N” side.

Once you’ve decided on the overall structure of the data, then you can, if you choose, denormalize data across multiple documents, by either denormalizing data from the “One” side into the “N” side, or from the “N” side into the “One” side. You’d do this only for fields that are frequently read, get read much more often than they get updated, and where you don’t require strong consistency, since updating a denormalized value is slower, more expensive, and is not atomic.

Productivity and Flexibility

The upshot of all of this is that MongoDB gives you the ability to design your database schema to match the needs of your application. You can structure your data in MongoDB so that it adapts easily to change, and supports the queries and updates that you need to get the most out of your application.

How Buffer uses MongoDB to power its Growth Platform

Jun 9 • Posted 2 months ago

By Sunil Sadasivin, CTO at Buffer

Buffer, powered by experiments and metrics

At Buffer, every product decision we make is driven by quantitative metrics. We have always sought to be lean in our decision making, and one of the core tenants of being lean is launching experimental features early and measuring their impact.

Buffer is a social media tool to help you schedule and space out your posts on social media networks like Twitter, Facebook, Google+ and Linkedin. We started in late 2010 and thanks to a keen focus on analytical data, we have now grown to over 1.5 million users and 155k unique active users per month. We’re now responsible for sharing 3 million social media posts a week.

When I started at Buffer in September 2012 we were using a mixture of Google Analytics, Kissmetrics and an internal tool to track our app usage and analytics. We struggled to move fast and effectively measure product and feature usage with these disconnected tools. We didn’t have an easy way to generate powerful reports like cohort analysis charts or measure things like activation segmented by signup sources over time. Third party tracking services were great for us early on, but as we started to dig deeper into our app insights, we realized there was no way around it—we needed to build our own custom metrics and event tracking.

We took the plunge in April 2013 to build our own metrics framework using MongoDB. While we’ve had some bumps and growing pains setting this up, it’s been one of the best decisions we’ve made. We are now in control of all metrics and event tracking and are able to understand what’s going on with our app at a deeper level. Here’s how we use MongoDB to power our metrics framework.

Why we chose MongoDB

At the time we were evaluating datastores, we had no idea what our data would look like. When I started designing our schema, I quickly found that we needed something that would let us change the metrics we track over time and on the fly. Today, I’ll want to measure our signup funnel based on referrals, tomorrow I might want to measure some custom event and associated data that is specific to some future experiment. I needed to plan for the future, and give our developers the power to track any arbitrary data. MongoDB and its dynamic schema made the most sense for us. MongoDB’s super powerful aggregation framework also seemed perfect for creating the right views with our tracking data.

Our Metrics Framework Architecture

In our app, we’ve set up an AWS SQS queue and any data we want to track from the app goes immediately to this queue. We use SQS heavily in our app and have found it to be a great tool to manage messaging at high throughput levels. We have a simple python worker which picks off messages from this queue and writes them to our metrics database. The reason why we’ve done this instead of connecting and writing directly to our metrics MongoDB database is because we wanted our metrics set up to have absolutely zero impact on application performance. Much like Google Analytics offers no overhead to an application, our event tracking had to do the same. The MongoDB database that would store our events would be extremely write heavy since we would be tracking anything we could think of, including every API request, page visited, Buffer user/profile/post/email created etc. If, for whatever reason our metrics db goes down, or starts having write locking issues, our users shouldn’t be impacted. Using SQS as a middleman would allow tracking data to queue up if any of these issues occur. SQS gives us enough time to figure out what the issue is, fix, it and then process that backlog. We’ve had quite a few times in the past year where using Amazon’s robust SQS service has saved us from losing any data during maintenance or downtime that would occur when creating a robust high throughput metrics framework from scratch. We use MongoHQ to host our data. They’ve been super helpful with any challenges in scaling a db like ours. Since our setup is write heavy, we’ve initially set up a 400GB SSD replica set. As of today (May 16) we have 90 collections and are storing over 500 million documents.

We wrote simple client libraries for tracking data for every language that we use (PHP, Python, Java, NodeJS, Javascript, Objective-C). In addition to bufferapp.com, our API, mobile apps and internal tools all plug into this framework.

Tracking events

Our event tracking is super simple. When a developer creates a new event message, our python worker creates a generic event collection (if it doesn’t exist) and stores event data that’s defined by the developer. It will store the user or visitor id, and the date that the event occurred. It’ll also store the user_joined_at date which is useful for cohort analysis.

Here are some examples of event tracking our metrics platform lets us do.

Visitor page views in the app.

Like many other apps, we want to track every visitor that hits our page. There is a bunch of data that we want to store to understand the context around the event. We’d like to know the IP address, the URI they viewed, the user agent they’re using among other data.

Here’s what the tracking would look like in our app written in PHP:

Here’s the corresponding result in our MongoDB metrics db:

Logging User API calls

We track every API call our clients make to the Buffer API. Essentially what we’ve done here is create query-able logging for API requests. This has been way more effective than standard web server logs and has allowed us to dig deeper into API bugs, security issues and understanding the load on our API.

Experiment data

With this type of event tracking, our developers are able to track anything by writing a single line of code. This has been especially useful for measuring events specific to a feature experiment. This frictionless process helps keep us lean: we can measure feature usage as soon as a feature is launched. For example, we recently launched a group sharing feature for business customers so that they can group their Buffer social media accounts together. Our hypothesis was that people with several social media accounts prefer to share specific content to subsets of accounts. We wanted to quantifiably validate whether this is something many would use, or whether it’s a niche or power user feature. After a week of testing this out, we had our answer.

This example shows our tracking of our ‘group sharing’ experiment. We wanted to track each group that was created with this new feature. With this, we’re able to track the user, the groups created, the name of the group, and the date it was created.

Making sense of the data

We store a lot of tracking data. While it’s great that we’re tracking all this data, there would be no point if we weren’t able to make sense of it. Our goal for tracking this data was to create our own growth dashboard so we can keep track of key metrics, and understand results of experiments. Making sense of the data was one of the most challenging parts of setting up our growth platform.

MongoDB Aggregation

We rely heavily on MongoDB’s aggregation framework. It has been super handy for things like gauging API client requests by hour, response times separated by API endpoint, number of visitors based on referrers, cohort analysis and so much more.

Here’s a simple example of how we use MongoDB aggregation to obtain our average API response times between April 8th and April 9th: Result:

With the aggregation framework, we have powerful insight into how clients are using our platform, which users are power users and a lot more. We previously created long running scripts to generate our cohort analysis reports. Now we can use MongoDB aggregation for much of this.

Running ETL jobs

We have several ETL jobs that run periodically to power our growth dashboard. This is the way we make sense of our data core. Some of the more complex reports need this level of reporting. For example, the way we measure product activation is whether someone has posted an update within a week of joining. With the way we’ve structured our data, this requires a join query in two different collections. All of this processing is done in our ETL jobs. We’ll upload the results to a different database which is used to power the views in our growth dashboard for faster loading.

Here are some reports on our growth dashboard that are powered by ETL jobs

Scaling Challenges and Pitfalls

We’ve faced a few different challenges and we’ve iterated to get to a point where we can make solid use out of our growth platform. Here are a few pitfalls and examples of challenges that we’ve faced in setting this up and scaling our platform.

Plan for high disk I/O and write throughput.

The DB server size and type has a key role in how quickly we could process and store events. In planning for the future we knew that we’d be tracking quite a lot of data and a fast pace, so a db with high disk write throughput was key for us. We ended up going for a large SSD replica set. This of course really depends on your application and use case. If you use an intermediate datastore like SQS, you can always start small, and upgrade db servers when you need it without any data loss.

We keep an eye on mongostat and SQS queue size almost daily to see how our writes are doing.

One of the good things about an SSD backed DB is that disk reads are much quicker compared to hard disk. This means it’s much more feasible to run ad hoc queries on un-indexed fields. We do this all the time whenever we have a hunch of something to dig into further.

Be mindful of the MongoDB document limit and how data is structured

Our first iteration of schema design was not scalable. True, MongoDB does not perform schema validation but that doesn’t mean it’s not important to think about how data is structured. Originally, we tracked all events in a single user_metrics and visitor_metrics collection. An event was stored as an embedded object in an array in a user document. Our hope was that we wouldn’t need to do any joins and we could effectively segment out tracking data super easily by user.

We had fields as arrays that were unbounded and could grow infinitely causing the document size to grow. For some highly active users (and bots), after a few months of tracking data in this way some documents in this collection would hit the 16MB document limit and fail to write any more. This created various performance issues in processing updates, and in our growth worker and ETL jobs because there were these huge documents transferred over the wire. When this happened we had to move quickly to restructure our data.

Moving to a single collection per event type has been the most scalable solution and a more flexible solution.

Reading from secondaries

Some of our ETL jobs read and process a lot of data. If you end up querying documents that haven’t been read or written to recently, it is very possible this may be stored out of memory and on disk. Querying this data means MongoDB will page out some documents that have been touched recently and bring query results into memory. This will then make writing to that paged out document slower. It’s for this reason that we have set up our ETL and aggregation queries to read only from our secondaries in our replica-set, even though they may not be consistent with the primary.

Our secondaries have a high number of faults because of paging due to reading ‘stale’ data

Visualizing results

As I mentioned before, one of the more challenging parts about maintaining our own growth platform is extracting and visualizing the data in a way that makes a lot of sense. I can’t say that we’ve come to a great solution yet. We’ve put a lot of effort into building out and maintaining our growth dashboard and creating visualizations is the bottleneck for us today. There is really a lot of room to reduce the turnaround time. We have started to experiment a bit with using Stripe’s MoSQL to map results from MongoDB to PostgresSQL and connect with something like Chart.io to make this a bit more seamless. If you’ve come across some solid solutions for visualizing event tracking with MongoDB, I’d love to hear about it!

Event tracking for everyone!

We would love to open source our growth platform. It’s something we’re hoping to do later this year. We’ve learned a lot by setting up our own tracking platform. If you have any questions about any of this or would like to have more control of your own event tracking with MongoDB, just hit me up @sunils34

Want to help build out our growth platform? Buffer is looking to grow its growth team and reliability team!

Like what you see? Sign up for the MongoDB Newsletter and get MongoDB updates straight to your inbox

6 Rules of Thumb for MongoDB Schema Design: Part 2

Jun 5 • Posted 2 months ago

By William Zola, Lead Technical Support Engineer at MongoDB

This is the second stop on our tour of modeling One-to-N relationships in MongoDB. Last time I covered the three basic schema designs: embedding, child-referencing, and parent-referencing. I also covered the two factors to consider when picking one of these designs:

  • Will the entities on the “N” side of the One-to-N ever need to stand alone?
  • What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?

With these basic techniques under our belt, I can move on to covering more sophisticated schema designs, involving two-way referencing and denormalization.

Intermediate: Two-Way Referencing

If you want to get a little bit fancier, you can combine two techniques and include both styles of reference in your schema, having both references from the “one” side to the “many” side and references from the “many” side to the “one” side.

For an example, let’s go back to that task-tracking system. There’s a “people” collection holding Person documents, a “tasks” collection holding Task documents, and a One-to-N relationship from Person -> Task. The application will need to track all of the Tasks owned by a Person, so we will need to reference Person -> Task.

With the array of references to Task documents, a single Person document might look like this:

On the other hand, in some other contexts this application will display a list of Tasks (for example, all of the Tasks in a multi-person Project) and it will need to quickly find which Person is responsible for each Task. You can optimize this by putting an additional reference to the Person in the Task document.

This design has all of the advantages and disadvantages of the “One-to-Many” schema, but with some additions. Putting in the extra ‘owner’ reference into the Task document means that its quick and easy to find the Task’s owner, but it also means that if you need to reassign the task to another person, you need to perform two updates instead of just one. Specifically, you’ll have to update both the reference from the Person to the Task document, and the reference from the Task to the Person. (And to the relational gurus who are reading this — you’re right: using this schema design means that it is no longer possible to reassign a Task to a new Person with a single atomic update. This is OK for our task-tracking system: you need to consider if this works with your particular use case.)

Intermediate: Denormalizing With “One-To-Many” Relationships

Beyond just modeling the various flavors of relationships, you can also add denormalization into your schema. This can eliminate the need to perform the application-level join for certain cases, at the price of some additional complexity when performing updates. An example will help make this clear.

Denormalizing from Many -> One

For the parts example, you could denormalize the name of the part into the ‘parts[]’ array. For reference, here’s the version of the Product document without denormalization.

Denormalizing would mean that you don’t have to perform the application-level join when displaying all of the part names for the product, but you would have to perform that join if you needed any other information about a part.

While making it easier to get the part names, this would add just a bit of client-side work to the application-level join:

Denormalizing saves you a lookup of the denormalized data at the cost of a more expensive update: if you’ve denormalized the Part name into the Product document, then when you update the Part name you must also update every place it occurs in the ‘products’ collection.

Denormalizing only makes sense when there’s an high ratio of reads to updates. If you’ll be reading the denormalized data frequently, but updating it only rarely, it often makes sense to pay the price of slower updates — and more complex updates — in order to get more efficient queries. As updates become more frequent relative to queries, the savings from denormalization decrease.

For example: assume the part name changes infrequently, but the quantity on hand changes frequently. This means that while it makes sense to denormalize the part name into the Product document, it does not make sense to denormalize the quantity on hand.

Also note that if you denormalize a field, you lose the ability to perform atomic and isolated updates on that field. Just like with the two-way referencing example above, if you update the part name in the Part document, and then in the Product document, there will be a sub-second interval where the denormalized ‘name’ in the Product document will not reflect the new, updated value in the Part document.

Denormalizing from One -> Many

You can also denormalize fields from the “One” side into the “Many” side:

However, if you’ve denormalized the Product name into the Part document, then when you update the Product name you must also update every place it occurs in the ‘parts’ collection. This is likely to be a more expensive update, since you’re updating multiple Parts instead of a single Product. As such, it’s significantly more important to consider the read-to-write ratio when denormalizing in this way.

Intermediate: Denormalizing With “One-To-Squillions” Relationships

You can also denormalize the “one-to-squillions” example. This works in one of two ways: you can either put information about the “one” side (from the ‘hosts’ document) into the “squillions” side (the log entries), or you can put summary information from the “squillions” side into the “one” side.

Here’s an example of denormalizing into the “squillions” side. I’m going to add the IP address of the host (from the ‘one’ side) into the individual log message:

Your query for the most recent messages from a particular IP address just got easier: it’s now just one query instead of two.

In fact, if there’s only a limited amount of information you want to store at the “one” side, you can denormalize it ALL into the “squillions” side and get rid of the “one” collection altogether:

On the other hand, you can also denormalize into the “one” side. Lets say you want to keep the last 1000 messages from a host in the ‘hosts’ document. You could use the $each / $slice functionality introduced in MongoDB 2.4 to keep that list sorted, and only retain the last 1000 messages:

The log messages get saved in the ‘logmsg’ collection as well as in the denormalized list in the ‘hosts’ document: that way the message isn’t lost when it ages out of the ‘hosts.logmsgs’ array.

Note the use of the projection specification ( {_id:1} ) to prevent MongoDB from having to ship the entire ‘hosts’ document over the network. By telling MongoDB to only return the _id field, I reduce the network overhead down to just the few bytes that it takes to store that field (plus just a little bit more for the wire protocol overhead).

Just as with denormalizing in the “One-to-Many” case, you’ll want to consider the ratio of reads to updates. Denormalizing the log messages into the Host document makes sense only if log messages are infrequent relative to the number of times the application needs to look at all of the messages for a single host. This particular denormalization is a bad idea if you want to look at the data less frequently than you update it.

Recap

In this post, I’ve covered the additional choices that you have past the basics of embed, child-reference, or parent-reference.

  • You can use bi-directional referencing if it optimizes your schema, and if you are willing to pay the price of not having atomic updates
  • If you are referencing, you can denormalize data either from the “One” side into the “N” side, or from the “N” side into the “One” side

When deciding whether or not to denormalize, consider the following factors:

  • You cannot perform an atomic update on denormalized data
  • Denormalization only makes sense when you have a high read to write ratio

Next time, I’ll give you some guidelines to pick and choose among all of these options.

Efficient Indexing in MongoDB 2.6

Jun 4 • Posted 2 months ago

By Osmar Olivo, Product Manager at MongoDB

One of the most powerful features of MongoDB is its rich indexing functionality. Users can specify secondary indexes on any field, compound indexes, geospatial, text, sparse, TTL, and others. Having extensive indexing functionality makes it easier for developers to build apps that provide rich functionality and low latency.

MongoDB 2.6 introduces a new query planner, including the ability to perform index intersection. Prior to 2.6 the query planner could only make use of a single index for most queries. That meant that if you wanted to query on multiple fields together, you needed to create a compound index. It also meant that if there were several different combinations of fields you wanted to query on, you might need several different compound indexes.

Each index adds overhead to your deployment - indexes consume space, on disk and in RAM, and indexes are maintained during updates, which adds disk IO. In other words, indexes improve the efficiency of many operations, but they also come at a cost. For many applications, index intersection will allow users to reduce the number of indexes they need while still providing rich features and low latency.

In the following sections we will take a deep dive into index intersection and how it can be applied to applications.

An Example - The Phone Book

Let’s take the example of a phone book with the following schema.

{
    FirstName
    LastName
    Phone_Number
    Address
}

If I were to search for “Smith, John” how would I index the following query to be as efficient as possible?

db.phonebook.find({ FirstName : “John”, LastName : “Smith” })

I could use an individual index on FirstName and search for all of the “Johns”.

This would look something like ensureIndex( { FirstName : 1 } )

We run this query and we get back 200,000 John Smiths. Looking at the explain() output below however, we see that we scanned 1,000,000 “Johns” in the process of finding 200,000 “John Smiths”.

> db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 1000000,
    "nscanned" : 1000000,
    "nscannedObjectsAllPlans" : 1000101,
    "nscannedAllPlans" : 1000101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 2,
    "nChunkSkips" : 0,
    "millis" : 2043,
    "indexBounds" : {
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

How about creating an individual index on LastName?

This would look something like ensureIndex( { LastName : 1 } )

Running this query we get back 200,000 “John Smiths” but our explain output says that we now scanned 400,000 “Smiths”. How can we make this better?

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 400000,
    "nscanned" : 400000,
    "nscannedObjectsAllPlans" : 400101,
    "nscannedAllPlans" : 400101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 1,
    "nChunkSkips" : 0,
    "millis" : 852,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

So we know that there are 1,000,000 “John” entries, 400,000 “Smith” entries, and 200,000 “John Smith” entries in our phonebook. Is there a way that we can scan just the 200,000 we need?

In the case of a phone book this is somewhat simple; since we know that we want it to be sorted by Lastname, Firstname we can create a compound index on them, like the below.

ensureIndex( {  LastName : true, FirstName : 1  } ) 

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1_FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 200000,
    "nscanned" : 200000,
    "nscannedObjectsAllPlans" : 200000,
    "nscannedAllPlans" : 200000,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 370,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ],
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

Looking at the explain on this, we see that the index only scanned the 200,000 documents that matched, so we got a perfect hit.

Beyond Compound Indexes

The compound index is a great solution in the case of a phonebook in which we always know how we are going to be querying our data. Now what if we have an application in which users can arbitrarily query for different fields together? We can’t possibly create a compound index for every possible combination because of the overhead imposed by indexes, as we discussed above, and because MongoDB limits you to 64 indexes per collection. Index intersection can really help.

Imagine the case of a medical application which doctors use to filter through patients. At a high level, omitting several details, a basic schema may look something like the below.

{
      Fname
      LName
      SSN
      Age
      Blood_Type
      Conditions : [] 
      Medications : [ ]
      ...
      ...
}

Some sample searches that a doctor/nurse may run on this system would look something like the below.

Find me a Patient with Blood_Type = O under the age of 50

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

Find me all patients over the age of 60 on Medication X

db.patients.find( { Medications : “X” , Age : { $gt : 60} })

Find me all Diabetic patients on medication Y

db.patients.find( { Conditions : “Diabetes”, Medications : “Y” } )

With all of the unstructured data in modern applications, along with the desire to be able to search for things as needed in an ad-hoc way, it can become very difficult to predict usage patterns. Since we can’t possibly create compound indexes for every combination of fields, because we don’t necessarily know what those will be ahead of time, we can try indexing individual fields to try to salvage some performance. But as shown above in our phone book application, this can lead to performance issues in which we pull documents into memory that are not matches.

To avoid the paging of unnecessary data, the new index intersection feature in 2.6 increases the overall efficiency of these types of ad-hoc queries by processing the indexes involved individually and then intersecting the result set to find the matching documents. This means you only pull the final matching documents into memory and everything else is processed using the indexes. This processing will utilize more CPU, but should greatly reduce the amount of IO done for queries where all of the data is not in memory as well as allow you to utilize your memory more efficiently.

For example, looking at the earlier example:

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

It is inefficient to find all patients with BloodType: O (which could be millions) and then pull into memory each document to find the ones with age < 50 or vice versa.

Instead, the query planner finds all patients with bloodType: O using the index on BloodType, and all patients with age < 50 using the index on age, and then only pulls the intersection of these 2 result sets into memory. The query planner only needs to fit the subsets of the indexes in memory, instead of pulling in all of the documents. This in turn causes less paging, and less thrashing of the contents of memory, which will yield overall better performance.

Index intersection allows for much more efficient use of existing RAM so less total memory will usually be required to fit the working set then previously. Also, if you had several compound indices that were made up of different combinations of fields, then you can reduce the total number of indexes on the system. This means storing less indices in memory as well as achieving better insert/update performance since fewer indices must be updated.

As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.

Optimizing Multi-key Indexes It is also possible to intersect an index with itself in the case of multi-key indexes. Consider the below query:

Find me all patients with Diabetes & High Blood Pressure

db.patients.find( {  Conditions : { $all : [ “Diabetes”, “High Blood Pressure” ]  }    }  )

In this case we will find the result set of all Patients with Diabetes, and the result set of all patients with High blood pressure, and intersect the two to get all patients with both. Again, this requires less memory and disk speed for better overall performance. As of the 2.6.0 release, an index can intersect with itself up to 10 times.

Do We Still Need Compound Indexes?

To be clear, compound indexing will ALWAYS be more performant IF you know what you are going to be querying on and can create one ahead of time. Furthermore, if your working set is entirely in memory, then you will not reap any of the benefits of Index Intersection as it is primarily based on reducing IO. But in a more ad-hoc case where one cannot predict the shape of the queries and the working set is much larger than available memory, index intersection will automatically take over and choose the most performant path.

MongoDB Security Part II: 10 mistakes that can compromise your database

Jun 3 • Posted 2 months ago

This is the second in our 3-part series on MongoDB Security by Andreas Nilsson, Lead Security Engineer at MongoDB

This post outlines 10 things to avoid when configuring security for MongoDB. These types of mistakes can lead to the loss of sensitive data, disrupted operations and have the potential to put entire companies out of business. These recommendations are based on my experience working with MongoDB users, and building security systems for databases and financial services organizations. Items are ordered by a combination of severity and frequency.

Mistake #1: Directly exposing a MongoDB server to the Internet

It is surprisingly common to deploy MongoDB database servers directly online or in a DMZ. The MongoDB server network interface is designed to be stable under rogue conditions but exposing the database to the Internet is an unnecessary risk. This holds true for any backend system and is not specific for MongoDB.

If public network exposure is combined with lack of access control, the entire content of the database is up for grabs for anyone who cares to look. In addition, an attacker could intentionally or accidentally change the database configuration, modify the application behavior or perform a Denial of Service (DoS) attack.

Recommendation

Design web applications with a multi-tier architecture in mind, use firewalls to segment the network layers appropriately, and place your database server at the inner data storage layer.

Mistake #2: No access control

Access control is not enabled by default when installing MongoDB. If access control is not enabled, anyone with network access to the server can perform any operation. This includes extracting all data and configuration, running arbitrary Javascript using the eval command, modifying any data in the database, creating and removing shards, etc.

Recommendation

Always enable access control, unless it is guaranteed that no untrusted entities will gain access to the server, see the section on Role-Based Access Control in the MongoDB security manual.

Mistake #3 - Not enabling SSL

It is fairly straightforward to protect the network communication using SSL, and enabling SSL in MongoDB does not impact the database performance. SSL also protects against man-in-the-middle attacks, where an attacker would intercept and modify communication between two parties.

Recommendation

Enable SSL to protect network communication against eavesdropping between the clients and the servers and within MongoDB clusters and replica sets.

Mistake #4 - Unnecessary exposure of interfaces

MongoDB ships with an HTTP server and REST interface. By default this interface is turned off in MongoDB 2.6. Do not enable the HTTP server interface unless it is used for backwards compatibility. Instead use the wire API for communication with the server.

We also recommend only binding to necessary network interfaces and turn off server side Javacript execution if not needed.

Recommendation

Run MongoDB with secure configuration options as described in the documentation.

Mistake #5 - Poor user account configuration

There are a few ways to configure user accounts incorrectly, for MongoDB as well as for other systems. These include but are not limited to:

  • Use of a single high-privilege admin account for all purposes.
  • Granting high privileges and roles to users who do not need them.
  • Use of weak passwords or the same password for multiple accounts.
  • Orphaned user accounts belonging to decommissioned users or applications.

Recommendation

Use the principle of least privilege when configuring user accounts and utilize the flexibility available in the MongoDB access control system. Use unique, complex passwords and be mindful to decommission deprecated user accounts.

Mistake #6 - Insecure OS privileges

Running the mongod or mongos processes using a non-dedicated, high-privilege account like root puts your Operating System at unnecessary risk. Instead use a dedicated, special purpose account.

Avoid lax, insecure OS file permissions on * Data files * Keyfile * SSL private key files * Log files

Recommendation

Database data files, the keyfile and SSL private key files should only be readable by the mongod/mongos user. Log files should only be writable by the mongod/mongos user and readable only by root.

Mistake #7 - Insecure replica set keyfile configuration

The content of the keyfile used for authentication in sharded clusters and replicasets is in essence a password and should as such be long and of high entropy. Avoid:

  • Short or low-entropy password in the keyfile
  • Inadequate protection of the keyfile

Recommendation

Use a long, complex password if using a keyfile and protect it using adequate file permissions.

Mistake #8 - Poor SSL configuration

SSL is a complex protocol that needs to be configured properly to avoid leaving unexpected security holes.

Recommendation

Always provide MongoDB servers or the mongo shell with one or several CA certificates to establish a basis of trust.

Do not use self-signed certificates unless you are only looking for the encryption parts of SSL. Using a self-signed certificate invalidates substantial parts of SSL. Use certificates issued by a commercial or internal Certificate Authority.

Avoid using the same certificate across servers or clients. This exposes the private key in multiple places and unless a wildcard (*) certificate is used no hostname validation can be performed.

Mistake #9 - Unprotected backups

Care should be taken to adequately protect backup files generated by copying the data files or using the mongodump tool. If the content of the database is sensitive, the content of the backup is equally sensitive and should be treated as such.

Recommendation

Treat database backup files with the same level of care as the original database storage files.

Mistake #10 - Conscious or unconscious ignorance

A guaranteed way to create an insecure system is to ignore the topic altogether, or hope someone else thinks about it.

Recommendation

Before deploying a MongoDB instance with sensitive data, please consult the MongoDB Security Manual and the MongoDB Security Architecture Whitepaper and stay conscious about potential threats to your application.

MongoDB subscriptions provide access to additional enterprise grade capabilities. The subscription includes all the ease-of-use, broad driver support and scalability features of MongoDB, while addressing the more demanding security and certification requirements of corporate and government information security environments. To see more, download the development version of the MongoDB Enterprise edition here

6 Rules of Thumb for MongoDB Schema Design: Part 1

May 29 • Posted 2 months ago

By William Zola, Lead Technical Support Engineer at MongoDB

“I have lots of experience with SQL, but I’m just a beginner with MongoDB. How do I model a one-to-N relationship?” This is one of the more common questions I get from users attending MongoDB office hours.

I don’t have a short answer to this question, because there isn’t just one way, there’s a whole rainbow’s worth of ways. MongoDB has a rich and nuanced vocabulary for expressing what, in SQL, gets flattened into the term “One-to-N”. Let me take you on a tour of your choices in modeling One-to-N relationships.

There’s so much to talk about here, I’m breaking this up into three parts. In this first part, I’ll talk about the three basic ways to model One-to-N relationships. In the second part I’ll cover more sophisticated schema designs, including denormalization and two-way referencing. And in the final part, I’ll review the entire rainbow of choices, and give you some suggestions for choosing among the thousands (really — thousands) of choices that you may consider when modeling a single One-to-N relationship.

Many beginners think that the only way to model “One-to-N” in MongoDB is to embed an array of sub-documents into the parent document, but that’s just not true. Just because you can embed a document, doesn’t mean you should embed a document.

When designing a MongoDB schema, you need to start with a question that you’d never consider when using SQL: what is the cardinality of the relationship? Put less formally: you need to characterize your “One-to-N” relationship with a bit more nuance: is it “one-to-few”, “one-to-many”, or “one-to-squillions”? Depending on which one it is, you’d use a different format to model the relationship.

Basics: Modeling One-to-Few

An example of “one-to-few” might be the addresses for a person. This is a good use case for embedding — you’d put the addresses in an array inside of your Person object:

This design has all of the advantages and disadvantages of embedding. The main advantage is that you don’t have to perform a separate query to get the embedded details; the main disadvantage is that you have no way of accessing the embedded details as stand-alone entities.

For example, if you were modeling a task-tracking system, each Person would have a number of Tasks assigned to them. Embedding Tasks inside the Person document would make queries like “Show me all Tasks due tomorrow” much more difficult than they need to be. I will cover a more appropriate design for this use case in the next post.

Basics: One-to-Many

An example of “one-to-many” might be parts for a product in a replacement parts ordering system. Each product may have up to several hundred replacement parts, but never more than a couple thousand or so. (All of those different-sized bolts, washers, and gaskets add up.) This is a good use case for referencing — you’d put the ObjectIDs of the parts in an array in product document. (For these examples I’m using 2-byte ObjectIDs because they’re easier to read: real-world code would use 12-byte ObjectIDs.)

Each Part would have its own document:

Each Product would have its own document, which would contain an array of ObjectID references to the Parts that make up that Product:

You would then use an application-level join to retrieve the parts for a particular product:

For efficient operation, you’d need to have an index on ‘products.catalog_number’. Note that there will always be an index on ‘parts._id’, so that query will always be efficient.

This style of referencing has a complementary set of advantages and disadvantages to embedding. Each Part is a stand-alone document, so it’s easy to search them and update them independently. One trade off for using this schema is having to perform a second query to get details about the Parts for a Product. (But hold that thought until we get to denormalizing in part 2.)

As an added bonus, this schema lets you have individual Parts used by multiple Products, so your One-to-N schema just became an N-to-N schema without any need for a join table!

Basics: One-to-Squillions

An example of “one-to-squillions” might be an event logging system that collects log messages for different machines. Any given host could generate enough messages to overflow the 16 MB document size, even if all you stored in the array was the ObjectID. This is the classic use case for “parent-referencing” — you’d have a document for the host, and then store the ObjectID of the host in the documents for the log messages.

You’d use a (slightly different) application-level join to find the most recent 5,000 messages for a host:

Recap

So, even at this basic level, there is more to think about when designing a MongoDB schema than when designing a comparable relational schema. You need to consider two factors:

  • Will the entities on the “N” side of the One-to-N ever need to stand alone?
  • What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?

Based on these factors, you can pick one of the three basic One-to-N schema designs:

  • Embed the N side if the cardinality is one-to-few and there is no need to access the embedded object outside the context of the parent object
  • Use an array of references to the N-side objects if the cardinality is one-to-many or if the N-side objects should stand alone for any reasons
  • Use a reference to the One-side in the N-side objects if the cardinality is one-to-squillions

Next time we’ll see how to use two-way relationship and denormalizing to enhance the performance of these basic schemas.

Appboy’s co-founder and CIO Jon Hyman discusses how the leading platform for app marketing automation leverages MongoDB and ObjectRocket for real-time data aggregation and scale and gives a preview of his talk with Kenny Gorman of ObjectRocket at MongoDB World.

Want to see more? MongoDB World will feature over 80 MongoDB experts from around the world. Early bird ticket prices for the event end May 23. Register now to grab your seat

MongoDB Security Part 1 - Design and Configuration

May 21 • Posted 3 months ago

By Andreas Nilsson, Lead Security Engineer at MongoDB

With increased regulatory compliance, heightened concerns around privacy and growing risk from hackers and organized crime, the need to secure access to data has never been more urgent.

MongoDB 2.6 provides a number of features to facilitate building secure applications, such as auditing and authentication with Kerberos and LDAP. MongoDB now features a more competent and complete role-based access control system, x.509 authentication, an improved SSL stack and revamped security documentation.

In a short series of blog posts I will attempt to explain the philosophy and design of the security model of MongoDB. The first post covers the basics of securing a MongoDB server and application and gives an overview of the options available. The second post lists the most common security mistakes when configuring MongoDB. The final post is a deep dive into the authentication and authorization subsystems, specifically covering sharded systems with multiple databases and how to use the new Role-Based Access Control system.

For details of the MongoDB security architecture and a complete list of features, please refer to the MongoDB Security Architecture Whitepaper.

Security Model

Before going into details let’s start with the basics. A database security model needs to offer a basic set of features:

  • control of read and write access to data
  • protection of the integrity and confidentiality of the data stored.
  • control of modifications to the database system configuration.
  • privilege levels for different user types, administrators, applications etc.
  • auditing of sensitive operations.
  • stable and secure operation in a potentially hostile environment.

These security requirements can be achieved in different ways. A disconnected database server in a locked room would constitute a secure deployment regardless of how the database was being protected. However, it would not be very useful. A database is often placed unprotected on a “secured”, internal network. This is an idealized scenario since no network is entirely secure, architecture changes over time, and a considerable number of successful breaches are from internal sources. A defense-in-depth approach is therefore recommended when implementing your application’s infrastructure. While MongoDB’s newest security features help to improve your overall security posture, security is only as strong as the weakest link in the chain.

A conceptual view of the MongoDB security architecture is represented in the image below. The security model is divided into the four pillars of authentication, authorization, auditing and encryption.

Secure Your Deployment

In the discussion below we will assume a simple web application which reads from and writes to a replicated database. Each of the steps below is described in detail in the documentation. The steps are described in greater detail in the MongoDB Security Checklist. The focus is primarily on enabling access control and transport encryption.

Infrastructure Prerequisites

Design the application to work in a multilayer fashion, and place the database server(s) on a dedicated network segment, isolated from the DMZ where the web application resides. Configure firewall rules to limit network access to the database server. Lock down MongoDB user and file permissions. The database files should be protected from unauthorized access and the mongod/mongos daemons should be running with minimum privileges, specifically not as root.

Enable Access Control

By default there are no users configured in MongoDB. In order to enable access control users needs to be created and assigned appropriate privileges. Access control is enabled using the command line —auth or —keyfile flags. When access control is enabled, clients and drivers are required to authenticate to the server, and servers are required to authenticate to each other.

The following is the recommended series of steps to enable access control.

Design Determine which type of authentication methods to use for client authentication. The options are challenge-response based username/password (MONGODB-CR), x.509, LDAP and Kerberos. Determine which type of authentication method to use for server-server cluster authentication. The options are username/password and x.509. Please note that x.509 requires the use of SSL. List the different user types that will exist in the system; administrators, support staff, different type of application users etc. For each of the user types, determine which built-in roles are required. Optionally create customized roles tailored for your deployment.

Deployment Configure a keyfile or x.509 certificates for the cluster nodes. Start the mongod servers with the —auth flag set and appropriate cluster authentication options as determined in 3. Start mongos servers with appropriate cluster authentication options. Create the desired users with correct permissions. Please note that after creating the first user access control is enabled. Therefore, at a minimum the first user should have the userAdminAnyDatabase or root roles in order to be able to create other users.

Enable Transport Encryption

In order to protect the network traffic, SSL should be enabled between clients and the server and in between servers. Enabling SSL is well described in the security documentation http://docs.mongodb.org/master//tutorial/configure-ssl/

MongoDB supports the use of any server SSL certificate as long as the corresponding root CA certificate is provided with the configuration parameter —sslCAFile. If no CA certificate is specified, no certificate validation is performed and the certificate is only used for encryption purposes. Although supported, use of self-signed certificates is not recommended, since there is no basis for trust, and hence no certificate validation can be performed.

There are several different ways to configure SSL with MongoDB. Mutual Validation In MongoDB 2.4 the recommended configuration was to issue SSL server certificates to the server, and to the clients connecting to enable mutual validation. The server certificate is validated against the CA certificate file provided on the client side, and the client certificates are validated against the CA certificate provided on the server side. However no authentication is performed.

In order to allow clients to connect without a certificate, the server can be started with the command line flag —sslWeakCertificateValidation.

Mutual validation is still supported but in MongoDB 2.6 several new SSL options are included. X.509 Authentication If SSL is enabled clients can use the new authentication mechanism MONGODB-X509 to authenticate using x.509 certificates.

It is also possible to use x.509 authentication between the servers in a cluster. From a security perspective this is a great improvement to the default keyfile solution.

Mixed mode SSL MongoDB 2.6 introduces the option of mixing encrypted and non-encrypted connections. That is, the server will listen for and detect both SSL and non-SSL inbound connections.

This feature enables cluster members running SSL to talk to non-SSL nodes and vice versa. It also enables a rolling configuration “upgrade” from a non-SSL to an SSL cluster without downtime.

The mixed mode behavior is controlled by the —sslMode parameter. From a security hardening perspective SSL mixed mode should be turned off, unless explicitly needed for one of the two scenarios discussed above.

Disable Unused Exposed Interfaces

Disable sensitive interfaces and functionality that is not needed.

MongoDB supports the execution of JavaScript code for certain server-side operations: mapReduce, group, eval, and $where. If you do not use these operations, disable server-side scripting by setting —noscripting to true.

Use only the MongoDB wire protocol on production deployments. The following interfaces are disabled by default in MongoDB 2.6: —httpinterface, —jsonp, and —rest. Leave these disabled, unless required for backwards compatibility. If using MongoDB 2.4 disable the HTTP interface using —nohttpinterface.

The bind_ip setting for mongod and mongos instances limits the network interfaces on which MongoDB programs will listen for incoming connections. Configure the server only to bind on desired interfaces.

Auditing

MongoDB Enterprise logs all administrative actions made against the database. Schema operations (such as creating or dropping databases, collections and indexes), replica set reconfiguration events along with authentication and authorization activities are all captured, along with the administrator’s identity and timestamp of the operation, enabling compliance and security analysis.

By default, MongoDB auditing logs all administrative actions, but can also be configured with filters to capture only specific events. The audit log can be written to multiple destinations in a variety of formats including to the console and syslog (in JSON format), and to a file (JSON or BSON), which can then be loaded to MongoDB and analyzed to find relevant events.

MongoDB Maintains an Audit Trail of Administrative Actions Against the Database

Each MongoDB server logs events to its local destination. The DBA can then merge these into a single log, enabling a cluster-wide view of operations that affected multiple nodes.

The MongoDB auditing documentation includes information on how to configure auditing and all of the operations that can be captured.

Summary

Securing the database layer of an application is a necessary step to protect the data from unauthorized access. MongoDB offers a flexible and competent security model but as with all security solutions, care should be take to enable and configure the system correctly.

In part 2, we will closely examine some common configuration mistakes and security pitfalls based on a number of existing MongoDB deployments and users.

MongoDB subscriptions provide access to additional enterprise grade capabilities. The subscription includes all the ease-of-use, broad driver support and scalability features of MongoDB, while addressing the more demanding security and certification requirements of corporate and government information security environments. To see more, download the development version of the MongoDB Enterprise edition here

Powering Social Insights with MongoDB at uberVu

May 19 • Posted 3 months ago

This is a guest post from the uberVU team.

Today, more than ever, marketers are being asked to show how their financial investments are driving tangible business results. We help them accomplish that goal. uberVU is a real-time social media marketing platform that allows organizations to leverage their social media data to better understand, connect with, and grow their online communities. We have an extensive client list including enterprise customers such as Heinz, NBC, World Bank, and Fujitsu.

We were recently acquired by HootSuite, and together our two products offer a complete and integrated feature set that addresses the entire social media lifecycle:

  • monitoring
  • metrics
  • reporting
  • engagement
  • collaboration

We have a five-year history in the social media monitoring market, and our evolving data storage architecture has played a key role in elevating our application’s value and user experience. For our data handling needs, we started with Tokyo Cabinet, SimpleDB, and MySQL and now use MongoDB, DynamoDB, S3, Glacier, and ElasticSearch.

Originally our team intended to use MongoDB as a secondary data store, but after a short implementation and adoption period of 3 months in which it quickly gained traction internally, MongoDB was promoted to our primary data store.

Challenges

We collect and store social media content such as tweets, Facebook posts, blog posts, blog comments, etc. Each item is stored in the database as a separate document.

A stored tweet might look something like this:

{
    generator: ‘twitter’,
    content: ‘This is a tweet example for ubervu’,
    author: ‘Vladimir Oane’,
    gender: ‘male’,
language: ‘english’,
    sentiment: ‘positive’,
    search: ‘ubervu’,
    published: 1391767879,
    ...
}

For our clients, relevant social media content must contain or match a predefined expression of interest in the designated ‘search’ field. In the example above, the tweet is collected and stored because it contains the string ‘ubervu’ in the content body.

Unique Index Structure

Our most common use case with MongoDB is performing a range query over a time frame for a fixed expression. For example, we might want to retrieve social media content that contains or matches the expression ‘ubervu’ between October 1st and November 23rd.

We constructed the unique index in MongoDB, _id, to perform this query automatically. For space considerations, we opted for a 64 bit integer and divided it into three parts:

  1. A hash of the search expression
  2. The entire published timestamp, in seconds
  3. An item id, which together with the timestamp should uniquely identify a document

To conduct a search for all “ubervu” mentions between timestamp1 and timestamp2, we simply run a range query on “_id” between:

and:

Note above how the lowest and highest bounds have the item id portion filled entirely with 0’s and 1’s, respectively. This allows us to cover edge cases of items that fall between timestamp1 and timestamp2.

Efficient Filtering

Another very common use case is retrieving all the data that matches a criteria set. Within our application, the fields we can filter on are predefined (generator, language, sentiment, language, gender).

Efficient filtering is a challenge because the most obvious approach - creating indexes for every combination of filters - is not scalable as every added index costs storage space and has the potential to adversely affect write performance.

To improve query efficiency, we added an ‘attributes’ field into each document that consists of an encoded array of all the field values that can be used in a query. It looks like this:

attributes: [2041, 15, 178, 23 …]

Each numeric code in the array represents a property, such as “sentiment: positive” or “language: English”. We added an index over the ‘attributes’ field to speed up queries.

To retrieve all items matching a criteria set using the ‘attributes’ field, queries are run using the $all operator:

collection.find({attributes:{$all:[...]}}

A shortcoming of using the $all operator is that prior to MongoDB 2.6 the index is only used to match the first code in the ‘attributes’ array; all other codes must be retrieved from disk and matched with the rest of query criteria.

In an effort to reduce the number of documents that need to be checked from disk for each query, we developed a system that first sorts all numeric codes by the frequency that they appear in the data store and then orders the elements in the ‘attributes’ fields according to their ranking.

For example, the property “generator: Twitter” (representing all tweets) is more prevalent in the data store than the property “language: Romanian” (representing all content in Romanian). If we wanted to obtain all tweets written in Romanian from the database, it would be more efficient to place the numeric code representing “language: Romanian” first in the ‘attributes’ array as it is faster to retrieve all Romanian content from disk and check if they are tweets than to retrieve all tweets and check if they are written in Romanian.

This solution described above dramatically improved our query response time. MongoDB’s dynamic schema and rich query model made this possible.

Saving Storage Space

After realizing the fields in our documents would be relatively small in number and mostly consistent across the database, our team decided to impose a two character limit on all field names (“generator” became, “g”, “sentiment” became “s”, etc).

This small change saved us 16% of our storage space, without any loss in information.

Our Infrastructure Setup

We have taken full advantage of the cloud computing resources available to efficiently deploy and scale our offerings. Our current infrastructure resides entirely on the AWS stack.

We currently have 30+ instances deployed and over 30 terabytes of storage in permanent use. All EBS-backed production data stores currently reside on xfs RAID arrays.

This storage architecture provides us with not only volume redundancy, but also performance boosts, which were much needed in the beginning when provisioned IOPS was not yet available to ensure EBS performance.

Our MongoDB setup consists of six production clusters, each with its own unique scope and usage pattern.

Our MongoDB clusters all have the same topology:

  • Each shard (shardA, shardB … shardZ) consists of three member replica sets
  • Three ‘config server’ processes are deployed on separate instances
  • ‘mongos’ routing processes are spread throughout the whole system (webnode, API nodes, worker nodes, etc …)

Optimizing Performance and Redundancy with MongoDB

For us, relying on EBS-backed MongoDB clusters meant we had to familiarize ourselves with the concept of the working set, a number which represents the amount of data that is regularly accessed during day-to-day operations. In situations when the working set is larger than RAM, our application would be forced to read from disk, resulting in an immediate loss in performance due to EBS I/O latency. Now working sets can be estimated using the working set estimator, which was first introduced in MongoDB v2.4.

To prevent the working set from exceeding RAM, we first viewed our data usage patterns in Graphite:

The graphic above represents the ‘write’ working set for one of the clusters.

The graphic above represents the ‘read’ working set generated by our API.

Using our data usage information, we defined the following access patterns:

RecentDB HistoricalDB
Read Working Set 0-20 days 20-90 days
Write Working Set 0-20 days 0-90 days ( > 20 day data can be written directly here)

The architecture represented by the table above has been in place for more than three years and has proven itself on multiple occasions from both a performance and redundancy standpoint.

We currently use MongoDB Management Service (MMS) in addition to tools such as Graphite and Collectd for both monitoring and backup. These applications have been critical to managing our cloud-based cluster backups.

As our MongoDB-powered data store grew in size, the decision was made to implement an ‘inverted pyramid’ mechanism in an effort to provide the best possible response time while remaining cost efficient.

This mechanism relies on two main data stores, RecentDB and HistoricalDB, with the use of an in-house oplog replay tool that keeps the two clusters in sync.

The oplog - short for operations log - is a special capped collection that keeps a rolling record of all operations that modify the data stored in a MongoDB database, and is the basic mechanism that enables replication in MongoDB. Secondary nodes tail the oplog for new operations and replay them locally.

To implement the ‘inverted pyramid’ mechanism, we developed a process that connects source cluster shards (RecentDB) to a destination cluster mongos router instance, verifies the last written timestamp, tails the source oplog to that timestamp, and finally, replays all executed operations.

After taking into consideration our current settings and data volumes, we determined that an oplog replay timeframe of 72 - 96 hrs worked best for our clusters as it ensured there was enough time to counter any major failures at the cluster level (e.g. full replica sets downtime, storage replacements, etc).

In the current implementation, all 5 oplog processes (one per source shard) run on an administrative instance that is continually monitored for delays.

A key design step required in making our inverted pyramid possible was splitting our data store into five ‘segment’ databases, which are provisioned and depleted by two external jobs. This made it possible to drop data (at the db level) from the first two layers, RecentDB and HistoricalDB, in an orderly fashion without impacting any part of the application or compromising performance.

The last step of our data migration consists of offloading all data that passes the 90-day mark from the segment databases to S3. To accomplish this, each HistoricalDB secondary node is provisioned with two Python modules that parse through, collect, and export (in CSV format) all data older than 90 days. The legacy data is then uploaded into an S3 bucket and made available to other parts of our system.

An added benefit of our data architecture is the ability to use HistoricalDB on the off chance that a major issue impacts RecentDB. Although there is a storage space trade-off that comes with storing the data in the 0-20 day intervals on both clusters, having HistoricalDB on hand has proven useful for us in the past, with the AWS EAST-1 crash in the summer of 2012 being the most recent incident.

Conclusion

With MongoDB, we were able to quickly develop the query processes we needed to efficiently serve our customers, all on a flexible database architecture that stresses high performance and redundancy. MongoDB has been a partner that continues to deliver as we grow and tackle new challenges.

To learn more about how MongoDB can have a significant impact on your business, download our whitepaper How a Database Can Make Your Organization Faster, Better, Leaner.

Tiered Storage Models in MongoDB: Optimizing Latency and Cost

May 14 • Posted 3 months ago

By Rohit Nijhawan, Senior Consulting Engineer at MongoDB with André Spiegel and Chad Tindel

For a user-facing application, speed and uptime are critical to success. There are a number of ways you can tune your application and hardware setup to provide the best experience for your customers — the trick is doing so at optimal cost. Here we provide an example for improving performance and lowering costs with MongoDB using Tiered Storage, a method of prioritizing data storage based on latency requirements.

In this example, we will be segmenting data by date: recent data is more frequently accessed and should exhibit lower latency than less recent data. However, the idea applies to other ways of segmenting data, such as location, user, source, size, or other criteria. This approach takes advantage of a powerful feature in MongoDB called tag-aware sharding that has been available since MongoDB 2.2.

Example Application: Insurance Claims

In many applications, low-latency access to data becomes less important as data ages. For example, an insurance company might prioritize access to claims from the last 12 months. Users should be able to view recent claims quickly, but once claims are more than a year old they tend to be accessed much less frequently, and the latency requirements tend to become less demanding.

By creating tiers of storage with different performance and cost profiles, the insurance company can provide a better experience for users while optimizing their costs. Older claims can be stored in a storage tier with more cost-effective hardware such as commodity hard drives. More recent data can be stored in a high-performance storage tier that provides lower latency such as SSD. Because the majority of the claims are more than a year old, storing older data in the lower-cost tier can provide significant cost advantages. The insurance company can optimize their hardware spread across the two tiers, providing a great user experience at an optimized cost point.

The requirements for this application can be summarized as:

The trailing 12 months of claims should reside on faster storage tier Claims over a year old should move to slower storage tier Over time new claims arrive, and older claims need to move from the faster tier to the slower tier

For simplicity, throughout this overview, we’ll distinguish the claims data by “current” and “tier-2” data.

Building Your Own Process: An Operational Headache

One approach to these requirements is use periodic batch jobs: selecting data, loading it into the archive, and erasing it from the faster storage. However, this is inherently complex:

  • The move process must be carefully coded to fail gracefully. In the event that a load fails, you don’t want to delete the original data!
  • If the data to be moved is large, you may wish to throttle the operations.
  • If moves succeed partially, you have to retry the unfinished data.
  • Unless you plan on halting your application during the move (generally unacceptable), your application needs custom code to find the data before, during, and after the move.
  • Your application needs to understand the physical location of the data, which unnecessarily complicates your code to the partitioning logic.

Furthermore, introducing another custom component to your operations requires additional maintenance and monitoring.

It’s an operational headache that many teams are forced to endure, but there is a simpler way: have MongoDB handle the load of migrating documents from the recent storage machines to the tier 2 storage machines, transparently. As it turns out, you can easily implement this approach with a feature called Tag-Aware Sharding.

The MongoDB Way: Tag-aware Sharding

MongoDB provides a feature called sharding to scale systems horizontally across multiple machines. Sharding is transparent to your application - whether you have 1 or 100 shards, your application code is the same. For a comprehensive description of sharding please see the Sharding Guide.

A key component of sharding is a process called the balancer. As collections grow, the balancer operates in the background to carefully move documents between shards. Normally the balancer works to achieve a uniform distribution of documents across shards. However, with tag-aware sharding we can create policies that affect where documents are stored. This feature can be applied in many use cases. One example is to keep user data in data centers that are near the user. In our application, we can use this feature to keep current data on our fast servers, and tier 2 data on cheaper, slower servers.

Here’s how it works:

  • Shards are assigned tags. A tag is an alphanumeric alias like “London-DC”.
  • Unique shard key ranges are ‘pinned’ to tags.
  • During normal balancing operations, chunks migrate only to shards whose tag is associated with a key range which contains the chunk’s key range*.
  • There are a few subtleties regarding what happens when a chunk’s key range overlaps more than one tag range. Please read the documentation carefully regarding this particular case

This means that we can assign the “tier-2” tag to shards running on slow servers and “current” tags to shards running on fast servers, and the balancer will handle migrating the data between tiers automatically. What’s great is that we can keep all the data in one database, so our application code doesn’t need to change as data moves between storage tiers.

Determining the shard key

When you query a sharded collection, the query router will do its best to only inspect the shards holding your data, but it can only do this if you provide the shard key as part of your query. (See Sharded Cluster Query Routing for more information.)

So we need to make sure that the we look up documents by the shard key. We also know that time is the basis for determining the location of documents in our two storage tiers. Accordingly, the shard key must contain an explicit timestamp. In our example, we’ll be using Enron’s email dataset, and we’ll set the top-level “date” as the shard key. Here’s a sample document:

Because the time is stored in the most significant digits of the date, messages from any given day will numerically precede messages from subsequent days.

Implementation

Here are the the steps to set up this system:

Set up an empty, sharded MongoDB cluster Create a target database to host the sharded collection Assign tags to different shards corresponding to the storage tiers Assign tag ranges to the shards Load data into the MongoDB Cluster

Set up the cluster The first thing you will want to do is set up your sharded cluster. You can see more information on how to set this up here.

In this case we will have a database called “enron” and a collection called “messages” which holds part of the Enron email corpus. In this example, we’ve set up a cluster with three shards. The first, shard0000, is optimized for low-latency access to data. The other two, shard0001 and shard0002, use more cost effective hardware for data that is older than the identified cutoff date.

Here’s our sharded cluster. These are empty machines with no data:

Adding the tags We can “tag” each of these shards to associate them with documents that should belong to our “current” tier or those that should belong to “tier-2.” In the absence of tags and range based tags, balancing will try to ensure that the number of chunks on each shard are equal without regard to any other data in the fields. Before we add the data to our collection, let’s tag shard0000 as “current” and the other two as “tier-2”:

Now we can verify our tags by calling sh.status():

Next, we need to set up a database and collection for the Enron emails. We’ll set up a new database ‘enron’ with a collection called ‘messages’ and enable sharding on that collection:

Since we’re going to shard the collection, we’ll need to set up a shard key. We will use the ‘date’ field as our shard key since this is the field that will define how the documents are distributed across shards:

Defining the cutoff date between tiers The cutoff point between “current” data and “tier-2” data is a point in time that we will update periodically to keep the most recent documents in our “current” shard. We will start with a cutoff of July 1, 2001, saved as an ISO Date ISODate(“2001-07-01”). Once we add the data to our collection, we will set this as the tag range. Going forward, when we add documents to the “messages” collection, any documents newer than July 1, 2001 will end up on the “current” shard, and documents older than that will end up on the “tier-2” shard.

It’s important that the two ranges overlap at exactly the same point in time. The lower bound of a tag range is inclusive, and the upper bound is exclusive. This means a document that has an date of exactly ‘ISODate(“2001-07-01”)’ will go on the “current” shard, not the “tier-2” shard.

Below you will see each of the shard’s new tag ranges:

As a final check, look in the config database for the tag range definitions.

Now, that all the shards and ranges are defined, we are ready to load the message data into the server. The collection will follow the instructions given by the tag ranges and land on the correct machines.

Now, let’s check the sharding status to see where the documents reside

That’s it! The mongos process automatically moves documents to comply with the tag ranges. In this example, it took all documents still on the “current” shard with an ISODate older than ISODate(“2001-07-01T00:00:00Z”) and move them to the “tier-2” shard.

The tag ranges must be updated on a regular basis to keep the cutoff point at the correct interval of time in the past (1 year, in our case). In order to do this, both ranges need to be updated. To perform this change the balancer should temporarily be disabled, so there is no point where the ranges overlap. Stopping the balancer temporarily is a safe operation - it will not affect the application or the experience of users.

If you wanted to move the cutoff back another month, to August 1, 2001, you just need to follow these three steps:

Stop the balancer sh.setBalancerState(false) Create a chunk split at August 1 sh.splitAt('enron.messages', {"date" : ISODate("2001-08-01")}) Move the cutoff date to ISODate(“2001-08-01T00:00:00Z”) var configdb=db.getSiblingDB("config"); configdb.tags.update({tag:"tier-2"},{$set:{'max.date':ISODate("2001-08-01")}}) configdb.tags.update({tag:"current"},{$set:{'min.date':ISODate("2001-08-01")}}) Re-start the balancer sh.setBalancerState(true) Verify the sharding status

By updating the chunk split to August 1, we have migrated all the documents with a date after July 1 but before August 1 from the “current” shard to the “tier-2” shards. The good news is that we were able to perform this operation without changing our application code and with no database downtime. We can also see that it would be simple to schedule this process to run automatically through an external process.

From Operational Headache to Simplicity

The end result is one collection spread across three shards and two different storage systems. This solution allows you to lower your storage costs without adding complexity to the architecture of your system. Instead of a complex setup with different databases on different machines we have one database to query, and instead of a data migration we update some simple rules to control the location of data in the system.

Like what you see? Sign up for the MongoDB Newsletter

blog comments powered by Disqus