Posts tagged:

mongodb

Getting Started with MongoDB and Java: Part II

Aug 14 • Posted 1 month ago

By Trisha Gee, Java Engineer and Advocate at MongoDB

In the last article, we covered the basics of installing and connecting to MongoDB via a Java application. In this post, I’ll give an introduction to CRUD (Create, Read, Update, Delete) operations using the Java driver. As in the previous article, if you want to follow along and code as we go, you can use these tips to get the tests in the Getting Started project to go green.

Creating documents

In the last article, we introduced documents and how to create them from Java and insert them into MongoDB, so I’m not going to repeat that here. But if you want a reminder, or simply want to skip to playing with the code, you can take a look at Exercise3InsertTest.

Querying

Putting stuff in the database is all well and good, but you’ll probably want to query the database to get data from it.

In the last article we covered some basics on using find() to get data from the database. We also showed an example in Exercise4RetrieveTest. But MongoDB supports more than simply getting a single document by ID or getting all the documents in a collection. As I mentioned, you can query by example, building up a query document that looks a similar shape to the one you want.

For the following examples I’m going to assume a document which looks something like this:

person = {
  _id: "anId",
  name: "A Name",
  address: {
    street: "Street Address",
    city: "City",
    phone: "12345"
  }
  books: [ 27464, 747854, ...]
}  

Find a document by ID

To recap, you can easily get a document back from the database using the unique ID:

…and you get the values out of the document (represented as a DBObject) using a Map-like syntax:

In the above example, because you’ve queried by ID (and you knew that ID existed), you can be sure that the cursor has a single document that matches the query. Therefore you can use cursor.one() to get it.

Find all documents matching some criteria

In the real world, you won’t always know the ID of the document you want. You could be looking for all the people with a particular name, for example.

In this case, you can create a query document that has the criteria you want:

You can find out the number of results:

and you can, naturally, iterate over them:

A note on batching

The cursor will fetch results in batches from the database, so if you run a query that matches a lot of documents, you don’t have to worry that every document is loaded into memory immediately. For most queries, the first batch returned will be 101 documents. But as you iterate over the cursor, the driver will automatically fetch further batches from the server. So you don’t have to worry about managing batching in your application. But you do need to be aware that if you iterate over the whole of the cursor (for example to put it into a List), you will end up fetching all the results and putting them in memory.

You can get started with Exercise5SimpleQueryTest.

Selecting Fields

Generally speaking, you will read entire documents from MongoDB most of the time. However, you can choose to return just the fields that you care about (for example, you might have a large document and not need all the values). You do this by passing a second parameter into the find method that’s another DBObject defining the fields you want to return. In this example, we’ll search for people called “Smith”, and return only the name field. To do this we pass in a DBObject representing {name: 1}:

You can also use this method to exclude fields from the results. Maybe we might want to exclude an unnecessary subdocument from the results - let’s say we want to find everyone called “Smith”, but we don’t want to return the address. We do this by passing in a zero for this field name, i.e. {address: 0}:

With this information, you’re ready to tackle Exercise6SelectFieldsTest

Query Operators

As I mentioned in the previous article, your fields can be one of a number of types, including numeric. This means that you can do queries for numeric values as well. Let’s assume, for example, that our person has a numberOfOrders field, and we wanted to find everyone who had ordered more than, let’s say, 10 items. You can do this using the $gt operator:

Note that you have to create a further subdocument containing the $gt condition to use this operator. All of the query operators are documented, and work in a similar way to this example.

You might be wondering what terrible things could happen if you try to perform some sort of numeric comparison on a field that is a String, since the database supports any type of value in any of the fields (and in Java the values are Objects so you don’t get the benefit of type safety). So, what happens if you do this?

The answer is you get zero results (assuming all your documents contain names that are Strings), and you don’t get any errors. The flexible nature of the document schema allows you to mix and match types and query without error.

You can use this technique to get the test in Exercise7QueryOperatorsTest to go green - it’s a bit of a daft example, but you get the idea.

Querying Subdocuments

So far we’ve assumed that we only want to query values in our top-level fields. However, we might want to query for values in a subdocument - for example, with our person document, we might want to find everyone who lives in the same city. We can use dot notation like this:

We’re not going to use this technique in a query test, but we will use it later when we’re doing updates.

Familiar methods

I mentioned earlier that you can iterate over a cursor, and that the driver will fetch results in batches. However, you can also use the familiar-looking skip() and limit() methods. You can use these to fix up the test in Exercise8SkipAndLimitTest.

A last note on querying: Indexes

Like a traditional database, you can add indexes onto the database to improve the speed of regular queries. There’s extensive documentation on indexes which you can read at your own leisure. However, it is worth pointing out that, if necessary, you can programmatically create indexes via the Java driver, using createIndexes. For example:

There is a very simple example for creating an index in Exercise9IndexTest, but indexes are a full topic on their own, and the purpose of this part of the tutorial is to merely make you aware of their existence rather than provide a comprehensive tutorial on their purpose and uses.

Updating values

Now you can insert into and read from the database. But your data is probably not static, especially as one of the benefits of MongoDB is a flexible schema that can evolve with your needs over time.

In order to update values in the database, you’ll have to define the query criteria that states which document(s) you want to update, and you’ll have to pass in the document that represents the updates you want to make.

There are a few things to be aware of when you’re updating documents in MongoDB, once you understand these it’s as simple as everything else we’ve seen so far.

Firstly, by default only the first document that matches the query criteria is updated.

Secondly, if you pass in a document as the value to update to, this new document will replace the whole existing document. If you think about it, the common use-case will be: you retrieve something from the database; you modify it based on some criteria from your application or the user; then you save the updated document to the database.

I’ll show the various types of updates (and point you to the code in the test class) to walk you through these different cases.

Simple Update: Find a document and replace it with an updated one

We’ll carry on using our simple Person document for our examples. Let’s assume we’ve got a document in our database that looks like:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  address: {
    street: "123 Fake St",
    city: "Faketon",
    phone: "5559991234"
  }
  books: [ 27464, 747854, ...]
} 

Maybe Jo goes into witness protection and needs to change his/her name. Assuming we’ve got jo populated in a DBObject, we can make the appropriate changes to the document and save it into the database:

You can make a start with Exercise10UpdateByReplacementTest.

Update Operators: Change a field

But sometimes you won’t have the whole document to replace the old one, sometimes you just want to update a single field in whichever document matched your criteria.

Let’s imagine that we only want to change Jo’s phone number, and we don’t have a DBObject with all of Jo’s details but we do have the ID of the document. If we use the $set operator, we’ll replace only the field we want to change:

There are a number of other operators for performing updates on documents, for example $inc which will increment a numeric field by a given amount.

Now you can do Exercise11UpdateAFieldTest

Update Multiple

As I mentioned earlier, by default the update operation updates the first document it finds and no more. You can, however, set the multi flag on update to update everything.

So maybe we want to update everyone in the database to have a country field, and for now we’re going to assume all the current people are in the UK:

The query parameter is an empty document which finds everything; the second boolean (set to true) is the flag that says to update all the values which were found.

Now we’ve learnt enough to complete the two tests in Exercise12UpdateMultipleDocumentsTest

Upsert

Finally, the last thing to mention when updating documents is Upsert (Update-or-Insert). This will search for a document matching the criteria and either: update it if it’s there; or insert it into the database if it wasn’t.

Like “update multiple”, you define an upsert operation with a magic boolean. It shouldn’t come as a surprise to find it’s the first boolean param in the update statement (since “multi” was the second):

Now you know everything you need to complete the test in Exercise13UpsertTest

Removing from the database

Finally the D in CRUD - Delete. The syntax of a remove should look familiar now we’ve got this far, you pass a document that represents your selection criteria into the remove method. So if we wanted to delete jo from our database, we’d do:

Unlike update, if the query matches more than one document, all those documents will be deleted (something to be aware of!). If we wanted to remove everyone who lives in London, we’d need to do:

That’s all there is to remove, you’re ready to finish off Exercise14RemoveTest

Conclusion

Unlike traditional databases, you don’t create SQL queries in MongoDB to perform CRUD operations. Instead, operations are done by constructing documents both to query the database, and to define the operations to perform.

While we’ve covered what the basics look like in Java, there’s loads more documentation on all the core concepts in the MongoDB documentation:

Getting Started with MongoDB and Java: Part I

Aug 7 • Posted 1 month ago

By Trisha Gee, Java Engineer and Advocate at MongoDB

Java is one of the most popular programming languages in the MongoDB Community. For new users, it’s important to provide an overview of how to work with the MongoDB Java driver and how to use MongoDB as a Java developer.

In this post, which is aimed at Java/JVM developers who are new to MongoDB, we’re going to give you a guide on how to get started, including:

  • Installation
  • Setting up your dependencies
  • Connecting
  • What are Collections and Documents?
  • The basics of writing to and reading from the database
  • An overview of some of the JVM libraries

Installation

The installation instructions for MongoDB are extensively documented, so I’m not going to repeat any of that here. If you want to follow along with this “getting started” guide, you’ll want to download the appropriate version of MongoDB and unzip/install it. At the time of writing, the latest version of MongoDB is 2.6.3, which is the version I’ll be using.

A note about security

In a real production environment, of course you’re going to want to consider authentication. This is something that MongoDB takes seriously and there’s a whole section of documentation on security. But for the purpose of this demonstration, I’m going to assume you’ve either got that working or you’re running in “trusted mode” (i.e. that you’re in a development environment that isn’t open to the public).

Take a look around

Once you’ve got MongoDB installed and started (a process that should only take a few minutes), you can connect to the MongoDB shell. Most of the MongoDB technical documentation is written for the shell, so it’s always useful to know how to access it, and how use it to troubleshoot problems or prototype solutions.

When you’ve connected, you should see something like

MongoDB shell version: 2.6.3                           
connecting to: test
> _  

Since you’re in the console, let’s take it for a spin. Firstly we’ll have a look at all the databases that are there right now:

> show dbs

Assuming this is a clean installation, there shouldn’t be much to see:

> show dbs
admin  (empty)
local  0.078GB
>

That’s great, but as you can see there’s loads of documentation on how to play with MongoDB from the shell. The shell is a really great environment for trying out queries and looking at things from the point-of-view of the server. However, I promised you Java, so we’re going to step away from the shell and get on with connecting via Java.

Getting started with Java

First, you’re going to want to set up your project/IDE to use the MongoDB Java Driver. These days IDEs tend to pick up the correct dependencies through your Gradle or Maven configuration, so I’m just going to cover configuring these.

At the time of writing, the latest version of the Java driver is 2.12.3 - this is designed to work with the MongoDB 2.6 series.

Gradle

You’ll need to add the following to your dependencies in build.gradle:

compile 'org.mongodb:mongo-java-driver:2.12.3'

Maven

For maven, you’ll want:

<dependencies>
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>mongo-java-driver</artifactId>
        <version>2.12.3</version>
    </dependency>
</dependencies>

Alternatively, if you’re really old-school and like maintaining your dependencies the hard way, you can always download the JAR file.

If you don’t already have a project that you want to try with MongoDB, I’ve created a series of unit tests on github which you can use to get a feel for working with MongoDB and Java.

Connecting via Java

Assuming you’ve resolved your dependencies and you’ve set up your project, you’re ready to connect to MongoDB from your Java application.

Since MongoDB is a document database, you might not be surprised to learn that you don’t connect to it via traditional SQL/relational DB methods like JDBC. But it’s simple all the same:

Where I’ve put mongodb://localhost:27017, you’ll want to put the address of where you’ve installed MongoDB. There’s more detailed information on how to create the correct URI, including how to connect to a Replica Set, in the MongoClientURI documentation.

If you’re connecting to a local instance on the default port, you can simply use:

Note that this does throw a checked Exception, UnknownHostException. You’ll either have to catch this or declare it, depending upon what your policy is for exception handling.

The MongoClient is your route in to MongoDB, from this you’ll get your database and collections to work with (more on this later). Your instance of MongoClient (e.g. mongoClient above) will ordinarily be a singleton in your application. However, if you need to connect via different credentials (different user names and passwords) you’ll want a MongoClient per set of credentials.

It is important to limit the number of MongoClient instances in your application, hence why we suggest a singleton - the MongoClient is effectively the connection pool, so for every new MongoClient, you are opening a new pool. Using a single MongoClient (and optionally configuring its settings) will allow the driver to correctly manage your connections to the server. This MongoClient singleton is safe to be used by multiple threads.

One final thing you need to be aware of: you want your application to shut down the connections to MongoDB when it finishes running. Always make sure your application or web server calls MongoClient.close() when it shuts down.

Try out connecting to MongoDB by getting the test in Exercise1ConnectingTest to pass.

Where are my tables?

MongoDB doesn’t have tables, rows, columns, joins etc. There are some new concepts to learn when you’re using it, but nothing too challenging.

While you still have the concept of a database, the documents (which we’ll cover in more detail later) are stored in collections, rather than your database being made up of tables of data. But it can be helpful to think of documents like rows and collections like tables in a traditional database. And collections can have indexes like you’d expect.

Selecting Databases and Collections

You’re going to want to define which databases and collections you’re using in your Java application. If you remember, a few sections ago we used the MongoDB shell to show the databases in your MongoDB instance, and you had an admin and a local.

Creating and getting a database or collection is extremely easy in MongoDB:

You can replace "TheDatabaseName" with whatever the name of your database is. If the database doesn’t already exist, it will be created automatically the first time you insert anything into it, so there’s no need for null checks or exception handling on the off-chance the database doesn’t exist.

Getting the collection you want from the database is simple too:

Again, replacing "TheCollectionName" with whatever your collection is called.

If you’re playing along with the test code, you now know enough to get the tests
in Exercise2MongoClientTest to pass.

An introduction to documents

Something that is, hopefully, becoming clear to you as you work through the examples in this blog, is that MongoDB is different from the traditional relational databases you’ve used. As I’ve mentioned, there are collections, rather than tables, and documents, rather than rows and columns.

Documents are much more flexible than a traditional row, as you have a dynamic schema rather than an enforced one. You can evolve the document over time without incurring the cost of schema migrations and tedious update scripts. But I’m getting ahead of myself.

Although documents don’t look like the tables, columns and rows you’re used to, they should look familiar if you’ve done anything even remotely JSON-like. Here’s an example:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  age: 34,
  address: {
    street: "123 Fake St",
    city: "Faketon",
    state: "MA",
    zip: “12345”
  }
  books: [ 27464, 747854, ...]
}  

There are a few interesting things to note:

  1. Like JSON, documents are structures of name/value pairs, and the values can be one of a number of primitive types, including Strings and various number types.
  2. It also supports nested documents - in the example above, address is a subdocument inside the person document. Unlike a relational database, where you might store this in a separate table and provide a reference to it, in MongoDB if that data benefits from always being associated with its parent, you can embed it in its parent.
  3. You can even store an array of values. The books field in the example above is an array of integers that might represent, for example, IDs of books the person has bought or borrowed.

You can find out more detailed information about Documents in the documentation.

Creating a document and saving it to the database

In Java, if you wanted to create a document like the one above, you’d do something like:

At this point, it’s really easy to save it into your database:

Note that the first three lines are set-up, and you don’t need to re-initialize those every time.

Now if we look inside MongoDB, we can see that the database has been created:

> show dbs
Examples  0.078GB
admin     (empty)
local     0.078GB
> _

…and we can see the collection has been created as well:

> use Examples
switched to db Examples
> show collections
people
system.indexes
> _ 

…finally, we can see the our person, “Jo”, was inserted:

> db.people.findOne()
{
    "_id" : "jo",
    "name" : "Jo Bloggs",
        "age": 34,
    "address" : {
        "street" : "123 Fake St",
        "city" : "Faketon",
        "state" : "MA",
        "zip" : "12345"
    },
    "books" : [
        27464,
        747854
    ]
}
> _

As a Java developer, you can see the similarities between the Document that’s stored in MongoDB, and your domain object. In your code, that person would probably be a Person class, with simple primitive fields, an array field, and an Address field.

So rather than building your DBObject manually like the above example, you’re more likely to be converting your domain object into a DBObject. It’s best not to have the MongoDB-specific DBObject class in your domain objects, so you might want to create a PersonAdaptor that converts your Person domain object to a DBObject:

As before, once you have the DBObject, you can save this into MongoDB:

Now you’ve got all the basics to get the tests in Exercise3InsertTest to pass.

Getting documents back out again

Now you’ve saved a Person to the database, and we’ve seen it in the database using the shell, you’re going to want to get it back out into your Java application. In this post, we’re going to cover the very basics of retrieving a document - in a later post we’ll cover more complex querying.

You’ll have guessed by the fact that MongoDB is a document database that we’re not going to be using SQL to query. Instead, we query by example, building up a document that looks like the document we’re looking for. So if we wanted to look for the person we saved into the database, “Jo Bloggs”, we remember that the _id field had the value of “jo”, and we create a document that matches this shape:

As you can see, the find method returns a cursor for the results. Since _id needs to be unique, we know that if we look for a document with this ID, we will find only one document, and it will be the one we want:

Earlier we saw that documents are simply made up of name/value pairs, where the value can be anything from a simple String or primitive, to more complex types like arrays or subdocuments. Therefore in Java, we can more or less treat DBObject as a Map<String, Object>. So if we wanted to look at the fields of the document we got back from the database, we can get them with:

Note that you’ll need to cast the value to a String, as the compiler only knows that it’s an Object.

If you’re still playing along with the example code, you’re now ready to take on all the tests in Exercise4RetrieveTest

Overview of JVM Libraries

So far I’ve shown you the basics of the official Java Driver, but you’ll notice that it’s quite low-level - you have to do a lot of taking things out of your domain objects and poking them into MongoDB-shaped DBObjects, and vice-versa. If this is the level of control you want, then the Java driver makes this easy for you. But if it seems like this is extra work that you shouldn’t have to do, there are plenty of other options for you.

The tools I’m about to describe all use the MongoDB Java Driver at their core to interact with MongoDB. They provide a high-level abstraction for converting your domain objects into MongoDB documents, whilst also giving you a way to get to the underlying driver as well in case you need to use it at a lower level.

Morphia

Morphia is a really lightweight ODM (Object Document Mapper), so it’s similar to ORMs like Hibernate. Documents can be in a fairly similar shape to your Java domain objects, so this mapping can be automatic, but Morphia allows you point the mapper in the right direction.

Morphia is open source, and has contributors from MongoDB. Sample code and documentation can be found here.

Spring Data

Another frequently used ODM is Spring Data. This supports traditional relational and non-relational databases, including MongoDB. If you’re already using Spring in your application, this should be a familiar way to work.

As always with Spring projects, there’s a lot of really great documentation, including a Getting Started guide with example code.

MongoJack

If you’re working with web services or something else that supports JSON, and you’re using Jackson to work with this data, it probably seems like a waste to be turning it from this form into a Java object and then into a MongoDB DBObject. But MongoJack might make your job easier, as it’s designed to map JSON objects directly into MongoDB. Take a look at the example code and documentation.

Jongo

This is another Jackson-based ODM, but provides an interesting extra in the form of supporting queries the way you’d write them in the shell. Documentation and example code is available on the website.

Grails MongoDB GORM

The Grails web application framework also supports its own Object-Relational Mapping (GORM), including support for MongoDB. More documentation for this plugin can be found here.

Casbah

This isn’t an ODM like the other tools mentioned, but the officially supported Scala driver for MongoDB. Like the previous libraries, it uses the MongoDB Java Driver under the covers, but it provides a Scala API for application developers to work with. If you like working with Scala but are searching for an async solution, consider ReactiveMongo, a community-supported driver that provides async and non-blocking operations.

Other libraries and tools

This is far from an extensive list, and I apologise if I’ve left a favourite out. But we’ve compiled a list of many more libraries for the JVM, which includes community projects and officially supported drivers.

Conclusion

We’ve covered the basics of using MongoDB from Java - we’ve touched on what MongoDB is, and you can find out a lot more detailed information about it from the manual; we’ve installed it somewhere that lets us play with it; we’ve talked a bit about collections and documents, and what these look like in Java; and we’ve started inserting things into MongoDB and getting them back out again.

If you haven’t already started playing with the test code, you can find it in this github repository. And if you get desperate and look hard enough, you’ll even find the answers there too.

Finally, there are more examples of using the Java Driver in the Quick Tour, and there is example code in github, including examples for authentication.

If you want to learn more, try our 7-week online course, "Intro to MongoDB and Java" which started on MongoDB University this week.

Try it out, and hopefully you’ll see how easy it is to use MongoDB from Java.

How Buffer uses MongoDB to power its Growth Platform

Jun 9 • Posted 3 months ago

By Sunil Sadasivin, CTO at Buffer

Buffer, powered by experiments and metrics

At Buffer, every product decision we make is driven by quantitative metrics. We have always sought to be lean in our decision making, and one of the core tenants of being lean is launching experimental features early and measuring their impact.

Buffer is a social media tool to help you schedule and space out your posts on social media networks like Twitter, Facebook, Google+ and Linkedin. We started in late 2010 and thanks to a keen focus on analytical data, we have now grown to over 1.5 million users and 155k unique active users per month. We’re now responsible for sharing 3 million social media posts a week.

When I started at Buffer in September 2012 we were using a mixture of Google Analytics, Kissmetrics and an internal tool to track our app usage and analytics. We struggled to move fast and effectively measure product and feature usage with these disconnected tools. We didn’t have an easy way to generate powerful reports like cohort analysis charts or measure things like activation segmented by signup sources over time. Third party tracking services were great for us early on, but as we started to dig deeper into our app insights, we realized there was no way around it—we needed to build our own custom metrics and event tracking.

We took the plunge in April 2013 to build our own metrics framework using MongoDB. While we’ve had some bumps and growing pains setting this up, it’s been one of the best decisions we’ve made. We are now in control of all metrics and event tracking and are able to understand what’s going on with our app at a deeper level. Here’s how we use MongoDB to power our metrics framework.

Why we chose MongoDB

At the time we were evaluating datastores, we had no idea what our data would look like. When I started designing our schema, I quickly found that we needed something that would let us change the metrics we track over time and on the fly. Today, I’ll want to measure our signup funnel based on referrals, tomorrow I might want to measure some custom event and associated data that is specific to some future experiment. I needed to plan for the future, and give our developers the power to track any arbitrary data. MongoDB and its dynamic schema made the most sense for us. MongoDB’s super powerful aggregation framework also seemed perfect for creating the right views with our tracking data.

Our Metrics Framework Architecture

In our app, we’ve set up an AWS SQS queue and any data we want to track from the app goes immediately to this queue. We use SQS heavily in our app and have found it to be a great tool to manage messaging at high throughput levels. We have a simple python worker which picks off messages from this queue and writes them to our metrics database. The reason why we’ve done this instead of connecting and writing directly to our metrics MongoDB database is because we wanted our metrics set up to have absolutely zero impact on application performance. Much like Google Analytics offers no overhead to an application, our event tracking had to do the same. The MongoDB database that would store our events would be extremely write heavy since we would be tracking anything we could think of, including every API request, page visited, Buffer user/profile/post/email created etc. If, for whatever reason our metrics db goes down, or starts having write locking issues, our users shouldn’t be impacted. Using SQS as a middleman would allow tracking data to queue up if any of these issues occur. SQS gives us enough time to figure out what the issue is, fix, it and then process that backlog. We’ve had quite a few times in the past year where using Amazon’s robust SQS service has saved us from losing any data during maintenance or downtime that would occur when creating a robust high throughput metrics framework from scratch. We use MongoHQ to host our data. They’ve been super helpful with any challenges in scaling a db like ours. Since our setup is write heavy, we’ve initially set up a 400GB SSD replica set. As of today (May 16) we have 90 collections and are storing over 500 million documents.

We wrote simple client libraries for tracking data for every language that we use (PHP, Python, Java, NodeJS, Javascript, Objective-C). In addition to bufferapp.com, our API, mobile apps and internal tools all plug into this framework.

Tracking events

Our event tracking is super simple. When a developer creates a new event message, our python worker creates a generic event collection (if it doesn’t exist) and stores event data that’s defined by the developer. It will store the user or visitor id, and the date that the event occurred. It’ll also store the user_joined_at date which is useful for cohort analysis.

Here are some examples of event tracking our metrics platform lets us do.

Visitor page views in the app.

Like many other apps, we want to track every visitor that hits our page. There is a bunch of data that we want to store to understand the context around the event. We’d like to know the IP address, the URI they viewed, the user agent they’re using among other data.

Here’s what the tracking would look like in our app written in PHP:

Here’s the corresponding result in our MongoDB metrics db:

Logging User API calls

We track every API call our clients make to the Buffer API. Essentially what we’ve done here is create query-able logging for API requests. This has been way more effective than standard web server logs and has allowed us to dig deeper into API bugs, security issues and understanding the load on our API.

Experiment data

With this type of event tracking, our developers are able to track anything by writing a single line of code. This has been especially useful for measuring events specific to a feature experiment. This frictionless process helps keep us lean: we can measure feature usage as soon as a feature is launched. For example, we recently launched a group sharing feature for business customers so that they can group their Buffer social media accounts together. Our hypothesis was that people with several social media accounts prefer to share specific content to subsets of accounts. We wanted to quantifiably validate whether this is something many would use, or whether it’s a niche or power user feature. After a week of testing this out, we had our answer.

This example shows our tracking of our ‘group sharing’ experiment. We wanted to track each group that was created with this new feature. With this, we’re able to track the user, the groups created, the name of the group, and the date it was created.

Making sense of the data

We store a lot of tracking data. While it’s great that we’re tracking all this data, there would be no point if we weren’t able to make sense of it. Our goal for tracking this data was to create our own growth dashboard so we can keep track of key metrics, and understand results of experiments. Making sense of the data was one of the most challenging parts of setting up our growth platform.

MongoDB Aggregation

We rely heavily on MongoDB’s aggregation framework. It has been super handy for things like gauging API client requests by hour, response times separated by API endpoint, number of visitors based on referrers, cohort analysis and so much more.

Here’s a simple example of how we use MongoDB aggregation to obtain our average API response times between April 8th and April 9th: Result:

With the aggregation framework, we have powerful insight into how clients are using our platform, which users are power users and a lot more. We previously created long running scripts to generate our cohort analysis reports. Now we can use MongoDB aggregation for much of this.

Running ETL jobs

We have several ETL jobs that run periodically to power our growth dashboard. This is the way we make sense of our data core. Some of the more complex reports need this level of reporting. For example, the way we measure product activation is whether someone has posted an update within a week of joining. With the way we’ve structured our data, this requires a join query in two different collections. All of this processing is done in our ETL jobs. We’ll upload the results to a different database which is used to power the views in our growth dashboard for faster loading.

Here are some reports on our growth dashboard that are powered by ETL jobs

Scaling Challenges and Pitfalls

We’ve faced a few different challenges and we’ve iterated to get to a point where we can make solid use out of our growth platform. Here are a few pitfalls and examples of challenges that we’ve faced in setting this up and scaling our platform.

Plan for high disk I/O and write throughput.

The DB server size and type has a key role in how quickly we could process and store events. In planning for the future we knew that we’d be tracking quite a lot of data and a fast pace, so a db with high disk write throughput was key for us. We ended up going for a large SSD replica set. This of course really depends on your application and use case. If you use an intermediate datastore like SQS, you can always start small, and upgrade db servers when you need it without any data loss.

We keep an eye on mongostat and SQS queue size almost daily to see how our writes are doing.

One of the good things about an SSD backed DB is that disk reads are much quicker compared to hard disk. This means it’s much more feasible to run ad hoc queries on un-indexed fields. We do this all the time whenever we have a hunch of something to dig into further.

Be mindful of the MongoDB document limit and how data is structured

Our first iteration of schema design was not scalable. True, MongoDB does not perform schema validation but that doesn’t mean it’s not important to think about how data is structured. Originally, we tracked all events in a single user_metrics and visitor_metrics collection. An event was stored as an embedded object in an array in a user document. Our hope was that we wouldn’t need to do any joins and we could effectively segment out tracking data super easily by user.

We had fields as arrays that were unbounded and could grow infinitely causing the document size to grow. For some highly active users (and bots), after a few months of tracking data in this way some documents in this collection would hit the 16MB document limit and fail to write any more. This created various performance issues in processing updates, and in our growth worker and ETL jobs because there were these huge documents transferred over the wire. When this happened we had to move quickly to restructure our data.

Moving to a single collection per event type has been the most scalable solution and a more flexible solution.

Reading from secondaries

Some of our ETL jobs read and process a lot of data. If you end up querying documents that haven’t been read or written to recently, it is very possible this may be stored out of memory and on disk. Querying this data means MongoDB will page out some documents that have been touched recently and bring query results into memory. This will then make writing to that paged out document slower. It’s for this reason that we have set up our ETL and aggregation queries to read only from our secondaries in our replica-set, even though they may not be consistent with the primary.

Our secondaries have a high number of faults because of paging due to reading ‘stale’ data

Visualizing results

As I mentioned before, one of the more challenging parts about maintaining our own growth platform is extracting and visualizing the data in a way that makes a lot of sense. I can’t say that we’ve come to a great solution yet. We’ve put a lot of effort into building out and maintaining our growth dashboard and creating visualizations is the bottleneck for us today. There is really a lot of room to reduce the turnaround time. We have started to experiment a bit with using Stripe’s MoSQL to map results from MongoDB to PostgresSQL and connect with something like Chart.io to make this a bit more seamless. If you’ve come across some solid solutions for visualizing event tracking with MongoDB, I’d love to hear about it!

Event tracking for everyone!

We would love to open source our growth platform. It’s something we’re hoping to do later this year. We’ve learned a lot by setting up our own tracking platform. If you have any questions about any of this or would like to have more control of your own event tracking with MongoDB, just hit me up @sunils34

Want to help build out our growth platform? Buffer is looking to grow its growth team and reliability team!

Like what you see? Sign up for the MongoDB Newsletter and get MongoDB updates straight to your inbox

Efficient Indexing in MongoDB 2.6

Jun 4 • Posted 3 months ago

By Osmar Olivo, Product Manager at MongoDB

One of the most powerful features of MongoDB is its rich indexing functionality. Users can specify secondary indexes on any field, compound indexes, geospatial, text, sparse, TTL, and others. Having extensive indexing functionality makes it easier for developers to build apps that provide rich functionality and low latency.

MongoDB 2.6 introduces a new query planner, including the ability to perform index intersection. Prior to 2.6 the query planner could only make use of a single index for most queries. That meant that if you wanted to query on multiple fields together, you needed to create a compound index. It also meant that if there were several different combinations of fields you wanted to query on, you might need several different compound indexes.

Each index adds overhead to your deployment - indexes consume space, on disk and in RAM, and indexes are maintained during updates, which adds disk IO. In other words, indexes improve the efficiency of many operations, but they also come at a cost. For many applications, index intersection will allow users to reduce the number of indexes they need while still providing rich features and low latency.

In the following sections we will take a deep dive into index intersection and how it can be applied to applications.

An Example - The Phone Book

Let’s take the example of a phone book with the following schema.

{
    FirstName
    LastName
    Phone_Number
    Address
}

If I were to search for “Smith, John” how would I index the following query to be as efficient as possible?

db.phonebook.find({ FirstName : “John”, LastName : “Smith” })

I could use an individual index on FirstName and search for all of the “Johns”.

This would look something like ensureIndex( { FirstName : 1 } )

We run this query and we get back 200,000 John Smiths. Looking at the explain() output below however, we see that we scanned 1,000,000 “Johns” in the process of finding 200,000 “John Smiths”.

> db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 1000000,
    "nscanned" : 1000000,
    "nscannedObjectsAllPlans" : 1000101,
    "nscannedAllPlans" : 1000101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 2,
    "nChunkSkips" : 0,
    "millis" : 2043,
    "indexBounds" : {
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

How about creating an individual index on LastName?

This would look something like ensureIndex( { LastName : 1 } )

Running this query we get back 200,000 “John Smiths” but our explain output says that we now scanned 400,000 “Smiths”. How can we make this better?

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 400000,
    "nscanned" : 400000,
    "nscannedObjectsAllPlans" : 400101,
    "nscannedAllPlans" : 400101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 1,
    "nChunkSkips" : 0,
    "millis" : 852,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

So we know that there are 1,000,000 “John” entries, 400,000 “Smith” entries, and 200,000 “John Smith” entries in our phonebook. Is there a way that we can scan just the 200,000 we need?

In the case of a phone book this is somewhat simple; since we know that we want it to be sorted by Lastname, Firstname we can create a compound index on them, like the below.

ensureIndex( {  LastName : true, FirstName : 1  } ) 

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1_FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 200000,
    "nscanned" : 200000,
    "nscannedObjectsAllPlans" : 200000,
    "nscannedAllPlans" : 200000,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 370,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ],
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

Looking at the explain on this, we see that the index only scanned the 200,000 documents that matched, so we got a perfect hit.

Beyond Compound Indexes

The compound index is a great solution in the case of a phonebook in which we always know how we are going to be querying our data. Now what if we have an application in which users can arbitrarily query for different fields together? We can’t possibly create a compound index for every possible combination because of the overhead imposed by indexes, as we discussed above, and because MongoDB limits you to 64 indexes per collection. Index intersection can really help.

Imagine the case of a medical application which doctors use to filter through patients. At a high level, omitting several details, a basic schema may look something like the below.

{
      Fname
      LName
      SSN
      Age
      Blood_Type
      Conditions : [] 
      Medications : [ ]
      ...
      ...
}

Some sample searches that a doctor/nurse may run on this system would look something like the below.

Find me a Patient with Blood_Type = O under the age of 50

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

Find me all patients over the age of 60 on Medication X

db.patients.find( { Medications : “X” , Age : { $gt : 60} })

Find me all Diabetic patients on medication Y

db.patients.find( { Conditions : “Diabetes”, Medications : “Y” } )

With all of the unstructured data in modern applications, along with the desire to be able to search for things as needed in an ad-hoc way, it can become very difficult to predict usage patterns. Since we can’t possibly create compound indexes for every combination of fields, because we don’t necessarily know what those will be ahead of time, we can try indexing individual fields to try to salvage some performance. But as shown above in our phone book application, this can lead to performance issues in which we pull documents into memory that are not matches.

To avoid the paging of unnecessary data, the new index intersection feature in 2.6 increases the overall efficiency of these types of ad-hoc queries by processing the indexes involved individually and then intersecting the result set to find the matching documents. This means you only pull the final matching documents into memory and everything else is processed using the indexes. This processing will utilize more CPU, but should greatly reduce the amount of IO done for queries where all of the data is not in memory as well as allow you to utilize your memory more efficiently.

For example, looking at the earlier example:

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

It is inefficient to find all patients with BloodType: O (which could be millions) and then pull into memory each document to find the ones with age < 50 or vice versa.

Instead, the query planner finds all patients with bloodType: O using the index on BloodType, and all patients with age < 50 using the index on age, and then only pulls the intersection of these 2 result sets into memory. The query planner only needs to fit the subsets of the indexes in memory, instead of pulling in all of the documents. This in turn causes less paging, and less thrashing of the contents of memory, which will yield overall better performance.

Index intersection allows for much more efficient use of existing RAM so less total memory will usually be required to fit the working set then previously. Also, if you had several compound indices that were made up of different combinations of fields, then you can reduce the total number of indexes on the system. This means storing less indices in memory as well as achieving better insert/update performance since fewer indices must be updated.

As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.

Optimizing Multi-key Indexes It is also possible to intersect an index with itself in the case of multi-key indexes. Consider the below query:

Find me all patients with Diabetes & High Blood Pressure

db.patients.find( {  Conditions : { $all : [ “Diabetes”, “High Blood Pressure” ]  }    }  )

In this case we will find the result set of all Patients with Diabetes, and the result set of all patients with High blood pressure, and intersect the two to get all patients with both. Again, this requires less memory and disk speed for better overall performance. As of the 2.6.0 release, an index can intersect with itself up to 10 times.

Do We Still Need Compound Indexes?

To be clear, compound indexing will ALWAYS be more performant IF you know what you are going to be querying on and can create one ahead of time. Furthermore, if your working set is entirely in memory, then you will not reap any of the benefits of Index Intersection as it is primarily based on reducing IO. But in a more ad-hoc case where one cannot predict the shape of the queries and the working set is much larger than available memory, index intersection will automatically take over and choose the most performant path.

Appboy’s co-founder and CIO Jon Hyman discusses how the leading platform for app marketing automation leverages MongoDB and ObjectRocket for real-time data aggregation and scale and gives a preview of his talk with Kenny Gorman of ObjectRocket at MongoDB World.

Want to see more? MongoDB World will feature over 80 MongoDB experts from around the world. Early bird ticket prices for the event end May 23. Register now to grab your seat

MongoDB’s New Bulk API

May 6 • Posted 4 months ago

By Christian Kvalheim, Driver Lead and Node.js Driver Maintainer at MongoDB

The New Bulk API

One of the core new features in MongoDB 2.6 is the new bulk write operations. All the drivers include a new bulk api that allows applications to leverage these new operations using a fluid style API. Let’s explore the API and how it’s implemented in the Node.js driver.

The API

The API has two core concepts. The ordered and the unordered bulk operation. The main difference is in the way the operations are executed in bulk. In the case of an ordered bulk operation, every operation will be executed in the order they are added to the bulk operation. In the case of an unordered bulk operation however there is no guarantee what order the operations are executed. Later we will look at how each is implemented.

Operations

You can initialize an ordered or unordered bulk operation in the following way.

    var ordered = db.collection('documents').initializeOrderedBulkOp();
    var unordered = db.collection('documents').initializeUnorderedBulkOp();

Both the ordered and unordered instances are bulk operation objects that we can add insert, update and remove operations to. The following operations are valid.

updateOne (update first matching document)

```ordered.find({ a : 1 }).updateOne({$inc : {x : 1}});```

update (update all matching documents)

```ordered.find({ a : 1 }).update({$inc : {x : 2}});```

replaceOne (replace entire document)

```ordered.find({ a : 1 }).replaceOne({ x : 2});```

updateOne or upsert (update first existing document or upsert)

```ordered.find({ a : 2 }).upsert().updateOne({ $inc : { x : 1}});```

update or upsert (update all or upsert)

```ordered.find({ a : 2 }).upsert().update({ $inc : { x : 2}});```

replace or upsert (replace first document or upsert)

```ordered.find({ a : 2 }).upsert().replaceOne({ x : 3 });```

removeOne (remove the first document matching)

```ordered.find({ a : 2 }).removeOne();```

remove (remove all documents matching)

```ordered.find({ a : 1 }).remove();```

insert

```ordered.insert({ a : 5});```

What happens under the covers when you start adding operations to a bulk operation? Let’s take a look at the new write operations to see how it works.

The New Write Operations

MongoDB 2.6 introduces a completely new set of write operations. Before 2.6 all write operations where done using wire protocol messages at the socket level. From 2.6 this changes to using commands.

### Insert Write Command

The insert write commands allow an application to insert batches of documents. Here’s an example:

    {
        insert: 'collection name'
      , documents: [{ a : 1}, ...]
      , writeConcern: {
        w: 1, j: true, wtimeout: 1000
      }
      , ordered: true/false
    }

A couple of things to note. The documents field contains an array of all the documents that are to be inserted. The writeConcern field specifies what would have previously been a getLastError command that would follow the pre 2.6 write operations. In other words there is always a response from a write operation in 2.6. This means that w:0 has different semantics than what one is used to in pre 2.6. In the context w:0 basically means only return an ack without any information about the success or failure of insert operations.

Let’s take a look at the update and remove write commands before seeing the results that are returned when executing these operations in 2.6.

Update Write Command

There are some slight differences in the update write command in comparison to the insert write command. Here’s an example:

    {
        update: 'collection name'
      , updates: [{ 
            q: { a : 1 }
          , u: { $inc : { x : 1}}
          , multi: true/false
          , upsert: true/false
        }, ...]
      , writeConcern: {
        w: 1, j: true, wtimeout: 1000
      }
      , ordered: true/false
    }

The main difference here is that the updates array is an array of update operations where each entry in the array contains the q field that specifies the selector for the update. The u contains the update operation. multi specifies if we will updateOne or updateAll documents that matches the selection. Finally upsert tells the server if it will perform an upsert if the document is not found.

Finally let’s look at the remove write command.

Remove Write Command

The remove write command is very similar to the update write command. Here’s an example:

    {
        delete: 'collection name'
      , deletes: [{ 
            q: { a : 1 }
          , limit: 0/1
        }, ...]
      , writeConcern: {
        w: 1, j: true, wtimeout: 1000
      }
      , ordered: true/false
    }

Similar to the update example, we can see that the entries in the deletes array contain documents with specific fields. The q field is the selector that will match which documents will be removed. The limit field sets the number of elements to be removed. Currently limit only supports two values, 0 and 1. The value 0 for limit removes all documents that match the selector. A value of 1 for limit removes the first matching document only.

Now let’s take a look at how results are returned for these new write commands.

Write Command Results

One of the best new aspects of the new write commands is that they can return information about each individual operation error in the batch. Results are efficient - only information about errors are returned as well as the aggregated counts of successful operations. Here’s an example of a comprehensive* result:

    {
      "ok" : 1,
      "n" : 0,
      "nModified": 1, (Applies only to update)
      "nRemoved": 1, (Applies only to removes)
      "writeErrors" : [
        {
          "index" : 0,
          "code" : 11000,
          "errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: t1.t.$a_1  dup key: { : 1.0 }"
        }
      ],
      writeConcernError: {
        code : 22,
        errInfo: { wtimeout : true },
        errmsg: "Could not replicate operation within requested timeout"
      }      
    }

The two most interesting fields here are writeErrors and writeConcernError. If we take a look at writeErrors we can see how it’s an array of objects that include an index field as well as a code and errmsg. The field references the position of the failing document in the original documents, updates or deletes array allowing the application to identify the original batch document that failed.

The Effect of Ordered (true/false)

If ordered is set to true the write operation will fail on the first write error (meaning the first error that fails to apply the operation to memory). If one sets ordered to false the operation will continue until all operations have been executed (potentially in parallel), then return all the results. writeConcernError on the other hand does not stop the processing of a bulk operation if a document fails to be written to MongoDB.

It helps to think of writeErrors as hard errors and writeConcernError as a soft error.

The Special Case of w:0 As I mentioned previously, the semantics for w:0 changed for the write commands. The old style of write operations before 2.6 are a combination of a write wire message and a getLastError command. In the old style w:0 meant that the driver would not send a getLastError command after the write operation.

In 2.6 the new insert/update/delete commands will always respond. While w:0 would not return a result in versions of MongoDB before 2.6, in 2.6 and above it will. However it will truncate all the results and only return if the command ran successfully or failed.

As a result, if you execute.

    {
        insert: 'collection name'
      , documents: [{ a : 1}, ...]
      , writeConcern: {
        w: 0
      }
      , ordered: true/false
    }

All you receive from the server is the result

{ok : 1}

The Implication For The Bulk API

There are some implications to the fact that write commands are not mixed operations but either insert/update or removes. The Bulk API lets you mix operations and then merges the results back into a single result that simulates a mixed operations command in MongoDB. What does that mean in practice. Well let’s look at how node.js implements ordered and unordered bulk operations. Let’s use examples to show what happens.

Ordered Operations

Let’s take the following set of operations:

    var ordered = db.collection('documents').initializeOrderedBulkOp();
    ordered.insert({ a : 1 });
    ordered.find({ a : 1 }).update({ $inc: { x : 1 }});
    ordered.insert({ a: 2 });
    ordered.find({ a : 2 }).remove();
    ordered.insert({ a: 3 });

When running in ordered mode the bulk API guarantees the ordering of the operations and thus will execute this as 5 operations one after the other:

    insert bulk operation
    update bulk operation
    insert bulk operation
    remove bulk operation
    insert bulk operation

We have now reduced the bulk API to performing single operations and your throughput suffers accordingly.

If we re-order our bulk operations in the following way:

    var ordered = db.collection('documents').initializeOrderedBulkOp();
    ordered.insert({ a : 1 });
    ordered.insert({ a: 2 });
    ordered.insert({ a: 3 });
    ordered.find({ a : 1 }).update({ $inc: { x : 1 }});
    ordered.find({ a : 2 }).remove();

The execution is reduced to the following operations one after the other:

    insert bulk operation
    update bulk operation
    remove bulk operation

Thus for ordered bulk operations the order of operations will impact the number of write commands that need to be executed and thus the throughput possible.

Unordered Operations

Unordered operations do not guarantee the execution order of operations. Let’s take the example from above:

    var ordered = db.collection('documents').initializeOrderedBulkOp();
    ordered.insert({ a : 1 });
    ordered.find({ a : 1 }).update({ $inc: { x : 1 }});
    ordered.insert({ a: 2 });
    ordered.find({ a : 2 }).remove();
    ordered.insert({ a: 3 });

The Node.js driver will collect the operations into separate type-specific operations. So we get.

    insert bulk operation
    update bulk operation
    remove bulk operation

In difference to the ordered operation these bulks all get executed in parallel in Node.js and the results then merged when they have all finished.

Takeaway

MongoDB as of 2.6 only allows batches of inserts, updates or removes and not a mixed batch containing all three of the operation types. When performing ordered bulk operation we need to keep this in mind to avoid the scenario above. However for an unordered bulk operation the missing mixed batch type in 2.6 does not impact performance.

Note: Although the Bulk API actually supports downconversion to 2.4 the performance impact is considerable as all operations are reduced to single write operations with a getLastError. It’s recommended to leverage this API primarily with 2.6 or higher.

Like what you see? Sign up to the MongoDB Newsletter and get monthly updates straight to your inbox

The MongoDB Open Source Hack Contest

Apr 28 • Posted 4 months ago

Some of the best MongoDB tools come from the Open Source community. Projects like the Node.js Driver, Mongoose and Meteor have become the backbone of many MongoDB apps and have helped support the developer community all over the world. We want to see more of what the community has built.

For the month of May, we’ll be hosting a worldwide hack contest for Open Source tools built on or connected to MongoDB. The winner of the contest will receive a ticket to OSCON, furnished by O’Reilly.

Guidelines:

  • All projects must be built with the MongoDB source code or on top of a MongoDB API, community or MongoDB supported driver
  • Any new drivers created should abide by the driver requirements listed in the MongoDB Manual
  • All hacks will be judged by MongoDB engineers

All entries can be submitted to community@MongoDB.com with the following information before May 31

  • Github or Bitbucket URL
  • Description of the project
  • How do users benefit from this application?
  • Why did you choose to contribute to MongoDB?

We’re looking forward to seeing your hacks come in!

Want to keep up-to-date on MongoDB news and events? Sign up for the MongoDB Monthly Newsletter”

maxTimeMS() and Query Optimizer Introspection in MongoDB 2.6

Apr 23 • Posted 5 months ago

By Alex Komyagin, Technical Service Engineer in MongoDB’s New York Office

In this series of posts we cover some of the enhancements in MongoDB 2.6 and how they will be valuable for users. You can find a more comprehensive list of changes in our 2.6 release notes. These are changes I believe you will find helpful, especially for our advanced users.

This post introduces two new features: maxTimeMS and query optimizer introspection. We will also look at a specific support case where these features would have been helpful.

maxTimeMS: SERVER-2212

The socket timeout option (e.g. MongoOptions.socketTimeout in the Java driver) specifies how long to wait for responses from the server. This timeout governs all types of requests (queries, writes, commands, authentication, etc.). The default value is “no timeout” but users tend to set the socket timeout to a relatively small value with the intention of limiting how long the database will service a single request. It is thus surprising to many users that while a socket timeout will close the connection, it does not affect the underlying database operation. The server will continue to process the request and consume resources even after the connection has closed.

Many applications are configured to retry queries when the driver reports a failed connection. If a query whose connection is killed by a socket timeout continues to consume resources on the server, a retry can give rise to a cascading effect. As queries run in MongoDB, the application may retry the same resource intensive query multiple times, increasing the load on the system leading to severe performance degradation. Before 2.6 the only way to cancel these queries was to issue db.killOp, which requires manual intervention.

MongoDB 2.6 introduces a new queryparameter, maxTimeMS, which limits how long an operation can run in the database. When the time limit is reached, MongoDB automatically cancels the query. maxTimeMS can be used to efficiently prevent unexpectedly slow queries from overloading MongoDB. Following the same logic, maxTimeMS can also be used to identify slow or unoptimized queries.

There are a few important notes about using maxTimeMS:

  • maxTimeMS can be set on a per-operation basis — long running aggregation jobs can have a different setting than simple CRUD operations.
  • requests can block behind locking operations on the server, and that blocking time is not counted, i.e. the timer starts ticking only after the actual start of the query when it initially acquires the appropriate lock;
  • operations are interrupted only at interrupt points where an operation can be safely aborted — the total execution time may exceed the specified maxTimeMS value;
  • maxTimeMS can be applied to both CRUD operations and commands, but not all commands are interruptible;
  • if a query opens a cursor, subsequent getmore requests will be included in the total time (but network roundtrips will not be counted);
  • maxTimeMS does not override the inactive cursor timeout for idle cursors (default is 10 min).

All MongoDB drivers support maxTimeMS in their releases for MongoDB 2.6. For example, the Java driver will start supporting it in 2.12.0 and 3.0.0, and Python - in version 2.7.

To illustrate how maxTimeMS can come in handy, let’s talk about a recent case that the MongoDB support team was working on. The case involved performance degradation of a MongoDB cluster that is powering a popular site with millions of users and heavy traffic. The degradation was caused by a huge amount of read queries performing full scans of the collection. The collection in question stores user comment data, so it has billions of records. Some of these collection scans were running for 15 minutes or so, and they were slowing down the whole system.

Normally, these queries would use an index, but in this case the query optimizer was choosing an unindexed plan. While the solution to this kind of problem is to hint the appropriate index, maxTimeMS can be used to prevent unintentional runaway queries by controlling the maximum execution time. Queries that exceed the maxTimeMS threshold will error out on the application side (in the Java driver it’s the MongoExecutionTimeoutException exception). maxTimeMS will help users to prevent unexpected performance degradation, and to gain better visibility into the operation of their systems.

In the next section we’ll take a look at another feature which would help in diagnosing the case we just discussed.

Query Optimizer Introspection: SERVER-666

To troubleshoot the support case described above and to understand why the query optimizer was not picking up the correct index for queries, it would have been very helpful to know a few things, such as whether the query plan had changed and to see the query plan cache.

MongoDB determines the best execution plan for a query and stores this plan in a cache for reuse. The cached plan is refreshed periodically and under certain operational conditions (which are discussed in detail below). Prior to 2.6 this caching behavior was opaque to the client. Version 2.6 provides a new query subsystem and query plan cache. Now users have visibility and control over the plan cache using a set of new methods for query plan introspection.

Each collection contains at most one plan cache. The cache is cleared every time a change is made to the indexes for the collection. To determine the best plan, multiple plans are executed in parallel and the winner is selected based on the amount of retrieved results within a fixed amount of steps, where each step is basically one “advance” of the query cursor. When a query has successfully passed the planning process, it is added to the cache along with the related index information.

A very interesting feature of the new query execution framework is the plan cache feedback mechanism. MongoDB stores runtime statistics for each cached plan in order to remove plans that are determined to be the cause of performance degradation. In practice, we don’t see these degradations often, but if they happen it is usually a consequence of a change in the composition of the data. For example, with new records being inserted, an indexed field may become less selective, leading to slower index performance. These degradations are extremely hard to manually diagnose, and we expect the feedback mechanism to automatically handle this change if a better alternative index is present.

The following events will result in a cached plan removal:

  • Performance degradation
  • Index add/drop
  • No more space in cache (the total number of plans stored in the collection cache is currently limited to 200; this is a subject to change in future releases)
  • Explicit commands to mutate cache
  • After the number of write operations on a collection exceeds the built-in limit, the whole collection plan cache is dropped (data distribution has changed)

MongoDB 2.6 supports a set of commands to view and manipulate the cache. List all known query shapes (planCacheListQueryShapes), display cached plans for a query shape (planCacheListPlans), manual removal of a query shape from the cache as well as emptying the whole cache (planCacheClear).

Here is an example invocation of the planCacheListQueryShapes command, that lists the shape of the query

db.test.find({first_name:"john",last_name:"galt"},

{_id:0,first_name:1,last_name:1}).sort({age:1}):

> db.runCommand({planCacheListQueryShapes: "test"})
{
    "shapes" : [
        {
            "query" : {
                "first_name" : "alex",
                "last_name" : "komyagin"
            },
            "sort" : {
                "age" : 1
            },
            "projection" : {
                "_id" : 0,
                "first_name" : 1,
                "last_name" : 1
            }
        }
    ],
    "ok" : 1
}

The exact values in the query predicate are insignificant in determining the query shape. For example, a query predicate

{first_name:"john", last_name:"galt"} is equivalent to the query predicate {first_name:"alex", last_name:"komyagin"}.

Additionally, with log level set to 1 or greater MongoDB will log plan cache changes caused by the events listed above. You can set the log level using the following command:

use admin
db.runCommand( { setParameter: 1, logLevel: 1 } )

Please don’t forget to change it back to 0 afterwards, since log level 1 logs all query operations and it can negatively affect the performance of your system.

Together, the new plan cache and the new query optimizer should make related operations more transparent and help users to have the visibility and control necessary for maintaining predictable response times for their queries.

You can see how these new features work yourself, by trying our latest release MongoDB 2.6 available for download here. I hope you will find these features helpful. We look forward to hearing your feedback, please post your questions in the mongodb-user google group.

Like what you see? Get MongoDB updates straight to your inbox

MongoDB 2.6: Our Biggest Release Ever

Apr 8 • Posted 5 months ago

By Eliot Horowitz, CTO and Co-founder, MongoDB

Discuss on Hacker News

In the five years since the initial release of MongoDB, and after hundreds of thousands of deployments, we have learned a lot. The time has come to take everything we have learned and create a basis for continued innovation over the next ten years.

Today I’m pleased to announce that, with the release of MongoDB 2.6, we have achieved that goal. With comprehensive core server enhancements, a groundbreaking new automation tool, and critical enterprise features, MongoDB 2.6 is by far our biggest release ever.

You’ll see the benefits in better performance and new innovations. We re-wrote the entire query execution engine to improve scalability, and took our first step in building a sophisticated query planner by introducing index intersection. We’ve made the codebase easier to maintain, and made it easier to implement new features. Finally, MongoDB 2.6 lays the foundation for massive improvements to concurrency in MongoDB 2.8, including document-level locking.

From the very beginning, MongoDB has offered developers a simple and elegant way to manage their data. Now we’re bringing that same simplicity and elegance to managing MongoDB. MongoDB Management Service (MMS), which already provides 35,000 MongoDB customers with monitoring and alerting, now provides backup and point-in-time restore functionality, in the cloud and on-premises.

We are also announcing a game-changing feature coming later this year: automation, also with hosted and on-premises options. Automation will allow you to provision and manage MongoDB replica sets and sharded clusters via a simple yet sophisticated interface.

MongoDB 2.6 brings security, integration and analytics enhancements to ease deployment in enterprise environments. LDAP, x.509 and Kerberos authentication are critical enhancements for organizations that require a single authentication mechanism across their entire infrastructure. To enhance security, MongoDB 2.6 implements TLS encryption, user-defined roles, auditing and field-level redaction, a critical building block for trusted systems. IBM Guardium also now offers integration with MongoDB, providing more extensive auditing abilities.

These are only a few of the key improvements; read the full official release notes for more details.

MongoDB 2.6 was a major endeavor and bringing it to fruition required hard work and coordination across a rapidly growing team. Over the past few years we have built and invested in that team, and I can proudly say we have the experience, drive and determination to deliver on this and future releases. There is much still to be done, and with MongoDB 2.6, we have a foundation for the next decade of database innovation.

Like what you see? Get MongoDB updates straight to your inbox

MongoDB Innovation Awards: Call for Nominations

Apr 2 • Posted 5 months ago

Fortune 500 enterprises, startups, hospitals, governments and organizations of all kinds use MongoDB because it is the best database for modern applications. The MongoDB Innovation Awards recognize organizations and individuals who are changing the world with MongoDB.

Nominations are open to any individual representing an organization that runs MongoDB, including partner solutions built on or for MongoDB.

The deadline to submit is midnight eastern time on May 30. A panel of judges will review nominations. Winners will be announced at MongoDB World on June 23-25.

Each winner will receive:

  • MongoDB Innovation Award
  • Pass to MongoDB World
  • Invitation to award winners reception at MongoDB World
  • Inclusion in Innovation Awards press release, blog post and email newsletter
  • $2,500 Amazon Gift Card

Submit a nomination by completing this nomination form — you’ll receive a discount code for MongoDB World.

Running MongoDB Queries Concurrently With Go

Mar 24 • Posted 6 months ago

This is a guest post by William Kennedy, managing partner at Ardan Studios in Miami, FL, a mobile and web app development company. Bill is also the author of the blog GoingGo.Net and the organizer for the Go-Miami and Miami MongoDB meetups in Miami. Bill looked for a new language in 2013 that would allow him to develop back end systems in Linux and found Go. He has never looked back.

If you are attending GopherCon 2014 or plan to watch the videos once they are released, this article will prepare you for the talk by Gustavo Niemeyer and Steve Francia. It provides a beginners view for using the Go mgo driver against a MongoDB database.

Introduction

MongoDB supports many different programming languages thanks to a great set of drivers. One such driver is the MongoDB Go driver which is called mgo. This driver was developed by Gustavo Niemeyer from Canonical with some assistance from MongoDB Inc. Both Gustavo and Steve Francia, the head of the drivers team, will be talking at GopherCon 2014 in April about “Painless Data Storage With MongoDB and Go”. The talk describes the mgo driver and how MongoDB and Go work well together for building highly scalable and concurrent software.

MongoDB and Go let you build scalable software on many different operating systems and architectures, without the need to install frameworks or runtime environments. Go programs are native binaries and the Go tooling is constantly improving to create binaries that run as fast as equivalent C programs. That wouldn’t mean anything if writing code in Go was complicated and as tedious as writing programs in C. This is where Go really shines because once you get up to speed, writing programs in Go is fast and fun.

In this post I am going to show you how to write a Go program using the mgo driver to connect and run queries concurrently against a MongoDB database. I will break down the sample code and explain a few things that can be a bit confusing to those new to using MongoDB and Go together.

Sample Program

The sample program connects to a public MongoDB database I have hosted with MongoLab. If you have Go and Bazaar installed on your machine, you can run the program against my database. The program is very simple - it launches ten goroutines that individually query all the records from the buoy_stations collection inside the goinggo database. The records are unmarshaled into native Go types and each goroutine logs the number of documents returned:

Now that you have seen the entire program, we can break it down. Let’s start with the type structures that are defined in the beginning:

The structures represent the data that we are going to retrieve and unmarshal from our query. BuoyStation represents the main document and BuoyCondition and BuoyLocation are embedded documents. The mgo driver makes it easy to use native types that represent the documents stored in our collections by using tags. With the tags, we can control how the mgo driver unmarshals the returned documents into our native Go structures.

Now let’s look at how we connect to a MongoDB database using mgo:

We start with creating a mgo.DialInfo object. Connecting to a replica set can be accomplished by providing multiple addresses in the Addrs field or with a single address. If we are using a single host address to connect to a replica set, the mgo driver will learn about any remaining hosts from the replica set member we connect to. In our case we are connecting to a single host.

After providing the host, we specify the database, username and password we need for authentication. One thing to note is that the database we authenticate against may not necessarily be the database our application needs to access. Some applications authenticate against the admin database and then use other databases depending on their configuration. The mgo driver supports these types of configurations very well.

Next we use the mgo.DialWithInfo method to create a mgo.Session object. Each session specifies a Strong or Monotonic mode, and other settings such as write concern and read preference. The mgo.Session object maintains a pool of connections to MongoDB. We can create multiple sessions with different modes and settings to support different aspects of our applications.

The next line of code sets the mode for the session. There are three modes that can be set, Strong, Monotonic and Eventual. Each mode sets a specific consistency for how reads and writes are performed. For more information on the differences between each mode, check out the documentation for the mgo driver.

We are using Monotonic mode which provides reads that may not entirely be up to date, but the reads will always see the history of changes moving forward. In this mode reads occur against secondary members of our replica sets until a write happens. Once a write happens, the primary member is used. The benefit is some distribution of the reading load can take place against the secondaries when possible.

With the session set and ready to go, next we execute multiple queries concurrently:

This code is classic Go concurrency in action. First we create a sync.WaitGroup object so we can keep track of all the goroutines we are going to launch as they complete their work. Then we immediately set the count of the sync.WaitGroup object to ten and use a for loop to launch ten goroutines using the RunQuery function. The keyword go is used to launch a function or method to run concurrently. The final line of code calls the Wait method on the sync.WaitGroup object which locks the main goroutine until everything is done processing.

To learn more about Go concurrency and better understand how this particular piece of code works, check out these posts on concurrency and channels.

Now let’s look at the RunQuery function and see how to properly use the mgo.Session object to acquire a connection and execute a query:

The very first thing we do inside of the RunQuery function is to defer the execution of the Done method on the sync.WaitGroup object. The defer keyword will postpone the execution of the Done method, to take place once the RunQuery function returns. This will guarantee that the sync.WaitGroup objects count will decrement even if an unhandled exception occurs.

Next we make a copy of the session we created in the main goroutine. Each goroutine needs to create a copy of the session so they each obtain their own socket without serializing their calls with the other goroutines. Again, we use the defer keyword to postpone and guarantee the execution of the Close method on the session once the RunQuery function returns. Closing the session returns the socket back to the main pool, so this is very important.

To execute a query we need a mgo.Collection object. We can get a mgo.Collection object through the mgo.Session object by specifying the name of the database and then the collection. Using the mgo.Collection object, we can perform a Find and retrieve all the documents from the collection. The All function will unmarshal the response into our slice of BuoyStation objects. A slice is a dynamic array in Go. Be aware that the All method will load all the data in memory at once. For large collections it is better to use the Iter method instead. Finally, we just log the number of BuoyStation objects that are returned.

Conclusion

The example shows how to use Go concurrency to launch multiple goroutines that can execute queries against a MongoDB database independently. Once a session is established, the mgo driver exposes all of the MongoDB functionality and handles the unmarshaling of BSON documents into Go native types.

MongoDB can handle a large number of concurrent requests when you architect your MongoDB databases and collections with concurrency in mind. Go and the mgo driver are perfectly aligned to push MongoDB to its limits and build software that can take advantage of all the computing power that is available.

The mgo driver provides a safe way to leverage Go’s concurrency support and you have the flexibility to execute queries concurrently and in parallel. It is best to take the time to learn a bit about MongoDB replica sets and load balancer configuration. Then make sure the load balancer is behaving as expected under the different types of load your application can produce.

Now is a great time to see what MongoDB and Go can do for your software applications, web services and service platforms. Both technologies are being battle tested everyday by all types of companies, solving all types of business and computing problems.

Processing 2 Billion Documents A Day And 30TB A Month With MongoDB

Mar 14 • Posted 6 months ago

This is a guest post by David Mytton. He has been programming Python for over 10 years and founded his website and and monitoring company, Server Density, back in 2009.

Server Density processes over 30TB/month of incoming data points from the servers and web checks we monitor for our customers, ranging from simple Linux system load average to website response times from 18 different countries. All of this data goes into MongoDB in real time and is pulled out when customers need to view graphs, update dashboards and generate reports.

We’ve been using MongoDB in production since mid-2009 and have learned a lot over the years about scaling the database. We run multiple MongoDB clusters but the one storing the historical data does the most throughput and is the one I shall focus on in this article, going through some of the things we’ve done to scale it.

1. Use dedicated hardware, and SSDs

All our MongoDB instances run on dedicated servers across two data centers at Softlayer. We’ve had bad experiences with virtualisation because you have no control over the host, and databases need guaranteed performance from disk i/o. When running on shared storage (e.g., a SAN) this is difficult to achieve unless you can get guaranteed throughput from things like AWS’s Provisioned IOPS on EBS (which are backed by SSDs).

MongoDB doesn’t really have many bottlenecks when it comes to CPU because CPU bound operations are rare (usually things like building indexes), but what really causes problem is CPU steal - when other guests on the host are competing for the CPU resources.

The way we have combated these problems is to eliminate the possibility of CPU steal and noisy neighbours by moving onto dedicated hardware. And we avoid problems with shared storage by deploying the dbpath onto locally mounted SSDs.

I’ll be speaking in-depth about managing MongoDB deployments in virtualized or dedicated hardware at MongoDB World this June.

2. Use multiple databases to benefit from improved concurrency

Running the dbpath on an SSD is a good first step but you can get better performance by splitting your data across multiple databases, and putting each database on a separate SSD with the journal on another.

Locking in MongoDB is managed at the database level so moving collections into their own databases helps spread things out - mostly important for scaling writes when you are also trying to read data. If you keep databases on the same disk you’ll start hitting the throughput limitations of the disk itself. This is improved by putting each database on its own SSD by using the directoryperdb option. SSDs help by significantly alleviating i/o latency, which is related to the number of IOPS and the latency for each operation, particularly when doing random reads/writes. This is even more visible for Windows environments where the memory mapped data files are flushed serially and synchronously. Again, SSDs help with this.

The journal is always within a directory so you can mount this onto its own SSD as a first step. All writes go via the journal and are later flushed to disk so if your write concern is configured to return when the write is successfully written to the journal, making those writes faster by using an SSD will improve query times. Even so, enabling the directoryperdb option gives you the flexibility to optimise for different goals (e.g., put some databases on SSDs and some on other types of disk, or EBS PIOPS volumes, if you want to save cost).

It’s worth noting that filesystem based snapshots where MongoDB is still running are no longer possible if you move the journal to a different disk (and so different filesystem). You would instead need to shut down MongoDB (to prevent further writes) then take the snapshot from all volumes.

3. Use hash-based sharding for uniform distribution

Every item we monitor (e.g., a server) has a unique MongoID and we use this as the shard key for storing the metrics data.

The query index is on the item ID (e.g. the server ID), the metric type (e.g. load average) and the time range; but because every query always has the item ID, it makes it a good shard key. That said, it is important to ensure that there aren’t large numbers of documents under a single item ID because this can lead to jumbo chunks which cannot be migrated. Jumbo chunks arise from failed splits where they’re already over the chunk size but cannot be split any further.

To ensure that the shard chunks are always evenly distributed, we’re using the hashed shard key functionality in MongoDB 2.4. Hashed shard keys are often a good choice for ensuring uniform distribution, but if you end up not using the hashed field in your queries, you could actually hurt performance because then a non-targeted scatter/gather query has to be used.

4. Let MongoDB delete data with TTL indexes

The majority of our users are only interested in the highest resolution data for a short period and more general trends over longer periods, so over time we average the time series data we collect then delete the original values. We actually insert the data twice - once as the actual value and once as part of a sum/count to allow us to calculate the average when we pull the data out later. Depending on the query time range we either read the average or the true values - if the query range is too long then we risk returning too many data points to be plotted. This method also avoids any batch processing so we can provide all the data in real time rather than waiting for a calculation to catch up at some point in the future.

Removal of the data after a period of time is done by using a TTL index. This is set based on surveying our customers to understand how long they want the high resolution data for. Using the TTL index to delete the data is much more efficient than doing our own batch removes and means we can rely on MongoDB to purge the data at the right time.

Inserting and deleting a lot of data can have implications for data fragmentation, but using a TTL index helps because it automatically activates PowerOf2Sizes for the collection, making disk usage more efficient. Although as of MongoDB 2.6, this storage option will become the default.

5. Take care over query and schema design

The biggest hit on performance I have seen is when documents grow, particularly when you are doing huge numbers of updates. If the document size increases after it has been written then the entire document has to be read and rewritten to another part of the data file with the indexes updated to point to the new location, which takes significantly more time than simply updating the existing document.

As such, it’s important to design your schema and queries to avoid this, and to use the right modifiers to minimise what has to be transmitted over the network and then applied as an update to the document. A good example of what you shouldn’t do when updating documents is to read the document into your application, update the document, then write it back to the database. Instead, use the appropriate commands - such as set, remove, and increment - to modify documents directly.

This also means paying attention to the BSON data types and pre-allocating documents, things I wrote about in MongoDB schema design pitfalls.

6. Consider network throughput & number of packets

Assuming 100Mbps networking is sufficient is likely to cause you problems, perhaps not during normal operations, but probably when you have some unusual event like needing to resync a secondary replica set member.

When cloning the database, MongoDB is going to use as much network capacity as it can to transfer the data over as quickly as possible before the oplog rolls over. If you’re doing 50-60Mbps of normal network traffic, there isn’t much spare capacity on a 100Mbps connection so that resync is going to be held up by hitting the throughput limits.

Also keep an eye on the number of packets being transmitted over the network - it’s not just the raw throughput that is important. A huge number of packets can overwhelm low quality network equipment - a problem we saw several years ago at our previous hosting provider. This will show up as packet loss and be very difficult to diagnose.

Conclusions

Scaling is an incremental process - there’s rarely one thing that will give you a big win. All of these tweaks and optimisations together help us to perform thousands of write operations per second and get response times within 10ms whilst using a write concern of 1.

Ultimately, all this ensures that our customers can load the graphs they want incredibly quickly. Behind the scenes we know that data is being written quickly, safely and that we can scale it as we continue to grow.

Bug Hunt Winners for MongoDB 2.6-rc0

Mar 13 • Posted 6 months ago

We would like to thank everyone who participated in MongoDB’s inaugural Bug Hunt. Your efforts have helped to improve the 2.6 release.

The MongoDB Bug Hunt for 2.6-rc0 uncovered a number of bugs. From this pool, we’ve selected 5 winning Bug Reporters based on three criteria: user impact, severity and prevalence.

The Winners

First Prize: Roman Kuzmin, SERVER-12846
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt
Second Prize: Mark Callaghan, Server 13060 & 13041
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Donation to Khan Academy (at Mark’s request)
  • MongoDB Contributor T-shirt
Honorable Mentions: Tim Callaghan (Server 12878), David Glasser (Server 12981) and Jeff Lee (Server 12925)
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

Congratulations to the five winners of the first MongoDB Bug Hunt and thanks to everyone who downloaded, tested and gave feedback on the release candidates.

-Eliot, Dan, Alvin and the MongoDB team

Announcing the MongoDB Bug Hunt 2.6.0-rc0 

Feb 21 • Posted 7 months ago

The MongoDB team released MongoDB 2.6.0-rc0 today and is proud to announce the MongoDB Bug Hunt. The MongoDB Bug Hunt is a new initiative to reward our community members who contribute to improving this MongoDB release. We’ve put the release through rigorous correctness, performance and usability testing. Now it’s your turn. Over the next 10 days, we challenge you to test and uncover any lingering issues in MongoDB 2.6.0-rc0.

How it works

You can download this release at MongoDB.org/downloads. If you find a bug, submit the issue to Jira (Core Server project) by March 4 at 12:00AM GMT. Bug reports will be judged on three criteria: user impact, severity and prevalence.

We will review all bugs submitted against 2.6.0-rc0. Winners will be announced on the MongoDB blog and user forum by March 8. There will be one first place winner, one second place winner and at least two honorable mentions.

The Rewards
First Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt
Second Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Amazon Gift Card
  • MongoDB Contributor T-shirt
Honorable Mentions:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

How to get started:

  • Deploy in your test environment: Software is best tested in a realistic environment. Help us see how 2.6 fares with your code and data so that others can build and run applications on MongoDB 2.6 successfully.
  • Test new features and improvements: Dozens of new features were added in 2.6. See the 2.6 Release Notes for a full list.
  • Log a ticket: If you find an issue, create a report in Jira. See the documentation for a guide to submitting well written bug reports.

If you are interested in doing this work full time, consider applying to join our engineering teams in New York City, Palo Alto and Austin, Texas.

Happy hunting!

Eliot, Dan and the MongoDB Team

Crittercism: Scaling To Billions Of Requests Per Day On MongoDB

Feb 20 • Posted 7 months ago

This is a guest post by Mike Chesnut, Director of Operations Engineering at Crittercism. This June, Mike will present at MongoDB World on how Crittercism scaled to 30,000 requests/second (and beyond) on MongoDB.

MongoDB is capable of scaling to meet your business needs — that is why its name is based on the word humongous. This doesn’t mean there aren’t some growing pains you’ll encounter along the way, of course. At Crittercism, we’ve had a huge amount of growth over the past 2 years and have hit some bumps in the road along the way, but we’ve also learned some important lessons that can hopefully be of use to others.

Background

Crittercism provides the world’s first and leading Mobile Application Performance Management (mAPM) solution. Our SDK is embedded in tens of thousands of applications, and used by nearly a billion users worldwide. We collect performance data such as error reporting, crash diagnostics details, network breadcrumbs, device/carrier/OS statistics, and user behavior. This data is largely unstructured and varies greatly among different applications, versions, devices, and usage patterns.

Storing all of this data in MongoDB allows us to collect raw information that may be of use in a variety of ways to our customers, while also providing the analytics necessary to summarize the data down to digestible, actionable metrics.

As our request volume has grown, so too has our data footprint; over the course of 18 months our daily request volume increased over 40x:

Our primary MongoDB cluster now houses over 20TB of data, and getting to this point has helped us learn a few things along the way.

Routing

The MongoDB documentation suggests that the most common topology is to include a router — a mongos process — on each client system. We started doing this and it worked well for quite a while.

As the number of front-end application servers in production grew from the order of 10s to several 100s, we found that we were creating heavy load via hundreds (or sometimes thousands) of connections between our mongos routers and our mongod shard servers. This meant that whenever chunk balancing occurred — something that is an integral part of maintaining a well-balanced, sharded MongoDB cluster — the chunk location information that is stored in the config database took a long time to propagate. This is because every mongos router needs to have a full picture of where in the cluster all of the chunks reside.

So what did we learn? We found that we could alleviate this issue by consolidating mongos routers onto a few hosts. Our production infrastructure is in AWS, so we deployed 2 mongos servers per availability zone. This gave us redundancy per AZ, as well as offered the shortest network path possible from the clients to the mongos routers. We were concerned about adding an additional hop to the request path, but using Chef to configure all of our clients to only talk to the mongos routers in their own AZ helped minimize this issue.

Making this topology change greatly reduced the number of open connections to our mongod shards, which we were able to measure using MMS, without a noticeable reduction in application performance. At the same time, there were several improvements to MongoDB that made both the mongos updates and the internal consistency checks more efficient in general. Combined with the new infrastructure this meant that we could now balance chunks throughout our cluster without causing performance problems while doing so.

Shard Replacement

Another scenario we’ve encountered is the need to dynamically replace mongod servers in order to migrate to larger shards. Again following the recommended best deployment practice, we deploy MongoDB onto server instances utilizing large RAID10 arrays running xfs. We use m2.4xlarge instances in AWS with 16 disks. We’ve used basic Linux mdadm for performance, but at the expense of flexibility in disk configuration. As a result when we are ready to allocate more capacity to our shards, we need to perform a migration procedure that can sometimes take several days. This not only means that we need to plan ahead appropriately, but also that we need to be aware of the full process in order to monitor it and react when things go wrong.

We start with a replica set where all replicas are at approximately the same disk utilization. We first create a new server instance with more disk allocated to it, and add it to this replica set with rs.add().

The new replica will enter the STARTUP2 state and remain there for a long time (in our case, usually 2-3 days) while it first clones data, then catches up via oplog replication and builds indexes. During this time, index builds will often stop the replication process (note that this behavior is set to change in MongoDB 2.6), and so the replication lag time does not strictly decrease the whole time — it will steadily decrease for a while, then an index build will cause it to pause and start to fall further behind. Once the index build completes the replication will resume. It’s worth noting that while index builds occur, mongostat and anything else that requires a read lock will be blocked as well.

Eventually the replica will enter the SECONDARY state and will be fully functional. At this point we can rs.stepDown() one of the old replicas, shut down the mongod process running on it, and then remove it from the replica set via rs.remove(), making the server ready for retirement.

We then repeat the process for each member of the replica set, until all have been replaced with the new instances with larger disks.

While this process can be time-consuming and somewhat tedious, it allows for a graceful way to grow our database footprint without any customer-facing impact.

Conclusion

Operating MongoDB at scale — as with any other technology — requires some knowledge that you can gain from documentation, and some that you have to learn from experience. By being willing to experiment with different strategies such as those shown above, you can discover flexibility that may not have been previously obvious. Consolidating the mongos router tier was a big win for Crittercism’s Ops team in terms of both performance and manageability, and developing the above described migration procedure has enabled us to continue to grow to meet our needs without affecting our service or our customers.

See how Crittercism, Stripe, Carfax, Intuit, Parse and Sailthru are building the next generation of applications at MongoDB World. Register now and join the MongoDB Community in New York City this June.
blog comments powered by Disqus