maxTimeMS() and Query Optimizer Introspection in MongoDB 2.6

Apr 23 • Posted 3 months ago

By Alex Komyagin, Technical Service Engineer in MongoDB’s New York Offices

In this series of posts we cover some of the enhancements in MongoDB 2.6 and how they will be valuable for users. You can find a more comprehensive list of changes in our 2.6 release notes. These are changes I believe you will find helpful, especially for our advanced users.

This post introduces two new features: maxTimeMS and query optimizer introspection. We will also look at a specific support case where these features would have been helpful.

maxTimeMS: SERVER-2212

The socket timeout option (e.g. MongoOptions.socketTimeout in the Java driver) specifies how long to wait for responses from the server. This timeout governs all types of requests (queries, writes, commands, authentication, etc.). The default value is “no timeout” but users tend to set the socket timeout to a relatively small value with the intention of limiting how long the database will service a single request. It is thus surprising to many users that while a socket timeout will close the connection, it does not affect the underlying database operation. The server will continue to process the request and consume resources even after the connection has closed.

Many applications are configured to retry queries when the driver reports a failed connection. If a query whose connection is killed by a socket timeout continues to consume resources on the server, a retry can give rise to a cascading effect. As queries run in MongoDB, the application may retry the same resource intensive query multiple times, increasing the load on the system leading to severe performance degradation. Before 2.6 the only way to cancel these queries was to issue db.killOp, which requires manual intervention.

MongoDB 2.6 introduces a new queryparameter, maxTimeMS, which limits how long an operation can run in the database. When the time limit is reached, MongoDB automatically cancels the query. maxTimeMS can be used to efficiently prevent unexpectedly slow queries from overloading MongoDB. Following the same logic, maxTimeMS can also be used to identify slow or unoptimized queries.

There are a few important notes about using maxTimeMS:

  • maxTimeMS can be set on a per-operation basis — long running aggregation jobs can have a different setting than simple CRUD operations.
  • requests can block behind locking operations on the server, and that blocking time is not counted, i.e. the timer starts ticking only after the actual start of the query when it initially acquires the appropriate lock;
  • operations are interrupted only at interrupt points where an operation can be safely aborted — the total execution time may exceed the specified maxTimeMS value;
  • maxTimeMS can be applied to both CRUD operations and commands, but not all commands are interruptible;
  • if a query opens a cursor, subsequent getmore requests will be included in the total time (but network roundtrips will not be counted);
  • maxTimeMS does not override the inactive cursor timeout for idle cursors (default is 10 min).

All MongoDB drivers support maxTimeMS in their releases for MongoDB 2.6. For example, the Java driver will start supporting it in 2.12.0 and 3.0.0, and Python - in version 2.7.

To illustrate how maxTimeMS can come in handy, let’s talk about a recent case that the MongoDB support team was working on. The case involved performance degradation of a MongoDB cluster that is powering a popular site with millions of users and heavy traffic. The degradation was caused by a huge amount of read queries performing full scans of the collection. The collection in question stores user comment data, so it has billions of records. Some of these collection scans were running for 15 minutes or so, and they were slowing down the whole system.

Normally, these queries would use an index, but in this case the query optimizer was choosing an unindexed plan. While the solution to this kind of problem is to hint the appropriate index, maxTimeMS can be used to prevent unintentional runaway queries by controlling the maximum execution time. Queries that exceed the maxTimeMS threshold will error out on the application side (in the Java driver it’s the MongoExecutionTimeoutException exception). maxTimeMS will help users to prevent unexpected performance degradation, and to gain better visibility into the operation of their systems.

In the next section we’ll take a look at another feature which would help in diagnosing the case we just discussed.

Query Optimizer Introspection: SEVER-666

To troubleshoot the support case described above and to understand why the query optimizer was not picking up the correct index for queries, it would have been very helpful to know a few things, such as whether the query plan had changed and to see the query plan cache.

MongoDB determines the best execution plan for a query and stores this plan in a cache for reuse. The cached plan is refreshed periodically and under certain operational conditions (which are discussed in detail below). Prior to 2.6 this caching behavior was opaque to the client. Version 2.6 provides a new query subsystem and query plan cache. Now users have visibility and control over the plan cache using a set of new methods for query plan introspection.

Each collection contains at most one plan cache. The cache is cleared every time a change is made to the indexes for the collection. To determine the best plan, multiple plans are executed in parallel and the winner is selected based on the amount of retrieved results within a fixed amount of steps, where each step is basically one “advance” of the query cursor. When a query has successfully passed the planning process, it is added to the cache along with the related index information.

A very interesting feature of the new query execution framework is the plan cache feedback mechanism. MongoDB stores runtime statistics for each cached plan in order to remove plans that are determined to be the cause of performance degradation. In practice, we don’t see these degradations often, but if they happen it is usually a consequence of a change in the composition of the data. For example, with new records being inserted, an indexed field may become less selective, leading to slower index performance. These degradations are extremely hard to manually diagnose, and we expect the feedback mechanism to automatically handle this change if a better alternative index is present.

The following events will result in a cached plan removal:

  • Performance degradation
  • Index add/drop
  • No more space in cache (the total number of plans stored in the collection cache is currently limited to 200; this is a subject to change in future releases)
  • Explicit commands to mutate cache
  • After the number of write operations on a collection exceeds the built-in limit, the whole collection plan cache is dropped (data distribution has changed)

MongoDB 2.6 supports a set of commands to view and manipulate the cache. List all known query shapes (planCacheListQueryShapes), display cached plans for a query shape (planCacheListPlans), manual removal of a query shape from the cache as well as emptying the whole cache (planCacheClear).

Here is an example invocation of the planCacheListQueryShapes command, that lists the shape of the query

db.test.find({first_name:"john",last_name:"galt"},

{_id:0,first_name:1,last_name:1}).sort({age:1}):

> db.runCommand({planCacheListQueryShapes: "test"})
{
    "shapes" : [
        {
            "query" : {
                "first_name" : "alex",
                "last_name" : "komyagin"
            },
            "sort" : {
                "age" : 1
            },
            "projection" : {
                "_id" : 0,
                "first_name" : 1,
                "last_name" : 1
            }
        }
    ],
    "ok" : 1
}

The exact values in the query predicate are insignificant in determining the query shape. For example, a query predicate

{first_name:"john", last_name:"galt"} is equivalent to the query predicate {first_name:"alex", last_name:"komyagin"}.

Additionally, with log level set to 1 or greater MongoDB will log plan cache changes caused by the events listed above. You can set the log level using the following command:

use admin
db.runCommand( { setParameter: 1, logLevel: 1 } )

Please don’t forget to change it back to 0 afterwards, since log level 1 logs all query operations and it can negatively affect the performance of your system.

Together, the new plan cache and the new query optimizer should make related operations more transparent and help users to have the visibility and control necessary for maintaining predictable response times for their queries.

You can see how these new features work yourself, by trying our latest release MongoDB 2.6 available for download here. I hope you will find these features helpful. We look forward to hearing your feedback, please post your questions in the mongodb-user google group.

Like what you see? Get MongoDB updates straight to your inbox

MongoDB 2.6: Our Biggest Release Ever

Apr 8 • Posted 3 months ago

By Eliot Horowitz, CTO and Co-founder, MongoDB

Discuss on Hacker News

In the five years since the initial release of MongoDB, and after hundreds of thousands of deployments, we have learned a lot. The time has come to take everything we have learned and create a basis for continued innovation over the next ten years.

Today I’m pleased to announce that, with the release of MongoDB 2.6, we have achieved that goal. With comprehensive core server enhancements, a groundbreaking new automation tool, and critical enterprise features, MongoDB 2.6 is by far our biggest release ever.

You’ll see the benefits in better performance and new innovations. We re-wrote the entire query execution engine to improve scalability, and took our first step in building a sophisticated query planner by introducing index intersection. We’ve made the codebase easier to maintain, and made it easier to implement new features. Finally, MongoDB 2.6 lays the foundation for massive improvements to concurrency in MongoDB 2.8, including document-level locking.

From the very beginning, MongoDB has offered developers a simple and elegant way to manage their data. Now we’re bringing that same simplicity and elegance to managing MongoDB. MongoDB Management Service (MMS), which already provides 35,000 MongoDB customers with monitoring and alerting, now provides backup and point-in-time restore functionality, in the cloud and on-premises.

We are also announcing a game-changing feature coming later this year: automation, also with hosted and on-premises options. Automation will allow you to provision and manage MongoDB replica sets and sharded clusters via a simple yet sophisticated interface.

MongoDB 2.6 brings security, integration and analytics enhancements to ease deployment in enterprise environments. LDAP, x.509 and Kerberos authentication are critical enhancements for organizations that require a single authentication mechanism across their entire infrastructure. To enhance security, MongoDB 2.6 implements TLS encryption, user-defined roles, auditing and field-level redaction, a critical building block for trusted systems. IBM Guardium also now offers integration with MongoDB, providing more extensive auditing abilities.

These are only a few of the key improvements; read the full official release notes for more details.

MongoDB 2.6 was a major endeavor and bringing it to fruition required hard work and coordination across a rapidly growing team. Over the past few years we have built and invested in that team, and I can proudly say we have the experience, drive and determination to deliver on this and future releases. There is much still to be done, and with MongoDB 2.6, we have a foundation for the next decade of database innovation.

Like what you see? Get MongoDB updates straight to your inbox

MongoDB Innovation Awards: Call for Nominations

Apr 2 • Posted 3 months ago

Fortune 500 enterprises, startups, hospitals, governments and organizations of all kinds use MongoDB because it is the best database for modern applications. The MongoDB Innovation Awards recognize organizations and individuals who are changing the world with MongoDB.

Nominations are open to any individual representing an organization that runs MongoDB, including partner solutions built on or for MongoDB.

The deadline to submit is midnight eastern time on May 30. A panel of judges will review nominations. Winners will be announced at MongoDB World on June 23-25.

Each winner will receive:

  • MongoDB Innovation Award
  • Pass to MongoDB World
  • Invitation to award winners reception at MongoDB World
  • Inclusion in Innovation Awards press release, blog post and email newsletter
  • $2,500 Amazon Gift Card

Submit a nomination by completing this nomination form — you’ll receive a discount code for MongoDB World.

Running MongoDB Queries Concurrently With Go

Mar 24 • Posted 4 months ago

This is a guest post by William Kennedy, managing partner at Ardan Studios in Miami, FL, a mobile and web app development company. Bill is also the author of the blog GoingGo.Net and the organizer for the Go-Miami and Miami MongoDB meetups in Miami. Bill looked for a new language in 2013 that would allow him to develop back end systems in Linux and found Go. He has never looked back.

If you are attending GopherCon 2014 or plan to watch the videos once they are released, this article will prepare you for the talk by Gustavo Niemeyer and Steve Francia. It provides a beginners view for using the Go mgo driver against a MongoDB database.

Introduction

MongoDB supports many different programming languages thanks to a great set of drivers. One such driver is the MongoDB Go driver which is called mgo. This driver was developed by Gustavo Niemeyer from Canonical with some assistance from MongoDB Inc. Both Gustavo and Steve Francia, the head of the drivers team, will be talking at GopherCon 2014 in April about “Painless Data Storage With MongoDB and Go”. The talk describes the mgo driver and how MongoDB and Go work well together for building highly scalable and concurrent software.

MongoDB and Go let you build scalable software on many different operating systems and architectures, without the need to install frameworks or runtime environments. Go programs are native binaries and the Go tooling is constantly improving to create binaries that run as fast as equivalent C programs. That wouldn’t mean anything if writing code in Go was complicated and as tedious as writing programs in C. This is where Go really shines because once you get up to speed, writing programs in Go is fast and fun.

In this post I am going to show you how to write a Go program using the mgo driver to connect and run queries concurrently against a MongoDB database. I will break down the sample code and explain a few things that can be a bit confusing to those new to using MongoDB and Go together.

Sample Program

The sample program connects to a public MongoDB database I have hosted with MongoLab. If you have Go and Bazaar installed on your machine, you can run the program against my database. The program is very simple - it launches ten goroutines that individually query all the records from the buoy_stations collection inside the goinggo database. The records are unmarshaled into native Go types and each goroutine logs the number of documents returned:

Now that you have seen the entire program, we can break it down. Let’s start with the type structures that are defined in the beginning:

The structures represent the data that we are going to retrieve and unmarshal from our query. BuoyStation represents the main document and BuoyCondition and BuoyLocation are embedded documents. The mgo driver makes it easy to use native types that represent the documents stored in our collections by using tags. With the tags, we can control how the mgo driver unmarshals the returned documents into our native Go structures.

Now let’s look at how we connect to a MongoDB database using mgo:

We start with creating a mgo.DialInfo object. Connecting to a replica set can be accomplished by providing multiple addresses in the Addrs field or with a single address. If we are using a single host address to connect to a replica set, the mgo driver will learn about any remaining hosts from the replica set member we connect to. In our case we are connecting to a single host.

After providing the host, we specify the database, username and password we need for authentication. One thing to note is that the database we authenticate against may not necessarily be the database our application needs to access. Some applications authenticate against the admin database and then use other databases depending on their configuration. The mgo driver supports these types of configurations very well.

Next we use the mgo.DialWithInfo method to create a mgo.Session object. Each session specifies a Strong or Monotonic mode, and other settings such as write concern and read preference. The mgo.Session object maintains a pool of connections to MongoDB. We can create multiple sessions with different modes and settings to support different aspects of our applications.

The next line of code sets the mode for the session. There are three modes that can be set, Strong, Monotonic and Eventual. Each mode sets a specific consistency for how reads and writes are performed. For more information on the differences between each mode, check out the documentation for the mgo driver.

We are using Monotonic mode which provides reads that may not entirely be up to date, but the reads will always see the history of changes moving forward. In this mode reads occur against secondary members of our replica sets until a write happens. Once a write happens, the primary member is used. The benefit is some distribution of the reading load can take place against the secondaries when possible.

With the session set and ready to go, next we execute multiple queries concurrently:

This code is classic Go concurrency in action. First we create a sync.WaitGroup object so we can keep track of all the goroutines we are going to launch as they complete their work. Then we immediately set the count of the sync.WaitGroup object to ten and use a for loop to launch ten goroutines using the RunQuery function. The keyword go is used to launch a function or method to run concurrently. The final line of code calls the Wait method on the sync.WaitGroup object which locks the main goroutine until everything is done processing.

To learn more about Go concurrency and better understand how this particular piece of code works, check out these posts on concurrency and channels.

Now let’s look at the RunQuery function and see how to properly use the mgo.Session object to acquire a connection and execute a query:

The very first thing we do inside of the RunQuery function is to defer the execution of the Done method on the sync.WaitGroup object. The defer keyword will postpone the execution of the Done method, to take place once the RunQuery function returns. This will guarantee that the sync.WaitGroup objects count will decrement even if an unhandled exception occurs.

Next we make a copy of the session we created in the main goroutine. Each goroutine needs to create a copy of the session so they each obtain their own socket without serializing their calls with the other goroutines. Again, we use the defer keyword to postpone and guarantee the execution of the Close method on the session once the RunQuery function returns. Closing the session returns the socket back to the main pool, so this is very important.

To execute a query we need a mgo.Collection object. We can get a mgo.Collection object through the mgo.Session object by specifying the name of the database and then the collection. Using the mgo.Collection object, we can perform a Find and retrieve all the documents from the collection. The All function will unmarshal the response into our slice of BuoyStation objects. A slice is a dynamic array in Go. Be aware that the All method will load all the data in memory at once. For large collections it is better to use the Iter method instead. Finally, we just log the number of BuoyStation objects that are returned.

Conclusion

The example shows how to use Go concurrency to launch multiple goroutines that can execute queries against a MongoDB database independently. Once a session is established, the mgo driver exposes all of the MongoDB functionality and handles the unmarshaling of BSON documents into Go native types.

MongoDB can handle a large number of concurrent requests when you architect your MongoDB databases and collections with concurrency in mind. Go and the mgo driver are perfectly aligned to push MongoDB to its limits and build software that can take advantage of all the computing power that is available.

The mgo driver provides a safe way to leverage Go’s concurrency support and you have the flexibility to execute queries concurrently and in parallel. It is best to take the time to learn a bit about MongoDB replica sets and load balancer configuration. Then make sure the load balancer is behaving as expected under the different types of load your application can produce.

Now is a great time to see what MongoDB and Go can do for your software applications, web services and service platforms. Both technologies are being battle tested everyday by all types of companies, solving all types of business and computing problems.

Processing 2 Billion Documents A Day And 30TB A Month With MongoDB

Mar 14 • Posted 4 months ago

This is a guest post by David Mytton. He has been programming Python for over 10 years and founded his website and and monitoring company, Server Density, back in 2009.

Server Density processes over 30TB/month of incoming data points from the servers and web checks we monitor for our customers, ranging from simple Linux system load average to website response times from 18 different countries. All of this data goes into MongoDB in real time and is pulled out when customers need to view graphs, update dashboards and generate reports.

We’ve been using MongoDB in production since mid-2009 and have learned a lot over the years about scaling the database. We run multiple MongoDB clusters but the one storing the historical data does the most throughput and is the one I shall focus on in this article, going through some of the things we’ve done to scale it.

1. Use dedicated hardware, and SSDs

All our MongoDB instances run on dedicated servers across two data centers at Softlayer. We’ve had bad experiences with virtualisation because you have no control over the host, and databases need guaranteed performance from disk i/o. When running on shared storage (e.g., a SAN) this is difficult to achieve unless you can get guaranteed throughput from things like AWS’s Provisioned IOPS on EBS (which are backed by SSDs).

MongoDB doesn’t really have many bottlenecks when it comes to CPU because CPU bound operations are rare (usually things like building indexes), but what really causes problem is CPU steal - when other guests on the host are competing for the CPU resources.

The way we have combated these problems is to eliminate the possibility of CPU steal and noisy neighbours by moving onto dedicated hardware. And we avoid problems with shared storage by deploying the dbpath onto locally mounted SSDs.

I’ll be speaking in-depth about managing MongoDB deployments in virtualized or dedicated hardware at MongoDB World this June.

2. Use multiple databases to benefit from improved concurrency

Running the dbpath on an SSD is a good first step but you can get better performance by splitting your data across multiple databases, and putting each database on a separate SSD with the journal on another.

Locking in MongoDB is managed at the database level so moving collections into their own databases helps spread things out - mostly important for scaling writes when you are also trying to read data. If you keep databases on the same disk you’ll start hitting the throughput limitations of the disk itself. This is improved by putting each database on its own SSD by using the directoryperdb option. SSDs help by significantly alleviating i/o latency, which is related to the number of IOPS and the latency for each operation, particularly when doing random reads/writes. This is even more visible for Windows environments where the memory mapped data files are flushed serially and synchronously. Again, SSDs help with this.

The journal is always within a directory so you can mount this onto its own SSD as a first step. All writes go via the journal and are later flushed to disk so if your write concern is configured to return when the write is successfully written to the journal, making those writes faster by using an SSD will improve query times. Even so, enabling the directoryperdb option gives you the flexibility to optimise for different goals (e.g., put some databases on SSDs and some on other types of disk, or EBS PIOPS volumes, if you want to save cost).

It’s worth noting that filesystem based snapshots where MongoDB is still running are no longer possible if you move the journal to a different disk (and so different filesystem). You would instead need to shut down MongoDB (to prevent further writes) then take the snapshot from all volumes.

3. Use hash-based sharding for uniform distribution

Every item we monitor (e.g., a server) has a unique MongoID and we use this as the shard key for storing the metrics data.

The query index is on the item ID (e.g. the server ID), the metric type (e.g. load average) and the time range; but because every query always has the item ID, it makes it a good shard key. That said, it is important to ensure that there aren’t large numbers of documents under a single item ID because this can lead to jumbo chunks which cannot be migrated. Jumbo chunks arise from failed splits where they’re already over the chunk size but cannot be split any further.

To ensure that the shard chunks are always evenly distributed, we’re using the hashed shard key functionality in MongoDB 2.4. Hashed shard keys are often a good choice for ensuring uniform distribution, but if you end up not using the hashed field in your queries, you could actually hurt performance because then a non-targeted scatter/gather query has to be used.

4. Let MongoDB delete data with TTL indexes

The majority of our users are only interested in the highest resolution data for a short period and more general trends over longer periods, so over time we average the time series data we collect then delete the original values. We actually insert the data twice - once as the actual value and once as part of a sum/count to allow us to calculate the average when we pull the data out later. Depending on the query time range we either read the average or the true values - if the query range is too long then we risk returning too many data points to be plotted. This method also avoids any batch processing so we can provide all the data in real time rather than waiting for a calculation to catch up at some point in the future.

Removal of the data after a period of time is done by using a TTL index. This is set based on surveying our customers to understand how long they want the high resolution data for. Using the TTL index to delete the data is much more efficient than doing our own batch removes and means we can rely on MongoDB to purge the data at the right time.

Inserting and deleting a lot of data can have implications for data fragmentation, but using a TTL index helps because it automatically activates PowerOf2Sizes for the collection, making disk usage more efficient. Although as of MongoDB 2.6, this storage option will become the default.

5. Take care over query and schema design

The biggest hit on performance I have seen is when documents grow, particularly when you are doing huge numbers of updates. If the document size increases after it has been written then the entire document has to be read and rewritten to another part of the data file with the indexes updated to point to the new location, which takes significantly more time than simply updating the existing document.

As such, it’s important to design your schema and queries to avoid this, and to use the right modifiers to minimise what has to be transmitted over the network and then applied as an update to the document. A good example of what you shouldn’t do when updating documents is to read the document into your application, update the document, then write it back to the database. Instead, use the appropriate commands - such as set, remove, and increment - to modify documents directly.

This also means paying attention to the BSON data types and pre-allocating documents, things I wrote about in MongoDB schema design pitfalls.

6. Consider network throughput & number of packets

Assuming 100Mbps networking is sufficient is likely to cause you problems, perhaps not during normal operations, but probably when you have some unusual event like needing to resync a secondary replica set member.

When cloning the database, MongoDB is going to use as much network capacity as it can to transfer the data over as quickly as possible before the oplog rolls over. If you’re doing 50-60Mbps of normal network traffic, there isn’t much spare capacity on a 100Mbps connection so that resync is going to be held up by hitting the throughput limits.

Also keep an eye on the number of packets being transmitted over the network - it’s not just the raw throughput that is important. A huge number of packets can overwhelm low quality network equipment - a problem we saw several years ago at our previous hosting provider. This will show up as packet loss and be very difficult to diagnose.

Conclusions

Scaling is an incremental process - there’s rarely one thing that will give you a big win. All of these tweaks and optimisations together help us to perform thousands of write operations per second and get response times within 10ms whilst using a write concern of 1.

Ultimately, all this ensures that our customers can load the graphs they want incredibly quickly. Behind the scenes we know that data is being written quickly, safely and that we can scale it as we continue to grow.

Bug Hunt Winners for MongoDB 2.6-rc0

Mar 13 • Posted 4 months ago

We would like to thank everyone who participated in MongoDB’s inaugural Bug Hunt. Your efforts have helped to improve the 2.6 release.

The MongoDB Bug Hunt for 2.6-rc0 uncovered a number of bugs. From this pool, we’ve selected 5 winning Bug Reporters based on three criteria: user impact, severity and prevalence.

The Winners

First Prize: Roman Kuzmin, SERVER-12846
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt
Second Prize: Mark Callaghan, Server 13060 & 13041
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Donation to Khan Academy (at Mark’s request)
  • MongoDB Contributor T-shirt
Honorable Mentions: Tim Callaghan (Server 12878), David Glasser (Server 12981) and Jeff Lee (Server 12925)
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

Congratulations to the five winners of the first MongoDB Bug Hunt and thanks to everyone who downloaded, tested and gave feedback on the release candidates.

-Eliot, Dan, Alvin and the MongoDB team

Upcoming Changes to the MongoDB C++ Driver

Mar 3 • Posted 4 months ago

By Andrew Morrow, Kernel Engineer at MongoDB

In coordination with the upcoming MongoDB 2.6 release, the Kernel team is changing how the MongoDB C++ driver is developed, licensed, released, and distributed.

What are we doing?

As of MongoDB 2.6.0-rc1, it will no longer be possible to build the C++ driver from the MongoDB server sources. The files required to build the driver have been forked to a new repository on GitHub, and will be maintained independently of the server sources.

Can I start using it now?

Yes. We are still fixing up some post-fork rough edges though, so please consider the current state of the new repository to be unstable. We should be announcing release candidates soon. Release candidates are useable for testing, but should not be used in production.

Why are we doing this?

By separating the development streams for the server and the C++ driver we will be able to respond to C++ driver issues without being tied to the server release schedule, as well as make changes that were previously difficult to apply in the context of the server repository. We want to make developing and using the C++ driver a better experience for everyone involved.

How will the new driver be versioned?

The new repository will initially offer two release streams: the ‘26compat’ release stream, and the ‘legacy’ release stream.

The ‘26compat’ branch and release stream will contain a minimally altered snapshot of the C++ driver files from the server repository as of MongoDB 2.6.0-rc0. It is intended to be a drop-in replacement for existing users of the C++ driver seeking minimal disruption to their current workflow and code. The build system, physical structure, C++ API, and other aspects of the driver will change minimally in this stream. We will backport selected fixes from the v2.6 server repository branch as needed, but conservatively. Enhancements and improvements will be minimal or non-existent.

The ‘legacy’ branch and release stream will more aggressively diverge from the server sources. While this release stream will retain the overall look and feel of the existing driver, we are likely to make small breaking changes to the physical structure, build system, API, and testing model for the driver. New driver features will only be made in this release stream. Our goal in this release stream is to provide meaningful improvements to the C++ driver, while requiring only modest changes on the part of users.

I see a ‘legacy’ branch in the new repository, but no ‘26compat’ branch. Why?

The initial release candidates for the 26compat release stream will be made from the ‘legacy’ branch. When MongoDB 2.6.0 is released, the 26compat branch will be forked from the legacy branch to track the MongoDB 2.6 point releases. Until that split, the legacy branch is effectively the 26compat branch. After the 26compat branch is formed, the legacy branch will begin to receive improvements and changes that are not compatible with the goals of the 26compat branch.

What license will apply to the C++ drivers in this repository?

Unlike the driver embedded within the server repository, which is a mix of Affero GPL and Apache 2.0 licensed files, all files in the new repository will be Apache 2.0 licensed.

How will existing users of the C++ driver be affected?

Our aim is to minimize disruption, but also to enable more rapid evolution and improvement of the driver. Since these goals are in conflict, there will be a period where there are several valid ways to obtain and build the C++ driver, all yielding slightly different results. We will be updating the documentation to reflect these alternatives in more detail, but as an overview, the initial options are to:

  1. Switch to the new C++ driver repository and use the 26compat release stream. This should offer as close to a drop-in replacement for your existing workflow and application code as can be managed. Your help in testing the release candidates of this release stream will help us ensure that our first production release of this stream is an effective replacement for the older mechanisms of obtaining the driver.
  2. Continue to use the server repository build of the driver. This process is valid for releases of the driver up to and including 2.6.0-rc0, but will not advance beyond that release. If you are using this process and wish to continue with minimal change, we encourage you to move to the 26compat stream.
  3. Continue to use the tarball versions of the driver. This process will remain valid for releases of the driver up to and including 2.5.2, but will not advance beyond that release. If you are using the C++ Client Driver tarball, we strongly encourage you to move to any of the other processes listed above.

Later, once the 26compat stream has been stabilized, there will be an additional option to track the legacy release stream. Our goal is that all C++ driver users will eventually be able to migrate to the legacy release stream.

What about JIRA, SERVER tickets, etc.?

There is a new JIRA project for the C++ driver: https://jira.mongodb.org/browse/CXX. We will be evaluating existing SERVER tickets filed under the “C++ Driver” component or label and migrating or copying them to the new project as appropriate.

New C++ driver issues should be filed in the CXX project.

Why is it called the ‘legacy’ driver or the ‘legacy’ branch?

We are saving the master branch of the repository for something new and exciting a bit further down the road. In the meantime, we want to make the experience with the existing C++ driver better, but it isn’t going to diverge radically from the ‘legacy’ of its origins in the server codebase.

Can I help?

Absolutely. We are very interested in feedback on the plan outlined above, and your feedback will help us solidify the details. We will be releasing an RC of the 26compat release stream sometime in the next week or two, and testing of that release candidate will be much appreciated.

Similarly, we will be releasing a series of unstable releases of the legacy driver leading to a stabilized version in the near future. Experimenting with these releases will give us the best opportunity to improve the driver for the community.

I have other questions…

Please post to the mongodb-user or mongodb-dev mailing lists with any questions about this announcement and the C++ Driver team will be happy to answer.

Thank You,

The C++ Driver Team

MongoDB Bug Hunt Extended to March 8

Feb 28 • Posted 4 months ago

On February 21, we launched the first ever MongoDB Bug Hunt. We have been impressed with the community’s enthusiasm during the first week and have decided to extend the hunt until March 8. This will allow more members of the community to get involved and help improve MongoDB for users worldwide.

As a reminder, you can download the latest release at www.MongoDB.org/downloads. If you find a bug, submit the issue to Jira (Core Server project) by March 8 at 12:00AM GMT. Bug reports will be judged on three criteria: user impact, severity and prevalence.

We will review all bugs submitted against 2.6.0-rc0. Winners will be announced on the MongoDB blog and user forum by March 13. There will be one first place winner, one second place winner and at least two honorable mentions.

For more info on the Bug Hunt see our announcement on the MongoDB Blog.

Thanks to everyone who has downloaded and tested the server so far. Keep on hunting!

Announcing the MongoDB Bug Hunt 2.6.0-rc0 

Feb 21 • Posted 5 months ago

The MongoDB team released MongoDB 2.6.0-rc0 today and is proud to announce the MongoDB Bug Hunt. The MongoDB Bug Hunt is a new initiative to reward our community members who contribute to improving this MongoDB release. We’ve put the release through rigorous correctness, performance and usability testing. Now it’s your turn. Over the next 10 days, we challenge you to test and uncover any lingering issues in MongoDB 2.6.0-rc0.

How it works

You can download this release at MongoDB.org/downloads. If you find a bug, submit the issue to Jira (Core Server project) by March 4 at 12:00AM GMT. Bug reports will be judged on three criteria: user impact, severity and prevalence.

We will review all bugs submitted against 2.6.0-rc0. Winners will be announced on the MongoDB blog and user forum by March 8. There will be one first place winner, one second place winner and at least two honorable mentions.

The Rewards
First Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt
Second Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Amazon Gift Card
  • MongoDB Contributor T-shirt
Honorable Mentions:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

How to get started:

  • Deploy in your test environment: Software is best tested in a realistic environment. Help us see how 2.6 fares with your code and data so that others can build and run applications on MongoDB 2.6 successfully.
  • Test new features and improvements: Dozens of new features were added in 2.6. See the 2.6 Release Notes for a full list.
  • Log a ticket: If you find an issue, create a report in Jira. See the documentation for a guide to submitting well written bug reports.

If you are interested in doing this work full time, consider applying to join our engineering teams in New York City, Palo Alto and Austin, Texas.

Happy hunting!

Eliot, Dan and the MongoDB Team

Crittercism: Scaling To Billions Of Requests Per Day On MongoDB

Feb 20 • Posted 5 months ago

This is a guest post by Mike Chesnut, Director of Operations Engineering at Crittercism. This June, Mike will present at MongoDB World on how Crittercism scaled to 30,000 requests/second (and beyond) on MongoDB.

MongoDB is capable of scaling to meet your business needs — that is why its name is based on the word humongous. This doesn’t mean there aren’t some growing pains you’ll encounter along the way, of course. At Crittercism, we’ve had a huge amount of growth over the past 2 years and have hit some bumps in the road along the way, but we’ve also learned some important lessons that can hopefully be of use to others.

Background

Crittercism provides the world’s first and leading Mobile Application Performance Management (mAPM) solution. Our SDK is embedded in tens of thousands of applications, and used by nearly a billion users worldwide. We collect performance data such as error reporting, crash diagnostics details, network breadcrumbs, device/carrier/OS statistics, and user behavior. This data is largely unstructured and varies greatly among different applications, versions, devices, and usage patterns.

Storing all of this data in MongoDB allows us to collect raw information that may be of use in a variety of ways to our customers, while also providing the analytics necessary to summarize the data down to digestible, actionable metrics.

As our request volume has grown, so too has our data footprint; over the course of 18 months our daily request volume increased over 40x:

Our primary MongoDB cluster now houses over 20TB of data, and getting to this point has helped us learn a few things along the way.

Routing

The MongoDB documentation suggests that the most common topology is to include a router — a mongos process — on each client system. We started doing this and it worked well for quite a while.

As the number of front-end application servers in production grew from the order of 10s to several 100s, we found that we were creating heavy load via hundreds (or sometimes thousands) of connections between our mongos routers and our mongod shard servers. This meant that whenever chunk balancing occurred — something that is an integral part of maintaining a well-balanced, sharded MongoDB cluster — the chunk location information that is stored in the config database took a long time to propagate. This is because every mongos router needs to have a full picture of where in the cluster all of the chunks reside.

So what did we learn? We found that we could alleviate this issue by consolidating mongos routers onto a few hosts. Our production infrastructure is in AWS, so we deployed 2 mongos servers per availability zone. This gave us redundancy per AZ, as well as offered the shortest network path possible from the clients to the mongos routers. We were concerned about adding an additional hop to the request path, but using Chef to configure all of our clients to only talk to the mongos routers in their own AZ helped minimize this issue.

Making this topology change greatly reduced the number of open connections to our mongod shards, which we were able to measure using MMS, without a noticeable reduction in application performance. At the same time, there were several improvements to MongoDB that made both the mongos updates and the internal consistency checks more efficient in general. Combined with the new infrastructure this meant that we could now balance chunks throughout our cluster without causing performance problems while doing so.

Shard Replacement

Another scenario we’ve encountered is the need to dynamically replace mongod servers in order to migrate to larger shards. Again following the recommended best deployment practice, we deploy MongoDB onto server instances utilizing large RAID10 arrays running xfs. We use m2.4xlarge instances in AWS with 16 disks. We’ve used basic Linux mdadm for performance, but at the expense of flexibility in disk configuration. As a result when we are ready to allocate more capacity to our shards, we need to perform a migration procedure that can sometimes take several days. This not only means that we need to plan ahead appropriately, but also that we need to be aware of the full process in order to monitor it and react when things go wrong.

We start with a replica set where all replicas are at approximately the same disk utilization. We first create a new server instance with more disk allocated to it, and add it to this replica set with rs.add().

The new replica will enter the STARTUP2 state and remain there for a long time (in our case, usually 2-3 days) while it first clones data, then catches up via oplog replication and builds indexes. During this time, index builds will often stop the replication process (note that this behavior is set to change in MongoDB 2.6), and so the replication lag time does not strictly decrease the whole time — it will steadily decrease for a while, then an index build will cause it to pause and start to fall further behind. Once the index build completes the replication will resume. It’s worth noting that while index builds occur, mongostat and anything else that requires a read lock will be blocked as well.

Eventually the replica will enter the SECONDARY state and will be fully functional. At this point we can rs.stepDown() one of the old replicas, shut down the mongod process running on it, and then remove it from the replica set via rs.remove(), making the server ready for retirement.

We then repeat the process for each member of the replica set, until all have been replaced with the new instances with larger disks.

While this process can be time-consuming and somewhat tedious, it allows for a graceful way to grow our database footprint without any customer-facing impact.

Conclusion

Operating MongoDB at scale — as with any other technology — requires some knowledge that you can gain from documentation, and some that you have to learn from experience. By being willing to experiment with different strategies such as those shown above, you can discover flexibility that may not have been previously obvious. Consolidating the mongos router tier was a big win for Crittercism’s Ops team in terms of both performance and manageability, and developing the above described migration procedure has enabled us to continue to grow to meet our needs without affecting our service or our customers.

See how Crittercism, Stripe, Carfax, Intuit, Parse and Sailthru are building the next generation of applications at MongoDB World. Register now and join the MongoDB Community in New York City this June.

Background Indexing on Secondaries and Orphaned Document Cleanup in MongoDB 2.6

Jan 27 • Posted 6 months ago

By Alex Komyagin, Technical Services Engineer in MongoDB’s New York Office

The MongoDB Support Team has broad visibility into the community’s use of MongoDB, issues they encounter, feature requests, bug fixes and the work of the engineering team. This is the first of a series of posts to help explain, from our perspective, what is changing in 2.6 and why.

Many of these changes are available today for testing in the 2.5.4 Development Release, which is available as of November 18, 2013 (2.5.5 release, coming soon, will be feature complete). Development Releases have odd-numbered minor versions (e.g., 2.1, 2.3, 2.5), and Production Releases have even-numbered minor versions (e.g., 2.2, 2.4, 2.6). MongoDB 2.6 will become available a little later this year.

Community testing helps MongoDB improve. You can test the development of MongoDB 2.5.4 today. Downloads are available here, and you can log Server issues in Jira.

Background indexes on secondaries (SERVER-2771)

Suppose you have a production replica set with two secondary servers, and you have a large, 1TB collection. At some point, you may decide that you need a new index to reflect a recent change in your application, so you build one in the background:

db.col.ensureIndex({..},{background : true})

Let’s also suppose that your application uses secondary reads (users should take special care with this feature, especially in sharded systems; for an example of why, see the next section in this post). After some time you observe that some of your application reads have started to fail, and replication lag on both secondaries has started to grow. While you are searching Google Groups for answers, everything magically goes back to normal by itself. Secondaries have caught up, and application reads on your secondaries are working fine. What happened?

One would expect that building indexes in the background would allow the replica set to continue serving regular operations during the index build. However, in all MongoDB releases before 2.6, background index builds on primaries become foreground index builds on secondaries, as noted in the documentation. Foreground index building is resource intensive and it can also affect replication and read/write operations on the database (see the FAQ on the MongoDB Docs). The good news is that impact can be minimized if your indexed collections are small enough for index builds to be relatively fast (on the order of minutes to complete).

The only way to make sure that indexing operations are not affecting the replica set in earlier versions of MongoDB was to build indexes in a rolling fashion. This works perfectly for most users, but not for everyone. For example, it wouldn’t work well for those who use a write concern “w:all”.

Starting with MongoDB 2.6, a background index build on the primary becomes a background index build on the secondaries. This behavior is much more intuitive and will improve the replica set robustness. We feel this will be a welcome enhancement for many users.

Please note that background index building normally takes longer than foreground building, because it allows other operations on the database to run. Keep in mind that, like most database systems, indexing in MongoDB is resource intensive and will increase the load on your system, whether it is a foreground or background process. Accordingly, it is best to perform these operations during a maintenance window or during off-peak hours.

The actual time needed to build a background index varies with the active load on your system, number of documents, database size and your hardware specs. Therefore, for production systems with large collections users can still take advantage of building indexes in a rolling fashion, or building them in foreground during maintenance windows if they believe a background index build will require more time than is acceptable.

Orphaned documents cleanup command (SERVER-8598)

MongoDB provides horizontal scaling through a feature called sharding. If you’re unfamiliar with sharding and how it works, I encourage you to read the nice new introduction to this feature the documentation team added a few months ago. Let me try and summarize some of the key concepts:

  • MongoDB partitions documents across shards.
  • Periodically the system runs a balancing process to ensure documents are uniformly distributed across the shards.
  • Groups of documents, called chunks, are the unit of a balancing job.
  • In certain failure scenarios stale copies of documents may remain on shards, which we call “orphaned documents.”

Under normal circumstances, there will be no orphaned documents in your system. However, in some production systems, “normal circumstances” are a very rare event, and migrations can fail (e.g., due to network connectivity issues), thus leaving orphaned documents behind.

The presence of orphaned documents can produce incorrect results for some queries. While orphaned documents are safe to delete, in versions prior to 2.6 there was no simple way to do so. In MongoDB 2.6 we implemented a new administrative command for sharded clusters: cleanupOrphaned(). This command removes orphaned documents from the shard in a single range of data.

The scenario where users typically encounter issues related to orphaned documents is when issuing secondary reads. In a sharded cluster, primary replicas for each shard are aware of the chunk placements, while secondaries are not. If you query the primary (which is the default read preference), you will not see any issues as the primary will not return orphaned documents even if it has them. But if you are using secondary reads, the presence of orphaned documents can produce unexpected results, because secondaries are not aware of the chunk ownerships and they can’t filter out orphaned documents. This scenario does not affect targeted queries (those having the shard key included), as mongos automatically routes them to correct shards.

To illustrate this discussion with an example, one of our clients told us that after a series of failed migrations he noticed that his queries were returning duplicate documents. He was using scatter-gather queries, meaning that they did not contain the shard key and were broadcast by mongos to all shards, as well as secondary reads. Shards return all the documents matching the query (including orphaned documents), which in this situation lead to duplicate entries in the final result set.

A short term solution was to remove orphaned documents (we used to have a special script for this). But a long term workaround for this particular client was to make their queries targeted, by including the shard key in each query. This way, mongos could efficiently route each query to the correct shard, not hitting the orphaned data. Routed queries are a best practice in any system as they also scale much better than scatter-gather queries.

Unfortunately, there are a few cases where there is no good way to make queries targeted, and you would need to either switch to primary reads or implement a regular process for removing orphaned documents.

The cleanupOrphaned() command is the first step on the path to automated cleanup of orphaned documents. This command should be run on the primary server and will clean up one unowned range on this shard. The idea is to run the command repeatedly, with a delay between calls to tune the cleanup rate.

In some configurations secondary servers might not be able to keep up with the rate of delete operations, resulting in replication lag. In order to control the lag, cleanupOrphaned() waits for the majority of the replica set members after the range removal is complete. Additionally, you can use the secondaryThrottle option, and each individual delete operation will be made with write concern w:2 (waits for one secondary). This may be useful for reducing the impact of removing orphaned documents on your regular operations.

You can find command usage examples and more information about the command in the 2.6 documentation.

I hope you will find these features helpful. We look forward to hearing your feedback on these features. If you would like to test them out, download MongoDB 2.5.4, the most recent Development Release of MongoDB.

Transitioning from Relational Databases to MongoDB - Data Models

Jan 10 • Posted 6 months ago

This post was written by Bryan Reinero, a Consulting Engineer at MongoDB.

Understanding how to use MongoDB isn’t difficult, but it does require you to change the way you think about databases if you are coming from a traditional relational database management system (RDBMS). This blog post is designed to help you understand data modeling and schema design in MongoDB from the perspective of someone used to programming with an RDBMS. I’ll explore the fundamental differences between RDBMS and MongoDB and highlight the advantages and design compromises of each approach.

Fundamental Differences
The immediate and fundamental difference between MongoDB and an RDBMS is the underlying data model. A relational database structures data into tables and rows, while MongoDB structures data into collections of JSON documents. JSON is a self-describing, human readable data format. Originally designed for lightweight exchanges between browser and server, it has become widely accepted for many types of applications.

JSON documents are particularly useful for data management for several reasons. A JSON document is composed of a set of fields which are themselves key-value pairs. This means each JSON document carries its own human readable schema design with it wherever it goes, allowing the documents to easily move between database and client applications without losing their meaning.

JSON is also a natural data format for use in the application layer. JSON supports a richer and more flexible data structure than tables made up of columns and rows. In addition to supporting field types like number, string, Boolean, etc., JSON fields can be arrays or nested sub-objects. This means we can represent a set of sophisticated relations which are a closer representation of the objects our applications work with. Using JSON documents in our database means we don’t need an object relational mapper between our database and the applications it serves. We can persist our data in the right form for our application.

Let’s dive into an example. Imagine I have an application dealing with information describing vehicle information including make, manufacturer, and category. My documents might look like this:

{
“_id” : ObjectId(“528ba7691738025d11aab772”),
“manufacturer” : “Porsche”,
“name” : “550 Spyder”,
“category” : [
“sports”,
“touring”,
“coupé”
]
}

It’s pretty clear what this document describes, and I could easily unmarshall this into an object native to my chosen language. Notice also that the “category” field is an array of strings. The ability to support arrays is an especially helpful feature; it simplifies the way my application interfaces with the database and helps me avoid a complicated database schema. Consider the complexity of supporting a repeating group in a properly normalized table structure. To represent the same data object in a single table row would like like this:

PK | Name | Manufacturer | categories
123 | “550 Spyder” | “Porsche” | “sports,touring,race,coupé”

In the real world, I know that vehicles belong to multiple categories, so I’ve chosen to represent that relation with a comma separated list: “sports,touring,race,coupé”. However, a comma separated list can be difficult to work with because:

- I can’t use equality to match single embedded values
- I must use regular expressions to find data

Which means that:

- Aggregate functions are difficult
- Updating a specific element is difficult

The first normal form was designed to avoid such problems by requiring that each relation contain a single atomic value. In other words, No repeating groups!

MongoDB’s JSON arrays do a far better job of matching the semantics of our categories than a set of rows since arrays are multi-element types, representing lists and sets by nature. As a document store, MongoDB’s multi value fields are a natural fit.


The Normal Forms and Why We Don’t Need Them
All of this is nice, but a perfectly valid argument can be made to implement the first normal form properly in my relational database and all the trouble I have with repeating groups will be solved. To adhere to the first normal form, I could restructure my rows to look like this:

Vehicle_Id | Manufacturer | Name | Category
—————————————————————-
2253 | “Porsche” | “550 Spyder” | “sports”
2253 | “Porsche” | “550 Spyder” | “touring”
2253 | “Porsche” | “550 Spyder” | “coupé”

This looks great, but I’ve clearly created data redundancies that make me vulnerable to anomalies. For example, consider an update to a single row:

update Manufacturer = “Porsche AG” where Vehicle_Id = 2253 and Category = “coupé”

This update leads to the following data anomaly.

Vehicle_Id | Manufacturer | Name | Category
—————————————————————-
2253 | “Porsche” | “550 Spyder” | “sports”
2253 | “Porsche” | “550 Spyder” | “touring”
2253 | “Porsche AG” | “550 Spyder” | “coupé”

This happened because I failed to adhere to the second normal form. To avoid this problem I’d have to normalize my data into three separate tables.

Vehicle_Id | Name
—————————
2253 | “”550 Spyder”

Vehicle_Id | Category
——————————-
2253 | “sports”
2253 | “touring”
2253 | “coupé”

Vehicle_Id | Manufacturer
————————————-
2253 | “Porsche AG”

This structure prevents the anomaly in the first example, but introduces a new set of problems. The new schema has become much more complex, and I have lost any semantic understanding of my stored objects. By adhering to these two normal forms, I’ve also entered the realm of cross-table joins and the need to enforce ACID transactions across multiple tables. The enforcement of referential integrity in multi-row, multi-table operations will require concurrency controls, increasing overhead and affecting performance.

MongoDB avoids these complications through the use of a document data model. The JSON document forms a more natural representation on the data than the normalized schema we explored. It does this by allowing the embedding of related data via arrays and sub-documents within a single document - thus eliminating the need for JOINs, referential integrity and multi-record ACID transactions

The Tao of MongoDB
{
“_id” : ObjectId(“528ba7691738025d11aab772”),
“manufacturer” : “Porsche AMG”,
“name” : “550 Spyder”,
“categories” : [
“sports”,
“touring”,
“coupé”
]
}

The data anomalies are avoided since the denormalized JSON structure has a single field “manufacturer.” As an attribute of the single document, the “manufacturer” field can be modified atomically, and consistency is preserved across all relations in the document. Using JSON gives us less reason to adhere to the normal forms. This doesn’t mean that you are prohibited from normalizing data in MongoDB. Of course you can, but the the reasons why are going to be different than those of a traditional RDBMS. We’ll look into that in a subsequent post on schema design.

For a full-length presentation on moving from Relational Databases to MongoDB, please visit the MongoDB presentations page. You can learn how to use MongoDB with our online education courses here. And please contact MongoDB support with any questions about transitioning to or using the product. You can also download our new whitepaper which provides best practices and considerations for migrating from an RDBMS to MongoDB.

Schema Design for Social Inboxes in MongoDB

Oct 31 • Posted 8 months ago

Designing a schema is a critical part of any application. Like most databases, there are many options for modeling data in MongoDB, and it is important to incorporate the functional requirements and performance goals for your application when determining the best design. In this post, we’ll explore three approaches for using MongoDB when creating social inboxes or message timelines.

If you’re building a social network, like Twitter for example, you need to design a schema that is efficient for users viewing their inbox, as well as users sending messages to all their followers. The whole point of social media, after all, is that you can connect in real time.

There are several design considerations for this kind of application:

  • The application needs to support a potentially large volume of reads and writes.
  • Reads and writes are not uniformly distributed across users. Some users post much more frequently than others, and some users have many, many more followers than others.
  • The application must provide a user experience that is instantaneous.
  • Edit 11/6: The application will have little to no user deletions of data (a follow up blog post will include information about user deletions and historical data)

Because we are designing an application that needs to support a large volume of reads and writes we will be using a sharded collection for the messages. All three designs include the concept of “fan out,” which refers to distributing the work across the shards in parallel:

  1. Fan out on Read
  2. Fan out on Write
  3. Fan out on Write with Buckets

Each approach presents trade-offs, and you should use the design that is best for your application’s requirements.

The first design you might consider is called Fan Out on Read. When a user sends a message, it is simply saved to the inbox collection. When any user views their own inbox, the application queries for all messages that include the user as a recipient. The messages are returned in descending date order so that users can see the most recent messages.

To implement this design, create a sharded collection called inbox, specifying the from field as the shard key, which represents the address sending the message. You can then add a compound index on the to field and the sent field. Once the document is saved into the inbox, the message is effectively sent to all the recipients. With this approach sending messages is very efficient.

Viewing an inbox, on the other hand, is less efficient. When a user views their inbox the application issues a find command based on the to field, sorted by sent. Because the inbox collection uses from as its shard key, messages are grouped by sender across the shards. In MongoDB queries that are not based on the shard key will be routed to all shards. Therefore, each inbox view will be routed to all shards in the system. As the system scales and many users go to view their inbox, all queries will be routed to all shards. This design does not scale as well as each query being routed to a single shard.

With the “Fan Out on Read” method, sending a message is very efficient, but viewing the inbox is less efficient.

Fan out on Read is very efficient for sending messages, but less efficient for reading messages. If the majority of your application consists of users sending messages, but very few go to read what anyone sends them — let’s call it an anti-social app — then this design might work well. However, for most social apps there are more requests by users to view their inbox than there are to send messages.

The Fan out on Write takes a different approach that is more optimized for viewing inboxes. This time, instead of sharding our inbox collection on the sender, we shard on the message recipient. In this way, when we go to view an inbox the queries can be routed to a single shard, which will scale very well. Our message document is the same, but now save a copy of the message for every recipient.

With the “Fan Out on Write” method, viewing the inbox is efficient, but sending messages consumes more resources.

In practice we might implement the saving of messages asynchronously. Imagine two celebrities quickly exchange messages at a high-profile event - the system could quickly be saturated with millions of writes. By saving a first copy of their message, then using a pool of background workers to write copies to all followers, we can ensure the two celebrities can exchange messages quickly, and that followers will soon have their own copies. Furthermore, we could maintain a last-viewed date on the user document to ensure they have accessed the system recently - zombie accounts probably shouldn’t get a copy of the message, and for users that haven’t accessed their account recently we could always resort to our first design - Fan out on Read - to repopulate their inbox. Subsequent requests would then be fast again.

At this point we have improved the design for viewing inboxes by routing each inbox view to a single shard. However, each message in the user’s inbox will produce a random read operation. If each inbox view produces 50 random reads, then it only takes a relatively modest number of concurrent users to potentially saturate the disks. Fortunately we can take advantage of the document data model to further optimize this design to be even more efficient.

Fan out on Write with Buckets refines the Fan Out on Write design by “bucketing” messages together into documents of 50 messages ordered by time. When a user views their inbox the request can be fulfilled by reading just a few documents of 50 messages each instead of performing many random reads. Because read time is dominated by seek time, reducing the number of seeks can provide a major performance improvement to the application. Another advantage to this approach is that there are fewer index entries.

To implement this design we create two collections, an inbox collection and a user collection. The inbox collection uses two fields for the shard key, owner and sequence, which holds the owner’s user id and sequence number (i.e. the id of 50-message “bucket” documents in their inbox). The user collection contains simple user documents for tracking the total number of messages in their inbox. Since we will probably need to show the total number of messages for a user in a variety of places in our application, this is a nice place to maintain the count instead of calculating for each request. Our message document is the same as in the prior examples.

To send a message we iterate through the list of recipients as we did in the Fan out on Write example, but we also take another step to increment the count of total messages in the inbox of the recipient, which is maintained on the user document. Once we know the count of messages, we know the “bucket” in which to add the latest message. As these messages reach the 50 item threshold, the sequence number increments and we begin to add messages to the next “bucket” document. The most recent messages will always be in the “bucket” document with the highest sequence number. Viewing the most recent 50 messages for a user’s inbox is at most two reads; viewing the most recent 100 messages is at most three reads.

Normally a user’s entire inbox will exist on a single shard. However, it is possible that a few user inboxes could be spread across two shards. Because our application will probably page through a user’s inbox, it is still likely that every query for these few users will be routed to a single shard.

Fan out on Write with Buckets is generally the most scalable approach of the these three designs for social inbox applications. Every design presents different trade-offs. In this case viewing a user’s inbox is very efficient, but writes are somewhat more complex, and more disk space is consumed. For many applications these are the right trade-offs to make.

Schema design is one of the most important optimizations you can make for your application. We have a number of additional resources available on schema design if you are interested in learning more:

Fan out on Read
Fan out on Write
Fan out on Write with Buckets
Send Message Performance
Best
Single write
Good
Shard per recipient
Multiple writes
Worst
Shard per recipient
Appends (grows)
Read Inbox Performance
Worst
Broadcast all shards
Random reads
Good
Single shard
Random reads
Best
Single shard
Single read
Data Size
Best
Message stored once
Worst
Copy per recipient
Worst
Copy per recipient


Schema design is one of the most important optimizations you can make for your application. We have a number of additional resources available on schema design if you are interested in learning more:

Schema Design for Time Series Data in MongoDB

Oct 30 • Posted 9 months ago

This is a post by Sandeep Parikh, Solutions Architect at MongoDB and Kelly Stirman, Director of Products at MongoDB.

Data as Ticker Tape

New York is famous for a lot of things, including ticker tape parades.

For decades the most popular way to track the price of stocks on Wall Street was through ticker tape, the earliest digital communication medium. Stocks and their values were transmitted via telegraph to a small device called a “ticker” that printed onto a thin roll of paper called “ticker tape.” While out of use for over 50 years, the idea of the ticker lives on in scrolling electronic tickers at brokerage walls and at the bottom of most news networks, sometimes two, three and four levels deep.

Today there are many sources of data that, like ticker tape, represent observations ordered over time. For example:

  • Financial markets generate prices (we still call them “stock ticks”).
  • Sensors measure temperature, barometric pressure, humidity and other environmental variables.
  • Industrial fleets such as ships, aircraft and trucks produce location, velocity, and operational metrics.
  • Status updates on social networks.
  • Calls, SMS messages and other signals from mobile devices.
  • Systems themselves write information to logs.

This data tends to be immutable, large in volume, ordered by time, and is primarily aggregated for access. It represents a history of what happened, and there are a number of use cases that involve analyzing this history to better predict what may happen in the future or to establish operational thresholds for the system.

Time Series Data and MongoDB

Time series data is a great fit for MongoDB. There are many examples of organizations using MongoDB to store and analyze time series data. Here are just a few:

  • Silver Spring Networks, the leading provider of smart grid infrastructure, analyzes utility meter data in MongoDB.
  • EnerNOC analyzes billions of energy data points per month to help utilities and private companies optimize their systems, ensure availability and reduce costs.
  • Square maintains a MongoDB-based open source tool called Cube for collecting timestamped events and deriving metrics.
  • Server Density uses MongoDB to collect server monitoring statistics.
  • Appboy, the leading platform for mobile relationship management, uses MongoDB to track and analyze billions of data points on user behavior.
  • Skyline Innovations, a solar energy company, stores and organizes meteorological data from commercial scale solar projects in MongoDB.
  • One of the world’s largest industrial equipment manufacturers stores sensor data from fleet vehicles to optimize fleet performance and minimize downtime.

In this post, we will take a closer look at how to model time series data in MongoDB by exploring the schema of a tool that has become very popular in the community: MongoDB Management Service (MMS). MMS helps users manage their MongoDB systems by providing monitoring, visualization and alerts on over 100 database metrics. Today the system monitors over 25k MongoDB servers across thousands of deployments. Every minute thousands of local MMS agents collect system metrics and ship the data back to MMS. The system processes over 5B events per day, and over 75,000 writes per second, all on less than 10 physical servers for the MongoDB tier.

Schema Design and Evolution

How do you store time series data in a database? In relational databases the answer is somewhat straightforward; you store each event as a row within a table. Let’s say you were monitoring the amount of system memory used per second. In that example you would have a table and rows that looked like the following:

timestamp memory_used
2013-10-10T23:06:37.000Z 1000000
2013-10-10T23:06:38.000Z 2000000


If we map that storage approach to MongoDB, we would end up with one document per event:

{
  timestamp: ISODate("2013-10-10T23:06:37.000Z"),
  type: ”memory_used”,
  value: 1000000
},
{
  timestamp: ISODate("2013-10-10T23:06:38.000Z"),
  type: ”memory_used”,
  value: 15000000
}

While this approach is valid in MongoDB, it doesn’t take advantage of the expressive nature of the document model. Let’s take a closer look at how we can refine the model to provide better performance for reads and to improve storage efficiency.

The Document-Oriented Design

A better schema approach looks like the following, which is not the same as MMS but it will help to understand the key concepts. Let’s call it the document-oriented design:

{
  timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 2000000
  }
}

We store multiple readings in a single document: one document per minute. To further improve the efficiency of the schema, we can isolate repeating data structures. In the ```timestamp_minute``` field we capture the minute that identifies the document, and for each memory reading we store a new value in the ```values``` sub-document. Because we are storing one value per second, we can simply represent each second as fields 0 - 59.

More Updates than Inserts

In any system there may be tradeoffs regarding the efficiency of different operations, such as inserts and updates. For example, in some systems updates are implemented as copies of the original record written out to a new location, which requires updating of indexes as well. One of MongoDB’s core capabilities is the in-place update mechanism: field-level updates are managed in place as long as the size of the document does not grow significantly. By avoiding rewriting the entire document and index entries unnecessarily, far less disk I/O is performed. Because field-level updates are efficient, we can design for this advantage in our application: with the document-oriented design there are many more updates (one per second) than inserts (one per minute).

For example, if you wanted to maintain a count in your application, MongoDB provides a handy operator that increments or decrements a field. Instead of reading a value into your application, incrementing, then writing the value back to the database, you can simply increase the field using $inc:

```{ $inc: { pageviews: 1 } }```

This approach has a number of advantages: first, the increment operation is atomic - multiple threads can safely increment a field concurrently using $inc. Furthermore, this approach is more efficient for disk operations, requires less data to be sent over the network and requires fewer round trips by omitting the need for any reads. Those are three big wins that result in a more simple, more efficient and more scalable system. The same advantages apply to the use of the $set operator.

The document-oriented design has several benefits for writing and reading. As previously stated, writes can be much faster as field-level updates because instead of writing a full document we’re sending a much smaller delta update that can be modeled like so:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: ”memory_used”
  }, 
  {$set: {“values.59”: 2000000 } }
)

With the document-oriented design reads are also much faster. If you needed an hour’s worth of measurements using the first approach you would need to read 3600 documents, whereas with this approach you would only need to read 60 documents. Reading fewer documents has the benefit of fewer disk seeks, and with any system fewer disk seeks usually results is significantly better performance.

A natural extension to this approach would be to have documents that span an entire hour, while still keeping the data resolution per second:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    1: 1000000, 
    …,
    3598: 1500000,
    3599: 2000000
  }
}

One benefit to this approach is that we can now access an hour’s worth of data using a single read. However, there is one significant downside: to update the last second of any given hour MongoDB would have to walk the entire length of the “values” object, taking 3600 steps to reach the end. We can further refine the model a bit to make this operation more efficient:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: { 0: 999999, 1: 999999, …, 59: 1000000 },
    1: { 0: 2000000, 1: 2000000, …, 59: 1000000 },
    …,
    58: { 0: 1600000, 1: 1200000, …, 59: 1100000 },
    59: { 0: 1300000, 1: 1400000, …, 59: 1500000 }
  }
}
db.metrics.update(
  { 
    timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
    type: “memory_used”
  }, 
  {$set: {“values.59.59”: 2000000 } }
)

MMS Implementation

In MMS users have flexibility to view monitoring data at varying levels of granularity. These controls appear at the top of the monitoring page:

These controls inform the schema design for MMS, and how the data needs to be displayed. In MMS, different resolutions have corresponding range requirements - for example, if you specify that you want to analyze monitoring data at the granularity of “1 hr” instead of “1 min” then the ranges also become less granular, changing from hours to days, weeks and months:

To satisfy this approach in a scalable manner and keep data retention easy to manage, MMS organizes monitoring data to be very efficient for reads by maintaining copies at varying degrees of granularity. The document model allows for efficient use of space, so the tradeoff is very reasonable, even for a system as large as MMS. As data ages out, collections that are associated with ranges of time are simply dropped, which is a very efficient operation. Collections are created to represent future ranges of time, and these will eventually be dropped as well. This cycle maintains a rolling window of history associated with the functionality provided by MMS.

In addition, to support the “avg/sec” display option the system also tracks the number of samples collected and the sum of all readings for each metric similar to the following example:

{
  timestamp_minute: ISODate(“2013-10-10T23:06:00.000Z”),
  num_samples: 58,
  total_samples: 108000000,
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 1800000
  }
}

The fields “num_samples” and “total_samples” are updated as new readings are applied to the document:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: “memory_used”
  }, 
  {
    {$set: {“values.59”: 2000000 }},
    {$inc: {num_samples: 1, total_samples: 2000000 }}
  }
)

Computing the average/sec is straightforward and requires no counting or processing, just a single read to retrieve the data and a simple application-level operation to compute the average. Note that with this model we assume a consistent cadence of measurements - one per second - that we can simply aggregate at the top of the document to report a rolled-up average for the whole minute. Other models are possible that would support inconsistent measurements and flexible averages over different time frames.

Another optimization used in MMS is preallocating all documents for the upcoming time period; MMS never causes an existing document to grow or be moved on disk. A background task within the MMS application performs inserts of empty “shell” documents including the subdocument schema but with all zeroes for the upcoming time periods before they are recorded. With this approach fields are always incremented or set without ever growing the document in size, which eliminates the possibility of moving the document and the associated overhead. This is a major performance win and another example of ensuring in-place updates within the document-oriented design.

Conclusion

MongoDB offers many advantages for storing and analyzing time series data, whether it’s stock ticks, tweets or MongoDB metrics. If you are using MongoDB for time series data analysis, we want to hear about your use case. Please continue the conversation by commenting on this post with your story.

More Information

Like what you see? Get MongoDB updates straight to your inbox

Performance Tuning MongoDB on Solidfire

Oct 24 • Posted 9 months ago

This is a guest post by by Chris Merz & Garrett Clark, SolidFire

We recently had a large enterprise customer implement a MongoDB sharded cluster on SolidFire as the backend for a global e-commerce system. By leveraging solid-state drive technology with features like storage virtualization, Quality of Service (guaranteed IOPS per volume), and horizontal scaling, the customer was looking to combine the benefits of dedicated storage performance with the simplicity and scalability of a MongoDB environment.

During the implementation the customer reached out to us with some performance and tuning questions, requesting our assistance with the configuration. After meeting with the team and reviewing the available statistics, we discovered response times that were deemed out of range for the application’s performance requirements. Response times were ~13-20ms (with an average of 15-17 ms). While this is considered acceptable latency in many implementations, the team was targeting < 5ms average query response times.

When troubleshooting any storage latency performance issue it is important to focus on two critical aspects of the end-to-end chain: potential i/o queue depth bottlenecks and the main contributors to the overall latency in the chain. A typical end-to-end sequence with attached storage can be described by:

MongoDB > OS > NIC > Network > Storage > Network > NIC > OS > MongoDB

First off, we looked for any i/o queue depth bottlenecks and found the first one on the operating system layer. MongoDB was periodically sending an i/o queue depth of >100 to the operating system and, by default, iSCSI could only release a 32 queue depth per iSCSI session. This drop from an i/o queue depth of >100 to 32 caused frames to be stalled on the operating system layer while they were waiting to continue down the chain.

We alleviated the issue by increasing the number of iSCSI sessions to the volume from 1 to 4, which proportionally increased the queue depth exiting the operating system to 128 (32*4). This enabled all frames coming off the application layer to immediately pass through the operating system and NIC, decreased the overall latency from ~15ms to ~4ms. Despite the latency average being 4ms, performance was still rather variable.

We then turned our focus to pinpointing the sources of the remaining end-to-end latency. We were able to determine the latency factors in the stack through the study of three latency loops:

First, the complete chain of: MongoDB > OS > NIC > Network > Storage > Network > NIC > OS > MongoDB. This loop took an average of 3.9ms to complete.

Secondly, the subset loop of: OS > NIC > Network > Storage > Network > NIC > OS. This loop took ~1.1ms to complete. We determined the latency of this loop by the output of “iostat –xk 1” then greping for the corresponding volume.

The last loop segment, latency on the storage subsystem, was 0.7ms and was obtained through a polling API command issued to the SolidFire unit.

Our analysis pointed to the first layers of the stack contributing the most significant percent (>70%) of the end-to-end latency, so we decided to start there and continue downstream.

We reviewed the OS configuration and tuning, with an eye towards both SolidFire/iSCSI best practices and MongoDB performance. Several OS-level tunables were found that could be tweaked to ensure optimal throughput for this type of deployment. Unfortunately, none of these resulted in any major reduction in the end-to-end latency for mongo.

Having eliminated the obvious, we were left with what remained: MongoDB itself. A phrase oft-quoted by the famous fictional detective, Sherlock Holmes came to mind: “when you have eliminated the impossible, whatever remains, however improbable, must be the truth.”

Upon going over the collected statistics runs with a fine-toothed comb, we noticed that the latency spikes had intervals of almost exactly 60 seconds. That’s when the light bulb went off…

The MongoDB flush interval. The architecture of MongoDB was developed in the context of spinning disk, a vastly slower storage technology requiring batched file syncs to minimize query latency. The syncdelay setting defaults to 60 seconds for this very reason. In the documentation, it is clearly stated “In almost every situation you should not set this value and use the default setting”. ‘Almost’ was the key to our solution, in this particular case. It should be noted that changing syncdelay is an advanced tuning, and should be carefully evaluated and tested on a per-deployment basis.

Little’s Law (IOPS = Queue Depth / Latency) indicated that lowering the flush interval would reduce the variance in queue depth thereby smoothing the overall latency. In lab testing, we had found that, under maximum load, decreasing the syncdelay to 1 second would force a ‘continuous flush’ behavior usually repeating every 6-7 seconds, reducing i/o spikes in the storage path. We had seen this as a useful technique for controlling IOPS throughput variability, but had not typically viewed it as a latency reduction technique.

It worked!

After implementing the change, the customer excitedly reported that they were seeing average end-to-end MongoDB response times of 1.2ms, with a throughput of ~4-5k IOPS per mongod (normal for this application), and NO obvious increase in extraneous i/o.

By increasing the number of iSCSI sessions, normalizing the flush rate and removing the artificial 60s buffer, we reduced average latency more than an order of magnitude, proving out the architecture at scale in a global production environment. Increasing the iSCSI sessions increased parallelism, and decreased the latency by 3.5-4x. The reduction in syncdelay had the effect of smoothing the average queue depth being sent to the storage system, decreasing latency by slightly more than 3x.

This customer’s experience is a good example of how engaging the MongoDB team early on can ensure a smooth product launch. As of today, we’re excited to announce that SolidFire is a MongoDB partner. Learn more about the SolidFire and MongoDB integration on our Database Solutions page. To learn more about performance tuning MongoDB on SolidFire, register for our upcoming webinar on November 6 with MongoDB.

For more information on MongoDB performance, check out Performance Considerations for MongoDB, which covers other topics such as indexes and application patterns and offers insight into achieving MongoDB performance at scale.

blog comments powered by Disqus