Posts tagged:

mongo

Processing 2 Billion Documents A Day And 30TB A Month With MongoDB

Mar 14 • Posted 6 months ago

This is a guest post by David Mytton. He has been programming Python for over 10 years and founded his website and and monitoring company, Server Density, back in 2009.

Server Density processes over 30TB/month of incoming data points from the servers and web checks we monitor for our customers, ranging from simple Linux system load average to website response times from 18 different countries. All of this data goes into MongoDB in real time and is pulled out when customers need to view graphs, update dashboards and generate reports.

We’ve been using MongoDB in production since mid-2009 and have learned a lot over the years about scaling the database. We run multiple MongoDB clusters but the one storing the historical data does the most throughput and is the one I shall focus on in this article, going through some of the things we’ve done to scale it.

1. Use dedicated hardware, and SSDs

All our MongoDB instances run on dedicated servers across two data centers at Softlayer. We’ve had bad experiences with virtualisation because you have no control over the host, and databases need guaranteed performance from disk i/o. When running on shared storage (e.g., a SAN) this is difficult to achieve unless you can get guaranteed throughput from things like AWS’s Provisioned IOPS on EBS (which are backed by SSDs).

MongoDB doesn’t really have many bottlenecks when it comes to CPU because CPU bound operations are rare (usually things like building indexes), but what really causes problem is CPU steal - when other guests on the host are competing for the CPU resources.

The way we have combated these problems is to eliminate the possibility of CPU steal and noisy neighbours by moving onto dedicated hardware. And we avoid problems with shared storage by deploying the dbpath onto locally mounted SSDs.

I’ll be speaking in-depth about managing MongoDB deployments in virtualized or dedicated hardware at MongoDB World this June.

2. Use multiple databases to benefit from improved concurrency

Running the dbpath on an SSD is a good first step but you can get better performance by splitting your data across multiple databases, and putting each database on a separate SSD with the journal on another.

Locking in MongoDB is managed at the database level so moving collections into their own databases helps spread things out - mostly important for scaling writes when you are also trying to read data. If you keep databases on the same disk you’ll start hitting the throughput limitations of the disk itself. This is improved by putting each database on its own SSD by using the directoryperdb option. SSDs help by significantly alleviating i/o latency, which is related to the number of IOPS and the latency for each operation, particularly when doing random reads/writes. This is even more visible for Windows environments where the memory mapped data files are flushed serially and synchronously. Again, SSDs help with this.

The journal is always within a directory so you can mount this onto its own SSD as a first step. All writes go via the journal and are later flushed to disk so if your write concern is configured to return when the write is successfully written to the journal, making those writes faster by using an SSD will improve query times. Even so, enabling the directoryperdb option gives you the flexibility to optimise for different goals (e.g., put some databases on SSDs and some on other types of disk, or EBS PIOPS volumes, if you want to save cost).

It’s worth noting that filesystem based snapshots where MongoDB is still running are no longer possible if you move the journal to a different disk (and so different filesystem). You would instead need to shut down MongoDB (to prevent further writes) then take the snapshot from all volumes.

3. Use hash-based sharding for uniform distribution

Every item we monitor (e.g., a server) has a unique MongoID and we use this as the shard key for storing the metrics data.

The query index is on the item ID (e.g. the server ID), the metric type (e.g. load average) and the time range; but because every query always has the item ID, it makes it a good shard key. That said, it is important to ensure that there aren’t large numbers of documents under a single item ID because this can lead to jumbo chunks which cannot be migrated. Jumbo chunks arise from failed splits where they’re already over the chunk size but cannot be split any further.

To ensure that the shard chunks are always evenly distributed, we’re using the hashed shard key functionality in MongoDB 2.4. Hashed shard keys are often a good choice for ensuring uniform distribution, but if you end up not using the hashed field in your queries, you could actually hurt performance because then a non-targeted scatter/gather query has to be used.

4. Let MongoDB delete data with TTL indexes

The majority of our users are only interested in the highest resolution data for a short period and more general trends over longer periods, so over time we average the time series data we collect then delete the original values. We actually insert the data twice - once as the actual value and once as part of a sum/count to allow us to calculate the average when we pull the data out later. Depending on the query time range we either read the average or the true values - if the query range is too long then we risk returning too many data points to be plotted. This method also avoids any batch processing so we can provide all the data in real time rather than waiting for a calculation to catch up at some point in the future.

Removal of the data after a period of time is done by using a TTL index. This is set based on surveying our customers to understand how long they want the high resolution data for. Using the TTL index to delete the data is much more efficient than doing our own batch removes and means we can rely on MongoDB to purge the data at the right time.

Inserting and deleting a lot of data can have implications for data fragmentation, but using a TTL index helps because it automatically activates PowerOf2Sizes for the collection, making disk usage more efficient. Although as of MongoDB 2.6, this storage option will become the default.

5. Take care over query and schema design

The biggest hit on performance I have seen is when documents grow, particularly when you are doing huge numbers of updates. If the document size increases after it has been written then the entire document has to be read and rewritten to another part of the data file with the indexes updated to point to the new location, which takes significantly more time than simply updating the existing document.

As such, it’s important to design your schema and queries to avoid this, and to use the right modifiers to minimise what has to be transmitted over the network and then applied as an update to the document. A good example of what you shouldn’t do when updating documents is to read the document into your application, update the document, then write it back to the database. Instead, use the appropriate commands - such as set, remove, and increment - to modify documents directly.

This also means paying attention to the BSON data types and pre-allocating documents, things I wrote about in MongoDB schema design pitfalls.

6. Consider network throughput & number of packets

Assuming 100Mbps networking is sufficient is likely to cause you problems, perhaps not during normal operations, but probably when you have some unusual event like needing to resync a secondary replica set member.

When cloning the database, MongoDB is going to use as much network capacity as it can to transfer the data over as quickly as possible before the oplog rolls over. If you’re doing 50-60Mbps of normal network traffic, there isn’t much spare capacity on a 100Mbps connection so that resync is going to be held up by hitting the throughput limits.

Also keep an eye on the number of packets being transmitted over the network - it’s not just the raw throughput that is important. A huge number of packets can overwhelm low quality network equipment - a problem we saw several years ago at our previous hosting provider. This will show up as packet loss and be very difficult to diagnose.

Conclusions

Scaling is an incremental process - there’s rarely one thing that will give you a big win. All of these tweaks and optimisations together help us to perform thousands of write operations per second and get response times within 10ms whilst using a write concern of 1.

Ultimately, all this ensures that our customers can load the graphs they want incredibly quickly. Behind the scenes we know that data is being written quickly, safely and that we can scale it as we continue to grow.

Announcing the MongoDB Bug Hunt 2.6.0-rc0 

Feb 21 • Posted 6 months ago

The MongoDB team released MongoDB 2.6.0-rc0 today and is proud to announce the MongoDB Bug Hunt. The MongoDB Bug Hunt is a new initiative to reward our community members who contribute to improving this MongoDB release. We’ve put the release through rigorous correctness, performance and usability testing. Now it’s your turn. Over the next 10 days, we challenge you to test and uncover any lingering issues in MongoDB 2.6.0-rc0.

How it works

You can download this release at MongoDB.org/downloads. If you find a bug, submit the issue to Jira (Core Server project) by March 4 at 12:00AM GMT. Bug reports will be judged on three criteria: user impact, severity and prevalence.

We will review all bugs submitted against 2.6.0-rc0. Winners will be announced on the MongoDB blog and user forum by March 8. There will be one first place winner, one second place winner and at least two honorable mentions.

The Rewards
First Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt
Second Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Amazon Gift Card
  • MongoDB Contributor T-shirt
Honorable Mentions:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

How to get started:

  • Deploy in your test environment: Software is best tested in a realistic environment. Help us see how 2.6 fares with your code and data so that others can build and run applications on MongoDB 2.6 successfully.
  • Test new features and improvements: Dozens of new features were added in 2.6. See the 2.6 Release Notes for a full list.
  • Log a ticket: If you find an issue, create a report in Jira. See the documentation for a guide to submitting well written bug reports.

If you are interested in doing this work full time, consider applying to join our engineering teams in New York City, Palo Alto and Austin, Texas.

Happy hunting!

Eliot, Dan and the MongoDB Team

Managing the web nuggets with MongoDB and MongoKit

Sep 27 • Posted 11 months ago

This is a guest post by Nicolas Clairon, maintainer of MongoKit and founder of Elkorado

MongoKit is a python ODM for MongoDB. I created it in 2009 (when the ODM acronym wasn’t even used) for my startup project called Elkorado. Now that the service is live, I realize that I never wrote about MongoKit. I’d like to introduce it to you with this quick tutorial based on real use cases from Elkorado.

Elkorado: a place to store web nuggets

Elkorado is a collaborative, interest-based curation tool. It was born over the frustration that there is no place where to find quality resources about a particular topic of interest. There are so many blogs, forums, videos and websites out there that it is very difficult to find our way over this massive wealth of information.

Elkorado aims at helping people to centralize quality content, so they can find them later easily and discover new ones.

MongoDB to the rescue

Rapid prototyping is one of the most important thing in startup world and it is an area where MongoDB shines.

The web is changing fast, and so are web resources and their metadata. MongoDB’s and schemaless database is a perfect fit to store this kind of data. After losing hair by trying to use polymorphism with SQL databases, I went into MongoDB… and I felt in love with it.

While playing with the data, I needed a validation layer and wanted to add some methods to my documents. Back then, they was no ODM for Python. And so I created MongoKit.

MongoKit: MongoDB ODM for Python

MongoKit is a thin layer on top of Pymongo. It brings field validations, inheritance, polymorphism and a bunch of other features. Let’s see how it is used in Elkorado.

Elkorado is a collection of quality web resources called nuggets. This is how we could fetch a nugget discovered by the user “namlook” with Pymongo:

nuggets here is a regular python dict.

Here’s a simple nugget definition with MongoKit:

Fetching a nugget with MongoKit is pretty the same:

However, this time, nugget is a Nugget object and we can call the is_popular method on it:

One of the main advantages of MongoKit is that all your models are registered and accessible via the connection instance. MongoKit look at the __database__ and __collection__ fields to know which database and which collection has to be used. This is useful so we have only one place to specify those variables.

Inheritance

MongoKit was first build to natively support inheritance:

In this Core object, we are defining the database name and some fields that will be shared by other models.

If one wants a Nugget object to have date metadata, one just have to make it inherit from Core:

It’s all about Pymongo

With MongoKit, your are still very close to Pymongo. In fact, MongoKit’s connection, database and collection are subclasses of Pymongo’s. If once in an algorithm, you need pure performances, you can directly use Pymongo’s layer which is blazing fast:

Here, connection is a MongoKit connection but it can be used like a Pymongo connection. Note that to keep the benefice of DRY, we can call the pymongo’s layer from a MongoKit document:

A real life “simplified” example

Let’s see an example of CRUD done with MongoKit.

On Elkorado, each nugget is unique but multiple users can share a nugget which have differents metadata. Each time a user picks up a nugget, a UserNugget is created with specific informations. If this is the first time the nugget is discovered, a Nugget object is created, otherwise, it is updated. Here is a simplified UserNugget structure:

This example well describes what can be done with MongoKit. Here, the save method has been overloaded to check if a nugget exists (remember, each nugget is unique by its URL). It will create it if it is not already created, and update it.

Updating data with MongoKit is similar to Pymongo. Use save on the object or use directly the Pymongo’s layer to make atomic updates. Here, we use atomic updates to push new topics and increase the popularity:

Getting live

Let’s play with our model:

When calling the save method, the document is validated against the UserNugget’s structure. As expected, the fields created_at and updated_at have been added:

and the related nugget has been created:

Conclusion

MongoKit is a central piece of Elkorado. It has been written to be small and minimalist but powerful. There is so much more to say about features like inherited queries, i18n and gridFS, so take a look at the wiki to read more about how this tool can help you.

Check the documentation for more information about MongoKit. And if you register on Elkorado, check out the nuggets about MongoDB. Don’t hesitate to share you nuggets as well, the more the merrier.

The Most Popular Pub Names

Jul 30 • Posted 1 year ago

By Ross Lawley, MongoEngine maintainer and Scala Engineer at 10gen

Earlier in the year I gave a talk at MongoDB London about the different aggregation options with MongoDB. The topic recently came up again in conversation at a user group, so I thought it deserved a blog post.

Gathering ideas for the talk

I wanted to give a more interesting aggregation talk than the standard “counting words in text”, and as the aggregation framework gained shiny 2dsphere geo support in 2.4, I figured I’d use that. I just needed a topic…

What is top of mind for us Brits?

Two things immediately sprang to mind: weather and beer.

Read more

October MongoDB Blogroll and Releases

Nov 5 • Posted 1 year ago

MongoDB Sharding Visualizer

Sep 14 • Posted 2 years ago

We’re happy to share with you the initial release of the MongoDB sharding visualizer. The visualizer is a Google Chrome app that provides an intuitive overview of a sharded cluster. This project provides an alternative to the printShardingStatus() utility function available in the MongoDB shell.

Features

The visualizer provides two different perspectives of the cluster’s state.

The collections view is a grid where each rectangle represents a collection. Each rectangle’s area is proportional to that collection’s size relative to the other collections in the cluster. Inside each rectangle a pie chart shows the distribution of that collection’s chunks over all the shards in the cluster.

The shards view is a bar graph where each bar represents a shard and each segment inside the shard represents a collection. The size of each segment is relative to the other collections on that shard.

Additionally, the slider underneath each view allows rewinding the state of the cluster. select and view the state of the cluster at a specific time.

Installation

To install the plugin, download and unzip the source code from 10gen labs. In Google Chrome, go to Preferences > Extensions, enable Developer Mode, and click “Load unpacked extension…”. When prompted, select the “plugin” directory. Then, open a new tab in Chrome and navigate to the Apps page and launch the visualizer.

Feedback

We very much look forward to hearing feedback and encourage everyone to look at the source code which is available https://github.com/10gen-labs/shard-viz .

MongoDB 2.2 Released

Aug 29 • Posted 2 years ago

We are pleased to announce the release of MongoDB version 2.2.  This release includes over 1,000 new features, bug fixes, and performance enhancements, with a focus on improved flexibility and performance. For additional details on the release:

New Features

Aggregation Framework

The Aggregation Framework is available in its first production-ready release as of 2.2. The aggregation framework makes it easier to manipulate and process documents inside of MongoDB, without needing to useMap Reducez,/span>, or separate application processes for data manipulation.

See the aggregation documentation for more information.

Additional “Data Center Awareness” Functionality

2.2 also brings a cluster of features that make it easier to use MongoDB for larger more geographically distributed contexts. The first change is a standardization of read preferences across all drivers and sharded (i.e. mongos) interfaces. The second is the addition of “tag aware sharding,” which makes it possible to ensure that data in a geographically distributed sharded cluster is always closest to the application that will use that data the most.

Improvements to Concurrency

v2.2 eliminates the global lock in the mongod process.  Locking is now per database.  In addition a new subsystem avoids locks under most page-fault events; thus concurrency improves even on systems with a single database.   Parallelism in application of writes on secondaries is enhanced also.  See this video for more details.

We’re looking forward to your feedback on 2.2. Keep the Jira Issues, blog posts, user group posts, and tweets coming.

- Eliot and the 10gen/MongoDBteam

Like what you see? Get MongoDB updates straight to your inbox

MongoDB Blogroll: The Best of July 2012 

Aug 2 • Posted 2 years ago

Every month, we’ll be publishing the best community blog posts from the month. Here is the digest for July:

Want your blog post to be included in the next update? Tweet it out with the #mongodb hashatag or send it to us directly

MongoDB at Craigslist: 1 Year Later

May 3 • Posted 2 years ago

Update: watch the video of Jeremy Zawodny and Chris Mooney’s talk on A Year of MongoDB at Craigslist at MongoSF ‘12

Last year, Craigslist moved their archive to MongoDB from MySQL. After the initial set up, we spoke with Jeremy Zawodny, software engineer at Craigslist and the author of High Performance MySQL (O’Reilly), and asked him some questions about their cluster. In advance of their talk at MongoSF tomorrow, we caught up with Jeremy to get the scoop on what’s happening at Craigslist one year later. 

Last time we spoke you were building a MongoDB store for 5 Billion Documents. What do your numbers look like now?

We’re currently approaching the 3 billion mark. The 5 billion number was our target capacity when building the system. Back then we had about 2.5 billion documents that we migrated into MongoDB, and we’ve continued to add documents ever since then.

Read more

MongoDB: Powering the Magic and the Monsters at Stripe

May 2 • Posted 2 years ago

Update: Watch the video of Greg Brockman’s talk on MongoDB for High Availability at MongoSF ‘12

Stripe offers a simple platform for developers to accept online payments. They are a long-time user of MongoDB and have built a powerful and flexible system for enabling transactions on the web. In advance of their talk at MongoSF on MongoDB for high availability, Stripe’s engineer, Greg Brockman spoke with us about what’s going on with MongoDB at Stripe.

Read more

Revamp of MongoDB’s Documentation

May 1 • Posted 2 years ago

We’re revamping MongoDB’s documentation. The new design in the MongoDB Manual has an improved reference section and an index for simplified search. It will also eventually support multiple MongoDB versions at the same time.

This project is a work in progress, and things are changing quickly. Our goal is to consolidate, sharpen, organize, and continue to improve the documentation in support of MongoDB. For now, the new docs will live alongside the original MongoDB Wiki. But over the next few months, we’ll be transitioning everything to the new manual.

In the spirit of open source, the docs are housed on Github. Feedback is welcome! Feel free to fork the repository and issue pull requests. You can also open tickets in JIRA, and we’ll promptly address any suggestions.

Meet Variety, a Schema Analyzer for MongoDB

Apr 27 • Posted 2 years ago

Variety is a lightweight tool which gives a feel for an application’s schema, as well as any schema outliers. It is particularly useful for

• quickly learning how data is structured, if inheriting a codebase with a production data dump

• finding all rare keys in a given collection

An Easy Example

We’ll make a collection, within the MongoDB shell:

db.users.insert({name: "Tom", bio: "A nice guy.", pets: ["monkey", "fish"], someWeirdLegacyKey: "I like Ike!"});
db.users.insert({name: "Dick", bio: "I swordfight."}); 
db.users.insert({name: "Harry", pets: "egret"});
db.users.insert({name: "Geneviève", bio: "Ça va?"}); 

Read more

Grails in the Land of MongoDB

Feb 29 • Posted 2 years ago

Groovy and Grails’ speed and simplicity are a perfect match to the flexibility and power of MongoDB. Dozens of plugins and libraries connect these two together, making it a breeze to get Grooving with MongoDB.

Using Grails with MongoDB

For the purpose of this post, let’s pretend we’re writing a hospital application that uses the following domain class.

class Doctor { 
  String first 
  String last 
  String degree 
  String specialty 
}

There are a few grails plugins that help communicate with MongoDB, but one of the easiest to use is the one created by Graeme Rocher himself (Grails project lead). The MongoDB GORM plugin allows you to persist all your domain classes in MongoDB. To use it, first remove any unneeded persistance-related plugins after you’ve executed the ‘grails create-app’ command, and install the MongoDB GORM plugin.

Read more

Operations in the New Aggregation Framework

Jan 17 • Posted 2 years ago

Available in 2.1 development release. Will be stable for production in the 2.2 release

Built by Chris Westin (@cwestin63)

MongoDB has built-in MapReduce functionality that can be used for complex analytics tasks. However, we’ve found that most of the time, users need the kind of group-by functionality that SQL implementations have. This can be implemented using map/reduce, but doing so is more work than it was in SQL. In version 2.1, MongoDB is introducing a new aggregation framework that will make it much easier to obtain the kind of results SQL group-by is used for, without having to write custom JavaScript.

Read more
blog comments powered by Disqus