An interview with Jeff Yemin, Director of Content Management Systems at MTV Networks and a presenter at the upcoming MongoNYC conference.
What do you work on at MTV Networks?
I manage the backend development of our next-generation content management system.
Tell us how MTV Networks is using MongoDB. What is the size and scope of the CMS application that you have built?
MongoDB is the repository that powers our CMS, which will eventually be used to manage and serve content for all of MTV Networks’ major websites.
Can you tell us why you chose MongoDB?
I have to save something for my actual talk! But briefly, MongoDB is a good fit for a lot of reasons. The schema-less document structure provides a lot of flexibility, while the search capabilities provide a lot more value than you get with key-value stores. At the same time, from an operational perspective, it feels very much like MySQL, so our systems and DBA groups are quite comfortable managing the deployment.
What were some of the challenges you faced using MongoDB for a large enterprise application?
The biggest was probably convincing people that it was a good idea. Like most enterprises that have been around for a while, we have a long history of using relational databases, so this is a pretty big change for us.
What’s next for MongoDB at MTV?
Right now MongoDB is powering the recently re-launched spike.com, and we are going to be rolling it out on many other major sites within the next year, most likely including gametrailers.com, thedailyshow.com, comedycentral.com, nick.com, and numerous international properties. We’ll also be rolling out new MongoDB-based applications for social activity tracking and online polls.
Learn more at Jeff’s talk at MongoNYC on June 7. Register before May 17 to take advantage of early bird pricing, which is only $50.
Over the last year I have seen a significant rise in the number of questions and interest from both the greater Java community and enterprise Java shops about MongoDB. Coming from the MongoDB and Java worlds (among others), this is something I have watched with great interest and excitement.
As one of the authors and project leads for Morphia (MongoDB Java ORM) I have seen a lot of questions relating to both the core driver and how to build Java applications with MongoDB. A lot of these questions arise from the paradigm shift users experience when moving from the standard SQL/JPA/Hibernate platforms/frameworks to the document oriented world of MongoDB.
For the last year or so, Morphia has — how long that seems looking back — bridged the gap between the document/map/dictionary world of the core MongodDB Java driver with the cleaner POJO (domain objects) design pattern, which is much friendlier for business logic.
Over the last few months, the Java MongoDB ecosystem has started to drastically evolve. New features in the core MongoDB server continue to distinguish the product from the crowd and new engineers are joining the MongoDB open source community by contributing Java persistence (mappers) frameworks and libraries.
As one of the main authors of Morphia, I envision many of the projection/mapping features which we have implemented being included in the core MongoDB Java driver, similar to what other drivers support (.Net/C#, and dynamic languages).
Even before I joined 10gen I had been in discussions of how to best integrate the user experience from high-level abstractions such as the JPA EntityManager and DAO pattern into the features supplied with the basic driver. Now that I am a member of the team, and have direct access to the driver (and the great team that has been maintaining it internally, along with the numerous community patches), I am finally in a great position to help make that integration possible. I will in fact be in the best position of all to make this vision come true: I get to do the integration :-)
Major Version Changes
Let’s talk a little about the changes coming in the 3.0* Java driver. First, we have some great ideas of how to improve overall performance for frameworks; these ideas primarily have come from the need to reduce the impedance mismatch of the current driver when being used from frameworks like Morphia, Casbah, and many others.
One of the performance improvements we have in mind will allow us to remove the need to decode binary BSON, to a temporary java map (DBObject) and then to the final format. In some of the very high volume and real-time analytics deployments, this will be an amazing performance improvement. In addition to reduced latency, the change will also reduce resource allocation as well.
A little code:
With Morphia, and soon the driver, you can do things like this:
@Entity
class Customer {
@Id ObjectId id;
String name;
String[] projects; }
Save, and find, a single Model/Entity
Datastore ds = …;
ds.save(new Customer(“Scott Hernandez”,
new String[]{“java”, “morphia”, “casbah”}));
// QBE(query by example)
Customer scott = ds.find(Customer.class, new Customer(“Scott Hernandez”)).get();
//Iterate over all customers
for(Customer customer : ds.find(Customer.class))
print(customer);
//Use the fluent query interface
Query<Customer> findMe = ds.find(Customer.class).field(“projects”).equal(“java”);
Customer scott = findMe.get(); //The query is validated by your java class
scott.name = “Scott D. Hernandez”; //change a field
ds.save(scott); //save the whole object
//Update on one field, without needing the full customer object
UpdateOperations<Customer> updates = ds.createUpdateOperations();
updates.add(“projects”, “support”);
UpdateResults<Customer> res = ds.updateFirst(findMe, updates);
Coming very soon we will have better support for mock frameworks and testing isolation by providing interfaces to the major components in the driver. Many (open source) projects built on the driver already have support for mock and isolated testing so having the driver support is a great benefit for all.
Compatibility
On the path to these changes we have many other milestones to hit as well. Some of these are internal changes while others will have some outwardly facing impact. The premier goal we have is to make sure that the MongoDB Java driver stays backwards compatible; this will reduce the need for existing applications and deployments to make any changes as we make improvements. There will be some changes that might not fit in this goal but we will keep them as small as possible and highlighted in the release notes.
Java Web Apps Webinar, May 12, 1:30p ET/ 10:30a PT
We will be hosting an online webinar to talk about one way to build webapps using Java and some common open source frameworks (Guice, Stripes and Morphia) in about week.
You can learn more and register here: http://www.10gen.com/webinars/javawebapps
MongoSF is a one day conference dedicated to MongoDB on May 24th. There will be loads of talks, including a bunch about MongoDB with Java and Morphia. Find out more about the event and registration at www.mongosf.com. Register soon, because it’s nearly sold out already.
MongoNYC will be held on June 7th and will contain a very similar number of tracks for those on the other coast.
More Blogs
This is just the start of the blogging about the changes coming to the Java driver. There will be many more in-depth posts regarding parts of the driver and interfaces changes.
Feedback
I can be contacted via email as scott -at- 10gen (dotcom).
Please feel free to submit patches or file issues:
http://github.com/mongodb/mongo-java-driver
http://jira.mongodb.org/browse/Java
—Scott
In light of last week’s EBS issues, we wanted to make sure MongoDB users on EBS are configured to be as robust as possible.
A basic setup would consist of a 3 node replica set. The nodes would be roughly laid out like this:
* A: us-east-1a priority 1
* B: us-east-1b priority 1
* C: us-west-1a priority 0

During steady state, either A or B would be primary. If the primary went down for any reason (system crash, or loss of one availability zone) the other node in us-east would take over. This is guaranteed because the west coast node has a priority of 0.
If instead the entire east coast region were lost, then you would still have a ful copy of data on C. If you decided that you were going to make the west coast your primary data center for the duration, you would just bring up a couple more nodes there, and make a new replica set with the data from C.
More information about running MongoDB on ec2 is available from our recent ec2 webinar. We are big fans of cloud computing in general and want MongoDB to only get more and more cloud-friendly over time.
-Eliot
Last week, VMware launched Cloud Foundry: an open-source platform as a service. It’s pretty radical in that not only can you run your apps on infrastructure operated by VMware, you can also download Cloud Foundry itself and run it on your own machines!
But what’s most awesome about Cloud Foundry is that it supports MongoDB right out of the box! In today’s post, we’re going to walk through the creation of a Rails application using MongoDB and Cloud Foundry.
Here’s what we’re going to need to do:
$ rails new my_app --skip-active-record
source "http://rubygems.org" gem "rails", "3.0.5" gem "mongo_mapper" gem "bson_ext"I’m assuming you’re using Rails 3 here, but you can easily adapt these instructions for other versions.
$ script/rails generate scaffold messages message:string --orm mongo_mapperI’m also going to set the root of my rails app to be our new messages controller and remove our
public/index.html. Here’s my routes file:
CloudFoundryRailsTutorial::Application.routes.draw do resources :messages root :to => "messages#index" endBe sure to delete your
public/index.html!!
config/mongo.yml configuration file for us.
$ script/rails generate mongo_mapper:configThe file looks like this:
defaults: &defaults host: 127.0.0.1 port: 27017 development: <<: *defaults database: myapp_development test: <<: *defaults database: myapp_test # set these environment variables on your prod server production: <<: *defaults database: myapp username: <%= ENV['MONGO_USERNAME'] %> password: <%= ENV['MONGO_PASSWORD'] %>We need to modify this so that it can talk to Cloud Foundry’s infrastructure. When CloudFoundry runs your app, it passes in a bunch of information through an environment variable. We need to pull the host, port, username, and password for MongoDB out of this environment variable. After some modification, the
production section of your config/mongo.yml should look something like this:
production: host: <%= JSON.parse( ENV['VCAP_SERVICES'] )['mongodb-1.8'].first['credentials']['hostname'] rescue 'localhost' %> port: <%= JSON.parse( ENV['VCAP_SERVICES'] )['mongodb-1.8'].first['credentials']['port'] rescue 27017 %> database: <%= JSON.parse( ENV['VCAP_SERVICES'] )['mongodb-1.8'].first['credentials']['db'] rescue 'cloud_foundry_mongodb_tutorial' %> username: <%= JSON.parse( ENV['VCAP_SERVICES'] )['mongodb-1.8'].first['credentials']['username'] rescue '' %> password: <%= JSON.parse( ENV['VCAP_SERVICES'] )['mongodb-1.8'].first['credentials']['password'] rescue '' %>Note: The “rescue” clauses are so that you can run this app locally. If you don’t include this and you try to run this app outside of cloud foundry, you’ll get an exception because there’s no VCAP_SERVICES environment variable passed into your app.
vmc command line tool. There’s a getting started with VMC guide here.
Here’s what it looked like when I deployed my app:
redeye:myapp jsr$ vmc push --runtime ruby19 Would you like to deploy from the current directory? [Yn]: y Application Name: mongodb-on-cf-demo Application Deployed URL: 'mongodb-on-cf-demo.cloudfoundry.com'? Detected a Rails Application, is this correct? [Yn]: y Memory Reservation [Default:256M] (64M, 128M, 256M or 512M) 256M Creating Application: OK Would you like to bind any services to 'mongodb-on-cf-demo'? [yN]: y Would you like to use an existing provisioned service [yN]? n The following system services are available:: 1. mysql 2. mongodb 3. redis Please select one you wish to provision: 2 Specify the name of the service [mongodb-a8a43]: Creating Service: OK Binding Service: OK Uploading Application: Checking for available resources: OK Processing resources: OK Packing application: OK Uploading (5K): OK Push Status: OK Starting Application: OK redeye:myapp jsr$Now you can point your browser to http://mongodb-on-cf-demo.cloudfoundry.com/ and you should see the list of messages!
— Jared Rosoff
10gen is happy to announce support for the official C# driver for MongoDB. Several preview releases have already been made available, and the latest, Version 0.11, was released January 25, 2011. Version 1.0 has just been released and includes support for the new features in MongoDB 1.8.
The official C# driver is designed to be fast and efficient, is fully supported by 10gen, and will have full support for new MongoDB features as new server versions are released. Version 1.0 is compatible with Visual Studio 2008 and 2010 and .NET 3.5. The driver is well suited for use in high load environments as the main classes are all thread safe and connections are efficiently and automatically managed by a connection pool. The driver can connect directly to a particular server, or can connect to a replica set and automatically find the current primary (as well as automatically rollover to a new primary as needed). SafeMode can be enabled to automatically call getLastError after every update operation to check for errors. Full serialization support for C# classes is provided to make integrating your domain model classes with MongoDB easy (including keeping your domain classes persistent ignorant if desired). The serialization support provides several ways for you to customize serialization of particular classes or data types should your classes require special handling. Interoperability with JavaScript and other languages is enhanced by the included JSON reader and writer. GridFS support is provided to deal with large documents (or files) that are too big to store as a single BSON document.
More information about the official C# driver is available at:
http://www.mongodb.org/display/DOCS/CSharp+Language+Center
A webinar, “Introduction to the New Official C# Driver Developed by 10gen”, is available at:
http://www.10gen.com/webinars/csharp
We are excited to be officially supporting the .NET community with this new driver.
-Robert Stam
We are happy to announce that MongoDB v1.8.0 is now available. 1.8 is the stable follow-up release to 1.6, which came out in August of 2010. Version 1.8 introduces many new features, along with bug fixes and other improvements. Some of the highlights:
mongostat --discoverThe state of Ruby and MongoDB is strong. In this post, I’d like to describe some of the recent developments in the Ruby driver and provide a few notes on Rails and the object mappers in particular.
We just released v1.2 of the MongoDB Ruby driver. This release is stable and supports all the latest features of MongoDB. If you haven’t been paying attention to the driver’s development, the Cliff’s Notes are below. (Note that if you’re an using older version of the driver, you owe it to your app to upgrade).
If you’re totally new to the driver, you may want to read Ethan’s Gunderson’s excellent post introducing it before continuing on.
There are now two connection classes: Connection and ReplSetConnection. The first simply creates a connection to a single node, primary or secondary. But you probably already knew that.
The ReplSetConnection class is brand new. It has a slightly different API and must be used when connecting to a replica set. To connect, initialize the ReplSetConnection with a set of seed nodes followed by any connection options.
You can pass the replica set’s name as a kind of sanity check, ensuring that each node connected to is part of the same replica set.
If you’re running replica sets (and why wouldn’t you be?), then you’ll first want to make sure you connect with the ReplSetConnection class. Why? Because this class facilitates discovery, automatic failover, and read distribution.
Discovery is the process of finding the nodes of a set and determining their roles. When you pass a set of seed nodes to the ReplSetConnection class, you may now know which is the primary node. The driver will find that node and ensure that all writes are sent to it. In addition, the driver will discover any other nodes not specified as seeds and then cache those for failover and, optionally, read distribution.
Failover works like this. Your application is humming along when, for whatever reason, the primary member of the replica set goes down. So subsequent operations will fail, and the driver will raise the Mongo::ConnectionFailure exception until the replica set has successfully elected a new primary.
We’ve decided that connection failures shouldn’t be handled automatically by the driver. However, it’s not hard to achieve the oft-sought seamless failover. You simply need to make sure that 1) all writes use safe mode and 2) that all operations are wrapped in a rescue block. Details on just how to do that can be found in the replica set docs.
Finally, we should mention read distribution. For certain read-heavy applications, it’s useful to distribute the read load to a number of slave nodes, and the driver now facilitates this.
With :read_secondary => true, the connection will send all reads to an arbitrary secondary node. When running Ruby in production, where you’ll have a whole bunch of Thins and Mongrels or forked workers (à la Unicorn and Phusion), you should get a good distribution of reads across secondaries.
Write concern is the term we use to describe safe mode and its options. For instance, you can use safe mode to ensure that a given write blocks until it’s been replicated to three nodes by specifying :safe => {:w => 3}. For example:
That gets verbose after a while, which is why the Ruby driver supports setting a default safe mode on the Connection, DB, and Collection levels as well. For instance:
Now, the insert will still use safe mode with w equal to 3, but it inherits this setting through the @con, @db, and @collection objects. A few more details on this can be found in the write concern docs.
One of the most exciting advances in the last few months is the driver’s special support for JRuby. Essentially, when you run the driver on JRuby, the BSON library uses a Java-based serializer, guaranteeing the best performance for the platform.
One of the big advantages to running on JRuby is its support for native threads. So if you’re building multi-threaded apps, you may want to take advantage of the driver’s built-in connection pooling. Whether you’re creating a standard connection or a replica set connection, simply pass in a size and timeout for the thread pool, and you’re good to go.
Another relevant feature that’s slated for the next month is an asynchronous facade for the driver that uses the reactor pattern. (This has been spearheaded, and is in fact used in production, by Chuck Remes. Thanks, Chuck!). You can track progress at the async branch.
Finally, a word about Rails and object mappers. If you’re a Rails user, then there’s a good chance that you don’t use the Ruby driver directly at all. Instead, you probably use one of the available object mappers.
The object mappers can be a great help, but do be careful. We’ve seen a number of users get burned because they don’t understand the data model being created. So the biggest piece of advice is to understand the underlying representation being built out by your object mapper. It’s all too easy to abuse the nice abstractions provided by the OMs to create unwieldy, inefficient mega-documents down below. Caveat programator.
That said, I get a lot of questions about which OM to use. Now, if you understand how the OM actually works, then it really shouldn’t matter which one you use. But not everyone has the time to dig into these code bases. So when I do recommend one, I recommend MongoMapper. This is, admittedly, a bit of an aesthetic judgment, but I like the API and have found the software to be simple and reliable. Long-awaited docs for the projects are imminent, and we’ll tweet about them once they’re available.
If you want to know more about the Ruby driver, tune in to next week’s Ruby driver webcast, where I’ll talk about everything in the post, plus some.
Finally, a big thanks to all those who have contributed to the driver, to the object mapper authors, and the all users of MongoDB with Ruby.
- Kyle Banker
Here’s a rundown of some of the most useful features added recently. These are all available in 1.7.4 and will, of course, be in 1.8.
Initial sync from a secondary
You can now set an initialSync source for each member, which controls where the new guy will sync from. For example, if you wanted to add a new node and force it to sync from a secondary, you could do:> rs.add({"_id" : num, "host" : hostname, "initialSync" : {"state" : 2}})
You can choose a sync source by its state (primary or secondary), _id, hostname, or up-to-date-ness. For the last, you can specify a date or timestamp and the new member will choose a source that is at least that up-to-date.
By default, a new member will attempt to sync from a random secondary and, if it can’t find one, sync from the primary. If it chooses a secondary, it will only use the secondary for its initial sync. Once it’s ready to sync new data, it will switch over and use the primary for “normal” syncing.
Slave delay
This option makes the slave postpone replaying operations from the master. The delay can be specified in seconds in a member’s configuration:> rs.add({"_id" : num, "host" : hostname, "slaveDelay" : 3600})
Hidden
Hidden servers won’t appear inisMaster() results. This also means that they will not be used if a driver automatically distributes reads to slaves. A hidden server must have a priority of 0 (you can’t have a hidden primary). To add a hidden member, run:
> rs.add({"_id" : num, "host" : hostname, "priority" : 0, "hidden" : true})
Freeze a member
Replica set members abhor a vacuum and will immediately try to elect themselves if the primary disappears. This can make maintenance or a planned failover difficult. Freezing a member forces it to remain a secondary for a given number of seconds (defaults to 60). This can be useful if you want to do some maintenance on the primary and don’t want an usurpers jumping in or you want to force a certain member to become the new primary. To freeze a member, run:> rs.freeze(3600)To unfreeze at any time:
> rs.freeze(0)
Fast sync
If you have a backup that’s reasonably up-to-date, you can bring up a new member quickly with a fast sync. Start the new member --dbpath set to your backup and --fastsync (as well as --replSet, --oplogSize, and whatever else you usually specify). Instead of copying all of the data from the master, it will just replay the latest operations.
You can check if a backup is recent enough to fast sync by connecting to the primary and running:
> use local
> new Date(db.oplog.rs.find().sort({$natural:1}).limit(1).next()["ts"]["t"])
If your backup is from after the date displayed, you can catch up to the master using fast sync. If not, you’ll need to resync from scratch. There are lots more replica set features coming soon: authentication, syncing from secondaries beyond the initial sync, data center awareness, and more. If there are any features you’d particularly like to see, be sure to vote on the cases you care about.
Someone recently pointed out to me, rather insightfully, that MongoDB is a good fit for archival of relational data.
I had not really considered this before, but it is a good point : flexible schemas are very helpful for archival. How do we keep an archive of data, say, 10 years or more of data history, when over that time period the schema will undergo significant changes? It is not so easy.
One approach would be to apply any schema changes from the online / operational database at the archival database too. However, there are some issues. First, the archival database may be huge, making schema migrations impractical. But more importantly, these changes may not be what we want in an archive. Imagine we decide to drop a column in the online db. It may now be deprecated and unneeded. However, a true and complete archive would still have that data. Dropping the column in the archive is not what we want.
Document-oriented databases, with their flexible schemas, provide a nice solution. We can have older documents which vary a bit from the newer ones in the archive. The lack of homogeneity over time may mean that querying the archive is a little harder. However, keeping the data is potentially much easier.
—dm
MongoDB 1.6.0 is the fourth stable major release (even numbers are “stable” : 1.0, 1.2, 1.4, …) and is the culmination of the 1.5 development series.
Scale-out
The focus of the 1.6 release is scale-out. Sharding is now production-ready. The combination of sharding and replica sets allows one to build out horizontally scalable data storage clusters with no single points of failure.
A single instance of mongod can be upgraded to a distributed cluster with zero downtime when the need arises.
A big thanks to all the 1.5.x beta testers of sharding (including foursquare and bit.ly who have been using sharding in production for a while now).
Replica Sets
Replica sets allow you to setup a high availability cluster with automatic fail over and recovery. Replica pair users should, when convenient, migrate to replica sets.
Other Improvements in v1.6
Downloads: http://www.mongodb.org/display/DOCS/Downloads
Release Notes: http://www.mongodb.org/display/DOCS/1.6+Release+Notes
Please report any issues to http://groups.google.com/group/mongodb-user (support forums) or http://jira.mongodb.org/ (bug/feature db).
What’s Next
Now that 1.6 is out, we’re going to be focusing on 1.8. Help us prioritize features for this release by voting for your key needs at jira.mongodb.org. The #1 feature queued for v1.8 is single server durability.
More Information
Please join 10gen CEO and Co-Founder Dwight Merriman for the webinar What’s New in MongoDB v1.6 on Tuesday, August 10 at 12:30pm ET / 9:30am PT.
Node.js is turning out to be a framework of choice for building real-time applications of all kinds, from analytics systems to chat servers to location-based tracking services. If you’re still new to Node, check out Simon Willison’s excellent introductory post. If you’re already using Node, you probably need a database, and you just might have considered using MongoDB.
The rationale is certainly there. Working with Node’s JavaScript means that MongoDB documents get their most natural representation — as JSON — right in the application layer. There’s also significant continuity between your application and the MongoDB shell, since the shell is essentially a JavaScript interpreter, so you don’t have to change languages when moving from application to database.
Node.js MongodB Driver
Especially impressive to us at 10gen has been the community support for Node.js and MongoDB. First, there’s Christian Kvalheim’s excellent mongodb-node-native project, a non-blocking MongoDB driver implemented entirely in JavaScript using Node.js’s system libraries. The project is a pretty close port of the MongoDB Ruby driver, making for an easy transition for those already used to the 10gen-supported drivers. If you’re just starting, there’s a helpful mongodb-node-native mailing list.
Hummingbird
Need a real-world example? Check out Hummingbird, Michael Nutt’s real-time analytics app. It’s built on top of MongoDB using Node.js and the mongodb-node-native driver. Hummingbird, which is used in production at Gilt Groupe, brings together an impressive array of technologies; it uses the express.js Node.js app framework and sports a responsive interface with the help of web sockets. Definitely worth checking out.
Mongoose
Of course, one of the admitted difficulties in working with Node.js is dealing with deep callback structures. If this poses a problem, or if you happen to want a richer data modeling library, then Mongoose is the answer. Created by Learnboost, Mongoose sits atop mongodb-node-native, providing a nice API for modeling your application.
Node Knockout
All of this just to show that the MongoDB/Node.js ecosystem thrives. If you need a good excuse to jump into Node.js or MongoDB development, be sure to check out next month’s Node Knockout. It’s a weekend app competition for teams up to four, and registration is now open.
We’re pleased to announce the winner’s of the MongoDB blogging contest!
Grand Prize
Runners Up
The winners should contact meghan@10gen.com to claim their prizes.
You check out all the awesome entries at mongodb.slinkset.com
Thanks to everyone who submitted!
On May 21, 10gen organized the second conference dedicated to MongoDB. Like MongoSF, MongoNYC included a great line-up of speakers. One of the more popular talks was Kyle Banker’s Schema Design session, which was so crowded that many attendees sat on the floor! Both the video and slides from the talk are now available.
Also interesting were the many talks on MongoDB production deployments. Kushal Dave, the CTO at Chartbeat, gave an excellent talk on how Chartbeat came to use MongoDB after trying many solutions to store historical data analytics (see slides & video). Jay Ridgeway talked about bit.ly user history, which is auto-sharded using MongoDB (slides & video). Gilt Groupe demoed their real-time analytics tool Hummingbird, which is built with MongoDB and node.js. Avery Rosen, the CTO of ShopWiki, wrote a recap of his talk “Finding a Swiss army data store” on the ShopWiki dev blog.
Another big hit was Harry Heymann’s presentation on MongoDB at foursquare, the video of which is included below.
Videos from all the talks at MongoNYC are available at mongodb.blip.tv.
Thanks for making the event such a success! We’re getting really excited about MongoUK and MongoFR, which are only a few weeks away.

Valentin Kuznetsov just presented a paper at the International Conference on Computational Science on CERN’s use of MongoDB for Large Hadron Collider data. The paper, The CMS Data Aggregation System, is available as a PDF at ScienceDirect.
“CMS” stands for Compact Muon Solenoid, a general-purpose particle physics detector built on the Large Hadron Collider. The CMS project posted a few comics which provide a nice, simple (if somewhat cheesy) explanation of what the CMS/LHC does.
The LHC generates massive amounts of data of all different varieties, which is distributed across a worldwide grid. It sends status messages to some of the computers, job monitoring info to other computers, bookkeeping info still elsewhere, and so on.
This means that each location has specialized queries it can do on the data it has, but up until now it’s been very difficult to query across the whole grid. Enter the Data Aggregation System, designed to allow anything to be queried across all of the machines.
The aggregation system uses MongoDB as a cache. It checks if Mongo has the aggregation the user is asking for and, if it does, returns it, otherwise the system does the aggregation and saves it to Mongo.
They query the system using a simple, SQL-like language which they transform into a MongoDB query. So, something like file="abc", run>10 becomes {"file" : "abc", "run" : {"$gt" : 10}}. (It’s not the same as SQL, but the code for this might be interesting to people who want to use SQL queries with MongoDB.)
If the cache does not contain the requested query, the system iterates over all of the places in the world that could have this information and queries them, gathering their results. It then merges all of the results, doing a sort of “group by” operation based on predefined identifying key and inserts the aggregated information into the cache.
It was built using the Python driver.
They’re looking forward to field testing it and horizontally scaling the system with sharding. As this is a general grid aggregation/querying tool, they’re also interested in applying it to problems outside of the LHC and CERN.
We wish them luck and hope they’ll keep us informed on future progress!
Edit: the slides from Valentin’s presentation are available at http://www.slideshare.net/vkuznet/das-iccs-2010.
Kristina Chodorow maintains the MongoDB PHP and Perl drivers. She blogs at www.snailinaturtleneck.com and tweets as @kchdorow.10gen has a ticket to OSCON that we’d like to give to a MongoDB user.
How to Enter
Prizes
Grand prize
There will also be 3 runners up who get MongoDB mugs and stickers, as well as mentions/links on blog.mongodb.org.
Judging
Rules
If you don’t have a blog, you can get one in about 3 seconds from Posterous or tumblr. Make sure you’re identifiable if you go this route!
Good luck everyone!