Operations in the New Aggregation Framework

Jan 17 • Posted 2 years ago

Available in 2.1 development release. Will be stable for production in the 2.2 release

Built by Chris Westin (@cwestin63)

MongoDB has built-in MapReduce functionality that can be used for complex analytics tasks. However, we’ve found that most of the time, users need the kind of group-by functionality that SQL implementations have. This can be implemented using map/reduce, but doing so is more work than it was in SQL. In version 2.1, MongoDB is introducing a new aggregation framework that will make it much easier to obtain the kind of results SQL group-by is used for, without having to write custom JavaScript.

MongoDB provides an easy way to store documents that aren’t just flat records like SQL, but that also contain arrays and nested documents. Another common problem users have told us about is that it can be difficult to manipulate these sub-parts of documents easily. The new aggregation mechanism is also taking on making this easier, by providing operators that can be used to disassemble and re-assemble document sub-parts.

After many requests, we have also included expression evaluation to return computed values—for virtual fields.

The new aggregation framework is declarative. Rather than writing code to aggregate data, you specify a chain of operations, called a “pipeline,” to apply to your documents. This is similar to a pipe in a unix shell.

We’re now going to show you a few short examples (using the mongo shell) of this new functionality to get you started.

$match is a pipeline operator that filters out documents that don’t match the specified condition. For example

db.runCommand({ aggregate : "article", pipeline : [ { $match : { author : "dave" } } ]}); 

This very simple one step pipeline doesn’t introduce any new functionality yet, but shows how to put a query at the beginning of your pipeline. For those of you familiar with MongoDB, this is the equivalient of

> db.article.find({ author : "dave" }); 

$project is a pipeline operator that let’s you select which fields you want to include or exclude from a result. You can also create new virtual fields from computed values. Here is a simple example, building on the previous one:

db.runCommand({ aggregate : "article", pipeline : [ { $match : { author : "dave" } }, { $project : { _id : 0, author : 1, tags : 1 }} ]}); 

As before, this doesn’t introduce anything radically new; this is the equivalent of

> db.article.find({ author : "dave" }, { _id : 0, author : 1, tags : 1); 

While this functionality was available before from find(), what it does demonstrate here is how to put together more than one aggregation pipeline operator. The value of pipeline is an array consisting of the operators to apply. Conceptually, the aggregation framework behaves as if it was feeding the result of scanning the target collection through that pipeline.

Now let’s take a look at some new functionality that wasn’t available before.

$unwind hands out the elements of an array one at a time; each element is surrounded by the document that contains the original array. For this example, we’ll first show what a document from the article collection looks like. Suppose we had saved this:

db.article.save( { title : "this is your title" , author : "dave" , posted : new Date(4121381470000) , pageViews : 7 , tags : [ "fun" , "nasty" ] , comments : [ { author :"barbara" , text : "this is interesting" } , { author :"jenny" , text : "i like to play pinball", votes: 10 } ], other : { bar : 14 } }); 

Now, we can unwind the tags array as follows:

db.runCommand({ aggregate : "article", pipeline : [ { $unwind : "$tags" } ]}); 

The result looks like this:

{ "result" : [ { "_id" : ObjectId("4eeeb5fef09a7c9170df094b"), "title" : "this is your title", "author" : "dave", "posted" : ISODate("2100-08-08T04:11:10Z"), "pageViews" : 7, "tags" : "fun", "comments" : [ { "author" : "barbara", "text" : "this is interesting" }, { "author" : "jenny", "text" : "i like to play pinball", "votes" : 10 } ], "other" : { "bar" : 14 } }, { "_id" : ObjectId("4eeeb5fef09a7c9170df094b"), "title" : "this is your title", "author" : "dave", "posted" : ISODate("2100-08-08T04:11:10Z"), "pageViews" : 7, "tags" : "nasty", "comments" : [ { "author" : "barbara", "text" : "this is interesting" }, { "author" : "jenny", "text" : "i like to play pinball", "votes" : 10 } ], "other" : { "bar" : 14 } } ], "ok" : 1 } 

Notice that the single document has been duplicated (see the identical _id values). However, the tags array in each copy of the document has been replaced with one value from the original tagsarray.

$unwind isn’t useful on its own, but is useful when combined with $match, which can be used to filter out copies of the document without values of interest, and then the array can be reassembled using other operators such as $push or $addToSet. $unwind is also useful when combined with $group.

$group groups elements with a common key together, and allows the application of aggregation functions. For the purposes of a small example, let’s add a second article to our collection:

db.article.save( { title : "this is some other title" , author : "jane" , posted : new Date(978239834000) , pageViews : 6 , tags : [ "nasty" , "filthy" ] , comments : [ { author :"will" , text : "i don't like the color" } , { author :"jenny" , text : "can i get that in green?" } ], other : { bar : 14 } }); 

If these articles represent content on a web site, I might want to know what tags users are applying to these articles. I can use $group to find that out like so:

db.runCommand({ aggregate : "article", pipeline : [ { $unwind : "$tags" }, { $group : { _id : "$tags", count : { $sum : 1 }, authors : { $addToSet : "$author" } }} ]}); 

The grouping key for a $group is defined by specifying a field (or subdocument of multiple fields) for the _id. The grouping key is used to define buckets into which documents are put. Other fields are aggregate functions that can compute values across the entries in each bucket.

In this case, one is added to count for each document that matches the group key. That counts up the number of items in each bucket. authors will be an array, and each unique author in the bucket will be added to it.

The result of all this is

{ "result" : [ { "_id" : "filthy", "count" : 1, "authors" : [ "jane" ] }, { "_id" : "fun", "count" : 1, "authors" : [ "dave" ] }, { "_id" : "nasty", "count" : 2, "authors" : [ "jane", "dave" ] } ], "ok" : 1 } 

One document has the “filthy” tag, and it was written by jane. One document has the “fun” tag, and it was written by dave. Two documents have the “nasty” tag, and both jane and dave wrote documents with that tag.

The new aggregation functionality is available only by compiling from source. It is available in the 2.1 development release but is not meant for production. MongoDB 2.2 will be the first stable release with the aggregation framework. For more examples and information, check out the talk and demo from MongoSV 2011, or the documentation on the MongoDB wiki.

by Chris Westin

blog comments powered by Disqus
blog comments powered by Disqus