Introducing the MongoDB Community Kit

Oct 16 • Posted 9 months ago

Open source projects thrive as a reflection of the participation and enthusiasm of their communities. MongoDB is a great example of a global community, and we have seen a number of people, like MongoDB MUG organizers and MongoDB Masters, create lasting impact for MongoDB through their work with the community. To encourage growth in the MongoDB Community, we’ve taken what we’ve learned and turned into a new resource: The MongoDB Community Kit.

Sometimes people want to get involved but aren’t sure how how to get started. This tool can help you as a user and a community member contribute to the project in whatever way you like. Based on our experiences with thousands of developers over the past 4 years, the MongoDB community team have developed a number of techniques that will help you provide valuable impact. For instance, you can:

  • Give a talk on how you scaled your MongoDB infrastructure
  • Write a blog post with real advice on how to use MongoDB in production
  • Create an open source tool for MongoDB that enables other users to code faster and create better deployments.
  • Create a User Group in your local area that educates new users and brings a community together
  • Contribute to the community in whichever way you like, even if it isn’t listed in the Community Kit. The invitation is open.

The Kit is available on Github as an open source project. This makes it easy to access, fork and update. This package is released under the Creative Commons license CC BY-NC-SA 3.0. Just like any other project, this kit gets better when users contribute their knowledge, so we encourage you to submit a pull request and add your feedback.

Some suggestions:

  • Add some posters you created for a MUG
  • Post your “Intro to MongoDB Slides”
  • Fork the kit and translate it into your own language and share with your community

We’re looking forward to seeing all the great activity coming from the community. Keep the pull requests coming!

MongoDB Engineer Valeri Karpov, who coined the term “MEAN” Stack in a blog post on the MongoDB Blog, joined the Google Developers webcast on the MEAN Stack to give an overview of MEAN and discuss why it is the new stack for the web.

QAing New Code with MMS: Map/Reduce vs. Aggregation Framework

Oct 2 • Posted 9 months ago

This is part three of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross.

When releasing software, most teams focus on correctness, and rightly so. But great teams also QA their code for performance. MMS Monitoring can also be used to quantify the effect of code changes on your MongoDB database. Our staging environment is an exact mirror of our production environment, so we can test code in staging to reveal performance issues that are not evident in development. We take code changes to staging, where we pull data from MMS to determine if feature X will impact performance.

As a working example, we can use MMS to calculate views across a day using both Map/Reduce and the aggregation framework to compare on their performance and how they affect overall DB performance.

Our test data consists of 10M entries in a collection named views in the database named CareAcross with entries of the following style:

{
userId: “userIdName”, date: ISODate(“2013-08-28T00:00:01Z”), url: “urlEntry”,  
}

Using a simple map reduce operation we can sum on our documents values and calculate the sum per userId:

 db.views.mapReduce(function () {emit(this.userId, 1)}, function (k,v) {return Array.sum(v)}, {out:"result"})

The equivalent operation using Aggregation framework looks like this:

db.views.aggregate({$group: {_id:"$userId", total:{$sum:1}}})

The mapReduce function hits the server at 18:54. The aggregation command hits the server at 19:01.

If we compare these two operations across our data set we will get the following metrics from MMS:

In terms of lock percentage, we see that the map reduce operation started locking up our test DB, whereas the aggregation framework’s impact is insignificant, primarily because of the fact that aggregation happens in memory.

This can be seen in the next graph, memory consumption:

As is fairly obvious, the first spike from 70MB to 0.54GB of usage occurred at the invocation of mapreduce at 18:54 and then another spike to 1.07GB of resident memory usage happened when the aggregation framework passed through the pipeline all of our views records, grouped by userId at 19:01.

In terms of performance, the map/reduce operation took 139 seconds, whereas the equivalent aggregation framework operation was completed in 52 seconds, which is really an improvement.

As for cursors, there were 2 open cursors for the mapreduce operation and 1 for the aggregation framework operation and that’s a something that we would like to monitor in case we are stuck with several open cursors dragging our database.

From the graphs at the online console, you can see that aggregation is trading off memory for speed, lock percentage time and resource utilization in general. MMS can help us visualize the difference and identify any abnormal behaviors.

Takeaways from the series

  • Set up monitoring and alerting, but also have procedures for for when those alerts arrive.
  • Carefully calibrate alerts - each email or text message should be actionable.
  • Install MMS on your staging or development environment and use it to test performance of new features.

If all these sound interesting to you and want to deal with large scale MongoDB powered applications in RoR, we are hiring! We are a stealth mode, digital health oriented company based in the UK with operations in Athens, Greece.

Thanks to Alex for sharing his expertise on the MongoDB blog in this series!

Scaling Advice from MongoHQ

Oct 2 • Posted 9 months ago

With most systems, trying to run a database of any significant size requires specialized knowledge, both to build your app and to manage the database it runs on top of. MongoDB makes your first 100GB simple - from running the database to writing the code. As your database gets larger, though, it helps to understand more about how MongoDB works so you can get the most out of it. MongoHQ has noticed that their customers that reach 100GB are running commercially successful businesses. MongoHQ recommends going through the 100GB Scaling Checklist as you grow. Watch the webinar recording on the subject for the full overview:

  1. Identify your data behavior: Figure out how your data patterns and how they are working within your application. You will need to link your data to how your application accesses this data. Consider the simple queries and the more complex queries you will need to look up, like multi-range queries.

  2. Refactor your schema to simplify queries

  3. Remove data that does not fit MongoDB: remove “unrefactorable” data

  4. Separate hot and cold data

  5. Don’t lean on mongodump’: this disrupts RAM and causes performance issues. Consider other Backup options instead, like MMS Backup

  6. Check your gauges: Monitor, monitor, monitor. Even if you aren’t having performance problems, set this up now so you can keep a history of your

  7. Avoid queries causing page faults: MongoHQ has run benchmarks against this to prove this. A system running in memory that was running at 7,000 operations per second was cut down by 50% to 3,500 operations per second when adding 1% table scans churning on a disk.

  8. Track and monitor slow queries: use Dex, MongoProfessor, Mongo-QP or MongoHQ’s Slow Query Tracker.

  9. Buying time with hardware: Don’t get addicted to buying hardware. Before making a purchase, always consider optimization and investigate separating and pairing data.

Watch the full recording with tips from MongoHQ’s Chris Winslet here.

September MongoDB News

Sep 30 • Posted 9 months ago

September was a big month. MongoDB 2.5.2 was released, MongoDB World was announced, MMS received a large feature update, and a number of MongoDB Drivers were released as well. Here’s a roundup of September news from the MongoDB newsletter:

MongoDB Announcements
  • Faceted Search with MongoDB: Faceted search, or faceted navigation, is a way of browsing and searching for items in a set of data by applying filters on various properties (facets) of the items in the collection. Faceted search makes it easy for users to navigate to the specific item or items they are interested in. It complements more free-form keyword search by facilitating exploration and discovery and is therefore useful when a user may not know the specific keywords they wish to search on. Learn how to implement faceted search with MongoDB
  • From Relational Databases to MongoDB - What You Need to Know: In this webinar on October 17, we’ll take a dive into how MongoDB works to better understand what non-relational design is, why we might use it and what advantages it gives us over relational databases. We’ll develop schema designs from examples, and consider strategies for scaling out.
  • MongoDB World: MongoDB has brought together more than 20,000 developers, IT professionals and executives in communities around the world. Now, MongoDB World will bring together this community in one place. Join us June 23-25 in New York City.
  • MMS Backup Feature Update*: MMS Backup now includes the ability to exclude databases and collections, permitting you to adjust the backup service to your needs and tune costs.
MongoDB Releases

Get MongoDB updates straight to your inbox each month. Sign up for the MongoDB Newsletter.

Managing the web nuggets with MongoDB and MongoKit

Sep 27 • Posted 9 months ago

This is a guest post by Nicolas Clairon, maintainer of MongoKit and founder of Elkorado

MongoKit is a python ODM for MongoDB. I created it in 2009 (when the ODM acronym wasn’t even used) for my startup project called Elkorado. Now that the service is live, I realize that I never wrote about MongoKit. I’d like to introduce it to you with this quick tutorial based on real use cases from Elkorado.

Elkorado: a place to store web nuggets

Elkorado is a collaborative, interest-based curation tool. It was born over the frustration that there is no place where to find quality resources about a particular topic of interest. There are so many blogs, forums, videos and websites out there that it is very difficult to find our way over this massive wealth of information.

Elkorado aims at helping people to centralize quality content, so they can find them later easily and discover new ones.

MongoDB to the rescue

Rapid prototyping is one of the most important thing in startup world and it is an area where MongoDB shines.

The web is changing fast, and so are web resources and their metadata. MongoDB’s and schemaless database is a perfect fit to store this kind of data. After losing hair by trying to use polymorphism with SQL databases, I went into MongoDB… and I felt in love with it.

While playing with the data, I needed a validation layer and wanted to add some methods to my documents. Back then, they was no ODM for Python. And so I created MongoKit.

MongoKit: MongoDB ODM for Python

MongoKit is a thin layer on top of Pymongo. It brings field validations, inheritance, polymorphism and a bunch of other features. Let’s see how it is used in Elkorado.

Elkorado is a collection of quality web resources called nuggets. This is how we could fetch a nugget discovered by the user “namlook” with Pymongo:

nuggets here is a regular python dict.

Here’s a simple nugget definition with MongoKit:

Fetching a nugget with MongoKit is pretty the same:

However, this time, nugget is a Nugget object and we can call the is_popular method on it:

One of the main advantages of MongoKit is that all your models are registered and accessible via the connection instance. MongoKit look at the __database__ and __collection__ fields to know which database and which collection has to be used. This is useful so we have only one place to specify those variables.

Inheritance

MongoKit was first build to natively support inheritance:

In this Core object, we are defining the database name and some fields that will be shared by other models.

If one wants a Nugget object to have date metadata, one just have to make it inherit from Core:

It’s all about Pymongo

With MongoKit, your are still very close to Pymongo. In fact, MongoKit’s connection, database and collection are subclasses of Pymongo’s. If once in an algorithm, you need pure performances, you can directly use Pymongo’s layer which is blazing fast:

Here, connection is a MongoKit connection but it can be used like a Pymongo connection. Note that to keep the benefice of DRY, we can call the pymongo’s layer from a MongoKit document:

A real life “simplified” example

Let’s see an example of CRUD done with MongoKit.

On Elkorado, each nugget is unique but multiple users can share a nugget which have differents metadata. Each time a user picks up a nugget, a UserNugget is created with specific informations. If this is the first time the nugget is discovered, a Nugget object is created, otherwise, it is updated. Here is a simplified UserNugget structure:

This example well describes what can be done with MongoKit. Here, the save method has been overloaded to check if a nugget exists (remember, each nugget is unique by its URL). It will create it if it is not already created, and update it.

Updating data with MongoKit is similar to Pymongo. Use save on the object or use directly the Pymongo’s layer to make atomic updates. Here, we use atomic updates to push new topics and increase the popularity:

Getting live

Let’s play with our model:

When calling the save method, the document is validated against the UserNugget’s structure. As expected, the fields created_at and updated_at have been added:

and the related nugget has been created:

Conclusion

MongoKit is a central piece of Elkorado. It has been written to be small and minimalist but powerful. There is so much more to say about features like inherited queries, i18n and gridFS, so take a look at the wiki to read more about how this tool can help you.

Check the documentation for more information about MongoKit. And if you register on Elkorado, check out the nuggets about MongoDB. Don’t hesitate to share you nuggets as well, the more the merrier.

New MongoDB Desktop Backgrounds

Sep 26 • Posted 10 months ago

New MongoDB Desktop Backgrounds are out, courtesy of MongoDB’s Graphic Design Team. Enjoy

Setting Up Actionable Alerts and Procedures in MMS

Sep 26 • Posted 10 months ago

This is part two of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross.

In my last post, I went over the metrics MMS Monitoring that I find most interesting.

Having the metrics is a useful first step but shouldn’t be end goal. Far more important than viewing the metrics in a web page is having clear procedures for how to act upon them. In my case, most of the problems arose because of replication lag and page faults. In the case of high replication lag, our application would automatically fail back to the primary server, which is always up to date. The engineers could then investigate the root cause for the issue and fix it. For page faults, the process was lengthier and most of the time meant going back to the application and improving the queries or design that was causing the page faults.

For every key metric, set sensible alert thresholds emailing or texting someone with a clear procedure set about what to do for each type of alert. Sensible thresholds should be emphasized. An alert should be a real situation waiting for an action. Set the threshold too low and you’ll receive alerts all the time and eventually get desensitized to them. Set the threshold too high and by the time you get the alert, you may have already lost data or otherwise be too late to act upon it.

Unfortunately, it takes a bit of time before you can establish what normal is for your system. Once you have a baseline, you can setup the alerts to make sure that you are operating within normal parameters.

An overlooked feature of MMS is that you can get a web view of logs and profile data using a single authentication mechanism across your servers. This is useful for troubleshooting when the production servers are locked up in a room and the janitor has eaten the keys ;)

In my next post, I’ll discuss how you can use MMS to QA new code.

For more on setting alerts in MMS, see Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track.

The Top 5 Metrics to Watch in MongoDB

Sep 24 • Posted 10 months ago

This is part one of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross, a stealth mode startup seeking to empower patients. Alex is also a proud Carnegie Mellon alumnus, a graduate of the onsite courses offered at MongoDB University and a Cloudera Certified developer for Apache Hadoop (CDH-410).

At Upstream Systems, Persado, Care Across and through various consulting roles, I have dealt with all types of MongoDB installations ranging from single server instances, medium size deployments, to large cloud-based sharded clusters. Whether large or small, monitoring is essential to assuring performance and reliability. We needed to visualize the health of production environments and maintain a clearly defined procedure for metrics exceeding threshold values, as well as measure the impact of development changes.

MongoDB Management Service (MMS) is rich with metrics, but in my experience, the most valuable metrics in practice are the following:

  • Lock percentage: This was more important in earlier versions, where the global write lock could eat you alive and lock yielding was not yet implemented. While it’s less important with more recent versions (please vote on SERVER-1240!), lock percentage still shows a lot about your database activity. A continuously high lock percentage will affect reads as they will eventually queue up behind writes.
  • Replication lag: Designing your application to read data from a secondary node can sometimes be a good idea, when it reduces latency of the read. But if your application is using the secondary’s data and you have high replication lag, your application will use stale data. In addition, a primary node failure when you have a high replication lag means that a secondary may not be sufficiently up-to-date in a failover scenario.
  • Journal writes: If your writes are overwhelming your journal file this will impact performance and stability of your MongoDB installation.
  • Page faults: Page faults are expensive to process and at sufficiently high rates, it probably means that your working set is not fitting in memory. In complex data driven applications, page faults may indicate a deeper root cause hidden in the implementation of the business logic of the app.
  • Non Mapped Virtual Memory: When this grows without an end, this usually means a memory leak. It’s better to monitor it and proactively restart the server or try to hunt down the leak rather than wait for the crash to happen.

There’s a lot of data in MMS Monitoring but I have found that these metrics are the most interesting. In my next post, I will go over how to make this data actionable.

Visualizing Performance Issues with MMS

Sep 20 • Posted 10 months ago

This is a guest post by Albert Engelbrecht, Web App Developer at SuretyBonds.com.

“LMS is down.” The dreaded phrase came back again, meaning the office is twiddling its thumbs as the sales lead system becomes unresponsive. I open my SSH connection, run the MongoDB startup script and say “It’s back up, Brad.”

I get alert emails on IO usage on the servers nightly; the swap file is killing the disk. For some reason, our single MongoDB instance shuts down a few times a day. Logs pointed to the DB running out of memory, which was supported by my slowing database calls and huge IO overhead.

The obvious short-term answer to our out-of-memory errors was to upgrade to a bigger server instance for more swap space, however I knew this wouldn’t improve database performance of any one operation.

There is nothing more painful when trying to debug a hard problem as realizing that you can’t see anything - you are entirely blind. Mongotop and mongostat tools are a bit arcane and don’t give the big picture view. Ultimately my research for MongoDB monitoring led me to try MMS - MongoDB Management Service.

Suddenly there are graphs and charts galore. Everything’s there: non-mapped memory, database size, total memory footprint. I am in hog heaven. Not only can I get DB statistics but with a Munin agent running, it can report back hardware statistics as well. These aren’t your Grandpa’s Munin graphs either, MMS’s graphs are interactive, allowing me to set custom date ranges.

Even More Data Riding Spinners

Another neat feature that I use is text and email alerts on a number of different events. Not only does MMS have graphs and alerts, but it also allows me to go through our slower queries and order them by response time. Instead of sifting through profiling data using queries inside the mongo shell, now I can easily view it all on one page. You know how many queries I rewrote? Probably a million, maybe two.

Take this for example:

Lead.find( {'$where' : "['fresh', 'new'].indexOf(this.status) > -1"} )
// This runs JavaScript to figure out if the status wasn't "fresh" or "new"
 
Lead.find( {"status" : { "$nin" : [ "fresh", "new" ] } } )
// Much more performant

This fun little query runs JavaScript to find every record in the collection to find the ones that don’t have two statuses. It runs every time someone hits the front page, even though the information is only shown to admins. Removing the query made a big difference in the non-mapped memory size. Take that interns, for you shall impact my DB performance no longer!

At the end of the day, upgrading the DB server and doing major refactoring worked out well. Our MongoDB instance hasn’t gone down since then, plus I find myself regularly coming back MMS to see how our newer queries are working out. It’s also good to know when we grow large enough to require sharding, MMS can seamlessly scale. If you happen to be wondering just how well your DB health is, I encourage you to check out MMS Monitoring. 10gen, er, MongoDB Inc. did a great job, plus it’s free!

When not waxing poetic about code smell, Albert works on web applications for SuretyBonds.com, a nationwide surety bonding agency.

Tracking Twitter Followers with MongoDB

Sep 17 • Posted 10 months ago

By André Spiegel, Consulting engineer for MongoDB

As a recently hired engineer at MongoDB, part of my ramping-up training is to create a number of small projects with our software to get a feel for how it works, how it performs, and how to get the most out of it. I decided to try it on Twitter. It’s the age-old question that plagues every Twitter user: who just unfollowed me? Surprising or not, Twitter won’t tell you that. You can see who’s currently following you, and you get notified when somebody new shows up. But when your follower count drops, it takes some investigation to figure out who you just lost.

I’m aware there’s a number of services that will answer that question for you. Well, I wanted to try this myself.

The Idea and Execution

The basic idea is simple: You have to make calls to Twitter’s REST API to retrieve, periodically, the follower lists of the accounts you want to monitor. Find changes in these lists to figure out who started or stopped following the user in question. There are two challenging parts:

  1. When you talk to Twitter, talk slowly, lest you hit the rate limit.
  2. This can get big. Accounts can have millions of followers. If the service is nicely done, millions of users might want to use it.

The second requirement makes this a nice fit for MongoDB.

The program, which I called “followt” and wrote in Java, can be found on github. For this article, let me just summarize the overall structure:

  • The scribe library proved to be a great way to handle Twitter’s OAuth authentication mechanism.

  • Using GET followers/ids, we can retrieve the numeric ids of 5,000 followers of a given account per minute. For large accounts, we need to retrieve the full list in batches, potentially thousands of batches in a row.

  • The numeric ids are fine for determining whether an account started or stopped following another. But if we want to display the actual user names, we need to translate those ids to screen names, using GET users/lookup. We can make 180 of these calls per 15 minute window, and up to 100 numeric ids can be translated in each call. In order to make good use of the 180 calls we’re allowed, we have to make sure not to waste them for individual user ids, but to batch as many requests into each of these as we can. The class net.followt.UserDB in the application implements this mechanism, using a BlockingQueue for user ids.

Storing Twitter Data in MongoDB: A Lesson in Schema Design

So how do we store the information in MongoDB? I came up with the following, simple schema:

{
    “followee”: 12345,
    “follower”:     54321,
    “start”:        ISODate(“2013-08-21T12:34:00”),
    “last”:     ISODate(“2013-08-23T06:00:00”),
    “end”:      ISODate(“2013-08-23T07:50:23”)
}

This document means that the Twitter user with numeric id 12345 has been followed by user 54321 since August 21. The last scan of user 12345, when user 54321 still showed up in the followers list, happened at 6am on August 23rd. At 7:50am that day, user 54321 was no longer following 12345. A document with no “end” field means that the “follower” is still following the “followee”.

This simple schema won over more complex designs, for example the idea to store all of a user’s followers as an array inside a single document. This approach wouldn’t scale to millions of followers, since the size limit for a single document in MongoDB is 16MB. But more importantly, we usually neither need nor want all of a user’s followers in our application’s memory at the same time. This is because a) the memory usage might quickly get out of hand (think millions of users), and b) the algorithms we need to run on these follower lists can indeed work without having it all in memory (more on these algorithms below). Therefore, having it all in a single document would actually be disadvantageous to our application. Two important lessons here:

  1. Just because you can have complex documents in MongoDB, doesn’t mean you actually need them all the time. Sometimes something very simple and “relational” is just fine.
  2. Schema design is influenced, more than anything else, by your data access patterns. Make it so that your typical operations are efficient.

As we retrieve a user’s follower list from Twitter, we update the “last” timestamp in the corresponding documents. If we don’t find a document for that followee/follower pair, or all these documents have an “end” field, then that is a new follower and we create a new document for that pair. It can actually take plenty of time until we are through with all of a user’s followers, since we can only retrieve 5,000 of them per minute. After all the followers have been retrieved, the next step is to figure out which users have unfollowed: in other words, which users that we saw in a previous iteration are no longer in the follower list.

Set Intersection and Adaptive Merging

This is a set intersection problem. We have two potentially very large sets (the previous and the current follower lists), and we want to find out which members are only in one of the sets. (Actually, this is the inverse of computing the intersection and hence the same problem.) There are a number of ways we can do this.

  1. The naive approach is to iterate over one of the sets and search each value in the other set. This is also called multisearch. We could do it with an individual database query for each value, but the performance is obviously unacceptable: Although each query is actually quite fast (100µs on my laptop), that adds up to more than 100 seconds for a million followers.

  2. If we store at least the second set in the application’s main memory, multisearch gets about two orders of magnitude faster. But if we need to do this for many follower lists at the same time, the memory consumption may be prohibitive. And we can also do better algorithmically.

  3. If both sets are sorted (something that MongoDB’s indexes give us essentially for free), then we can perform the equivalent of a merge operation: Iterate over both sets at the same time and advance the iterators according to whether the current set elements are the same or different. This is very fast and uses constant memory.

  4. An even better solution is to exploit the “last” timestamp in our documents. Any follower whose “last” timestamp wasn’t updated in the first phase of the scan must have unfollowed. By simply comparing the time stamps to the start time of the scan we instantly get all unfollowers, and we can actually update their “end” time stamp as part of the same find-and-modify operation. This solution, which we could call “mark-and-sweep”, turns out to be the fastest — although it isn’t any better than the previous ones algorithmically, it wins by offloading all the work into the database.

There is another approach based on more recent research which would require a little more implementation work, and we therefore couldn’t try it as part of this small project. It is a variant of the merge algorithm called adaptive merging. It exploits the idea that the two sets are very similar. So rather than comparing the elements one by one, we take our chances and “leap forward” within the two sets and check whether we are still in sync (i.e. the elements from set A and set B at that index are still the same). If they are, we take an even larger leap forward (this is also called “galloping”). However, if the elements we find at the end of the leap are not the same, we do a binary search in the section we just leapt over to find the exact place where the difference occured.

If we assume that we cannot keep both sets in memory, then this algorithm requires some careful buffering of the results as they come in from the database, which was a little beyond the scope of this project. It would however be an interesting exercise by itself, not least because the open source nature of MongoDB would allow us to use and modify buffering internals deep down in the driver code to do exactly what we want.

Performance and Results

The table below shows execution times for the algorithms we did implement for follower lists of 100k, 500k, and 1m entries, and 100 missing followers between the two lists. All times are in seconds, and were measured locally on my laptop.


algorithm/list size
100,000
500,000
1,000,000
multisearch/database
12
62
127
multisearch/memory
0.39
2.2
3.8
merge
.95 
3.3 
6.3
mark-and-sweep
0.34
1.7
3.6

We let the application run for a couple of days and tracked the development of the @MongoDB account and the accounts of a few other databases and vendors. The results, with the other vendors anonymized, are shown in the charts below. These charts echo the findings of the current database popularity ranking, which shows MongoDB as the most popular NoSQL database.

But the Twitter numbers add a not-so-obvious twist to it: Although @MongoDB is still a small account as compared to some of the big players, its follower growth rate is unusually high for an account of this size, while the loss rate is just about average. (This can be seen most clearly in the normalized chart of follower gain/loss per day and 1000 followers.)

We’re excited to see those results.

Twitter Memes Dataset Overview with PageRank

Sep 9 • Posted 10 months ago

This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

Once we’d familiarized ourselves with a sizable amount of data from the Flights dataset, the next step was to move towards really BIG data: Twitter Memes. SNAP (Stanford Network Analysis Project) provides free large network data sets. In particular, we were interested in the Twitter Memes dataset from the 2008 presidential election generated by MemeTracker, a combined project between Stanford and Cornell which created news cycle maps from news articles and blog posts. This dataset is a compilation of blogs, news articles, and other media postings that pertain to the presidential election. In particular, the project focused on the time lapse between the first mention of a quote and the time it took for the quote to be picked up by mass media. However, our analysis focuses again on the PageRank and importance of both the individual URLs and websites.

The entire dataset contains over 96 million documents and over 418 million links between them. To begin, we focused solely on the memes during the month of April in 2009, which had over 15 million documents. The goal was to run PageRank on this dataset (where all web pages form a graph). The higher the PageRank of a web page, the more important the page. Web pages with a relatively high PageRank usually have a high ratio of incoming links to outgoing links compared to all other pages in the graph.

Disclaimer: As we’d learn through this project, the pages with the most PageRank do not necessarily have to be related to the 2008 presidential election.

Importing the Data

The source file quotes_2009-04.txt was 11G. It came in this continuous format:

P       http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html
T       2008-09-09 22:35:24
Q       that's not change
Q       you know you can put lipstick on a pig
Q       what's the difference between a hockey mom and a pit bull lipstick
Q       you can wrap an old fish in a piece of paper called change
L       http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112
L       http://cbn.com/cbnnews/436448.aspx
L       http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews
  • P denotes the URL of the document.
  • T represents the time of the post.
  • Q is a quote found in the post.
  • L is a link that exists in the post.

This was not an ideal schema for MongoDB. With the use of inputMongo.py, the above input was converted into documents resembling the following:

{
    "_id" : ObjectId("51c8a3f200a7f40aae706e86"),
    "url" : "http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html",
    "quotes" : [
        "that's not change",
                "you know you can put lipstick on a pig", 
                "what's the difference between a hockey mom and a pit bull lipstick", 
                "you can wrap an old fish in a piece of paper called change"
    ],
    "links" : [
        "http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112", 
                "http://cbn.com/cbnnews/436448.aspx", 
                "http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews"
    ],
    "time" : ISODate("2008-09-09T22:35:24Z")
}

This resulted in 15,312,738 documents. We also utilized bulk insertion instead of individual document insertions. It took about 1 hour and 48 minutes to insert all of these documents into a collection in the database.

Notice that we still generated an ObjectId as _id. We’d later realize that this is superfluous as the url is unique per document.

Preparing the Dataset for PageRank

Theoretically, for PageRank to produce more accurate and reflective results, there must be no dead ends and the graph must be strongly connected (every node must be able to travel to any other node in the graph and back). Dead ends are nodes in the graph that have incoming links but no outgoing links. The presence of dead ends in a dataset leaks the PageRank in the graph so that the sum of PageRank of nodes in the graph will slowly converge to zero.

There are 2 ways to fix this problem:

  1. Use a taxation parameter, also called the random surfer, to pass a portion of PageRank from every node to every other node. This is a fix more for a graph that is not strongly connected than for graphs that have dead ends.

  2. Recursively remove dead ends.

We decided to use method (2) first because the resulting graph will be much cleaner and will theoretically give more reasonable results. To make the graph even more strongly connected, we also used a taxation parameter after erasing all dead ends.

Removing dead ends from the dataset proved to more involved than we had initially thought. Here’s what we tried:

Attempt #1: Patience, Young Grasshoppers

The first solution created new collections of nodes that were not dead ends in the last collection. If the size of the links array is zero for a doc/node, then do not add this doc to the next iteration. Else, iterate through each link in the links array for each doc. Remove each link which points to a doc that does not exist in the last iteration collection. This is only bounded by the 16MB document limit.

Although we had an index on the url field, after 5 hours of running the script, we realized that the projected elapsed time would be over 20 days. This prompted us to explore other options.

Attempt #2: Further Optimizing URL search

We originally thought the reason for this lag was due to the script searching for a link in the entire collection, although we did have an index on it already. Instead, we created a more optimized solution to create another collection titled char8 which has as its _id, the first 8 characters of the url. Then in each char8 doc, there is a url array that starts with those 8 characters. So instead of searching for link in each url in the original memes collection, we’d only search through the links of the char8 collection (which is indexed by the first 8 characters).

This was now projected to finish in 5 days, still too inefficient.

Attempt #3: Migrating to Hadoop/Pig

However, we were undeterred; we still wanted to compute PageRank over the entire dataset with no dead ends. The single script approach was extremely inefficient and slow. It wasn’t fit for BIG DATA. Instead, we turned to Hadoop and Amazon Elastic MapReduce.

Hadoop is an open-source MapReduce framework that handles intensive data computations and often operates over large clusters, although it could also be used locally. It supports many languages and tools to ease usage. Amazon Elastic MapReduce is a service provided by Amazon that hosts the Hadoop framework on its EC2 instances that would be created solely for each MapReduce task and connects to S3 Simple Storage for data input and output. Mongo-Hadoop was the necessary library which would allow us to connect MongoDB as the input/output source for these MapReduce jobs.

We submitted jobs to Elastic MapReduce through both the UI and the command line tool elastic-mapreduce, with our access ID and secret key.

First, we wrote a Pig script, explained below, to eliminate dead ends.

-- Registers the mongo and hadoop jars from our bucket
REGISTER s3://memes-bson/mongo-2.10.1.jar
REGISTER s3://memes-bson/mongo-hadoop-pig_1.1.2-1.1.0.jar
REGISTER s3://memes-bson/mongo-hadoop-core_1.1.2-1.1.0.jar
REGISTER s3://memes-bson/mongo-hadoop_1.1.2-1.1.0.jar
-- There's a User Defined Function in myudf.MYBAG which needs to be used
REGISTER s3://memes-bson/myudf.jar

original =  LOAD "$INPUT"
            USING com.mongodb.hadoop.pig.BSONLoader;

outs =      FOREACH original GENERATE $0#"_id" AS url, $0#"value"#"links" AS links;

-- Dead ends will not appear in showing as links is empty. url is the origin of the link while 
-- single is the destination. myudf.MYBAG turns the tuple of links to a bag of tuple of each link.
showing =   FOREACH outs GENERATE url, FLATTEN(myudf.MYBAG(links)) AS single;

-- JOIN will eliminate links that go to a single which doesn't exist in our dataset
joined =    JOIN outs BY url, showing BY single;

project =   FOREACH joined GENERATE showing::url AS url, showing::single AS single;

-- Group together the urls such that they form the original scheme of a url and an array of links 
together =  GROUP project BY url;

result =    FOREACH together GENERATE $0 AS url, $1.single AS links;

STORE result INTO "$OUTPUT"
    USING com.mongodb.hadoop.pig.BSONStorage;

The Pig script above removed all dead ends from the current dataset. The problem with this, is that removing dead ends could create new ones. For example, in the simple graph of

A -> B -> C -> D

D is the only dead end in the graph. But when D is removed, we have

A -> B -> C

So we would keep on removing the new dead ends until there were no new dead ends. In this particular “pathological” graph, the entire graph would be removed because all of the nodes in the graph would eventually be dead ends. Fortunately, most datasets aren’t linear graphs.

We had to figure out a way to repeatedly run the Pig script above until there are no more dead ends. Elastic MapReduce only allows for a single Pig script execution, so we wrote Bash script removeAllDeadEnds.sh that kept running the above Pig script until the output filesize stopped decreasing. The script utilized s3cmd to check file size on S3 and elastic-mapreduce-ruby to submit jobs. In theory, the output filesize will decrease if and only if some dead ends have been removed.

After 10 iterations or so, the script would only erase two or three dead ends in each iteration. This continued until we stopped the script at 70 iterations, which took over 9 hours. With 8 normalized instance hours on m1.xlarge, the iterative jobs finished, on average, in 8 minutes. The initial BSON size of the dataset with dead ends was 606.9MB and the final BSON size with only 2 dead ends was 448.2MB. We decided that the result of running more iterations would only be trivial, and thus we could simply move ahead. We ended up with 1,113,524 total nodes.


PageRank in Elastic MapReduce on the dataset

At this point, we had two collections:

  1. NO_DEAD_ENDS which had 1,113,524 nodes, 2 of which were dead ends.

  2. ORIGINAL which had 36,814,086 nodes, around 75% of which were dead ends.

Whereas we were able to run MongoDB’s inherent MapReduce for the flights dataset to quickly converge, the sheer size of these collections drove us to use Amazon’s Elastic MapReduce to compute the PageRank of the nodes in the graph.

NO_DEAD_ENDS
Preformat

First, we had to preformat the graph to suit the PageRank program. This involved changing the schema, using a Hadoop Job written in Java, to:

{
    "_id" : "001rian.blogspot.com/2008_07_01_archive.html",
    "ptr" : 0.2,
    "pg" : 8.98049795e-7,
    "links" : [
        "001rian.blogspot.com/2008/07/sanra-dewi-ikutan-bisnis-interne.html",
        "001rian.blogspot.com/2008/07/peduli-dengan-pagerank-dan-traffic.html",
        "001rian.blogspot.com/2008/08/88808-08-2008.html",
        "001rian.blogspot.com/2008/07/jadwal-puasa-dan-imsak-ramadhan-1429-h.html",
        "001rian.blogspot.com/2008/07/buku-elektronik-untuk-sekolah.html"
    ]
}
  • "pg" : 6.081907480807019e-8 was the initial PageRank of all nodes in a graph, which corresponds to the reciprocal of the total number of nodes in the graph.
  • ptr was the probability of this node going to any other node it links to, which corresponds to the reciprocal of the length of the links array.
PageRank

The structure of the Elastic MapReduce Java program for PageRank was similar to the MongoDB MapReduce program written for the Flights dataset.

However, instead of setting all of the arguments for the Hadoop job in the Bash script, as each iteration (one Map and one Reduce) was a job, the jobs were continuously created and their different variables needed to be set dynamically. For example, the output of the last job was set as the input of the next job for the number iteration.

last = iteration - 1;
FileInputFormat.addInputPath(job, new Path(outputString + last));

// mapred.output.dir is the current configuration variable for the output path in Mongo-Hadoop. 
job.getConfiguration().set("mapred.output.dir", outputString + iteration);
```

Again, the stopping criterion for PageRank is when the average percentage change of a node, the `residual` or the `diff`, drops below 0.1%. Instead of outputting the individual node diffs in Reduce and aggregating over the entire output to sum up the diff as in the MongoDB MapReduce for Flights, we used the Hadoop counter to sum the residuals per Reduce call. The Hadoop counter is an atomic variable that is accessible by all Mappers and Reducers and it will track any statistic or variable to be read after the job is completed.

// Need this 10E6 because long only takes whole numbers context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual*10E6)); “`

Therefore, after each job was completed, we viewed the total residual to determine whether it was under the threshold. Our stopping criteria is again when the residual converged to .001 * n where n was the number of elements. In this case, PageRank finished after 16 total iterations in 2 hours and 30 minutes with 7 m1.xlarge instances.

ORIGINAL

Another implementation of the PageRank algorithm didn’t delete all dead ends from the graph, but instead connected all dead ends to every other node in the graph. This created a strongly connected graph but it also increased the noise in the graph. Since the resulting number of nodes after erasing dead ends is only 1 million out of the original 15,312,738, we wanted to see how the PageRank results would change if all of the dead ends were included.

Notice that the original collection from the text file only has 15,312,738 nodes, whereas we accounted for 36,814,086 nodes in the ORIGINAL collection. The extra 21,501,348 nodes are links in the original 15,312,738 nodes but were not documents in the imported collection. Rather than decrease the graph, as with erasing dead ends, making the graph strong connected increased the size of the graph with 21,501,348 extra dead end nodes.

However, there’s no need to actually create edges between all dead end nodes and those that aren’t dead ends (with 27 million dead ends, that would create 27 million * 36 million = 972 billion links). Instead, we simply distributed the summation of the PageRank from all dead ends to every other node. Here are our implementation ideas:

  1. The first (implemented) idea was to add all of the PageRank from nodes with dead ends in the Mapper, and then distribute this PageRank among all nodes in the Reducer when summing up the incoming PageRank; however, this was not feasible as Hadoop counters accessed in the Mapper would be zero in the Reducer. Mappers and Reducers executed simultaneously so the counter values were only finalized after the job was done.

  2. To solve (1) we waited until the end of a job to determine the PageRank of the dead ends. Then the final PageRank and residual was calculated in the Mapper of the next iteration.

The main function, which submits jobs and waits for completion, retrieves the job’s total PageRank from dead ends and passes it as a configuration variable to the next job.

long prevDeadEndsPG = prevJob.getCounters().findCounter(PageRank.RanCounters.DEAD_END_PG).getValue();
currentJob.getConfiguration().setLong("deadEndsPG", prevDeadEndsPG);

Then in the Mapper step, we add this deadEndsPG divided by the total number of nodes (the probability of any dead end jumping to this node). We compute the residual using the previous PageRank value, and add to the counter for residuals. This way, the final PageRank value of anode for that iteration is determined in the Mapper instead of the Reducer.

long deadEndsPG = 0;
// fetch the dead ends PG from the configuration file
context.getConfiguration().getLong("deadEndsPG", deadEndsPG);

// 10E10 for numbers large enough to keep as long
double doubleDeadEndsPG = (double deadEndsPG) / 10E10;
double distributeDeadEndsPG = ((double) deadEndsPG) / (PageRank.totalNodes); 

double beta = 0.9;
currentPG = PageRank.beta * (currentPG + distributeDeadEndsPG) + PageRank.distributedBeta;

double residual = residual = Math.abs(prevpg - currentPG) / prevpg;

context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual * 10E6));

This original PageRank took 17 iterations in the span of 7 hours and 12 minutes with 8 m1.xlarge instances.

RESULTS

The following is an interpretation of the results obtained after running the PageRank algorithm over the 2 collections above.

NO_DEAD_ENDS

For the dataset with no dead ends, the 10 web pages with the most PageRank are:

1. {"pg":0.020741859913578454
   ,"url":"http://slideshare.net/youtube-in-slideshare"
   ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
2. {"pg":0.01490199050574318
   ,"url":"http://slideshare.com/youtube-in-slideshare"
   ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
3. {"pg":0.00542114032291505
   ,"url":"http://london.kijiji.ca/f-buy-and-sell-w0qqcatidz10"
   ,"quotes":[]}
4. {"pg":0.005381224128322537
   ,"url":"http://badgerandblade.com/index.php?page=terms"
   ,"quotes":["badger and blade","b amp b"]}
5. {"pg":0.00328534940037117
   ,"url":"http://saintjohn.kijiji.ca/f-buy-and-sell-w0qqcatidz10"
   ,"quotes":[]}
6. {"pg":0.00301961022829243
   ,"url":"http://london.kijiji.ca/c-buy-and-sell-business-industrial-salon-equipment-w0qqadidz115815478"
   ,"quotes":[]}
7. {"pg":0.0028168240288373365
   ,"url":"http://dealsofamerica.com/terms.php"
   ,"quotes":[]}
8. {"pg":0.0025406641926389753
   ,"url":"http://london.kijiji.ca/c-buy-and-sell-cds-dvds-vhs-house-seasons-1-4-w0qqadidz123632361"
   ,"quotes":[]}
9. {"pg":0.0024984791525017504
   ,"url":"http://answerbag.com/c_view/3602"
   ,"quotes":[]}
10. {"pg":0.0021795435717848356
    ,"url":"http://chacha.com/entertainment-arts/music"
    ,"quotes":["up where they play all day in the sun","stay all day in the sun","stay or leave i want you not to go but you did","sometimes goodbye is a second chance"]}

It’s not surprising that http://slideshare.net/youtube-in-slideshare and http://slideshare.com/youtube-in-slideshare have the most PageRank. Around the beginning of 2009, SlideShare released a new feature to enable users to embed youtube videos in their presentations. At the time, this feature was in Beta. FAQs, examples, and other details were posted on both http://www.slideshare.net/youtube-in-slideshare and http://www.slideshare.com/youtube-in-slideshare. Since this was a new feature (that a lot of people were excited about!), lots of reputable bloggers posted a link to these FAQ pages to showcase the power of the new SlideShare feature. As a result, these FAQ pages accumulated most of the PageRank.

Another interesting observation here is that there are 4 web pages with the most PageRank from kijiji, a centralized network of online urban communities for posting local online classified advertisements. The reason kijiji accumulated a lot of PageRank is that there’s a lot of intra-domain (but inter sub-domain) linking. For example, lots of pages on london.kijiji.ca link to vacation.kijiji.ca which links back to london.kijiji.ca. Such intra-domain linking creates a web structure called a spider trap that accumulates a large portion of the PageRank available to the whole system of pages. Furthermore, about 5% fo the entire NO_DEAD_ENDS contains kijiji web pages.

ORIGINAL
1. {"_id" : "craigslist.org/about/scams.html"
   , "pg" : 0.018243114523103326
   , "links" : []}
2. {"_id" : "slideshare.net/youtube-in-slideshare"
   , "pg" : 0.003038463243965542
   , "links" : ["slideshare.net/youtube-in-slideshare"]
   , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
3. {"_id" : "slideshare.com/youtube-in-slideshare"
   , "pg" : 0.002161141838313388
   , "links" : ["slideshare.com/youtube-in-slideshare"]
   , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
4. {"_id" : "ad.doubleclick.net/clk;212944738;23941492;n?goto.canon-asia.com/oic"
   , "pg" : 0.0015214745797758247
   , "links" : []
   , "quotes" : []}
5. {"_id" : "mx.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0013631525163117727
   , "links" : []
   , "quotes" : []}
6. {"_id" : "ar.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0013542983079855681
   , "links" : []
   , "quotes" : []}
7. {"_id" : "it.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0011670409020562926
   , "links" : []
   , "quotes" : []}
8. {"_id" : "fr.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.001083113456512683
   , "links" : []
   , "quotes" : []}
9. {"_id" : "seroundtable.com"
   , "pg" : 0.0009033963316740201
   , "links" : []
   , "quotes" : []}
10. {"_id" : "de.answers.yahoo.com/info/disclaimer"
    , "pg" : 0.0006914069352292967
    , "links" : []
    , "quotes" : []}

These results differed significantly from the results of the NO_DEAD_ENDS but for good reasons.

  • The PageRank for these web pages are significantly lower than the PageRank of the pages in ORIGINAL because there’s 30x more pages in this graph than NO_DEAD_ENDS.
  • The PageRank of the Craigslist scams information page is 6x as much as the next highest PageRank value. This is because every Craigslist page links to the scam page, creating a spider trap. Similarly, the Yahoo! Answers disclaimer pages accumulated a lot of PageRank because most Yahoo! forum pages link to one of the disclaimer pages.
  • The curious entry above is the ad.doubleclick.net URL. This URL is no longer available. But since doubleclick was an online ad platform, it’s possible that this particular ad was either served the most or gathered the most attention.
SubDomains

Looking at subdomains, we see a lot of entries for websites that have no URLs in the top 10 list. Sites like Twitter, Blogspot, and English Wikipedia are ranked among the top 25. It’s reasonable to assume that links to those websites aren’t purely spam or disclaimers. The D3 bubble chart below comprises the subdomains with the top 25 PageRanks.

Domains

By summing up the PageRank by domains instead of URLs, Yahoo! surpasses SlideShare. This makes sense as Yahoo! is quite spread out among its subdomains.

Lessons Learned

Bulk Insertion Speeds

The Flights dataset was inserted a single document at a time, as we didn’t know about bulk insertions yet. However, pymongo allows for bulk insertions. This is a much preferred and faster method for creating large collections. We utilized bulk insertions on the Twitter Memes dataset with a default batch size of 1000 docs. Here is the amount of time it takes for inputMongo.py to completely finish inserting the original 15,312,736 docs into Twitter Memes:

Bulk Insert Array Size Elapsed Time in Seconds Elapsed Time (hr:min:sec)
1 15007.014883 4:10:7
10 7781.75180 2:09:41
100 7332.346791 1:52:12
1000 6493.35044885 1:48:13
10000 Error: Size too big for insert

The elapsed time for inputting the data dropped significantly between single insertions and insertions of size 10 each, but interestingly, the speed tapered off as the size of the insertions increased. The maxMessageSizeByte value is 48 million bytes, and a bulk insertion of 10,000 exceeded this limit. This occurred in the Python driver, but some of the other drivers will split the array into 16MB chunks, which would avoid this error.

PageRank != Relevance

The results above show that the pages with the most PageRank are often the disclaimers and information pages, which probably isn’t what interests people most of the time. It turns out that Google’s Search algorithm is more complex than our simplistic version of PageRank. For instance, Google takes into account which links people click on for a certain query, thus boosting those URLs’ relevance for that query. Google also filters out some spam links and ad sites. In addition, some websites made with many internal links to intentionally boost their PageRank will have their PageRank values docked as part of Google’s campaign to disincentivize the intentional manipulation of PageRank.

Indexing URLs

Since each document has an unique URL, we created a unique index on this field. Fortunately, all the URLs in this dataset are well within the 1024 bytes limit for index entries. An alternative and more optimal way to index URLs is to use hashing. In addition, MongoDB supports hashed indices.

Going with Go

Sep 5 • Posted 10 months ago

As you may have read previously on the blog, the MongoDB team is adopting the Go language for a variety of projects. At last month’s New York MongoDB User Group, Sam Helman presented on how MongoDB is using Go for new and existing cloud tools. In Sam’s talk, you’ll learn how MongoDB is using Go for the backup capabilities in MongoDB Management Service and a new continuous integration tool.

Why go with Go? Between the lightweight syntax, the first-class concurrency and the well documented, idiomatic libraries such as mgo, Go is a great choice for writing anything from small scripts to large distributed applications.

Thanks to g33ktalk for recording Sam’s talk.

PageRank on Flights Dataset

Sep 3 • Posted 10 months ago

By Sweet Song and Daniel Alabi, MongoDB Summer Interns for 2013

This is the second of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

Having done some basic analysis on the Flights dataset (mostly using the MongoDB aggregation framework), we moved on to do some more advanced analysis on this dataset. We settled on computing the PageRank of all airports in the Flights dataset. The PageRank of nodes in a network is often computed iteratively. This process can easily be parallelized and often is. We can utilize MongoDB to compute the PageRank of nodes in a network in several ways. Here are the two options we considered:

  1. We can use the MongoDB MapReduce framework, which since version 2.4 uses the V8 JavaScript engine. Furthermore, since MongoDB is known for its robust sharding capabilities, we can increase the performance of query operations by setting up a MongoDB sharded cluster for our dataset. This is essential for really large working datasets.

  2. The Hadoop open-source framework is well-known for its robust distributed data processing features. MongoDB interfaces with hadoop via the Mongo-Hadoop connector.

For this particular dataset, we opted for (1) since the Flights dataset has only 319 airports. Regardless, there were 4,601 total weighted edges among USA commercial airports. The weight of an edge between any two airports was calculated using the 6,155,752 recorded trips in the flights collection.

Making a Graph of Airports

The airports dataset is fairly connected, with only one airport receiving flights in the past year without any domestic departures. Most flights out of Puerto Rico are considered international flights; as a result, our dataset didn’t have any recorded domestic flights in that year. This would be a black hole for any PageRank that goes to that airport. A more thorough explanation can be found here. Therefore, we removed that singular airport in Puerto Rico from our airports graph.

From the previous analysis, we had put the Flights dataset in a flights collection in the flying database. An entry looks like this:

For each document, we create (or modify) at least one node that keeps track of this “edge”:

where NUM_OF_AIRPORTS_IN_DB is the total number of airports in the Flights dataset which corresponds to the number of nodes in the network. NUM_OF_FLIGHTS_FROM_12478 is the total number of flights leaving from airport with airportId=12478. NUM_OF_FLIGHTS_FROM_12478_TO_12892 is the number of flights that leave the airport with airportId=12478 and arrive at the airport with airportId=12892. pg is the current PageRank of an airport; prs is a Map of <aId, pr> where pr is the probability of a flight going from the airport specified by _id to an airport identified by aId. For example, NUM_OF_FLIGHTS_FROM_12478_TO_12892/NUM_OF_FLIGHTS_FROM_12478 is the probability of transitioning from airport with airportId=12478 to airport with airportId=12892.

We wrote preformat.py to create the graph that contains information about the probability of every node in the graph transitioning to another. The resulting graph was stored in an fpg_0 collection (Flights PageRank 0) with 318 nodes.

MongoDB MapReduce

Next, we wrote some JavaScript code to calculate PageRank on the graph stored in the database. The goal was to create a new collection fpg_i for every ith iteration of PageRank. Every iteration is a call on oneiteration() in iteration.js consists of a map and a reduce function. The PageRank algorithm will stop once the average percentage change of the PageRank values for all nodes drops below 0.1%. The map function looks like this:

The map function considers every document (corresponding to an airport) in the current fpg_i. For each airport (call this x), it emits its airport ID (stored in _id) and passes the prs and prevpg (previous pg) information, for use in the next iteration of PageRank. Then, it passes a portion of x's PageRank to every airport that x links to.

The reduce function looks like this:

The reduce function has two duties:

  1. Collect the prs and prevpg information for each node;

  2. Accumulate the total PageRank score sent to each node.

Finally, db["fpg_"+i].mapReduce(map, reduce, {out: "fpg_"+(i+1)}); runs MapReduce on the fpg_i collection using the map and reduce functions defined below and stores the result (in the same format as fpg_i) into fpg_(i+1).

We keep applying the MapReduce operations until the PageRank of the nodes eventually converges. This happens when the average percentage change of pg for each node is less than a certain threshold (0.1% in our case). The execution of our implementation of the PageRank algorithm took 6.203 seconds, having converged after 20 iterations.

PageRank Result and Interpretation

The 10 airports with the most PageRank are:

The outcome matches our intuition that the airports with the most flights would accumulate most of the PageRank. In general, the nodes in a weighted graph with the most PageRank will be the ones with a greater ratio of incoming weight to outgoing weight.

Below is a map of the USA that illustrates the PageRank of all airports in the Flights dataset. Click on the image below to see the interactive map. The bigger the circle on the airport, the larger its PageRank. Hover around a circle to see the full name of an airport, its airport code, and the percentage of the total PageRank the airport accumulated.

Challenges/Lessons Learned

Insert vs. Update

Initially, we envisioned iterations.js to merely update the pg and prevpg of the PageRank collection instead of outputting to a new collection. However, updates were significantly slower than inserts into a new collection, even though we already had indexes on the pg and prevpg fields. We learned that, in general, updates in really large collections are significantly slower than insertions into a new collection. This preference of inserts over updates would be common in our other attempts.

Flights Dataset has no information on International Flights

Only domestic flights are present in our Flights dataset. Perhaps, if international flights were included, JFK, O’Hare, and San Francisco airports would have the most PageRank. Also, our map does not show the USA states and territories of Alaska, Hawaii, and Guam. If they were included, then the continental USA would have been too small to distinguish between individual airports.

Relatively small number of nodes in graph

Even though our initial Flights dataset contained 6,155,748 documents (corresponding to domestic flights), the resulting airports graph had only 318 documents (corresponding to airports/nodes). This is why the MongoDB MapReduce framework was very fast and converged after a few seconds and after less than 20 iterations. Perhaps, it might take a longer time before it converged if run on a dataset with more nodes (more airports).

The next dataset we’ll use is the Twitter Memes dataset. This dataset will have at least 1 million nodes (after pre-processing) that correspond to web pages on the Internet. Performance analysis based on the PageRank algorithm is more easily done on datasets with more nodes.

The MongoDB Java Driver 3.0: What’s Changing

Aug 30 • Posted 10 months ago

By Trisha Gee, MongoDB Java Engineer and Evangelist

In the last post, we covered the design goals for the new MongoDB Java Driver. In this one, we’re going to go into a bit more detail on the changes you can expect to see, and how to start playing with an alpha version of the driver. Please note, however, that the driver is still a work in progress, and not ready for production.

New features

Other than the overall changes to design detailed above, the 3.0 driver has the following new features:

  • Pluggable Codecs: This means you can do simple changes to serialisation/deserialisation, like tell the driver to use Joda Time instead of java.util.Date, or you can take almost complete control of how to turn your Java objects into BSON. This should be particularly useful for ODMs or other libraries, as they can write their own codecs to convert Java objects to BSON bytes.
  • Predictable cluster management: We’ve done quite a lot of work around discovering the servers in your cluster and determining which ones to talk to. In particular, the driver doesn’t have to wait for all servers to become available before it can start using the ones that are definitely there - the design is event-based so as soon as a server notifies the driver of its state the driver can take appropriate action - use it if it’s active, or start ignoring it if it’s no longer available.
  • Additional Connection Pool features: We’ve added support for additional connection pool settings, and a number of other improvements around connection management. Here’s the full list.
  • Deprecated methods/classes will be removed: In the next 2.x release a number of methods and classes will be deprecated. These, along with existing deprecated methods, will be removed in the 3.0 driver. This should point you in the right direction to help you migrate from 2.x to 3.x.

Speaking of Migration…

We’ve worked hard to maintain backwards compatibility whilst moving forwards with the architecture of the Java driver for MongoDB. We want to make migration as painless as possible, in many cases it should be a simple drop-in replacement if you want to keep using the existing API. We hope to provide a step-by-step guide to migrating from 2.x to 3.0 in the very near future. For now, it’s worth mentioning that upgrading will be easiest if you update to 2.12 (to be released soon), migrate any code that uses deprecated features, and then move to the compatible mode of the new driver.

Awesome! Can I try it?

Yes you can! You can try out an alpha of the new driver right now, but as you’d expect there are CAVEATS: this is an alpha, it does not support all current features (notably aggregation); although it has been tested it is still in development and we can’t guarantee everything will work as you expect. Features which have been or will be deprecated in the 2.x driver are missing completely from the 3.0 driver. Please don’t use it in production. However, if you do want to play with it in a development environment, or want to run your existing test suite against it, please do send us any feedback you have.

If you want to use the compatible mode, with the old API (minus deprecations) and new architecture:

Maven

Gradle

You should be able to do a drop-in replacement with this dependency - use this instead of your existing MongoDB driver, run it in your test environment and see how ready you are to use the new driver.

If you want to play with the new, ever-changing, not-at-all-final API, then you can use the new driver with the new API. Because we wanted to be able to support both APIs and not have a big-bang switchover, there’s a subtle difference to the location of the driver with the updated API, see if you can spot it:

Maven

Gradle

Note that if you use the new API version, you don’t have access to the old compatible API.

Of course, the code is in GitHub

In Summary

For 3.0, we will deliver the updated, simplified architecture with the same API as the existing driver, as well as working towards a more fluent style of API. This means that although in future you have the option of using the new API, you should also be able to do a simple drop-in replacement of your driver jar file and have the application work as before.

A release date for the 3.0 driver has not been finalized, but keep your eyes open for it.

All Hail the new Java driver!

blog comments powered by Disqus