Tracking Twitter Followers with MongoDB

Sep 17 • Posted 11 months ago

By André Spiegel, Consulting engineer for MongoDB

As a recently hired engineer at MongoDB, part of my ramping-up training is to create a number of small projects with our software to get a feel for how it works, how it performs, and how to get the most out of it. I decided to try it on Twitter. It’s the age-old question that plagues every Twitter user: who just unfollowed me? Surprising or not, Twitter won’t tell you that. You can see who’s currently following you, and you get notified when somebody new shows up. But when your follower count drops, it takes some investigation to figure out who you just lost.

I’m aware there’s a number of services that will answer that question for you. Well, I wanted to try this myself.

The Idea and Execution

The basic idea is simple: You have to make calls to Twitter’s REST API to retrieve, periodically, the follower lists of the accounts you want to monitor. Find changes in these lists to figure out who started or stopped following the user in question. There are two challenging parts:

  1. When you talk to Twitter, talk slowly, lest you hit the rate limit.
  2. This can get big. Accounts can have millions of followers. If the service is nicely done, millions of users might want to use it.

The second requirement makes this a nice fit for MongoDB.

The program, which I called “followt” and wrote in Java, can be found on github. For this article, let me just summarize the overall structure:

  • The scribe library proved to be a great way to handle Twitter’s OAuth authentication mechanism.

  • Using GET followers/ids, we can retrieve the numeric ids of 5,000 followers of a given account per minute. For large accounts, we need to retrieve the full list in batches, potentially thousands of batches in a row.

  • The numeric ids are fine for determining whether an account started or stopped following another. But if we want to display the actual user names, we need to translate those ids to screen names, using GET users/lookup. We can make 180 of these calls per 15 minute window, and up to 100 numeric ids can be translated in each call. In order to make good use of the 180 calls we’re allowed, we have to make sure not to waste them for individual user ids, but to batch as many requests into each of these as we can. The class net.followt.UserDB in the application implements this mechanism, using a BlockingQueue for user ids.

Storing Twitter Data in MongoDB: A Lesson in Schema Design

So how do we store the information in MongoDB? I came up with the following, simple schema:

{
    “followee”: 12345,
    “follower”:     54321,
    “start”:        ISODate(“2013-08-21T12:34:00”),
    “last”:     ISODate(“2013-08-23T06:00:00”),
    “end”:      ISODate(“2013-08-23T07:50:23”)
}

This document means that the Twitter user with numeric id 12345 has been followed by user 54321 since August 21. The last scan of user 12345, when user 54321 still showed up in the followers list, happened at 6am on August 23rd. At 7:50am that day, user 54321 was no longer following 12345. A document with no “end” field means that the “follower” is still following the “followee”.

This simple schema won over more complex designs, for example the idea to store all of a user’s followers as an array inside a single document. This approach wouldn’t scale to millions of followers, since the size limit for a single document in MongoDB is 16MB. But more importantly, we usually neither need nor want all of a user’s followers in our application’s memory at the same time. This is because a) the memory usage might quickly get out of hand (think millions of users), and b) the algorithms we need to run on these follower lists can indeed work without having it all in memory (more on these algorithms below). Therefore, having it all in a single document would actually be disadvantageous to our application. Two important lessons here:

  1. Just because you can have complex documents in MongoDB, doesn’t mean you actually need them all the time. Sometimes something very simple and “relational” is just fine.
  2. Schema design is influenced, more than anything else, by your data access patterns. Make it so that your typical operations are efficient.

As we retrieve a user’s follower list from Twitter, we update the “last” timestamp in the corresponding documents. If we don’t find a document for that followee/follower pair, or all these documents have an “end” field, then that is a new follower and we create a new document for that pair. It can actually take plenty of time until we are through with all of a user’s followers, since we can only retrieve 5,000 of them per minute. After all the followers have been retrieved, the next step is to figure out which users have unfollowed: in other words, which users that we saw in a previous iteration are no longer in the follower list.

Set Intersection and Adaptive Merging

This is a set intersection problem. We have two potentially very large sets (the previous and the current follower lists), and we want to find out which members are only in one of the sets. (Actually, this is the inverse of computing the intersection and hence the same problem.) There are a number of ways we can do this.

  1. The naive approach is to iterate over one of the sets and search each value in the other set. This is also called multisearch. We could do it with an individual database query for each value, but the performance is obviously unacceptable: Although each query is actually quite fast (100µs on my laptop), that adds up to more than 100 seconds for a million followers.

  2. If we store at least the second set in the application’s main memory, multisearch gets about two orders of magnitude faster. But if we need to do this for many follower lists at the same time, the memory consumption may be prohibitive. And we can also do better algorithmically.

  3. If both sets are sorted (something that MongoDB’s indexes give us essentially for free), then we can perform the equivalent of a merge operation: Iterate over both sets at the same time and advance the iterators according to whether the current set elements are the same or different. This is very fast and uses constant memory.

  4. An even better solution is to exploit the “last” timestamp in our documents. Any follower whose “last” timestamp wasn’t updated in the first phase of the scan must have unfollowed. By simply comparing the time stamps to the start time of the scan we instantly get all unfollowers, and we can actually update their “end” time stamp as part of the same find-and-modify operation. This solution, which we could call “mark-and-sweep”, turns out to be the fastest — although it isn’t any better than the previous ones algorithmically, it wins by offloading all the work into the database.

There is another approach based on more recent research which would require a little more implementation work, and we therefore couldn’t try it as part of this small project. It is a variant of the merge algorithm called adaptive merging. It exploits the idea that the two sets are very similar. So rather than comparing the elements one by one, we take our chances and “leap forward” within the two sets and check whether we are still in sync (i.e. the elements from set A and set B at that index are still the same). If they are, we take an even larger leap forward (this is also called “galloping”). However, if the elements we find at the end of the leap are not the same, we do a binary search in the section we just leapt over to find the exact place where the difference occured.

If we assume that we cannot keep both sets in memory, then this algorithm requires some careful buffering of the results as they come in from the database, which was a little beyond the scope of this project. It would however be an interesting exercise by itself, not least because the open source nature of MongoDB would allow us to use and modify buffering internals deep down in the driver code to do exactly what we want.

Performance and Results

The table below shows execution times for the algorithms we did implement for follower lists of 100k, 500k, and 1m entries, and 100 missing followers between the two lists. All times are in seconds, and were measured locally on my laptop.


algorithm/list size
100,000
500,000
1,000,000
multisearch/database
12
62
127
multisearch/memory
0.39
2.2
3.8
merge
.95 
3.3 
6.3
mark-and-sweep
0.34
1.7
3.6

We let the application run for a couple of days and tracked the development of the @MongoDB account and the accounts of a few other databases and vendors. The results, with the other vendors anonymized, are shown in the charts below. These charts echo the findings of the current database popularity ranking, which shows MongoDB as the most popular NoSQL database.

But the Twitter numbers add a not-so-obvious twist to it: Although @MongoDB is still a small account as compared to some of the big players, its follower growth rate is unusually high for an account of this size, while the loss rate is just about average. (This can be seen most clearly in the normalized chart of follower gain/loss per day and 1000 followers.)

We’re excited to see those results.

Twitter Memes Dataset Overview with PageRank

Sep 9 • Posted 11 months ago

This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

Once we’d familiarized ourselves with a sizable amount of data from the Flights dataset, the next step was to move towards really BIG data: Twitter Memes. SNAP (Stanford Network Analysis Project) provides free large network data sets. In particular, we were interested in the Twitter Memes dataset from the 2008 presidential election generated by MemeTracker, a combined project between Stanford and Cornell which created news cycle maps from news articles and blog posts. This dataset is a compilation of blogs, news articles, and other media postings that pertain to the presidential election. In particular, the project focused on the time lapse between the first mention of a quote and the time it took for the quote to be picked up by mass media. However, our analysis focuses again on the PageRank and importance of both the individual URLs and websites.

The entire dataset contains over 96 million documents and over 418 million links between them. To begin, we focused solely on the memes during the month of April in 2009, which had over 15 million documents. The goal was to run PageRank on this dataset (where all web pages form a graph). The higher the PageRank of a web page, the more important the page. Web pages with a relatively high PageRank usually have a high ratio of incoming links to outgoing links compared to all other pages in the graph.

Disclaimer: As we’d learn through this project, the pages with the most PageRank do not necessarily have to be related to the 2008 presidential election.

Importing the Data

The source file quotes_2009-04.txt was 11G. It came in this continuous format:

P       http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html
T       2008-09-09 22:35:24
Q       that's not change
Q       you know you can put lipstick on a pig
Q       what's the difference between a hockey mom and a pit bull lipstick
Q       you can wrap an old fish in a piece of paper called change
L       http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112
L       http://cbn.com/cbnnews/436448.aspx
L       http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews
  • P denotes the URL of the document.
  • T represents the time of the post.
  • Q is a quote found in the post.
  • L is a link that exists in the post.

This was not an ideal schema for MongoDB. With the use of inputMongo.py, the above input was converted into documents resembling the following:

{
    "_id" : ObjectId("51c8a3f200a7f40aae706e86"),
    "url" : "http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html",
    "quotes" : [
        "that's not change",
                "you know you can put lipstick on a pig", 
                "what's the difference between a hockey mom and a pit bull lipstick", 
                "you can wrap an old fish in a piece of paper called change"
    ],
    "links" : [
        "http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112", 
                "http://cbn.com/cbnnews/436448.aspx", 
                "http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews"
    ],
    "time" : ISODate("2008-09-09T22:35:24Z")
}

This resulted in 15,312,738 documents. We also utilized bulk insertion instead of individual document insertions. It took about 1 hour and 48 minutes to insert all of these documents into a collection in the database.

Notice that we still generated an ObjectId as _id. We’d later realize that this is superfluous as the url is unique per document.

Preparing the Dataset for PageRank

Theoretically, for PageRank to produce more accurate and reflective results, there must be no dead ends and the graph must be strongly connected (every node must be able to travel to any other node in the graph and back). Dead ends are nodes in the graph that have incoming links but no outgoing links. The presence of dead ends in a dataset leaks the PageRank in the graph so that the sum of PageRank of nodes in the graph will slowly converge to zero.

There are 2 ways to fix this problem:

  1. Use a taxation parameter, also called the random surfer, to pass a portion of PageRank from every node to every other node. This is a fix more for a graph that is not strongly connected than for graphs that have dead ends.

  2. Recursively remove dead ends.

We decided to use method (2) first because the resulting graph will be much cleaner and will theoretically give more reasonable results. To make the graph even more strongly connected, we also used a taxation parameter after erasing all dead ends.

Removing dead ends from the dataset proved to more involved than we had initially thought. Here’s what we tried:

Attempt #1: Patience, Young Grasshoppers

The first solution created new collections of nodes that were not dead ends in the last collection. If the size of the links array is zero for a doc/node, then do not add this doc to the next iteration. Else, iterate through each link in the links array for each doc. Remove each link which points to a doc that does not exist in the last iteration collection. This is only bounded by the 16MB document limit.

Although we had an index on the url field, after 5 hours of running the script, we realized that the projected elapsed time would be over 20 days. This prompted us to explore other options.

Attempt #2: Further Optimizing URL search

We originally thought the reason for this lag was due to the script searching for a link in the entire collection, although we did have an index on it already. Instead, we created a more optimized solution to create another collection titled char8 which has as its _id, the first 8 characters of the url. Then in each char8 doc, there is a url array that starts with those 8 characters. So instead of searching for link in each url in the original memes collection, we’d only search through the links of the char8 collection (which is indexed by the first 8 characters).

This was now projected to finish in 5 days, still too inefficient.

Attempt #3: Migrating to Hadoop/Pig

However, we were undeterred; we still wanted to compute PageRank over the entire dataset with no dead ends. The single script approach was extremely inefficient and slow. It wasn’t fit for BIG DATA. Instead, we turned to Hadoop and Amazon Elastic MapReduce.

Hadoop is an open-source MapReduce framework that handles intensive data computations and often operates over large clusters, although it could also be used locally. It supports many languages and tools to ease usage. Amazon Elastic MapReduce is a service provided by Amazon that hosts the Hadoop framework on its EC2 instances that would be created solely for each MapReduce task and connects to S3 Simple Storage for data input and output. Mongo-Hadoop was the necessary library which would allow us to connect MongoDB as the input/output source for these MapReduce jobs.

We submitted jobs to Elastic MapReduce through both the UI and the command line tool elastic-mapreduce, with our access ID and secret key.

First, we wrote a Pig script, explained below, to eliminate dead ends.

-- Registers the mongo and hadoop jars from our bucket
REGISTER s3://memes-bson/mongo-2.10.1.jar
REGISTER s3://memes-bson/mongo-hadoop-pig_1.1.2-1.1.0.jar
REGISTER s3://memes-bson/mongo-hadoop-core_1.1.2-1.1.0.jar
REGISTER s3://memes-bson/mongo-hadoop_1.1.2-1.1.0.jar
-- There's a User Defined Function in myudf.MYBAG which needs to be used
REGISTER s3://memes-bson/myudf.jar

original =  LOAD "$INPUT"
            USING com.mongodb.hadoop.pig.BSONLoader;

outs =      FOREACH original GENERATE $0#"_id" AS url, $0#"value"#"links" AS links;

-- Dead ends will not appear in showing as links is empty. url is the origin of the link while 
-- single is the destination. myudf.MYBAG turns the tuple of links to a bag of tuple of each link.
showing =   FOREACH outs GENERATE url, FLATTEN(myudf.MYBAG(links)) AS single;

-- JOIN will eliminate links that go to a single which doesn't exist in our dataset
joined =    JOIN outs BY url, showing BY single;

project =   FOREACH joined GENERATE showing::url AS url, showing::single AS single;

-- Group together the urls such that they form the original scheme of a url and an array of links 
together =  GROUP project BY url;

result =    FOREACH together GENERATE $0 AS url, $1.single AS links;

STORE result INTO "$OUTPUT"
    USING com.mongodb.hadoop.pig.BSONStorage;

The Pig script above removed all dead ends from the current dataset. The problem with this, is that removing dead ends could create new ones. For example, in the simple graph of

A -> B -> C -> D

D is the only dead end in the graph. But when D is removed, we have

A -> B -> C

So we would keep on removing the new dead ends until there were no new dead ends. In this particular “pathological” graph, the entire graph would be removed because all of the nodes in the graph would eventually be dead ends. Fortunately, most datasets aren’t linear graphs.

We had to figure out a way to repeatedly run the Pig script above until there are no more dead ends. Elastic MapReduce only allows for a single Pig script execution, so we wrote Bash script removeAllDeadEnds.sh that kept running the above Pig script until the output filesize stopped decreasing. The script utilized s3cmd to check file size on S3 and elastic-mapreduce-ruby to submit jobs. In theory, the output filesize will decrease if and only if some dead ends have been removed.

After 10 iterations or so, the script would only erase two or three dead ends in each iteration. This continued until we stopped the script at 70 iterations, which took over 9 hours. With 8 normalized instance hours on m1.xlarge, the iterative jobs finished, on average, in 8 minutes. The initial BSON size of the dataset with dead ends was 606.9MB and the final BSON size with only 2 dead ends was 448.2MB. We decided that the result of running more iterations would only be trivial, and thus we could simply move ahead. We ended up with 1,113,524 total nodes.


PageRank in Elastic MapReduce on the dataset

At this point, we had two collections:

  1. NO_DEAD_ENDS which had 1,113,524 nodes, 2 of which were dead ends.

  2. ORIGINAL which had 36,814,086 nodes, around 75% of which were dead ends.

Whereas we were able to run MongoDB’s inherent MapReduce for the flights dataset to quickly converge, the sheer size of these collections drove us to use Amazon’s Elastic MapReduce to compute the PageRank of the nodes in the graph.

NO_DEAD_ENDS
Preformat

First, we had to preformat the graph to suit the PageRank program. This involved changing the schema, using a Hadoop Job written in Java, to:

{
    "_id" : "001rian.blogspot.com/2008_07_01_archive.html",
    "ptr" : 0.2,
    "pg" : 8.98049795e-7,
    "links" : [
        "001rian.blogspot.com/2008/07/sanra-dewi-ikutan-bisnis-interne.html",
        "001rian.blogspot.com/2008/07/peduli-dengan-pagerank-dan-traffic.html",
        "001rian.blogspot.com/2008/08/88808-08-2008.html",
        "001rian.blogspot.com/2008/07/jadwal-puasa-dan-imsak-ramadhan-1429-h.html",
        "001rian.blogspot.com/2008/07/buku-elektronik-untuk-sekolah.html"
    ]
}
  • "pg" : 6.081907480807019e-8 was the initial PageRank of all nodes in a graph, which corresponds to the reciprocal of the total number of nodes in the graph.
  • ptr was the probability of this node going to any other node it links to, which corresponds to the reciprocal of the length of the links array.
PageRank

The structure of the Elastic MapReduce Java program for PageRank was similar to the MongoDB MapReduce program written for the Flights dataset.

However, instead of setting all of the arguments for the Hadoop job in the Bash script, as each iteration (one Map and one Reduce) was a job, the jobs were continuously created and their different variables needed to be set dynamically. For example, the output of the last job was set as the input of the next job for the number iteration.

last = iteration - 1;
FileInputFormat.addInputPath(job, new Path(outputString + last));

// mapred.output.dir is the current configuration variable for the output path in Mongo-Hadoop. 
job.getConfiguration().set("mapred.output.dir", outputString + iteration);
```

Again, the stopping criterion for PageRank is when the average percentage change of a node, the `residual` or the `diff`, drops below 0.1%. Instead of outputting the individual node diffs in Reduce and aggregating over the entire output to sum up the diff as in the MongoDB MapReduce for Flights, we used the Hadoop counter to sum the residuals per Reduce call. The Hadoop counter is an atomic variable that is accessible by all Mappers and Reducers and it will track any statistic or variable to be read after the job is completed.

// Need this 10E6 because long only takes whole numbers context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual*10E6)); “`

Therefore, after each job was completed, we viewed the total residual to determine whether it was under the threshold. Our stopping criteria is again when the residual converged to .001 * n where n was the number of elements. In this case, PageRank finished after 16 total iterations in 2 hours and 30 minutes with 7 m1.xlarge instances.

ORIGINAL

Another implementation of the PageRank algorithm didn’t delete all dead ends from the graph, but instead connected all dead ends to every other node in the graph. This created a strongly connected graph but it also increased the noise in the graph. Since the resulting number of nodes after erasing dead ends is only 1 million out of the original 15,312,738, we wanted to see how the PageRank results would change if all of the dead ends were included.

Notice that the original collection from the text file only has 15,312,738 nodes, whereas we accounted for 36,814,086 nodes in the ORIGINAL collection. The extra 21,501,348 nodes are links in the original 15,312,738 nodes but were not documents in the imported collection. Rather than decrease the graph, as with erasing dead ends, making the graph strong connected increased the size of the graph with 21,501,348 extra dead end nodes.

However, there’s no need to actually create edges between all dead end nodes and those that aren’t dead ends (with 27 million dead ends, that would create 27 million * 36 million = 972 billion links). Instead, we simply distributed the summation of the PageRank from all dead ends to every other node. Here are our implementation ideas:

  1. The first (implemented) idea was to add all of the PageRank from nodes with dead ends in the Mapper, and then distribute this PageRank among all nodes in the Reducer when summing up the incoming PageRank; however, this was not feasible as Hadoop counters accessed in the Mapper would be zero in the Reducer. Mappers and Reducers executed simultaneously so the counter values were only finalized after the job was done.

  2. To solve (1) we waited until the end of a job to determine the PageRank of the dead ends. Then the final PageRank and residual was calculated in the Mapper of the next iteration.

The main function, which submits jobs and waits for completion, retrieves the job’s total PageRank from dead ends and passes it as a configuration variable to the next job.

long prevDeadEndsPG = prevJob.getCounters().findCounter(PageRank.RanCounters.DEAD_END_PG).getValue();
currentJob.getConfiguration().setLong("deadEndsPG", prevDeadEndsPG);

Then in the Mapper step, we add this deadEndsPG divided by the total number of nodes (the probability of any dead end jumping to this node). We compute the residual using the previous PageRank value, and add to the counter for residuals. This way, the final PageRank value of anode for that iteration is determined in the Mapper instead of the Reducer.

long deadEndsPG = 0;
// fetch the dead ends PG from the configuration file
context.getConfiguration().getLong("deadEndsPG", deadEndsPG);

// 10E10 for numbers large enough to keep as long
double doubleDeadEndsPG = (double deadEndsPG) / 10E10;
double distributeDeadEndsPG = ((double) deadEndsPG) / (PageRank.totalNodes); 

double beta = 0.9;
currentPG = PageRank.beta * (currentPG + distributeDeadEndsPG) + PageRank.distributedBeta;

double residual = residual = Math.abs(prevpg - currentPG) / prevpg;

context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual * 10E6));

This original PageRank took 17 iterations in the span of 7 hours and 12 minutes with 8 m1.xlarge instances.

RESULTS

The following is an interpretation of the results obtained after running the PageRank algorithm over the 2 collections above.

NO_DEAD_ENDS

For the dataset with no dead ends, the 10 web pages with the most PageRank are:

1. {"pg":0.020741859913578454
   ,"url":"http://slideshare.net/youtube-in-slideshare"
   ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
2. {"pg":0.01490199050574318
   ,"url":"http://slideshare.com/youtube-in-slideshare"
   ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
3. {"pg":0.00542114032291505
   ,"url":"http://london.kijiji.ca/f-buy-and-sell-w0qqcatidz10"
   ,"quotes":[]}
4. {"pg":0.005381224128322537
   ,"url":"http://badgerandblade.com/index.php?page=terms"
   ,"quotes":["badger and blade","b amp b"]}
5. {"pg":0.00328534940037117
   ,"url":"http://saintjohn.kijiji.ca/f-buy-and-sell-w0qqcatidz10"
   ,"quotes":[]}
6. {"pg":0.00301961022829243
   ,"url":"http://london.kijiji.ca/c-buy-and-sell-business-industrial-salon-equipment-w0qqadidz115815478"
   ,"quotes":[]}
7. {"pg":0.0028168240288373365
   ,"url":"http://dealsofamerica.com/terms.php"
   ,"quotes":[]}
8. {"pg":0.0025406641926389753
   ,"url":"http://london.kijiji.ca/c-buy-and-sell-cds-dvds-vhs-house-seasons-1-4-w0qqadidz123632361"
   ,"quotes":[]}
9. {"pg":0.0024984791525017504
   ,"url":"http://answerbag.com/c_view/3602"
   ,"quotes":[]}
10. {"pg":0.0021795435717848356
    ,"url":"http://chacha.com/entertainment-arts/music"
    ,"quotes":["up where they play all day in the sun","stay all day in the sun","stay or leave i want you not to go but you did","sometimes goodbye is a second chance"]}

It’s not surprising that http://slideshare.net/youtube-in-slideshare and http://slideshare.com/youtube-in-slideshare have the most PageRank. Around the beginning of 2009, SlideShare released a new feature to enable users to embed youtube videos in their presentations. At the time, this feature was in Beta. FAQs, examples, and other details were posted on both http://www.slideshare.net/youtube-in-slideshare and http://www.slideshare.com/youtube-in-slideshare. Since this was a new feature (that a lot of people were excited about!), lots of reputable bloggers posted a link to these FAQ pages to showcase the power of the new SlideShare feature. As a result, these FAQ pages accumulated most of the PageRank.

Another interesting observation here is that there are 4 web pages with the most PageRank from kijiji, a centralized network of online urban communities for posting local online classified advertisements. The reason kijiji accumulated a lot of PageRank is that there’s a lot of intra-domain (but inter sub-domain) linking. For example, lots of pages on london.kijiji.ca link to vacation.kijiji.ca which links back to london.kijiji.ca. Such intra-domain linking creates a web structure called a spider trap that accumulates a large portion of the PageRank available to the whole system of pages. Furthermore, about 5% fo the entire NO_DEAD_ENDS contains kijiji web pages.

ORIGINAL
1. {"_id" : "craigslist.org/about/scams.html"
   , "pg" : 0.018243114523103326
   , "links" : []}
2. {"_id" : "slideshare.net/youtube-in-slideshare"
   , "pg" : 0.003038463243965542
   , "links" : ["slideshare.net/youtube-in-slideshare"]
   , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
3. {"_id" : "slideshare.com/youtube-in-slideshare"
   , "pg" : 0.002161141838313388
   , "links" : ["slideshare.com/youtube-in-slideshare"]
   , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
4. {"_id" : "ad.doubleclick.net/clk;212944738;23941492;n?goto.canon-asia.com/oic"
   , "pg" : 0.0015214745797758247
   , "links" : []
   , "quotes" : []}
5. {"_id" : "mx.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0013631525163117727
   , "links" : []
   , "quotes" : []}
6. {"_id" : "ar.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0013542983079855681
   , "links" : []
   , "quotes" : []}
7. {"_id" : "it.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0011670409020562926
   , "links" : []
   , "quotes" : []}
8. {"_id" : "fr.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.001083113456512683
   , "links" : []
   , "quotes" : []}
9. {"_id" : "seroundtable.com"
   , "pg" : 0.0009033963316740201
   , "links" : []
   , "quotes" : []}
10. {"_id" : "de.answers.yahoo.com/info/disclaimer"
    , "pg" : 0.0006914069352292967
    , "links" : []
    , "quotes" : []}

These results differed significantly from the results of the NO_DEAD_ENDS but for good reasons.

  • The PageRank for these web pages are significantly lower than the PageRank of the pages in ORIGINAL because there’s 30x more pages in this graph than NO_DEAD_ENDS.
  • The PageRank of the Craigslist scams information page is 6x as much as the next highest PageRank value. This is because every Craigslist page links to the scam page, creating a spider trap. Similarly, the Yahoo! Answers disclaimer pages accumulated a lot of PageRank because most Yahoo! forum pages link to one of the disclaimer pages.
  • The curious entry above is the ad.doubleclick.net URL. This URL is no longer available. But since doubleclick was an online ad platform, it’s possible that this particular ad was either served the most or gathered the most attention.
SubDomains

Looking at subdomains, we see a lot of entries for websites that have no URLs in the top 10 list. Sites like Twitter, Blogspot, and English Wikipedia are ranked among the top 25. It’s reasonable to assume that links to those websites aren’t purely spam or disclaimers. The D3 bubble chart below comprises the subdomains with the top 25 PageRanks.

Domains

By summing up the PageRank by domains instead of URLs, Yahoo! surpasses SlideShare. This makes sense as Yahoo! is quite spread out among its subdomains.

Lessons Learned

Bulk Insertion Speeds

The Flights dataset was inserted a single document at a time, as we didn’t know about bulk insertions yet. However, pymongo allows for bulk insertions. This is a much preferred and faster method for creating large collections. We utilized bulk insertions on the Twitter Memes dataset with a default batch size of 1000 docs. Here is the amount of time it takes for inputMongo.py to completely finish inserting the original 15,312,736 docs into Twitter Memes:

Bulk Insert Array Size Elapsed Time in Seconds Elapsed Time (hr:min:sec)
1 15007.014883 4:10:7
10 7781.75180 2:09:41
100 7332.346791 1:52:12
1000 6493.35044885 1:48:13
10000 Error: Size too big for insert

The elapsed time for inputting the data dropped significantly between single insertions and insertions of size 10 each, but interestingly, the speed tapered off as the size of the insertions increased. The maxMessageSizeByte value is 48 million bytes, and a bulk insertion of 10,000 exceeded this limit. This occurred in the Python driver, but some of the other drivers will split the array into 16MB chunks, which would avoid this error.

PageRank != Relevance

The results above show that the pages with the most PageRank are often the disclaimers and information pages, which probably isn’t what interests people most of the time. It turns out that Google’s Search algorithm is more complex than our simplistic version of PageRank. For instance, Google takes into account which links people click on for a certain query, thus boosting those URLs’ relevance for that query. Google also filters out some spam links and ad sites. In addition, some websites made with many internal links to intentionally boost their PageRank will have their PageRank values docked as part of Google’s campaign to disincentivize the intentional manipulation of PageRank.

Indexing URLs

Since each document has an unique URL, we created a unique index on this field. Fortunately, all the URLs in this dataset are well within the 1024 bytes limit for index entries. An alternative and more optimal way to index URLs is to use hashing. In addition, MongoDB supports hashed indices.

Going with Go

Sep 5 • Posted 12 months ago

As you may have read previously on the blog, the MongoDB team is adopting the Go language for a variety of projects. At last month’s New York MongoDB User Group, Sam Helman presented on how MongoDB is using Go for new and existing cloud tools. In Sam’s talk, you’ll learn how MongoDB is using Go for the backup capabilities in MongoDB Management Service and a new continuous integration tool.

Why go with Go? Between the lightweight syntax, the first-class concurrency and the well documented, idiomatic libraries such as mgo, Go is a great choice for writing anything from small scripts to large distributed applications.

Thanks to g33ktalk for recording Sam’s talk.

PageRank on Flights Dataset

Sep 3 • Posted 12 months ago

By Sweet Song and Daniel Alabi, MongoDB Summer Interns for 2013

This is the second of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

Having done some basic analysis on the Flights dataset (mostly using the MongoDB aggregation framework), we moved on to do some more advanced analysis on this dataset. We settled on computing the PageRank of all airports in the Flights dataset. The PageRank of nodes in a network is often computed iteratively. This process can easily be parallelized and often is. We can utilize MongoDB to compute the PageRank of nodes in a network in several ways. Here are the two options we considered:

  1. We can use the MongoDB MapReduce framework, which since version 2.4 uses the V8 JavaScript engine. Furthermore, since MongoDB is known for its robust sharding capabilities, we can increase the performance of query operations by setting up a MongoDB sharded cluster for our dataset. This is essential for really large working datasets.

  2. The Hadoop open-source framework is well-known for its robust distributed data processing features. MongoDB interfaces with hadoop via the Mongo-Hadoop connector.

For this particular dataset, we opted for (1) since the Flights dataset has only 319 airports. Regardless, there were 4,601 total weighted edges among USA commercial airports. The weight of an edge between any two airports was calculated using the 6,155,752 recorded trips in the flights collection.

Making a Graph of Airports

The airports dataset is fairly connected, with only one airport receiving flights in the past year without any domestic departures. Most flights out of Puerto Rico are considered international flights; as a result, our dataset didn’t have any recorded domestic flights in that year. This would be a black hole for any PageRank that goes to that airport. A more thorough explanation can be found here. Therefore, we removed that singular airport in Puerto Rico from our airports graph.

From the previous analysis, we had put the Flights dataset in a flights collection in the flying database. An entry looks like this:

For each document, we create (or modify) at least one node that keeps track of this “edge”:

where NUM_OF_AIRPORTS_IN_DB is the total number of airports in the Flights dataset which corresponds to the number of nodes in the network. NUM_OF_FLIGHTS_FROM_12478 is the total number of flights leaving from airport with airportId=12478. NUM_OF_FLIGHTS_FROM_12478_TO_12892 is the number of flights that leave the airport with airportId=12478 and arrive at the airport with airportId=12892. pg is the current PageRank of an airport; prs is a Map of <aId, pr> where pr is the probability of a flight going from the airport specified by _id to an airport identified by aId. For example, NUM_OF_FLIGHTS_FROM_12478_TO_12892/NUM_OF_FLIGHTS_FROM_12478 is the probability of transitioning from airport with airportId=12478 to airport with airportId=12892.

We wrote preformat.py to create the graph that contains information about the probability of every node in the graph transitioning to another. The resulting graph was stored in an fpg_0 collection (Flights PageRank 0) with 318 nodes.

MongoDB MapReduce

Next, we wrote some JavaScript code to calculate PageRank on the graph stored in the database. The goal was to create a new collection fpg_i for every ith iteration of PageRank. Every iteration is a call on oneiteration() in iteration.js consists of a map and a reduce function. The PageRank algorithm will stop once the average percentage change of the PageRank values for all nodes drops below 0.1%. The map function looks like this:

The map function considers every document (corresponding to an airport) in the current fpg_i. For each airport (call this x), it emits its airport ID (stored in _id) and passes the prs and prevpg (previous pg) information, for use in the next iteration of PageRank. Then, it passes a portion of x's PageRank to every airport that x links to.

The reduce function looks like this:

The reduce function has two duties:

  1. Collect the prs and prevpg information for each node;

  2. Accumulate the total PageRank score sent to each node.

Finally, db["fpg_"+i].mapReduce(map, reduce, {out: "fpg_"+(i+1)}); runs MapReduce on the fpg_i collection using the map and reduce functions defined below and stores the result (in the same format as fpg_i) into fpg_(i+1).

We keep applying the MapReduce operations until the PageRank of the nodes eventually converges. This happens when the average percentage change of pg for each node is less than a certain threshold (0.1% in our case). The execution of our implementation of the PageRank algorithm took 6.203 seconds, having converged after 20 iterations.

PageRank Result and Interpretation

The 10 airports with the most PageRank are:

The outcome matches our intuition that the airports with the most flights would accumulate most of the PageRank. In general, the nodes in a weighted graph with the most PageRank will be the ones with a greater ratio of incoming weight to outgoing weight.

Below is a map of the USA that illustrates the PageRank of all airports in the Flights dataset. Click on the image below to see the interactive map. The bigger the circle on the airport, the larger its PageRank. Hover around a circle to see the full name of an airport, its airport code, and the percentage of the total PageRank the airport accumulated.

Challenges/Lessons Learned

Insert vs. Update

Initially, we envisioned iterations.js to merely update the pg and prevpg of the PageRank collection instead of outputting to a new collection. However, updates were significantly slower than inserts into a new collection, even though we already had indexes on the pg and prevpg fields. We learned that, in general, updates in really large collections are significantly slower than insertions into a new collection. This preference of inserts over updates would be common in our other attempts.

Flights Dataset has no information on International Flights

Only domestic flights are present in our Flights dataset. Perhaps, if international flights were included, JFK, O’Hare, and San Francisco airports would have the most PageRank. Also, our map does not show the USA states and territories of Alaska, Hawaii, and Guam. If they were included, then the continental USA would have been too small to distinguish between individual airports.

Relatively small number of nodes in graph

Even though our initial Flights dataset contained 6,155,748 documents (corresponding to domestic flights), the resulting airports graph had only 318 documents (corresponding to airports/nodes). This is why the MongoDB MapReduce framework was very fast and converged after a few seconds and after less than 20 iterations. Perhaps, it might take a longer time before it converged if run on a dataset with more nodes (more airports).

The next dataset we’ll use is the Twitter Memes dataset. This dataset will have at least 1 million nodes (after pre-processing) that correspond to web pages on the Internet. Performance analysis based on the PageRank algorithm is more easily done on datasets with more nodes.

The MongoDB Java Driver 3.0: What’s Changing

Aug 30 • Posted 1 year ago

By Trisha Gee, MongoDB Java Engineer and Evangelist

In the last post, we covered the design goals for the new MongoDB Java Driver. In this one, we’re going to go into a bit more detail on the changes you can expect to see, and how to start playing with an alpha version of the driver. Please note, however, that the driver is still a work in progress, and not ready for production.

New features

Other than the overall changes to design detailed above, the 3.0 driver has the following new features:

  • Pluggable Codecs: This means you can do simple changes to serialisation/deserialisation, like tell the driver to use Joda Time instead of java.util.Date, or you can take almost complete control of how to turn your Java objects into BSON. This should be particularly useful for ODMs or other libraries, as they can write their own codecs to convert Java objects to BSON bytes.
  • Predictable cluster management: We’ve done quite a lot of work around discovering the servers in your cluster and determining which ones to talk to. In particular, the driver doesn’t have to wait for all servers to become available before it can start using the ones that are definitely there - the design is event-based so as soon as a server notifies the driver of its state the driver can take appropriate action - use it if it’s active, or start ignoring it if it’s no longer available.
  • Additional Connection Pool features: We’ve added support for additional connection pool settings, and a number of other improvements around connection management. Here’s the full list.
  • Deprecated methods/classes will be removed: In the next 2.x release a number of methods and classes will be deprecated. These, along with existing deprecated methods, will be removed in the 3.0 driver. This should point you in the right direction to help you migrate from 2.x to 3.x.

Speaking of Migration…

We’ve worked hard to maintain backwards compatibility whilst moving forwards with the architecture of the Java driver for MongoDB. We want to make migration as painless as possible, in many cases it should be a simple drop-in replacement if you want to keep using the existing API. We hope to provide a step-by-step guide to migrating from 2.x to 3.0 in the very near future. For now, it’s worth mentioning that upgrading will be easiest if you update to 2.12 (to be released soon), migrate any code that uses deprecated features, and then move to the compatible mode of the new driver.

Awesome! Can I try it?

Yes you can! You can try out an alpha of the new driver right now, but as you’d expect there are CAVEATS: this is an alpha, it does not support all current features (notably aggregation); although it has been tested it is still in development and we can’t guarantee everything will work as you expect. Features which have been or will be deprecated in the 2.x driver are missing completely from the 3.0 driver. Please don’t use it in production. However, if you do want to play with it in a development environment, or want to run your existing test suite against it, please do send us any feedback you have.

If you want to use the compatible mode, with the old API (minus deprecations) and new architecture:

Maven

Gradle

You should be able to do a drop-in replacement with this dependency - use this instead of your existing MongoDB driver, run it in your test environment and see how ready you are to use the new driver.

If you want to play with the new, ever-changing, not-at-all-final API, then you can use the new driver with the new API. Because we wanted to be able to support both APIs and not have a big-bang switchover, there’s a subtle difference to the location of the driver with the updated API, see if you can spot it:

Maven

Gradle

Note that if you use the new API version, you don’t have access to the old compatible API.

Of course, the code is in GitHub

In Summary

For 3.0, we will deliver the updated, simplified architecture with the same API as the existing driver, as well as working towards a more fluent style of API. This means that although in future you have the option of using the new API, you should also be able to do a simple drop-in replacement of your driver jar file and have the application work as before.

A release date for the 3.0 driver has not been finalized, but keep your eyes open for it.

All Hail the new Java driver!

Faceted Search with MongoDB

Aug 30 • Posted 1 year ago

By Jon Rangel, MongoDB Consulting Engineer

Introduction

Faceted search, or faceted navigation, is a way of browsing and searching for items in a set of data by applying filters on various properties (facets) of the items in the collection. It is increasingly seen as an important part of the UI for many search platforms, and indeed nowadays is pretty much expected in places such as e-commerce websites.

Faceted search makes it easy for users to navigate to the specific item or items they are interested in. It complements more free-form keyword search by facilitating exploration and discovery and is therefore useful when a user may not know the specific keywords they wish to search on.

Some core functionality that a faceted search feature should provide might include:

  • finding the items that match a particular value of a certain facet (e.g. colour:blue)
  • finding the items in the intersection of multiple facet values (e.g. colour:blue AND size:medium)
  • finding the items in the union of multiple facet values (e.g. colour:blue OR colour:red OR size:large)
  • for each possible facet filter combination, display to the user the possible facet values on which it is possible to filter further (“drill down”)
  • for each facet value on which it is possible to drill down, display to the user the count of items matching that filter.

In this article, we’ll look at implementing the above faceted search functionality using a pure MongoDB solution. We’ll examine a number of approaches to solving this problem, and discuss their relative performance characteristics and any other pros/cons. We will also introduce some third party tools that, alternatively, can integrate with MongoDB to provide faceted search functionality.

Navigating a Book Store

Suppose we want to build faceted search functionality for a product catalog for a book store. A typical document representing a publication in the catalog might look something like the following:

  {
        _id : 123,
        title : "MongoDB: The Definitive Guide",
        authors : [ "Kristina Chodorow" ],
        publication_date : ISODate("2013-05-23"),
        pages : 432,
        edition : 2,
        isbn_10 : 1449344682,
        isbn_13 : 978-1449344689,
        language : "English",
        publisher : {
            name: "O’Reilly Media",
            ...
        },
        last_updated : ISODate("2013-05-16"),
        ...
    } 

First off, let’s state some reasonable assumptions about the facets for this (or indeed any other) catalog:

  • The total number of facets will be small.
  • The total number of possible facet values for each facet may be large, but will typically be small.
  • Each item in the catalog may have zero or more facet values (“tags”) for each facet (but typically one).
  • The facets are well-known up front, and change rarely if at all. The set of facet values may change frequently i.e. any time the product catalog is updated to add/remove items, or change the tags on existing items.
  • The application has knowledge of the facets being used, but not the set of all possible facet values that exist in the catalog for each of those facets.

For this example, let’s say we have three facets on which we wish to search — Subject, Publisher and Language — and consider how to search efficiently, and how to generate the faceted navigation meta-data to present to the user. We will test on some pre-generated test data based on a real-world product catalog.

Searching

The first part of the problem to solve is how to efficiently search for items in the product catalog. A few schema and indexing approaches are presented below.

Solution #1

One way to define the facet tags for a publication would be to store all facet types and values in subdocuments in an array, as follows:

  {
        _id: 123,
        ...
        facets1 : [
            {
                type : "subject",
                val : "MongoDB"
            },
            {
                type : "subject",
                val : "Databases"
            },
            {
                type : "publisher",
                val : "O'Reilly Media"
            },
            {
                type : "language",
                val : "English"
            }
        ]
    } 

A single ‘generic’ compound index can then be created containing all the facets and facet values:

 > db.books.ensureIndex({"facets1.type" : 1, "facets1.val" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 29891456,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536
        },
        "ok" : 1
    }

See this blog post for a good treatment on these kinds of generic indexes.

Let’s see how this performs for some faceted searches, using explain(). We’ll look at queries on a single facet tag to start with.

Find all books about databases:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 27,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Find all books by a specific publisher:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 39960,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 133,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Both of these queries use the index optimally as the number of documents returned is the same as the number of documents scanned (nscanned is the same as n).

How about queries for documents matching the union or intersection of multiple facet values? To do these “and”/”or” queries we use the $all/$in operators respectively.

Find all books about databases OR published by O’Reilly Media:

  > db.books.find(
    ...     { "facets1" :
    ...         { "$in" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    Fri Aug 16 15:59:04.989 JavaScript execution failed: error: 
    { "$err" : "$elemMatch not allowed within $in", "code" : 15881 } at src/mongo/shell/query.js:L128

Oops! This type of search doesn’t work using $in to construct the query as we cannot use the $elemMatch operator within a $in clause. This query can instead be constructed using the $or operator:

  > db.books.find(
    ...     { "$or" : [
    ...             { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } },
    ...             { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ...         ]
    ...     }
    ... ).explain()
    {
        "clauses" : [
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 40019,
                "nscannedObjects" : 40019,
                "nscanned" : 40019,
                "nscannedObjectsAllPlans" : 40019,
                "nscannedAllPlans" : 40019,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 0,
                "nChunkSkips" : 0,
                "millis" : 118,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "publisher",
                            "publisher"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "O'Reilly Media",
                            "O'Reilly Media"
                        ]
                    ]
                }
            },
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 6640,
                "nscannedObjects" : 7374,
                "nscanned" : 7374,
                "nscannedObjectsAllPlans" : 7374,
                "nscannedAllPlans" : 7374,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 1,
                "nChunkSkips" : 0,
                "millis" : 123,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "subject",
                            "subject"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "Databases",
                            "Databases"
                        ]
                    ]
                }
            }
        ],
        "n" : 46659,
        "nscannedObjects" : 47393,
        "nscanned" : 47393,
        "nscannedObjectsAllPlans" : 47393,
        "nscannedAllPlans" : 47393,
        "millis" : 242,
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index is used for both parts of the “or” statement.

Next, find all books about databases AND published by O’Reilly Media:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 118,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query uses the index, but is not optimal as many more documents are scanned than returned. Note that the number of documents scanned is the same as the number of books by this publisher (as seen from the previous query) — this is because at present $all only uses the index for the first element in the query array.

The performance of these kinds of queries will improve significantly once MongoDB supports index intersection, which is a feature that is coming soon (see SERVER-3071). With single index intersection, queries like the above will not need to scan more documents than those returned. In the meantime, to optimize these kinds of queries put the most selective filter criterion as the first element of the $all array if possible to minimize scanning:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } },
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }
Solution #2

Store all facet types and values in in an array, but instead of each element of the array being a subdocument, concatenate the facet type name and value into a single string value:

 {
        _id: 123,
        ...
        facets2 : [
            "subject:MongoDB",
            "subject:Databases",
            "publisher:O'Reilly Media",
            "language:English"
        ]
    }

Create an index on the facets field:

  > db.books.ensureIndex({"facets2" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 55694912,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456
        },
        "ok" : 1
    }

Now let’s try some of the same queries as before. First, a simple query on a single facet value (all books about databases):

   > db.books.find(
    ...     { "facets2" : "subject"+":"+"Databases" }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 28,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This works exactly as expected.

Now, lets try an “or” query (all books about databases OR published by O’Reilly Media):

    > db.books.find(
    ...     { "facets2" :
    ...         { "$in" : [
    ...             "publisher"+":"+"O'Reilly Media",
    ...             "subject"+":"+"Databases"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1 multi",
        "isMultiKey" : true,
        "n" : 46600,
        "nscannedObjects" : 47275,
        "nscanned" : 47276,
        "nscannedObjectsAllPlans" : 47275,
        "nscannedAllPlans" : 47276,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 117,
        "indexBounds" : {
            "facets2" : [
                [
                    "publisher:O'Reilly Media",
                    "publisher:O'Reilly Media"
                ],
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index bounds look sensible, showing that the index is used for both elements of the $in array. Note that $in may be used to construct this type of query since we don’t need to use the $elemMatch operator with this schema.

Finally, an “and” query (all books about databases that are published by O’Reilly Media):

  > db.books.find(
    ...     { "facets2" :
    ...         { "$all" : [
    ...             "subject"+":"+"Databases",
    ...             "publisher"+":"+"O'Reilly Media"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

If you’ve been following so far, you won’t be too surprised to see that, unfortunately, this performs exactly the same as in solution #1, for the same reasons described there. Index intersection is coming soon though!

Solution #3

Consider the following schema, where each facet is a field in a subdocument, associated with an array of the tags for that facet:

 {
        _id: 123,
        ...
        facets3 : {
            subject : [ "MongoDB", "Databases" ],
            publisher : [ "O'Reilly Media" ],
            language : [ "English" ]
        }
    }

Add an index on each facet individually:

  > db.books.ensureIndex({"facets3.subject" : 1})
    > db.books.ensureIndex({"facets3.publisher" : 1})
    > db.books.ensureIndex({"facets3.language" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 75464480,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456,
            "facets3.subject_1" : 12084128,
            "facets3.publisher_1" : 2321984,
            "facets3.language_1" : 5363456
        },
        "ok" : 1
    }

This solution has the same performance characteristics as the first two solutions, with the additional benefit that the total size of the indexes required is significantly smaller. This is because we are not storing the facet names in the index for each indexed value.

Once index intersection using multiple indexes is supported (which is also coming under SERVER-3071), this approach will also perform well for “and” queries.

Generating the Faceted Navigation Information

The other part of the faceted search problem is how to most efficiently generate and return the faceted search meta-data. One way to do this would be to use the Aggregation Framework to calculate this information on-the-fly.

For example, to get all the facet values for the collection and the count of documents associated with each one, we could perform the following aggregation query (assuming schema #2 as above):

 > db.books.aggregate([{ "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", count : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 39960
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

Then, as the user drills down using the facets, we need to add the filter predicates to the aggregation query. For instance, if the user clicks on the “Databases” subject facet, we can obtain the facet values and counts for documents matching this filter as follows:

 > db.books.aggregate([{ "$match" : { "facets2" : "subject"+":"+"Databases" } },
                          { "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", "count" : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 675
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

The downside to this approach is that it incurs the overhead of an additional aggregation query each time the user queries the product catalog. Furthermore, for certain choices of schema (e.g. solution #3 above) we actually need to do one aggregation query per distinct facet.

It’s reasonable to assume that the product catalog will be updated much less frequently than it is queried, therefore it may well make sense to pre-compute the faceted navigation meta-data and store it in a separate collection. Consider the following schema for a collection of faceted navigation documents:

 {
        _id : "'facet_filter_string",
        value : {
            count : 12,
            facets : {
                facet1_name : {
                    facet1_val1 : 8,
                    facet1_val2 : 12,
                    ...
                },
                facet2_name : {
                    facet2_val1 : 5,
                    ...
                },
                ...
            }
        }
    }

where <facet_filter_string> is either the empty string (for the document representing the root of the faceted navigation) or one or more of “|<facet_name>:<facet_filter_val>|" concatenated together.

Then, to find the faceted navigation information pertaining to all books about databases, the following simple query on _id will do the job:

 > db.facetnav.find({_id:"|subject:Databases|"}).pretty()
    {
        "_id" : "|subject:Databases|",
        "value" : {
            "count" : 7315,
            "facets" : {
                "publisher" : {
                    "O'Reilly Media" : 675,
                    "Pub2" : 3605,
                    "Pub3" : 185,
                    "Pub4" : 305,
                    "Pub5" : 2505,
                    "Pub6" : 15,
                    "Pub7" : 25
                },
                "language" : {
                    "English" : 7250,
                    "French" : 1095,
                    "German" : 1290
                }
            }
        }
    }

Note that it’s not necessary to generate a document like the above for every single permutation of facet filters, only for each unique combination of filters according to some predetermined canonical ordering of facets (e.g. Subject, Publisher, Language). We can then ensure that the application always builds the _id string with which to query using this canonical ordering.

The faceted navigation meta-data collection can be generated quite easily using a Map-Reduce job. For some example code that does this, take a look at my GitHub repo. With the map and reduce functions defined there, the facetnav info for the entire product catalog can be generated as follows:

 > db.books.mapReduce(mapFn, reduceFn, { "out" : "facetnav" })
    {
        "result" : "facetnav",
        "timeMillis" : 117529,
        "counts" : {
            "input" : 105280,
            "emit" : 2423080,
            "reduce" : 63850,
            "output" : 1599
        },
        "ok" : 1,
    }

Subsequently, whenever the product catalog is updated, the facetnav collection can be quickly updated by specifying that the map-reduce job operate only on the recently updated items and fold those changes in to the existing facetnav collection. For example:

  > db.books.ensureIndex({"last_updated : 1"})
    > db.books.mapReduce(mapFn, reduceFn,
    ...                  { "query" : { "last_updated" : { "$gt" : new Date(2013,7,1) } },
    ...                    "out" : { "reduce" : "facetnav" } })
    {
        "result" : "facetnav",
        "timeMillis" : 724,
        "counts" : {
            "input" : 1000,
            "emit" : 13484,
            "reduce" : 198,
            "output" : 1599
        },
        "ok" : 1,
    }

Third-Party Tools

There are a number of search engine software packages that provide faceted search capabilities. These typically provide the core functionality we have described above, plus more advanced features such as more convenient searching on ranges of facet values (e.g. finding documents that fall within a certain date or price range) or auto-completion (i.e. displaying relevant suggestions, grouped by facet, as a user types in a search query).

The trade-offs with using an additional search engine are:

  • Extra complexity due to adding another ‘moving part’ to your deployment
  • Your application must deal with the fact that the system as a whole is now eventually consistent, with respect to the data stored in MongoDB versus the data stored in the external search engine. This may be undesirable, particularly for a product catalog that changes very frequently, for example.

Two of the most popular search engines are Solr and ElasticSearch which, like MongoDB, are also free and open-source products.

Solr and ElasticSearch can be easily integrated with MongoDB using Mongo Connector, which comes bundled with plugins for interfacing with each of them. Using the appropriate plugin, Mongo Connector can integrate data from MongoDB into the desired target system and keep the two systems in sync.

Conclusion

Faceted search functionality can be implemented in MongoDB, without requiring the use of external search engines. When index intersection arrives, all the types of queries we have examined here will perform optimally. Integrating with an external search engine to provide faceted search is also a good option, and something to consider depending on the specific requirements of your application.

Enhancing the F# developer experience with MongoDB

Aug 28 • Posted 1 year ago

This is a guest post by Max Hirschhorn, who is currently an intern at MongoDB.

About the F# programming language

F# is a multi-paradigm language built on the .NET framework. It is functional-first and prefers immutability, but also supports object-oriented and imperative programming styles.

Also, F# is a statically-typed language with a type inference system. It has a syntax similar to Ocaml, and draws upon ideas from other functional programming languages such as Erlang and Haskell.

Using the existing .NET driver

The existing .NET driver is compatible with F#, but is not necessarily written in a way that is idiomatic to use from F#.

Part of the reason behind this is that everything in F# is explicit. For example, consider the following example interface and implementing class.

[<Interface>]
type I =
    abstract Foo : unit -> string

type C() =
    interface I with
        member __.Foo () = "bar"

// example usage
let c = C()
(c :> I).Foo()

So in order to use any of the interface members, the class must be upcasted using the :> operator. Note that this cast is still checked at compile-time.

In a similar vein, C# supports implicit operators, which the BSON library uses for converting between a primitive value and its BsonValue equivalent, e.g.

new BsonDocument {
    { "price", 1.99 },
    { "$or", new BsonDocument {
        { "qty", new BsonDocument { { "$lt", 20 } } },
        { "sale", true }
    } }
};

whereas F# does not. This requires the developer to explicitly construct the appropriate type of BsonValue, e.g.

BsonDocument([ BsonElement("price", BsonDouble(1.99))
               BsonElement("$or", BsonArray([ BsonDocument("qty", BsonDocument("$lt", BsonInt32(20)))
                                              BsonDocument("sale", BsonBoolean(true)) ])) ])

with the query builder, we can hide the construction of BsonDocument instances, e.g.

Query.And([ Query.EQ("price", BsonDouble(1.99))
            Query.OR([ Query.LT("qty", BsonInt32(20))
                       Query.EQ("sale", BsonBoolean(true)) ]) ])

It is worth noting that the need to construct the BsonValue instances is completely avoided when using a typed QueryBuilder.

type Item = {
    Price : float
    Quantity : int
    Sale : bool
}

let query = QueryBuilder<Item>()

query.And([ query.EQ((fun item -> item.Price), 1.99)
            query.Or([ query.LT((fun item -> item.Quantity), 20)
                       query.EQ((fun item -> item.Sale), true) ]) ])

What we are looking for is a solution that matches the brevity of F# code, offers type-safety if desired, and is easy to use from the language.

New features

The main focus of this project is to make writing queries against MongoDB as natural from the F# language as possible.

bson quotations

We strive to make writing predicates as natural as possible by reusing as many of the existing operators as possible.

A taste

Consider the following query

{ price: 1.99, $or: [ { qty: { $lt: 20 } }, { sale: true } ] }

we could express this with a code quotation

bson <@ fun (x : BsonDocument) -> x?price = 1.99 && (x?qty < 20 || x?sale = true) @>

or with type safety

bson <@ fun (x : Item) -> x.Price = 1.99 && (x.Quantity < 20 || x.Sale = true) @>
Breaking it down

The quotations are not actually executed, but instead are presented as an abstract syntax tree (AST), from which an equivalent BsonDocument instance is constructed.

The ? operator

The ? operator is defined to allow for an unchecked comparison. The F# language supports the ability to do a dynamic lookup (get) and assignment (set) via the ? and ?<- operators respectively, but does not actually provide a implementation.

So, the F# driver defines the ? operator as the value associated with a field in a document casted to a fresh generic type.

// type signature: BsonDocument -> string -> 'a
let (?) (doc : BsonDocument) (field : string) =
    unbox doc.[field]

and similarly defines the ?<- operator as the coerced assignment of a generically typed value to the associated field in the document.

// type signature: BsonDocument -> string -> 'a -> unit
let (?<-) (doc : BsonDocument) (field : string) value =
    doc.[field] = unbox value |> ignore
Queries

Unchecked expressions have the type signature Expr<BsonDocument -> bool>.

// $mod
bson <@ fun (x : BsonDocument) -> x?qty % 4 = 0 @>

Checked expressions have the type signature Expr<'DocType -> bool>.

// $mod
bson <@ fun (x : Item) -> x.Quantity % 4 = 0 @>
Updates

Unchecked expressions have the type signature Expr<BsonDocument -> unit list>. The reason for the list in the return type is to perform multiple update operations.

// $set
bson <@ fun (x : BsonDocument) -> [ x?qty <- 20 ] @>

// $inc
bson <@ fun (x : BsonDocument) -> [ x?qty <- (+) 1 ] @>
Mmm… sugar

A keen observer would notice that (+) 1 is not an int, but actually a function int -> int. We are abusing the fact that type safety is not enforced here by assigning the quantity field of the document to a lambda expression, that takes a single parameter of the current value.

Note that

// $inc
bson <@ fun (x : BsonDocument) -> [ x?qty <- x?qty + 1 ] @>

is also valid.

Checked expressions either have the type signature Expr<'DocType -> unit list> or Expr<'DocType -> 'DocType>, depending on whether the document type has mutable fields (only matters for record types).

// $set
bson <@ fun (x : Item) -> [ x.Quantity <- 20 ] @>

// $inc
bson <@ fun (x : Item) -> [ x.Quantity <- x.Quantity + 1 ] @>

mongo expressions

Uses the monadic structure (computation expression) to define a pipeline of operations that are executed on each document in the collection.

Queries
let collection : IMongoCollection<BsonDocument> = ...

mongo {
    for x in collection do
    where (x?price = 1.99 && (x?qty < 20 || x?sale = true))
}

or with a typed collection

let collection : IMongoCollection<Item> = ...

mongo {
    for x in collection do
    where (x.price = 1.99 && (x.qty < 20 || x.sale = true))
}
Updates
let collection : IMongoCollection<BsonDocument> = ...

mongo {
    for x in collection do
    update
    set x?price 0.99
    inc x?qty 1
}

or with a typed collection

let collection : IMongoCollection<Item> = ...

mongo {
    for x in collection do
    update
    set x.Price 0.99
    inc x.Quantity 1
}

Serialization of F# data types

Now supports

Conclusion

Resources

The source code is available at GitHub. We absolutely encourage you to experiment with it and provide us feedback on the API, design, and implementation. Bug reports and suggestions for improvements are welcomed, as are pull requests.

Disclaimer. The API and implementation are currently subject to change at any time. You must not use this driver in production, as it is still under development and is in no way supported by MongoDB, Inc.

Acknowledgments

Many thanks to the guidance from the F# community on Twitter, and my mentors: Sridhar Nanjundeswaran, Craig Wilson, and Robert Stam. Also, a special thanks to Stacy Ferranti and Ian Whalen for overseeing the internship program.

Today’s News

Aug 27 • Posted 1 year ago

A: I met someone from 10gen the other day…

B: From where?

A: 10gen. The company that makes MongoDB.

B: Ohhh.

As of today, the above conversation will never happen again, because we are now called “MongoDB, Inc.”

MongoDB CEO, Max Schireson, published a post that details why we made the decision to rebrand. See. If you have any questions or concerns, please let us know.

-Eliot and the MongoDB Team

Surviving Success at Matchbook: Using MMS To Track Down Performance Issues

Aug 22 • Posted 1 year ago

This is a guest post from Jared Wyatt, CTO of Matchbook, an app for remembering the places you love and want to try.

I joined Matchbook as CTO in January with the goal of breathing new life into an iOS app that had a small, but very devoted following. For various reasons, we decided to start fresh and rebuild everything from the ground up—this included completely revamping the app itself and totally redesigning our API and backend infrastructure. The old system was using MySQL as a datastore, but MongoDB seemed like a better fit for our needs because of its excellent geospatial support and the flexibility offered by its document-oriented data model.

We submitted Matchbook 2.0 to the App Store at the end of June and within a few days received an email from Apple requesting design assets because they wanted to feature our app. So, of course we were all, like, “OMG OMG OMG.”

Read more

Aggregation Options on Big Data Sets Part 1: Basic Analysis using a Flights Data Set

Aug 21 • Posted 1 year ago

By Daniel Alabi and Sweet Song, MongoDB Summer Interns

Flights Dataset Overview

This is the first of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

The first dataset explored was a domestic flights dataset. The Bureau of Transportation Statistics provides information for every commercial flight from 1987, but we narrowed down our project to focus on the most recent available data for the past year (April 2012-March 2013).

We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.

Read more

$push to sorted array

Aug 20 • Posted 1 year ago

By Sam Weaver, MongoDB Solutions Architect and Alberto Lerner, MongoDB Kernel Lead

MongoDB 2.4 introduced a feature that many have requested for some time - the ability to create a capped array.

Capped arrays are great for any application that needs a fixed size list. For example, If you’re designing an ecommerce application with MongoDB and want to include a listing of the last 5 products viewed, you previously had to issue a $push request for each new item viewed, and then a $pop to kick the oldest item out of the array. Whilst this method was effective, it wasn’t necessarily efficient. Let’s take an example of the old way to do this:

First we would need to create a document to represent a user which contains an array to hold the last products viewed:

db.products.insert({last_viewed:["bike","cd","game","bike","book"]})
db.products.findOne()
{
    "_id" : ObjectId("51ff97d233c4f2089347cab6"),
    "last_viewed" : [
        "bike",
        "cd",
        "game",
        "bike",
        "book"
    ]
}

Read more

Mongoose 3.7.0 (Unstable) Released

Aug 15 • Posted 1 year ago

By EJ Bensing, MongoDB intern for Summer 2013

I’ve spent the last 2 months interning at 10gen, the MongoDB company, working on Mongoose. It has been a lot of fun and I’ve learned a ton about Node.js, MongoDB, and building open source libraries. I’m going to save all of that for a different post though, and instead talk about the newest release of Mongoose.

Unstable

To start things off, this is an unstable release. This means that it contains potentially breaking changes or other major updates, and thus should probably not be used in production. You can tell this is an unstable release because of the middle digit. Starting from 3.7, odd middle digits mean unstable, even mean stable. This is identical to the Node.js and MongoDB versioning schemes.

Read more

The MongoDB Web Shell

Aug 14 • Posted 1 year ago

About

The MongoDB Web Shell is a web application designed to emulate some of the features of the mongo terminal shell. This project has three main uses: try.mongodb.org, 10gen Education online classes, and the MongoDB API documentation.

In these three different contexts, users will be able to familiarize themselves with the MongoDB interface and basic commands available both independently and as part of a 10gen education homework assignment in the education program.

See a screenshot of the state of the browser shell prior to this summer below:

Read more

The MongoDB Java Driver 3.0

Aug 13 • Posted 1 year ago

By Trisha Gee, MongoDB Java Engineer and Evangelist

You may have heard that the JVM team at 10gen is working on a 3.0 version of the Java driver. We’ve actually been working on it since the end of last year, and it’s probably as surprising to you as it is to me that we still haven’t finished it yet. But this is a bigger project than it might seem, and we’re working hard to get it right.

So why update the driver? What are we trying to achieve?

Well, the requirements are:

  • More maintainable
  • More extensible
  • Better support for ODMs, third party libraries and other JVM languages
  • More idiomatic for Java developers
Read more

Improving Driver Documentation: The MongoDB Meta Driver

Aug 8 • Posted 1 year ago

This is a guest post, written by Mario Alvarez, a MongoDB intern for Summer 2013

This summer, I worked on developing the Meta Driver project, an effort to re-work the drivers documentation, creating an implementable specification of driver functionality that could be applied to different drivers in different languages.

Read more
blog comments powered by Disqus