Posts tagged:

mongodb

Schema Design for Time Series Data in MongoDB

Oct 30 • Posted 9 months ago

This is a post by Sandeep Parikh, Solutions Architect at MongoDB and Kelly Stirman, Director of Products at MongoDB.

Data as Ticker Tape

New York is famous for a lot of things, including ticker tape parades.

For decades the most popular way to track the price of stocks on Wall Street was through ticker tape, the earliest digital communication medium. Stocks and their values were transmitted via telegraph to a small device called a “ticker” that printed onto a thin roll of paper called “ticker tape.” While out of use for over 50 years, the idea of the ticker lives on in scrolling electronic tickers at brokerage walls and at the bottom of most news networks, sometimes two, three and four levels deep.

Today there are many sources of data that, like ticker tape, represent observations ordered over time. For example:

  • Financial markets generate prices (we still call them “stock ticks”).
  • Sensors measure temperature, barometric pressure, humidity and other environmental variables.
  • Industrial fleets such as ships, aircraft and trucks produce location, velocity, and operational metrics.
  • Status updates on social networks.
  • Calls, SMS messages and other signals from mobile devices.
  • Systems themselves write information to logs.

This data tends to be immutable, large in volume, ordered by time, and is primarily aggregated for access. It represents a history of what happened, and there are a number of use cases that involve analyzing this history to better predict what may happen in the future or to establish operational thresholds for the system.

Time Series Data and MongoDB

Time series data is a great fit for MongoDB. There are many examples of organizations using MongoDB to store and analyze time series data. Here are just a few:

  • Silver Spring Networks, the leading provider of smart grid infrastructure, analyzes utility meter data in MongoDB.
  • EnerNOC analyzes billions of energy data points per month to help utilities and private companies optimize their systems, ensure availability and reduce costs.
  • Square maintains a MongoDB-based open source tool called Cube for collecting timestamped events and deriving metrics.
  • Server Density uses MongoDB to collect server monitoring statistics.
  • Appboy, the leading platform for mobile relationship management, uses MongoDB to track and analyze billions of data points on user behavior.
  • Skyline Innovations, a solar energy company, stores and organizes meteorological data from commercial scale solar projects in MongoDB.
  • One of the world’s largest industrial equipment manufacturers stores sensor data from fleet vehicles to optimize fleet performance and minimize downtime.

In this post, we will take a closer look at how to model time series data in MongoDB by exploring the schema of a tool that has become very popular in the community: MongoDB Management Service (MMS). MMS helps users manage their MongoDB systems by providing monitoring, visualization and alerts on over 100 database metrics. Today the system monitors over 25k MongoDB servers across thousands of deployments. Every minute thousands of local MMS agents collect system metrics and ship the data back to MMS. The system processes over 5B events per day, and over 75,000 writes per second, all on less than 10 physical servers for the MongoDB tier.

Schema Design and Evolution

How do you store time series data in a database? In relational databases the answer is somewhat straightforward; you store each event as a row within a table. Let’s say you were monitoring the amount of system memory used per second. In that example you would have a table and rows that looked like the following:

timestamp memory_used
2013-10-10T23:06:37.000Z 1000000
2013-10-10T23:06:38.000Z 2000000


If we map that storage approach to MongoDB, we would end up with one document per event:

{
  timestamp: ISODate("2013-10-10T23:06:37.000Z"),
  type: ”memory_used”,
  value: 1000000
},
{
  timestamp: ISODate("2013-10-10T23:06:38.000Z"),
  type: ”memory_used”,
  value: 15000000
}

While this approach is valid in MongoDB, it doesn’t take advantage of the expressive nature of the document model. Let’s take a closer look at how we can refine the model to provide better performance for reads and to improve storage efficiency.

The Document-Oriented Design

A better schema approach looks like the following, which is not the same as MMS but it will help to understand the key concepts. Let’s call it the document-oriented design:

{
  timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 2000000
  }
}

We store multiple readings in a single document: one document per minute. To further improve the efficiency of the schema, we can isolate repeating data structures. In the ```timestamp_minute``` field we capture the minute that identifies the document, and for each memory reading we store a new value in the ```values``` sub-document. Because we are storing one value per second, we can simply represent each second as fields 0 - 59.

More Updates than Inserts

In any system there may be tradeoffs regarding the efficiency of different operations, such as inserts and updates. For example, in some systems updates are implemented as copies of the original record written out to a new location, which requires updating of indexes as well. One of MongoDB’s core capabilities is the in-place update mechanism: field-level updates are managed in place as long as the size of the document does not grow significantly. By avoiding rewriting the entire document and index entries unnecessarily, far less disk I/O is performed. Because field-level updates are efficient, we can design for this advantage in our application: with the document-oriented design there are many more updates (one per second) than inserts (one per minute).

For example, if you wanted to maintain a count in your application, MongoDB provides a handy operator that increments or decrements a field. Instead of reading a value into your application, incrementing, then writing the value back to the database, you can simply increase the field using $inc:

```{ $inc: { pageviews: 1 } }```

This approach has a number of advantages: first, the increment operation is atomic - multiple threads can safely increment a field concurrently using $inc. Furthermore, this approach is more efficient for disk operations, requires less data to be sent over the network and requires fewer round trips by omitting the need for any reads. Those are three big wins that result in a more simple, more efficient and more scalable system. The same advantages apply to the use of the $set operator.

The document-oriented design has several benefits for writing and reading. As previously stated, writes can be much faster as field-level updates because instead of writing a full document we’re sending a much smaller delta update that can be modeled like so:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: ”memory_used”
  }, 
  {$set: {“values.59”: 2000000 } }
)

With the document-oriented design reads are also much faster. If you needed an hour’s worth of measurements using the first approach you would need to read 3600 documents, whereas with this approach you would only need to read 60 documents. Reading fewer documents has the benefit of fewer disk seeks, and with any system fewer disk seeks usually results is significantly better performance.

A natural extension to this approach would be to have documents that span an entire hour, while still keeping the data resolution per second:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    1: 1000000, 
    …,
    3598: 1500000,
    3599: 2000000
  }
}

One benefit to this approach is that we can now access an hour’s worth of data using a single read. However, there is one significant downside: to update the last second of any given hour MongoDB would have to walk the entire length of the “values” object, taking 3600 steps to reach the end. We can further refine the model a bit to make this operation more efficient:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: { 0: 999999, 1: 999999, …, 59: 1000000 },
    1: { 0: 2000000, 1: 2000000, …, 59: 1000000 },
    …,
    58: { 0: 1600000, 1: 1200000, …, 59: 1100000 },
    59: { 0: 1300000, 1: 1400000, …, 59: 1500000 }
  }
}
db.metrics.update(
  { 
    timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
    type: “memory_used”
  }, 
  {$set: {“values.59.59”: 2000000 } }
)

MMS Implementation

In MMS users have flexibility to view monitoring data at varying levels of granularity. These controls appear at the top of the monitoring page:

These controls inform the schema design for MMS, and how the data needs to be displayed. In MMS, different resolutions have corresponding range requirements - for example, if you specify that you want to analyze monitoring data at the granularity of “1 hr” instead of “1 min” then the ranges also become less granular, changing from hours to days, weeks and months:

To satisfy this approach in a scalable manner and keep data retention easy to manage, MMS organizes monitoring data to be very efficient for reads by maintaining copies at varying degrees of granularity. The document model allows for efficient use of space, so the tradeoff is very reasonable, even for a system as large as MMS. As data ages out, collections that are associated with ranges of time are simply dropped, which is a very efficient operation. Collections are created to represent future ranges of time, and these will eventually be dropped as well. This cycle maintains a rolling window of history associated with the functionality provided by MMS.

In addition, to support the “avg/sec” display option the system also tracks the number of samples collected and the sum of all readings for each metric similar to the following example:

{
  timestamp_minute: ISODate(“2013-10-10T23:06:00.000Z”),
  num_samples: 58,
  total_samples: 108000000,
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 1800000
  }
}

The fields “num_samples” and “total_samples” are updated as new readings are applied to the document:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: “memory_used”
  }, 
  {
    {$set: {“values.59”: 2000000 }},
    {$inc: {num_samples: 1, total_samples: 2000000 }}
  }
)

Computing the average/sec is straightforward and requires no counting or processing, just a single read to retrieve the data and a simple application-level operation to compute the average. Note that with this model we assume a consistent cadence of measurements - one per second - that we can simply aggregate at the top of the document to report a rolled-up average for the whole minute. Other models are possible that would support inconsistent measurements and flexible averages over different time frames.

Another optimization used in MMS is preallocating all documents for the upcoming time period; MMS never causes an existing document to grow or be moved on disk. A background task within the MMS application performs inserts of empty “shell” documents including the subdocument schema but with all zeroes for the upcoming time periods before they are recorded. With this approach fields are always incremented or set without ever growing the document in size, which eliminates the possibility of moving the document and the associated overhead. This is a major performance win and another example of ensuring in-place updates within the document-oriented design.

Conclusion

MongoDB offers many advantages for storing and analyzing time series data, whether it’s stock ticks, tweets or MongoDB metrics. If you are using MongoDB for time series data analysis, we want to hear about your use case. Please continue the conversation by commenting on this post with your story.

More Information

Like what you see? Get MongoDB updates straight to your inbox

Performance Tuning MongoDB on Solidfire

Oct 24 • Posted 9 months ago

This is a guest post by by Chris Merz & Garrett Clark, SolidFire

We recently had a large enterprise customer implement a MongoDB sharded cluster on SolidFire as the backend for a global e-commerce system. By leveraging solid-state drive technology with features like storage virtualization, Quality of Service (guaranteed IOPS per volume), and horizontal scaling, the customer was looking to combine the benefits of dedicated storage performance with the simplicity and scalability of a MongoDB environment.

During the implementation the customer reached out to us with some performance and tuning questions, requesting our assistance with the configuration. After meeting with the team and reviewing the available statistics, we discovered response times that were deemed out of range for the application’s performance requirements. Response times were ~13-20ms (with an average of 15-17 ms). While this is considered acceptable latency in many implementations, the team was targeting < 5ms average query response times.

When troubleshooting any storage latency performance issue it is important to focus on two critical aspects of the end-to-end chain: potential i/o queue depth bottlenecks and the main contributors to the overall latency in the chain. A typical end-to-end sequence with attached storage can be described by:

MongoDB > OS > NIC > Network > Storage > Network > NIC > OS > MongoDB

First off, we looked for any i/o queue depth bottlenecks and found the first one on the operating system layer. MongoDB was periodically sending an i/o queue depth of >100 to the operating system and, by default, iSCSI could only release a 32 queue depth per iSCSI session. This drop from an i/o queue depth of >100 to 32 caused frames to be stalled on the operating system layer while they were waiting to continue down the chain.

We alleviated the issue by increasing the number of iSCSI sessions to the volume from 1 to 4, which proportionally increased the queue depth exiting the operating system to 128 (32*4). This enabled all frames coming off the application layer to immediately pass through the operating system and NIC, decreased the overall latency from ~15ms to ~4ms. Despite the latency average being 4ms, performance was still rather variable.

We then turned our focus to pinpointing the sources of the remaining end-to-end latency. We were able to determine the latency factors in the stack through the study of three latency loops:

First, the complete chain of: MongoDB > OS > NIC > Network > Storage > Network > NIC > OS > MongoDB. This loop took an average of 3.9ms to complete.

Secondly, the subset loop of: OS > NIC > Network > Storage > Network > NIC > OS. This loop took ~1.1ms to complete. We determined the latency of this loop by the output of “iostat –xk 1” then greping for the corresponding volume.

The last loop segment, latency on the storage subsystem, was 0.7ms and was obtained through a polling API command issued to the SolidFire unit.

Our analysis pointed to the first layers of the stack contributing the most significant percent (>70%) of the end-to-end latency, so we decided to start there and continue downstream.

We reviewed the OS configuration and tuning, with an eye towards both SolidFire/iSCSI best practices and MongoDB performance. Several OS-level tunables were found that could be tweaked to ensure optimal throughput for this type of deployment. Unfortunately, none of these resulted in any major reduction in the end-to-end latency for mongo.

Having eliminated the obvious, we were left with what remained: MongoDB itself. A phrase oft-quoted by the famous fictional detective, Sherlock Holmes came to mind: “when you have eliminated the impossible, whatever remains, however improbable, must be the truth.”

Upon going over the collected statistics runs with a fine-toothed comb, we noticed that the latency spikes had intervals of almost exactly 60 seconds. That’s when the light bulb went off…

The MongoDB flush interval. The architecture of MongoDB was developed in the context of spinning disk, a vastly slower storage technology requiring batched file syncs to minimize query latency. The syncdelay setting defaults to 60 seconds for this very reason. In the documentation, it is clearly stated “In almost every situation you should not set this value and use the default setting”. ‘Almost’ was the key to our solution, in this particular case. It should be noted that changing syncdelay is an advanced tuning, and should be carefully evaluated and tested on a per-deployment basis.

Little’s Law (IOPS = Queue Depth / Latency) indicated that lowering the flush interval would reduce the variance in queue depth thereby smoothing the overall latency. In lab testing, we had found that, under maximum load, decreasing the syncdelay to 1 second would force a ‘continuous flush’ behavior usually repeating every 6-7 seconds, reducing i/o spikes in the storage path. We had seen this as a useful technique for controlling IOPS throughput variability, but had not typically viewed it as a latency reduction technique.

It worked!

After implementing the change, the customer excitedly reported that they were seeing average end-to-end MongoDB response times of 1.2ms, with a throughput of ~4-5k IOPS per mongod (normal for this application), and NO obvious increase in extraneous i/o.

By increasing the number of iSCSI sessions, normalizing the flush rate and removing the artificial 60s buffer, we reduced average latency more than an order of magnitude, proving out the architecture at scale in a global production environment. Increasing the iSCSI sessions increased parallelism, and decreased the latency by 3.5-4x. The reduction in syncdelay had the effect of smoothing the average queue depth being sent to the storage system, decreasing latency by slightly more than 3x.

This customer’s experience is a good example of how engaging the MongoDB team early on can ensure a smooth product launch. As of today, we’re excited to announce that SolidFire is a MongoDB partner. Learn more about the SolidFire and MongoDB integration on our Database Solutions page. To learn more about performance tuning MongoDB on SolidFire, register for our upcoming webinar on November 6 with MongoDB.

For more information on MongoDB performance, check out Performance Considerations for MongoDB, which covers other topics such as indexes and application patterns and offers insight into achieving MongoDB performance at scale.

MongoDB Engineer Valeri Karpov, who coined the term “MEAN” Stack in a blog post on the MongoDB Blog, joined the Google Developers webcast on the MEAN Stack to give an overview of MEAN and discuss why it is the new stack for the web.

Scaling Advice from MongoHQ

Oct 2 • Posted 10 months ago

With most systems, trying to run a database of any significant size requires specialized knowledge, both to build your app and to manage the database it runs on top of. MongoDB makes your first 100GB simple - from running the database to writing the code. As your database gets larger, though, it helps to understand more about how MongoDB works so you can get the most out of it. MongoHQ has noticed that their customers that reach 100GB are running commercially successful businesses. MongoHQ recommends going through the 100GB Scaling Checklist as you grow. Watch the webinar recording on the subject for the full overview:

  1. Identify your data behavior: Figure out how your data patterns and how they are working within your application. You will need to link your data to how your application accesses this data. Consider the simple queries and the more complex queries you will need to look up, like multi-range queries.

  2. Refactor your schema to simplify queries

  3. Remove data that does not fit MongoDB: remove “unrefactorable” data

  4. Separate hot and cold data

  5. Don’t lean on mongodump’: this disrupts RAM and causes performance issues. Consider other Backup options instead, like MMS Backup

  6. Check your gauges: Monitor, monitor, monitor. Even if you aren’t having performance problems, set this up now so you can keep a history of your

  7. Avoid queries causing page faults: MongoHQ has run benchmarks against this to prove this. A system running in memory that was running at 7,000 operations per second was cut down by 50% to 3,500 operations per second when adding 1% table scans churning on a disk.

  8. Track and monitor slow queries: use Dex, MongoProfessor, Mongo-QP or MongoHQ’s Slow Query Tracker.

  9. Buying time with hardware: Don’t get addicted to buying hardware. Before making a purchase, always consider optimization and investigate separating and pairing data.

Watch the full recording with tips from MongoHQ’s Chris Winslet here.

September MongoDB News

Sep 30 • Posted 10 months ago

September was a big month. MongoDB 2.5.2 was released, MongoDB World was announced, MMS received a large feature update, and a number of MongoDB Drivers were released as well. Here’s a roundup of September news from the MongoDB newsletter:

MongoDB Announcements
  • Faceted Search with MongoDB: Faceted search, or faceted navigation, is a way of browsing and searching for items in a set of data by applying filters on various properties (facets) of the items in the collection. Faceted search makes it easy for users to navigate to the specific item or items they are interested in. It complements more free-form keyword search by facilitating exploration and discovery and is therefore useful when a user may not know the specific keywords they wish to search on. Learn how to implement faceted search with MongoDB
  • From Relational Databases to MongoDB - What You Need to Know: In this webinar on October 17, we’ll take a dive into how MongoDB works to better understand what non-relational design is, why we might use it and what advantages it gives us over relational databases. We’ll develop schema designs from examples, and consider strategies for scaling out.
  • MongoDB World: MongoDB has brought together more than 20,000 developers, IT professionals and executives in communities around the world. Now, MongoDB World will bring together this community in one place. Join us June 23-25 in New York City.
  • MMS Backup Feature Update*: MMS Backup now includes the ability to exclude databases and collections, permitting you to adjust the backup service to your needs and tune costs.
MongoDB Releases

Get MongoDB updates straight to your inbox each month. Sign up for the MongoDB Newsletter.

Managing the web nuggets with MongoDB and MongoKit

Sep 27 • Posted 10 months ago

This is a guest post by Nicolas Clairon, maintainer of MongoKit and founder of Elkorado

MongoKit is a python ODM for MongoDB. I created it in 2009 (when the ODM acronym wasn’t even used) for my startup project called Elkorado. Now that the service is live, I realize that I never wrote about MongoKit. I’d like to introduce it to you with this quick tutorial based on real use cases from Elkorado.

Elkorado: a place to store web nuggets

Elkorado is a collaborative, interest-based curation tool. It was born over the frustration that there is no place where to find quality resources about a particular topic of interest. There are so many blogs, forums, videos and websites out there that it is very difficult to find our way over this massive wealth of information.

Elkorado aims at helping people to centralize quality content, so they can find them later easily and discover new ones.

MongoDB to the rescue

Rapid prototyping is one of the most important thing in startup world and it is an area where MongoDB shines.

The web is changing fast, and so are web resources and their metadata. MongoDB’s and schemaless database is a perfect fit to store this kind of data. After losing hair by trying to use polymorphism with SQL databases, I went into MongoDB… and I felt in love with it.

While playing with the data, I needed a validation layer and wanted to add some methods to my documents. Back then, they was no ODM for Python. And so I created MongoKit.

MongoKit: MongoDB ODM for Python

MongoKit is a thin layer on top of Pymongo. It brings field validations, inheritance, polymorphism and a bunch of other features. Let’s see how it is used in Elkorado.

Elkorado is a collection of quality web resources called nuggets. This is how we could fetch a nugget discovered by the user “namlook” with Pymongo:

nuggets here is a regular python dict.

Here’s a simple nugget definition with MongoKit:

Fetching a nugget with MongoKit is pretty the same:

However, this time, nugget is a Nugget object and we can call the is_popular method on it:

One of the main advantages of MongoKit is that all your models are registered and accessible via the connection instance. MongoKit look at the __database__ and __collection__ fields to know which database and which collection has to be used. This is useful so we have only one place to specify those variables.

Inheritance

MongoKit was first build to natively support inheritance:

In this Core object, we are defining the database name and some fields that will be shared by other models.

If one wants a Nugget object to have date metadata, one just have to make it inherit from Core:

It’s all about Pymongo

With MongoKit, your are still very close to Pymongo. In fact, MongoKit’s connection, database and collection are subclasses of Pymongo’s. If once in an algorithm, you need pure performances, you can directly use Pymongo’s layer which is blazing fast:

Here, connection is a MongoKit connection but it can be used like a Pymongo connection. Note that to keep the benefice of DRY, we can call the pymongo’s layer from a MongoKit document:

A real life “simplified” example

Let’s see an example of CRUD done with MongoKit.

On Elkorado, each nugget is unique but multiple users can share a nugget which have differents metadata. Each time a user picks up a nugget, a UserNugget is created with specific informations. If this is the first time the nugget is discovered, a Nugget object is created, otherwise, it is updated. Here is a simplified UserNugget structure:

This example well describes what can be done with MongoKit. Here, the save method has been overloaded to check if a nugget exists (remember, each nugget is unique by its URL). It will create it if it is not already created, and update it.

Updating data with MongoKit is similar to Pymongo. Use save on the object or use directly the Pymongo’s layer to make atomic updates. Here, we use atomic updates to push new topics and increase the popularity:

Getting live

Let’s play with our model:

When calling the save method, the document is validated against the UserNugget’s structure. As expected, the fields created_at and updated_at have been added:

and the related nugget has been created:

Conclusion

MongoKit is a central piece of Elkorado. It has been written to be small and minimalist but powerful. There is so much more to say about features like inherited queries, i18n and gridFS, so take a look at the wiki to read more about how this tool can help you.

Check the documentation for more information about MongoKit. And if you register on Elkorado, check out the nuggets about MongoDB. Don’t hesitate to share you nuggets as well, the more the merrier.

New MongoDB Desktop Backgrounds

Sep 26 • Posted 10 months ago

New MongoDB Desktop Backgrounds are out, courtesy of MongoDB’s Graphic Design Team. Enjoy

The Top 5 Metrics to Watch in MongoDB

Sep 24 • Posted 10 months ago

This is part one of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross, a stealth mode startup seeking to empower patients. Alex is also a proud Carnegie Mellon alumnus, a graduate of the onsite courses offered at MongoDB University and a Cloudera Certified developer for Apache Hadoop (CDH-410).

At Upstream Systems, Persado, Care Across and through various consulting roles, I have dealt with all types of MongoDB installations ranging from single server instances, medium size deployments, to large cloud-based sharded clusters. Whether large or small, monitoring is essential to assuring performance and reliability. We needed to visualize the health of production environments and maintain a clearly defined procedure for metrics exceeding threshold values, as well as measure the impact of development changes.

MongoDB Management Service (MMS) is rich with metrics, but in my experience, the most valuable metrics in practice are the following:

  • Lock percentage: This was more important in earlier versions, where the global write lock could eat you alive and lock yielding was not yet implemented. While it’s less important with more recent versions (please vote on SERVER-1240!), lock percentage still shows a lot about your database activity. A continuously high lock percentage will affect reads as they will eventually queue up behind writes.
  • Replication lag: Designing your application to read data from a secondary node can sometimes be a good idea, when it reduces latency of the read. But if your application is using the secondary’s data and you have high replication lag, your application will use stale data. In addition, a primary node failure when you have a high replication lag means that a secondary may not be sufficiently up-to-date in a failover scenario.
  • Journal writes: If your writes are overwhelming your journal file this will impact performance and stability of your MongoDB installation.
  • Page faults: Page faults are expensive to process and at sufficiently high rates, it probably means that your working set is not fitting in memory. In complex data driven applications, page faults may indicate a deeper root cause hidden in the implementation of the business logic of the app.
  • Non Mapped Virtual Memory: When this grows without an end, this usually means a memory leak. It’s better to monitor it and proactively restart the server or try to hunt down the leak rather than wait for the crash to happen.

There’s a lot of data in MMS Monitoring but I have found that these metrics are the most interesting. In my next post, I will go over how to make this data actionable.

Twitter Memes Dataset Overview with PageRank

Sep 9 • Posted 10 months ago

This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

Once we’d familiarized ourselves with a sizable amount of data from the Flights dataset, the next step was to move towards really BIG data: Twitter Memes. SNAP (Stanford Network Analysis Project) provides free large network data sets. In particular, we were interested in the Twitter Memes dataset from the 2008 presidential election generated by MemeTracker, a combined project between Stanford and Cornell which created news cycle maps from news articles and blog posts. This dataset is a compilation of blogs, news articles, and other media postings that pertain to the presidential election. In particular, the project focused on the time lapse between the first mention of a quote and the time it took for the quote to be picked up by mass media. However, our analysis focuses again on the PageRank and importance of both the individual URLs and websites.

The entire dataset contains over 96 million documents and over 418 million links between them. To begin, we focused solely on the memes during the month of April in 2009, which had over 15 million documents. The goal was to run PageRank on this dataset (where all web pages form a graph). The higher the PageRank of a web page, the more important the page. Web pages with a relatively high PageRank usually have a high ratio of incoming links to outgoing links compared to all other pages in the graph.

Disclaimer: As we’d learn through this project, the pages with the most PageRank do not necessarily have to be related to the 2008 presidential election.

Importing the Data

The source file quotes_2009-04.txt was 11G. It came in this continuous format:

P       http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html
T       2008-09-09 22:35:24
Q       that's not change
Q       you know you can put lipstick on a pig
Q       what's the difference between a hockey mom and a pit bull lipstick
Q       you can wrap an old fish in a piece of paper called change
L       http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112
L       http://cbn.com/cbnnews/436448.aspx
L       http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews
  • P denotes the URL of the document.
  • T represents the time of the post.
  • Q is a quote found in the post.
  • L is a link that exists in the post.

This was not an ideal schema for MongoDB. With the use of inputMongo.py, the above input was converted into documents resembling the following:

{
    "_id" : ObjectId("51c8a3f200a7f40aae706e86"),
    "url" : "http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html",
    "quotes" : [
        "that's not change",
                "you know you can put lipstick on a pig", 
                "what's the difference between a hockey mom and a pit bull lipstick", 
                "you can wrap an old fish in a piece of paper called change"
    ],
    "links" : [
        "http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112", 
                "http://cbn.com/cbnnews/436448.aspx", 
                "http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews"
    ],
    "time" : ISODate("2008-09-09T22:35:24Z")
}

This resulted in 15,312,738 documents. We also utilized bulk insertion instead of individual document insertions. It took about 1 hour and 48 minutes to insert all of these documents into a collection in the database.

Notice that we still generated an ObjectId as _id. We’d later realize that this is superfluous as the url is unique per document.

Preparing the Dataset for PageRank

Theoretically, for PageRank to produce more accurate and reflective results, there must be no dead ends and the graph must be strongly connected (every node must be able to travel to any other node in the graph and back). Dead ends are nodes in the graph that have incoming links but no outgoing links. The presence of dead ends in a dataset leaks the PageRank in the graph so that the sum of PageRank of nodes in the graph will slowly converge to zero.

There are 2 ways to fix this problem:

  1. Use a taxation parameter, also called the random surfer, to pass a portion of PageRank from every node to every other node. This is a fix more for a graph that is not strongly connected than for graphs that have dead ends.

  2. Recursively remove dead ends.

We decided to use method (2) first because the resulting graph will be much cleaner and will theoretically give more reasonable results. To make the graph even more strongly connected, we also used a taxation parameter after erasing all dead ends.

Removing dead ends from the dataset proved to more involved than we had initially thought. Here’s what we tried:

Attempt #1: Patience, Young Grasshoppers

The first solution created new collections of nodes that were not dead ends in the last collection. If the size of the links array is zero for a doc/node, then do not add this doc to the next iteration. Else, iterate through each link in the links array for each doc. Remove each link which points to a doc that does not exist in the last iteration collection. This is only bounded by the 16MB document limit.

Although we had an index on the url field, after 5 hours of running the script, we realized that the projected elapsed time would be over 20 days. This prompted us to explore other options.

Attempt #2: Further Optimizing URL search

We originally thought the reason for this lag was due to the script searching for a link in the entire collection, although we did have an index on it already. Instead, we created a more optimized solution to create another collection titled char8 which has as its _id, the first 8 characters of the url. Then in each char8 doc, there is a url array that starts with those 8 characters. So instead of searching for link in each url in the original memes collection, we’d only search through the links of the char8 collection (which is indexed by the first 8 characters).

This was now projected to finish in 5 days, still too inefficient.

Attempt #3: Migrating to Hadoop/Pig

However, we were undeterred; we still wanted to compute PageRank over the entire dataset with no dead ends. The single script approach was extremely inefficient and slow. It wasn’t fit for BIG DATA. Instead, we turned to Hadoop and Amazon Elastic MapReduce.

Hadoop is an open-source MapReduce framework that handles intensive data computations and often operates over large clusters, although it could also be used locally. It supports many languages and tools to ease usage. Amazon Elastic MapReduce is a service provided by Amazon that hosts the Hadoop framework on its EC2 instances that would be created solely for each MapReduce task and connects to S3 Simple Storage for data input and output. Mongo-Hadoop was the necessary library which would allow us to connect MongoDB as the input/output source for these MapReduce jobs.

We submitted jobs to Elastic MapReduce through both the UI and the command line tool elastic-mapreduce, with our access ID and secret key.

First, we wrote a Pig script, explained below, to eliminate dead ends.

-- Registers the mongo and hadoop jars from our bucket
REGISTER s3://memes-bson/mongo-2.10.1.jar
REGISTER s3://memes-bson/mongo-hadoop-pig_1.1.2-1.1.0.jar
REGISTER s3://memes-bson/mongo-hadoop-core_1.1.2-1.1.0.jar
REGISTER s3://memes-bson/mongo-hadoop_1.1.2-1.1.0.jar
-- There's a User Defined Function in myudf.MYBAG which needs to be used
REGISTER s3://memes-bson/myudf.jar

original =  LOAD "$INPUT"
            USING com.mongodb.hadoop.pig.BSONLoader;

outs =      FOREACH original GENERATE $0#"_id" AS url, $0#"value"#"links" AS links;

-- Dead ends will not appear in showing as links is empty. url is the origin of the link while 
-- single is the destination. myudf.MYBAG turns the tuple of links to a bag of tuple of each link.
showing =   FOREACH outs GENERATE url, FLATTEN(myudf.MYBAG(links)) AS single;

-- JOIN will eliminate links that go to a single which doesn't exist in our dataset
joined =    JOIN outs BY url, showing BY single;

project =   FOREACH joined GENERATE showing::url AS url, showing::single AS single;

-- Group together the urls such that they form the original scheme of a url and an array of links 
together =  GROUP project BY url;

result =    FOREACH together GENERATE $0 AS url, $1.single AS links;

STORE result INTO "$OUTPUT"
    USING com.mongodb.hadoop.pig.BSONStorage;

The Pig script above removed all dead ends from the current dataset. The problem with this, is that removing dead ends could create new ones. For example, in the simple graph of

A -> B -> C -> D

D is the only dead end in the graph. But when D is removed, we have

A -> B -> C

So we would keep on removing the new dead ends until there were no new dead ends. In this particular “pathological” graph, the entire graph would be removed because all of the nodes in the graph would eventually be dead ends. Fortunately, most datasets aren’t linear graphs.

We had to figure out a way to repeatedly run the Pig script above until there are no more dead ends. Elastic MapReduce only allows for a single Pig script execution, so we wrote Bash script removeAllDeadEnds.sh that kept running the above Pig script until the output filesize stopped decreasing. The script utilized s3cmd to check file size on S3 and elastic-mapreduce-ruby to submit jobs. In theory, the output filesize will decrease if and only if some dead ends have been removed.

After 10 iterations or so, the script would only erase two or three dead ends in each iteration. This continued until we stopped the script at 70 iterations, which took over 9 hours. With 8 normalized instance hours on m1.xlarge, the iterative jobs finished, on average, in 8 minutes. The initial BSON size of the dataset with dead ends was 606.9MB and the final BSON size with only 2 dead ends was 448.2MB. We decided that the result of running more iterations would only be trivial, and thus we could simply move ahead. We ended up with 1,113,524 total nodes.


PageRank in Elastic MapReduce on the dataset

At this point, we had two collections:

  1. NO_DEAD_ENDS which had 1,113,524 nodes, 2 of which were dead ends.

  2. ORIGINAL which had 36,814,086 nodes, around 75% of which were dead ends.

Whereas we were able to run MongoDB’s inherent MapReduce for the flights dataset to quickly converge, the sheer size of these collections drove us to use Amazon’s Elastic MapReduce to compute the PageRank of the nodes in the graph.

NO_DEAD_ENDS
Preformat

First, we had to preformat the graph to suit the PageRank program. This involved changing the schema, using a Hadoop Job written in Java, to:

{
    "_id" : "001rian.blogspot.com/2008_07_01_archive.html",
    "ptr" : 0.2,
    "pg" : 8.98049795e-7,
    "links" : [
        "001rian.blogspot.com/2008/07/sanra-dewi-ikutan-bisnis-interne.html",
        "001rian.blogspot.com/2008/07/peduli-dengan-pagerank-dan-traffic.html",
        "001rian.blogspot.com/2008/08/88808-08-2008.html",
        "001rian.blogspot.com/2008/07/jadwal-puasa-dan-imsak-ramadhan-1429-h.html",
        "001rian.blogspot.com/2008/07/buku-elektronik-untuk-sekolah.html"
    ]
}
  • "pg" : 6.081907480807019e-8 was the initial PageRank of all nodes in a graph, which corresponds to the reciprocal of the total number of nodes in the graph.
  • ptr was the probability of this node going to any other node it links to, which corresponds to the reciprocal of the length of the links array.
PageRank

The structure of the Elastic MapReduce Java program for PageRank was similar to the MongoDB MapReduce program written for the Flights dataset.

However, instead of setting all of the arguments for the Hadoop job in the Bash script, as each iteration (one Map and one Reduce) was a job, the jobs were continuously created and their different variables needed to be set dynamically. For example, the output of the last job was set as the input of the next job for the number iteration.

last = iteration - 1;
FileInputFormat.addInputPath(job, new Path(outputString + last));

// mapred.output.dir is the current configuration variable for the output path in Mongo-Hadoop. 
job.getConfiguration().set("mapred.output.dir", outputString + iteration);
```

Again, the stopping criterion for PageRank is when the average percentage change of a node, the `residual` or the `diff`, drops below 0.1%. Instead of outputting the individual node diffs in Reduce and aggregating over the entire output to sum up the diff as in the MongoDB MapReduce for Flights, we used the Hadoop counter to sum the residuals per Reduce call. The Hadoop counter is an atomic variable that is accessible by all Mappers and Reducers and it will track any statistic or variable to be read after the job is completed.

// Need this 10E6 because long only takes whole numbers context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual*10E6)); “`

Therefore, after each job was completed, we viewed the total residual to determine whether it was under the threshold. Our stopping criteria is again when the residual converged to .001 * n where n was the number of elements. In this case, PageRank finished after 16 total iterations in 2 hours and 30 minutes with 7 m1.xlarge instances.

ORIGINAL

Another implementation of the PageRank algorithm didn’t delete all dead ends from the graph, but instead connected all dead ends to every other node in the graph. This created a strongly connected graph but it also increased the noise in the graph. Since the resulting number of nodes after erasing dead ends is only 1 million out of the original 15,312,738, we wanted to see how the PageRank results would change if all of the dead ends were included.

Notice that the original collection from the text file only has 15,312,738 nodes, whereas we accounted for 36,814,086 nodes in the ORIGINAL collection. The extra 21,501,348 nodes are links in the original 15,312,738 nodes but were not documents in the imported collection. Rather than decrease the graph, as with erasing dead ends, making the graph strong connected increased the size of the graph with 21,501,348 extra dead end nodes.

However, there’s no need to actually create edges between all dead end nodes and those that aren’t dead ends (with 27 million dead ends, that would create 27 million * 36 million = 972 billion links). Instead, we simply distributed the summation of the PageRank from all dead ends to every other node. Here are our implementation ideas:

  1. The first (implemented) idea was to add all of the PageRank from nodes with dead ends in the Mapper, and then distribute this PageRank among all nodes in the Reducer when summing up the incoming PageRank; however, this was not feasible as Hadoop counters accessed in the Mapper would be zero in the Reducer. Mappers and Reducers executed simultaneously so the counter values were only finalized after the job was done.

  2. To solve (1) we waited until the end of a job to determine the PageRank of the dead ends. Then the final PageRank and residual was calculated in the Mapper of the next iteration.

The main function, which submits jobs and waits for completion, retrieves the job’s total PageRank from dead ends and passes it as a configuration variable to the next job.

long prevDeadEndsPG = prevJob.getCounters().findCounter(PageRank.RanCounters.DEAD_END_PG).getValue();
currentJob.getConfiguration().setLong("deadEndsPG", prevDeadEndsPG);

Then in the Mapper step, we add this deadEndsPG divided by the total number of nodes (the probability of any dead end jumping to this node). We compute the residual using the previous PageRank value, and add to the counter for residuals. This way, the final PageRank value of anode for that iteration is determined in the Mapper instead of the Reducer.

long deadEndsPG = 0;
// fetch the dead ends PG from the configuration file
context.getConfiguration().getLong("deadEndsPG", deadEndsPG);

// 10E10 for numbers large enough to keep as long
double doubleDeadEndsPG = (double deadEndsPG) / 10E10;
double distributeDeadEndsPG = ((double) deadEndsPG) / (PageRank.totalNodes); 

double beta = 0.9;
currentPG = PageRank.beta * (currentPG + distributeDeadEndsPG) + PageRank.distributedBeta;

double residual = residual = Math.abs(prevpg - currentPG) / prevpg;

context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual * 10E6));

This original PageRank took 17 iterations in the span of 7 hours and 12 minutes with 8 m1.xlarge instances.

RESULTS

The following is an interpretation of the results obtained after running the PageRank algorithm over the 2 collections above.

NO_DEAD_ENDS

For the dataset with no dead ends, the 10 web pages with the most PageRank are:

1. {"pg":0.020741859913578454
   ,"url":"http://slideshare.net/youtube-in-slideshare"
   ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
2. {"pg":0.01490199050574318
   ,"url":"http://slideshare.com/youtube-in-slideshare"
   ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
3. {"pg":0.00542114032291505
   ,"url":"http://london.kijiji.ca/f-buy-and-sell-w0qqcatidz10"
   ,"quotes":[]}
4. {"pg":0.005381224128322537
   ,"url":"http://badgerandblade.com/index.php?page=terms"
   ,"quotes":["badger and blade","b amp b"]}
5. {"pg":0.00328534940037117
   ,"url":"http://saintjohn.kijiji.ca/f-buy-and-sell-w0qqcatidz10"
   ,"quotes":[]}
6. {"pg":0.00301961022829243
   ,"url":"http://london.kijiji.ca/c-buy-and-sell-business-industrial-salon-equipment-w0qqadidz115815478"
   ,"quotes":[]}
7. {"pg":0.0028168240288373365
   ,"url":"http://dealsofamerica.com/terms.php"
   ,"quotes":[]}
8. {"pg":0.0025406641926389753
   ,"url":"http://london.kijiji.ca/c-buy-and-sell-cds-dvds-vhs-house-seasons-1-4-w0qqadidz123632361"
   ,"quotes":[]}
9. {"pg":0.0024984791525017504
   ,"url":"http://answerbag.com/c_view/3602"
   ,"quotes":[]}
10. {"pg":0.0021795435717848356
    ,"url":"http://chacha.com/entertainment-arts/music"
    ,"quotes":["up where they play all day in the sun","stay all day in the sun","stay or leave i want you not to go but you did","sometimes goodbye is a second chance"]}

It’s not surprising that http://slideshare.net/youtube-in-slideshare and http://slideshare.com/youtube-in-slideshare have the most PageRank. Around the beginning of 2009, SlideShare released a new feature to enable users to embed youtube videos in their presentations. At the time, this feature was in Beta. FAQs, examples, and other details were posted on both http://www.slideshare.net/youtube-in-slideshare and http://www.slideshare.com/youtube-in-slideshare. Since this was a new feature (that a lot of people were excited about!), lots of reputable bloggers posted a link to these FAQ pages to showcase the power of the new SlideShare feature. As a result, these FAQ pages accumulated most of the PageRank.

Another interesting observation here is that there are 4 web pages with the most PageRank from kijiji, a centralized network of online urban communities for posting local online classified advertisements. The reason kijiji accumulated a lot of PageRank is that there’s a lot of intra-domain (but inter sub-domain) linking. For example, lots of pages on london.kijiji.ca link to vacation.kijiji.ca which links back to london.kijiji.ca. Such intra-domain linking creates a web structure called a spider trap that accumulates a large portion of the PageRank available to the whole system of pages. Furthermore, about 5% fo the entire NO_DEAD_ENDS contains kijiji web pages.

ORIGINAL
1. {"_id" : "craigslist.org/about/scams.html"
   , "pg" : 0.018243114523103326
   , "links" : []}
2. {"_id" : "slideshare.net/youtube-in-slideshare"
   , "pg" : 0.003038463243965542
   , "links" : ["slideshare.net/youtube-in-slideshare"]
   , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
3. {"_id" : "slideshare.com/youtube-in-slideshare"
   , "pg" : 0.002161141838313388
   , "links" : ["slideshare.com/youtube-in-slideshare"]
   , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]}
4. {"_id" : "ad.doubleclick.net/clk;212944738;23941492;n?goto.canon-asia.com/oic"
   , "pg" : 0.0015214745797758247
   , "links" : []
   , "quotes" : []}
5. {"_id" : "mx.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0013631525163117727
   , "links" : []
   , "quotes" : []}
6. {"_id" : "ar.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0013542983079855681
   , "links" : []
   , "quotes" : []}
7. {"_id" : "it.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.0011670409020562926
   , "links" : []
   , "quotes" : []}
8. {"_id" : "fr.answers.yahoo.com/info/disclaimer"
   , "pg" : 0.001083113456512683
   , "links" : []
   , "quotes" : []}
9. {"_id" : "seroundtable.com"
   , "pg" : 0.0009033963316740201
   , "links" : []
   , "quotes" : []}
10. {"_id" : "de.answers.yahoo.com/info/disclaimer"
    , "pg" : 0.0006914069352292967
    , "links" : []
    , "quotes" : []}

These results differed significantly from the results of the NO_DEAD_ENDS but for good reasons.

  • The PageRank for these web pages are significantly lower than the PageRank of the pages in ORIGINAL because there’s 30x more pages in this graph than NO_DEAD_ENDS.
  • The PageRank of the Craigslist scams information page is 6x as much as the next highest PageRank value. This is because every Craigslist page links to the scam page, creating a spider trap. Similarly, the Yahoo! Answers disclaimer pages accumulated a lot of PageRank because most Yahoo! forum pages link to one of the disclaimer pages.
  • The curious entry above is the ad.doubleclick.net URL. This URL is no longer available. But since doubleclick was an online ad platform, it’s possible that this particular ad was either served the most or gathered the most attention.
SubDomains

Looking at subdomains, we see a lot of entries for websites that have no URLs in the top 10 list. Sites like Twitter, Blogspot, and English Wikipedia are ranked among the top 25. It’s reasonable to assume that links to those websites aren’t purely spam or disclaimers. The D3 bubble chart below comprises the subdomains with the top 25 PageRanks.

Domains

By summing up the PageRank by domains instead of URLs, Yahoo! surpasses SlideShare. This makes sense as Yahoo! is quite spread out among its subdomains.

Lessons Learned

Bulk Insertion Speeds

The Flights dataset was inserted a single document at a time, as we didn’t know about bulk insertions yet. However, pymongo allows for bulk insertions. This is a much preferred and faster method for creating large collections. We utilized bulk insertions on the Twitter Memes dataset with a default batch size of 1000 docs. Here is the amount of time it takes for inputMongo.py to completely finish inserting the original 15,312,736 docs into Twitter Memes:

Bulk Insert Array Size Elapsed Time in Seconds Elapsed Time (hr:min:sec)
1 15007.014883 4:10:7
10 7781.75180 2:09:41
100 7332.346791 1:52:12
1000 6493.35044885 1:48:13
10000 Error: Size too big for insert

The elapsed time for inputting the data dropped significantly between single insertions and insertions of size 10 each, but interestingly, the speed tapered off as the size of the insertions increased. The maxMessageSizeByte value is 48 million bytes, and a bulk insertion of 10,000 exceeded this limit. This occurred in the Python driver, but some of the other drivers will split the array into 16MB chunks, which would avoid this error.

PageRank != Relevance

The results above show that the pages with the most PageRank are often the disclaimers and information pages, which probably isn’t what interests people most of the time. It turns out that Google’s Search algorithm is more complex than our simplistic version of PageRank. For instance, Google takes into account which links people click on for a certain query, thus boosting those URLs’ relevance for that query. Google also filters out some spam links and ad sites. In addition, some websites made with many internal links to intentionally boost their PageRank will have their PageRank values docked as part of Google’s campaign to disincentivize the intentional manipulation of PageRank.

Indexing URLs

Since each document has an unique URL, we created a unique index on this field. Fortunately, all the URLs in this dataset are well within the 1024 bytes limit for index entries. An alternative and more optimal way to index URLs is to use hashing. In addition, MongoDB supports hashed indices.

Going with Go

Sep 5 • Posted 11 months ago

As you may have read previously on the blog, the MongoDB team is adopting the Go language for a variety of projects. At last month’s New York MongoDB User Group, Sam Helman presented on how MongoDB is using Go for new and existing cloud tools. In Sam’s talk, you’ll learn how MongoDB is using Go for the backup capabilities in MongoDB Management Service and a new continuous integration tool.

Why go with Go? Between the lightweight syntax, the first-class concurrency and the well documented, idiomatic libraries such as mgo, Go is a great choice for writing anything from small scripts to large distributed applications.

Thanks to g33ktalk for recording Sam’s talk.

The MongoDB Java Driver 3.0: What’s Changing

Aug 30 • Posted 11 months ago

By Trisha Gee, MongoDB Java Engineer and Evangelist

In the last post, we covered the design goals for the new MongoDB Java Driver. In this one, we’re going to go into a bit more detail on the changes you can expect to see, and how to start playing with an alpha version of the driver. Please note, however, that the driver is still a work in progress, and not ready for production.

New features

Other than the overall changes to design detailed above, the 3.0 driver has the following new features:

  • Pluggable Codecs: This means you can do simple changes to serialisation/deserialisation, like tell the driver to use Joda Time instead of java.util.Date, or you can take almost complete control of how to turn your Java objects into BSON. This should be particularly useful for ODMs or other libraries, as they can write their own codecs to convert Java objects to BSON bytes.
  • Predictable cluster management: We’ve done quite a lot of work around discovering the servers in your cluster and determining which ones to talk to. In particular, the driver doesn’t have to wait for all servers to become available before it can start using the ones that are definitely there - the design is event-based so as soon as a server notifies the driver of its state the driver can take appropriate action - use it if it’s active, or start ignoring it if it’s no longer available.
  • Additional Connection Pool features: We’ve added support for additional connection pool settings, and a number of other improvements around connection management. Here’s the full list.
  • Deprecated methods/classes will be removed: In the next 2.x release a number of methods and classes will be deprecated. These, along with existing deprecated methods, will be removed in the 3.0 driver. This should point you in the right direction to help you migrate from 2.x to 3.x.

Speaking of Migration…

We’ve worked hard to maintain backwards compatibility whilst moving forwards with the architecture of the Java driver for MongoDB. We want to make migration as painless as possible, in many cases it should be a simple drop-in replacement if you want to keep using the existing API. We hope to provide a step-by-step guide to migrating from 2.x to 3.0 in the very near future. For now, it’s worth mentioning that upgrading will be easiest if you update to 2.12 (to be released soon), migrate any code that uses deprecated features, and then move to the compatible mode of the new driver.

Awesome! Can I try it?

Yes you can! You can try out an alpha of the new driver right now, but as you’d expect there are CAVEATS: this is an alpha, it does not support all current features (notably aggregation); although it has been tested it is still in development and we can’t guarantee everything will work as you expect. Features which have been or will be deprecated in the 2.x driver are missing completely from the 3.0 driver. Please don’t use it in production. However, if you do want to play with it in a development environment, or want to run your existing test suite against it, please do send us any feedback you have.

If you want to use the compatible mode, with the old API (minus deprecations) and new architecture:

Maven

Gradle

You should be able to do a drop-in replacement with this dependency - use this instead of your existing MongoDB driver, run it in your test environment and see how ready you are to use the new driver.

If you want to play with the new, ever-changing, not-at-all-final API, then you can use the new driver with the new API. Because we wanted to be able to support both APIs and not have a big-bang switchover, there’s a subtle difference to the location of the driver with the updated API, see if you can spot it:

Maven

Gradle

Note that if you use the new API version, you don’t have access to the old compatible API.

Of course, the code is in GitHub

In Summary

For 3.0, we will deliver the updated, simplified architecture with the same API as the existing driver, as well as working towards a more fluent style of API. This means that although in future you have the option of using the new API, you should also be able to do a simple drop-in replacement of your driver jar file and have the application work as before.

A release date for the 3.0 driver has not been finalized, but keep your eyes open for it.

All Hail the new Java driver!

Faceted Search with MongoDB

Aug 30 • Posted 11 months ago

By Jon Rangel, MongoDB Consulting Engineer

Introduction

Faceted search, or faceted navigation, is a way of browsing and searching for items in a set of data by applying filters on various properties (facets) of the items in the collection. It is increasingly seen as an important part of the UI for many search platforms, and indeed nowadays is pretty much expected in places such as e-commerce websites.

Faceted search makes it easy for users to navigate to the specific item or items they are interested in. It complements more free-form keyword search by facilitating exploration and discovery and is therefore useful when a user may not know the specific keywords they wish to search on.

Some core functionality that a faceted search feature should provide might include:

  • finding the items that match a particular value of a certain facet (e.g. colour:blue)
  • finding the items in the intersection of multiple facet values (e.g. colour:blue AND size:medium)
  • finding the items in the union of multiple facet values (e.g. colour:blue OR colour:red OR size:large)
  • for each possible facet filter combination, display to the user the possible facet values on which it is possible to filter further (“drill down”)
  • for each facet value on which it is possible to drill down, display to the user the count of items matching that filter.

In this article, we’ll look at implementing the above faceted search functionality using a pure MongoDB solution. We’ll examine a number of approaches to solving this problem, and discuss their relative performance characteristics and any other pros/cons. We will also introduce some third party tools that, alternatively, can integrate with MongoDB to provide faceted search functionality.

Navigating a Book Store

Suppose we want to build faceted search functionality for a product catalog for a book store. A typical document representing a publication in the catalog might look something like the following:

  {
        _id : 123,
        title : "MongoDB: The Definitive Guide",
        authors : [ "Kristina Chodorow" ],
        publication_date : ISODate("2013-05-23"),
        pages : 432,
        edition : 2,
        isbn_10 : 1449344682,
        isbn_13 : 978-1449344689,
        language : "English",
        publisher : {
            name: "O’Reilly Media",
            ...
        },
        last_updated : ISODate("2013-05-16"),
        ...
    } 

First off, let’s state some reasonable assumptions about the facets for this (or indeed any other) catalog:

  • The total number of facets will be small.
  • The total number of possible facet values for each facet may be large, but will typically be small.
  • Each item in the catalog may have zero or more facet values (“tags”) for each facet (but typically one).
  • The facets are well-known up front, and change rarely if at all. The set of facet values may change frequently i.e. any time the product catalog is updated to add/remove items, or change the tags on existing items.
  • The application has knowledge of the facets being used, but not the set of all possible facet values that exist in the catalog for each of those facets.

For this example, let’s say we have three facets on which we wish to search — Subject, Publisher and Language — and consider how to search efficiently, and how to generate the faceted navigation meta-data to present to the user. We will test on some pre-generated test data based on a real-world product catalog.

Searching

The first part of the problem to solve is how to efficiently search for items in the product catalog. A few schema and indexing approaches are presented below.

Solution #1

One way to define the facet tags for a publication would be to store all facet types and values in subdocuments in an array, as follows:

  {
        _id: 123,
        ...
        facets1 : [
            {
                type : "subject",
                val : "MongoDB"
            },
            {
                type : "subject",
                val : "Databases"
            },
            {
                type : "publisher",
                val : "O'Reilly Media"
            },
            {
                type : "language",
                val : "English"
            }
        ]
    } 

A single ‘generic’ compound index can then be created containing all the facets and facet values:

 > db.books.ensureIndex({"facets1.type" : 1, "facets1.val" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 29891456,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536
        },
        "ok" : 1
    }

See this blog post for a good treatment on these kinds of generic indexes.

Let’s see how this performs for some faceted searches, using explain(). We’ll look at queries on a single facet tag to start with.

Find all books about databases:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 27,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Find all books by a specific publisher:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 39960,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 133,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Both of these queries use the index optimally as the number of documents returned is the same as the number of documents scanned (nscanned is the same as n).

How about queries for documents matching the union or intersection of multiple facet values? To do these “and”/”or” queries we use the $all/$in operators respectively.

Find all books about databases OR published by O’Reilly Media:

  > db.books.find(
    ...     { "facets1" :
    ...         { "$in" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    Fri Aug 16 15:59:04.989 JavaScript execution failed: error: 
    { "$err" : "$elemMatch not allowed within $in", "code" : 15881 } at src/mongo/shell/query.js:L128

Oops! This type of search doesn’t work using $in to construct the query as we cannot use the $elemMatch operator within a $in clause. This query can instead be constructed using the $or operator:

  > db.books.find(
    ...     { "$or" : [
    ...             { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } },
    ...             { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ...         ]
    ...     }
    ... ).explain()
    {
        "clauses" : [
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 40019,
                "nscannedObjects" : 40019,
                "nscanned" : 40019,
                "nscannedObjectsAllPlans" : 40019,
                "nscannedAllPlans" : 40019,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 0,
                "nChunkSkips" : 0,
                "millis" : 118,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "publisher",
                            "publisher"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "O'Reilly Media",
                            "O'Reilly Media"
                        ]
                    ]
                }
            },
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 6640,
                "nscannedObjects" : 7374,
                "nscanned" : 7374,
                "nscannedObjectsAllPlans" : 7374,
                "nscannedAllPlans" : 7374,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 1,
                "nChunkSkips" : 0,
                "millis" : 123,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "subject",
                            "subject"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "Databases",
                            "Databases"
                        ]
                    ]
                }
            }
        ],
        "n" : 46659,
        "nscannedObjects" : 47393,
        "nscanned" : 47393,
        "nscannedObjectsAllPlans" : 47393,
        "nscannedAllPlans" : 47393,
        "millis" : 242,
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index is used for both parts of the “or” statement.

Next, find all books about databases AND published by O’Reilly Media:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 118,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query uses the index, but is not optimal as many more documents are scanned than returned. Note that the number of documents scanned is the same as the number of books by this publisher (as seen from the previous query) — this is because at present $all only uses the index for the first element in the query array.

The performance of these kinds of queries will improve significantly once MongoDB supports index intersection, which is a feature that is coming soon (see SERVER-3071). With single index intersection, queries like the above will not need to scan more documents than those returned. In the meantime, to optimize these kinds of queries put the most selective filter criterion as the first element of the $all array if possible to minimize scanning:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } },
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }
Solution #2

Store all facet types and values in in an array, but instead of each element of the array being a subdocument, concatenate the facet type name and value into a single string value:

 {
        _id: 123,
        ...
        facets2 : [
            "subject:MongoDB",
            "subject:Databases",
            "publisher:O'Reilly Media",
            "language:English"
        ]
    }

Create an index on the facets field:

  > db.books.ensureIndex({"facets2" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 55694912,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456
        },
        "ok" : 1
    }

Now let’s try some of the same queries as before. First, a simple query on a single facet value (all books about databases):

   > db.books.find(
    ...     { "facets2" : "subject"+":"+"Databases" }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 28,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This works exactly as expected.

Now, lets try an “or” query (all books about databases OR published by O’Reilly Media):

    > db.books.find(
    ...     { "facets2" :
    ...         { "$in" : [
    ...             "publisher"+":"+"O'Reilly Media",
    ...             "subject"+":"+"Databases"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1 multi",
        "isMultiKey" : true,
        "n" : 46600,
        "nscannedObjects" : 47275,
        "nscanned" : 47276,
        "nscannedObjectsAllPlans" : 47275,
        "nscannedAllPlans" : 47276,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 117,
        "indexBounds" : {
            "facets2" : [
                [
                    "publisher:O'Reilly Media",
                    "publisher:O'Reilly Media"
                ],
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index bounds look sensible, showing that the index is used for both elements of the $in array. Note that $in may be used to construct this type of query since we don’t need to use the $elemMatch operator with this schema.

Finally, an “and” query (all books about databases that are published by O’Reilly Media):

  > db.books.find(
    ...     { "facets2" :
    ...         { "$all" : [
    ...             "subject"+":"+"Databases",
    ...             "publisher"+":"+"O'Reilly Media"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

If you’ve been following so far, you won’t be too surprised to see that, unfortunately, this performs exactly the same as in solution #1, for the same reasons described there. Index intersection is coming soon though!

Solution #3

Consider the following schema, where each facet is a field in a subdocument, associated with an array of the tags for that facet:

 {
        _id: 123,
        ...
        facets3 : {
            subject : [ "MongoDB", "Databases" ],
            publisher : [ "O'Reilly Media" ],
            language : [ "English" ]
        }
    }

Add an index on each facet individually:

  > db.books.ensureIndex({"facets3.subject" : 1})
    > db.books.ensureIndex({"facets3.publisher" : 1})
    > db.books.ensureIndex({"facets3.language" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 75464480,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456,
            "facets3.subject_1" : 12084128,
            "facets3.publisher_1" : 2321984,
            "facets3.language_1" : 5363456
        },
        "ok" : 1
    }

This solution has the same performance characteristics as the first two solutions, with the additional benefit that the total size of the indexes required is significantly smaller. This is because we are not storing the facet names in the index for each indexed value.

Once index intersection using multiple indexes is supported (which is also coming under SERVER-3071), this approach will also perform well for “and” queries.

Generating the Faceted Navigation Information

The other part of the faceted search problem is how to most efficiently generate and return the faceted search meta-data. One way to do this would be to use the Aggregation Framework to calculate this information on-the-fly.

For example, to get all the facet values for the collection and the count of documents associated with each one, we could perform the following aggregation query (assuming schema #2 as above):

 > db.books.aggregate([{ "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", count : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 39960
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

Then, as the user drills down using the facets, we need to add the filter predicates to the aggregation query. For instance, if the user clicks on the “Databases” subject facet, we can obtain the facet values and counts for documents matching this filter as follows:

 > db.books.aggregate([{ "$match" : { "facets2" : "subject"+":"+"Databases" } },
                          { "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", "count" : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 675
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

The downside to this approach is that it incurs the overhead of an additional aggregation query each time the user queries the product catalog. Furthermore, for certain choices of schema (e.g. solution #3 above) we actually need to do one aggregation query per distinct facet.

It’s reasonable to assume that the product catalog will be updated much less frequently than it is queried, therefore it may well make sense to pre-compute the faceted navigation meta-data and store it in a separate collection. Consider the following schema for a collection of faceted navigation documents:

 {
        _id : "'facet_filter_string",
        value : {
            count : 12,
            facets : {
                facet1_name : {
                    facet1_val1 : 8,
                    facet1_val2 : 12,
                    ...
                },
                facet2_name : {
                    facet2_val1 : 5,
                    ...
                },
                ...
            }
        }
    }

where <facet_filter_string> is either the empty string (for the document representing the root of the faceted navigation) or one or more of “|<facet_name>:<facet_filter_val>|" concatenated together.

Then, to find the faceted navigation information pertaining to all books about databases, the following simple query on _id will do the job:

 > db.facetnav.find({_id:"|subject:Databases|"}).pretty()
    {
        "_id" : "|subject:Databases|",
        "value" : {
            "count" : 7315,
            "facets" : {
                "publisher" : {
                    "O'Reilly Media" : 675,
                    "Pub2" : 3605,
                    "Pub3" : 185,
                    "Pub4" : 305,
                    "Pub5" : 2505,
                    "Pub6" : 15,
                    "Pub7" : 25
                },
                "language" : {
                    "English" : 7250,
                    "French" : 1095,
                    "German" : 1290
                }
            }
        }
    }

Note that it’s not necessary to generate a document like the above for every single permutation of facet filters, only for each unique combination of filters according to some predetermined canonical ordering of facets (e.g. Subject, Publisher, Language). We can then ensure that the application always builds the _id string with which to query using this canonical ordering.

The faceted navigation meta-data collection can be generated quite easily using a Map-Reduce job. For some example code that does this, take a look at my GitHub repo. With the map and reduce functions defined there, the facetnav info for the entire product catalog can be generated as follows:

 > db.books.mapReduce(mapFn, reduceFn, { "out" : "facetnav" })
    {
        "result" : "facetnav",
        "timeMillis" : 117529,
        "counts" : {
            "input" : 105280,
            "emit" : 2423080,
            "reduce" : 63850,
            "output" : 1599
        },
        "ok" : 1,
    }

Subsequently, whenever the product catalog is updated, the facetnav collection can be quickly updated by specifying that the map-reduce job operate only on the recently updated items and fold those changes in to the existing facetnav collection. For example:

  > db.books.ensureIndex({"last_updated : 1"})
    > db.books.mapReduce(mapFn, reduceFn,
    ...                  { "query" : { "last_updated" : { "$gt" : new Date(2013,7,1) } },
    ...                    "out" : { "reduce" : "facetnav" } })
    {
        "result" : "facetnav",
        "timeMillis" : 724,
        "counts" : {
            "input" : 1000,
            "emit" : 13484,
            "reduce" : 198,
            "output" : 1599
        },
        "ok" : 1,
    }

Third-Party Tools

There are a number of search engine software packages that provide faceted search capabilities. These typically provide the core functionality we have described above, plus more advanced features such as more convenient searching on ranges of facet values (e.g. finding documents that fall within a certain date or price range) or auto-completion (i.e. displaying relevant suggestions, grouped by facet, as a user types in a search query).

The trade-offs with using an additional search engine are:

  • Extra complexity due to adding another ‘moving part’ to your deployment
  • Your application must deal with the fact that the system as a whole is now eventually consistent, with respect to the data stored in MongoDB versus the data stored in the external search engine. This may be undesirable, particularly for a product catalog that changes very frequently, for example.

Two of the most popular search engines are Solr and ElasticSearch which, like MongoDB, are also free and open-source products.

Solr and ElasticSearch can be easily integrated with MongoDB using Mongo Connector, which comes bundled with plugins for interfacing with each of them. Using the appropriate plugin, Mongo Connector can integrate data from MongoDB into the desired target system and keep the two systems in sync.

Conclusion

Faceted search functionality can be implemented in MongoDB, without requiring the use of external search engines. When index intersection arrives, all the types of queries we have examined here will perform optimally. Integrating with an external search engine to provide faceted search is also a good option, and something to consider depending on the specific requirements of your application.

Enhancing the F# developer experience with MongoDB

Aug 28 • Posted 11 months ago

This is a guest post by Max Hirschhorn, who is currently an intern at MongoDB.

About the F# programming language

F# is a multi-paradigm language built on the .NET framework. It is functional-first and prefers immutability, but also supports object-oriented and imperative programming styles.

Also, F# is a statically-typed language with a type inference system. It has a syntax similar to Ocaml, and draws upon ideas from other functional programming languages such as Erlang and Haskell.

Using the existing .NET driver

The existing .NET driver is compatible with F#, but is not necessarily written in a way that is idiomatic to use from F#.

Part of the reason behind this is that everything in F# is explicit. For example, consider the following example interface and implementing class.

[<Interface>]
type I =
    abstract Foo : unit -> string

type C() =
    interface I with
        member __.Foo () = "bar"

// example usage
let c = C()
(c :> I).Foo()

So in order to use any of the interface members, the class must be upcasted using the :> operator. Note that this cast is still checked at compile-time.

In a similar vein, C# supports implicit operators, which the BSON library uses for converting between a primitive value and its BsonValue equivalent, e.g.

new BsonDocument {
    { "price", 1.99 },
    { "$or", new BsonDocument {
        { "qty", new BsonDocument { { "$lt", 20 } } },
        { "sale", true }
    } }
};

whereas F# does not. This requires the developer to explicitly construct the appropriate type of BsonValue, e.g.

BsonDocument([ BsonElement("price", BsonDouble(1.99))
               BsonElement("$or", BsonArray([ BsonDocument("qty", BsonDocument("$lt", BsonInt32(20)))
                                              BsonDocument("sale", BsonBoolean(true)) ])) ])

with the query builder, we can hide the construction of BsonDocument instances, e.g.

Query.And([ Query.EQ("price", BsonDouble(1.99))
            Query.OR([ Query.LT("qty", BsonInt32(20))
                       Query.EQ("sale", BsonBoolean(true)) ]) ])

It is worth noting that the need to construct the BsonValue instances is completely avoided when using a typed QueryBuilder.

type Item = {
    Price : float
    Quantity : int
    Sale : bool
}

let query = QueryBuilder<Item>()

query.And([ query.EQ((fun item -> item.Price), 1.99)
            query.Or([ query.LT((fun item -> item.Quantity), 20)
                       query.EQ((fun item -> item.Sale), true) ]) ])

What we are looking for is a solution that matches the brevity of F# code, offers type-safety if desired, and is easy to use from the language.

New features

The main focus of this project is to make writing queries against MongoDB as natural from the F# language as possible.

bson quotations

We strive to make writing predicates as natural as possible by reusing as many of the existing operators as possible.

A taste

Consider the following query

{ price: 1.99, $or: [ { qty: { $lt: 20 } }, { sale: true } ] }

we could express this with a code quotation

bson <@ fun (x : BsonDocument) -> x?price = 1.99 && (x?qty < 20 || x?sale = true) @>

or with type safety

bson <@ fun (x : Item) -> x.Price = 1.99 && (x.Quantity < 20 || x.Sale = true) @>
Breaking it down

The quotations are not actually executed, but instead are presented as an abstract syntax tree (AST), from which an equivalent BsonDocument instance is constructed.

The ? operator

The ? operator is defined to allow for an unchecked comparison. The F# language supports the ability to do a dynamic lookup (get) and assignment (set) via the ? and ?<- operators respectively, but does not actually provide a implementation.

So, the F# driver defines the ? operator as the value associated with a field in a document casted to a fresh generic type.

// type signature: BsonDocument -> string -> 'a
let (?) (doc : BsonDocument) (field : string) =
    unbox doc.[field]

and similarly defines the ?<- operator as the coerced assignment of a generically typed value to the associated field in the document.

// type signature: BsonDocument -> string -> 'a -> unit
let (?<-) (doc : BsonDocument) (field : string) value =
    doc.[field] = unbox value |> ignore
Queries

Unchecked expressions have the type signature Expr<BsonDocument -> bool>.

// $mod
bson <@ fun (x : BsonDocument) -> x?qty % 4 = 0 @>

Checked expressions have the type signature Expr<'DocType -> bool>.

// $mod
bson <@ fun (x : Item) -> x.Quantity % 4 = 0 @>
Updates

Unchecked expressions have the type signature Expr<BsonDocument -> unit list>. The reason for the list in the return type is to perform multiple update operations.

// $set
bson <@ fun (x : BsonDocument) -> [ x?qty <- 20 ] @>

// $inc
bson <@ fun (x : BsonDocument) -> [ x?qty <- (+) 1 ] @>
Mmm… sugar

A keen observer would notice that (+) 1 is not an int, but actually a function int -> int. We are abusing the fact that type safety is not enforced here by assigning the quantity field of the document to a lambda expression, that takes a single parameter of the current value.

Note that

// $inc
bson <@ fun (x : BsonDocument) -> [ x?qty <- x?qty + 1 ] @>

is also valid.

Checked expressions either have the type signature Expr<'DocType -> unit list> or Expr<'DocType -> 'DocType>, depending on whether the document type has mutable fields (only matters for record types).

// $set
bson <@ fun (x : Item) -> [ x.Quantity <- 20 ] @>

// $inc
bson <@ fun (x : Item) -> [ x.Quantity <- x.Quantity + 1 ] @>

mongo expressions

Uses the monadic structure (computation expression) to define a pipeline of operations that are executed on each document in the collection.

Queries
let collection : IMongoCollection<BsonDocument> = ...

mongo {
    for x in collection do
    where (x?price = 1.99 && (x?qty < 20 || x?sale = true))
}

or with a typed collection

let collection : IMongoCollection<Item> = ...

mongo {
    for x in collection do
    where (x.price = 1.99 && (x.qty < 20 || x.sale = true))
}
Updates
let collection : IMongoCollection<BsonDocument> = ...

mongo {
    for x in collection do
    update
    set x?price 0.99
    inc x?qty 1
}

or with a typed collection

let collection : IMongoCollection<Item> = ...

mongo {
    for x in collection do
    update
    set x.Price 0.99
    inc x.Quantity 1
}

Serialization of F# data types

Now supports

Conclusion

Resources

The source code is available at GitHub. We absolutely encourage you to experiment with it and provide us feedback on the API, design, and implementation. Bug reports and suggestions for improvements are welcomed, as are pull requests.

Disclaimer. The API and implementation are currently subject to change at any time. You must not use this driver in production, as it is still under development and is in no way supported by MongoDB, Inc.

Acknowledgments

Many thanks to the guidance from the F# community on Twitter, and my mentors: Sridhar Nanjundeswaran, Craig Wilson, and Robert Stam. Also, a special thanks to Stacy Ferranti and Ian Whalen for overseeing the internship program.

Today’s News

Aug 27 • Posted 11 months ago

A: I met someone from 10gen the other day…

B: From where?

A: 10gen. The company that makes MongoDB.

B: Ohhh.

As of today, the above conversation will never happen again, because we are now called “MongoDB, Inc.”

MongoDB CEO, Max Schireson, published a post that details why we made the decision to rebrand. See. If you have any questions or concerns, please let us know.

-Eliot and the MongoDB Team

Surviving Success at Matchbook: Using MMS To Track Down Performance Issues

Aug 22 • Posted 11 months ago

This is a guest post from Jared Wyatt, CTO of Matchbook, an app for remembering the places you love and want to try.

I joined Matchbook as CTO in January with the goal of breathing new life into an iOS app that had a small, but very devoted following. For various reasons, we decided to start fresh and rebuild everything from the ground up—this included completely revamping the app itself and totally redesigning our API and backend infrastructure. The old system was using MySQL as a datastore, but MongoDB seemed like a better fit for our needs because of its excellent geospatial support and the flexibility offered by its document-oriented data model.

We submitted Matchbook 2.0 to the App Store at the end of June and within a few days received an email from Apple requesting design assets because they wanted to feature our app. So, of course we were all, like, “OMG OMG OMG.”

Read more
blog comments powered by Disqus