Posts tagged:

indexing

MongoDB & The Soundwave Music Map

Aug 5 • Posted 2 weeks ago

By David Lynch: Principal Engineer at Soundwave

Soundwave is a smartphone app that tracks music as quickly as it is played. Soundwave tracks each user’s music-listening habits across multiple platforms and streaming services1, creating a listening profile. Users follow each other, facilitating listening, sharing, discovery, and discussion of music old and new, popular and niche.

MongoDB is our database of choice. We track around 250,000 plays per day from a user base that has grown to over 1 million in the past 13 months. Soundwave has users in all time zones, making availability critical. MongoDB replica sets provide us fault tolerance and redundancy, allowing us to perform scheduled maintenance without affecting users.

We consider responsiveness to be a key part of the Soundwave user experience so we use a sharded configuration that allows us to add compute and storage resources as needed.

We use indexes liberally, and app usage patterns require maintaining a fairly large working set in memory. Sharding allows us to deploy a larger number of relatively smaller, and disproportionately cheaper, AWS instances2 while maintaining the snappy user experience we require.

The volume of data we process today is significant, but not huge. It was important for us to architect a system from the outset that would scale easily and economically.

The complexity of data retrieval is moderate but well suited to MongoDB. Our App has been featured a couple of times by both Apple and Google. At our peak rate, we handle 30,000 new sign-ups per day. Our MongoDB configuration handles this load comfortably. In fact, with respect to scale and schema, our deployment is pretty boring and by the book.

Music Map

One of Soundwaves’ most compelling features is the Music Map. Our users can draw a free-form enclosing polygon on a Google Map of any area, at any zoom level, and see what music was recently played in that area. This is captured in the screenshot below. These constraints required some interesting engineering outside the scope of the MongoDB playbook.

We’re kind of proud of our implementation. Many location aware apps with similar features reduce the user experience to some form of circular $geoNear around a point. In our minds, this is not as interesting as a randomly drawn polygon - nor is it a great user experience to promise polygons and return some other approximation of what a user wanted.

Constraints

We set a target maximum 95th percentile latency of 1.5 seconds for the map feature. From a client perspective, we decided anything over this wait time is just boring for the user. We discovered that users perform searches that are either pretty-zoomed-out - like Ireland - or-pretty-zoomed in - like Landsdowne Road.

For lots of reasons, we have clusters of data around certain areas. Big cities typically have lots of dense data points, but some cities are better represented than others. This makes pretty-zoomed-out and pretty-zoomed-in hard to quantify. It follows also that zoom-level of the map is somewhat useless as an optimization datum. We wanted all our location data to be available to the user - there are a few plays in Antarctica, but they happened a while ago - it would be a shame to cut the search space by time and miss this.

Schema Design

Our deployment contains around 90 million play actions, each one represented by a single document. For any interaction with the app, for example capturing a play, there is an associated geo-coordinate and document. Our first implementation of this feature leveraged the 2d index type with a legacy location data point associated with each action document.

Our geo-query therefore took the form of a $within statement using a polygon built of points on the free form finger-drag.

The supporting index was constructed as follows.

This actually worked quite nicely out of the box. For legacy queries, MongoDB $within is not too fussy about self-intersecting nor is it concerned about the uniqueness of each point, or the completeness of the polygon.

The query, however was dog-slow. For dense areas at city sized zoom levels, we could not get response times under multiple seconds for our queries. For areas where significant numbers of plays were happening, hinting an index on action-time performed much better than 2d in all cases.

For areas of low activity, this strategy didn’t work particularly well. Searches at the country or continent scale didn’t complete in any reasonable time - 10s of seconds - and began to affect the latency of other parts of the app. MongoDB’s Explain showed scans over half, if not more, of the records in the collection for city sized searches, yet also confirmed that the 2d index was used. This led us to conclude that the 2d index may not be powerful enough for our needs.

Migration to GeoJSON & 2dsphere

After some hints from the MongoDB community and migration to MongoDB 2.4, we decided to move to the 2dsphere index. This required a rework of our schema and a rework of our query engine. Firstly, we needed GeoJSON points.

Next the index

Alert readers will notice the emergence of a new collection, action_loc, in our schema. Version 2.4 2dsphere indexes do not respect the sparse property, but this was changed in 2.6. Roughly 65% of our Action collection consists of documents that do not have an associated location.

Creating the 2dsphere index on the action collection results in 57 million documents indexed on location with no benefit. In practice this resulted in an index roughly 10GB in size. Moving to a collection where all records had locations resulted in a 7.3Gb decrease in the size of the index. Given that reads dominate writes in our application, we were happy to incur the latency of one additional write per play to the action_loc collection when location is available for specific actions.

GeoJSON support has some tougher constraints for queries. There can be no duplicate points. Polygons must form a closed linear ring. In practice, our client application sometimes produced multiple points i.e. when the polygon self-intersected, and didn’t provide a closed linear ring ever. Finally, self-intersecting polygons, which are a common side effect of finger-led free-form drawing and interesting random searches, are not acceptable. Technically, a self-intersecting polygon could be considered as a geometry containing n distinct non self-intersecting polygons, multiple polygon geometries are not supported on MongoDB 2.4 but were added in 2.5.1.

The JTS Java Library helped us fix all this at the application tier. De-duplication and closing of linear rings was trivial however, supporting self-intersecting polygons was a little trickier. The final solution involved calculation of the convex hull of the search polygon. This guaranteed a closed linear ring around any polygon geometry, therefore removing the chance of self-intersection. As illustrated in the graphic above it does, however, result in a larger result set than intended.

Rather than show these points and confuse the user, we cull them from the response. This preserves the bounding-box user experience we were after. In other words, users get what they would expect. With regard to wasted resources, this is not ideal in theory, but works well in practice. Users typically draw self-intersecting polygons by accident3. The 2nd polygon is usually an order of magnitude smaller than the larger of two polygons that result. Therefore the wasted search space is typically small.

Once migrated, city and street level searches were an order of magnitude faster in generating results. We attribute this to the 2dsphere index and the s2 cursor. It still remained, though, that low zoom level searches, at state and country level, pushed the 95th percentile latency beyond 1.5 seconds on average and were often a painfully slow user experience.

Optimization

We limit results of music map searches to 100 plays. If there are more than 100, we show the 100 most recent.

Queries at a low zoom level are usually dominated by plays that have happened recently. Tests showed that for low-zoom level wide-area searches an inverse time index was much faster at finding records than the 2dsphere index. Hence we created two indexes on location, one based on time, the other one 2dsphere.

However, In our experience, MongoDB’s query planner always chose the 2dsphere index over the time-based index for $geoWithin queries, even when tests showed that the time-based index would provide faster results in a specific case.

MongoDB’s query planner works by periodically using multiple indexes in parallel and seeing which one returns faster. The problem is that most of the time, the right answer is to use the 2dsphere index and MongoDB only periodically re-evaluates this choice. If the best choice of index varies for two queries that have the same signature, MongoDB will make the wrong choice some percentage of the time. MongoDB makes an index choice that works more often than not, but leaves the remaining cases unacceptably slow.

Some conjunction of polygon-area and pre-generated metrics of data-point density for the area was explored but quickly abandoned due to difficulty of implementation.

Finally a simple strategy of parallel execution was used. We run the query two ways, in parallel, one query hinting the 2dsphere index, the other hinting the time index. The winner is the query that returns results the fastest. If the loser runs for longer than 10 seconds it is forcibly killed to save wasted resources.

In production, we found that if we did not kill long-running speculative queries, the resulting load on the database adversely affected the read latency for other queries, and thus user experience.

With MongoDB 2.6 its possible to set the user defined time limit for queries and so this step could be avoided. At the time of writing, we were not using this feature in production, and we needed a Javascript worker that periodically killed queries on the collection running over 10 seconds. The function is below.

Some care is needed to ensure the operations on the collection that are long running are not related to the redistribution of chunks across shards.

Explaining Results

In production, our inverse time index beats our 2dsphere index roughly one third of the time, typically for pretty zoomed-out searches. Intuitively, this makes sense, because a continent is large and encompasses a significant portion of the geographic space. For example, if you evaluate the 1000 most recent items, 100 of them are likely to also satisfy your geographical constraint if your geographic constraint is the size of a continent.

Using $geoWithin when very zoomed out performs poorly. Again looking at the example of a continent at least 10% of all plays are probably within the continent, but as we only want to return the most recent data, we will wind up discarding nearly all the data we pull from the database as we search for the most recent data from the continent.

In the first example [Tbl. 1], the 2dsphere index wins by a huge margin. The cells covered by the areas that are considered in the query do not have very many recent plays in relation to the global set of recent plays. Therefore, the thread using time index needs to work much harder to fill the 100 pin bucket from a global perspective, scanning a factor of 40 more records to do the same work. Meanwhile, the query that uses the 2dsphere index finds that a relatively high proportion of the plays that satisfy the geographic constraint also satisfy the time constraint.

In the second example [Tbl. 2], the probability that a recent play is also geographically compatible is high, causing the time based index to win comfortably, scanning half the records of 2dsphere and bringing us back under our desired latency limit.

To reiterate, we always use the result from the faster query. The result from the slower query is discarded if it returns prior to being killed. By speculatively going down the path of both queries, we are able to get a result back to the user within our requirements without analyzing the query and its parameters. We don’t need to understand whether it’s zoomed out or zoomed in.

Parallel Index Usage

The graphic below shows a few days’ worth of queries and the respective index used. The data-points have been plotted for both winning and losing queries. For many of the of the points shown, results were not used in user responses, but latency was measured anyway.

The graph serves to show that it is the combination of indexes, that is both the red and the green points, are responsible for the p95 latency shown in the graphic below. Moreover, it illustrates roughly how the p95 might be affected should we use only one of the indexes in question. Also, the proportion of 2dsphere searches to time searches can be more roughly deduced.

We process roughly 0.4 map searches on average per second. The p95 response time is well under 1.5 seconds. We are able to achieve these results even though we are performing every query two ways, for up to 10 seconds. The graph shown below shows these results over a week in March.

Conclusion

MongoDB geo-search feature enables one of Soundwave’s most unique features: Music Maps. The Music Map is one of the main reasons we chose MongoDB as a primary data store.

Compared to older features like automatic sharded cluster rebalancing and aggregation, geo-search and supporting documentation are still somewhat immature. That said, with some extra work we managed to build a reasonably complex search feature without having to deploy a separate GIS stack.

Furthermore, MongoDB’s strategy of incorporating the S2 library and more GeoJSON geometries leaves us confident that we can easily build upon our map feature, now and in the future. This allows us to keep location awareness front and center as a feature of our app. Hopefully this article will clear up a few pain points for users implementing similar features.

footnotes

1. Soundwave is available for iOS and Android. Installation details at http://www.soundwave.com

2. We use r3.4xlarge instances.

3. We observed queries on self-intersecting polygons about half the time. This is mostly an accident; it’s hard to draw an exact linear-ring with your finger. However, sometimes people just want to draw figures of 8.

Efficient Indexing in MongoDB 2.6

Jun 4 • Posted 2 months ago

By Osmar Olivo, Product Manager at MongoDB

One of the most powerful features of MongoDB is its rich indexing functionality. Users can specify secondary indexes on any field, compound indexes, geospatial, text, sparse, TTL, and others. Having extensive indexing functionality makes it easier for developers to build apps that provide rich functionality and low latency.

MongoDB 2.6 introduces a new query planner, including the ability to perform index intersection. Prior to 2.6 the query planner could only make use of a single index for most queries. That meant that if you wanted to query on multiple fields together, you needed to create a compound index. It also meant that if there were several different combinations of fields you wanted to query on, you might need several different compound indexes.

Each index adds overhead to your deployment - indexes consume space, on disk and in RAM, and indexes are maintained during updates, which adds disk IO. In other words, indexes improve the efficiency of many operations, but they also come at a cost. For many applications, index intersection will allow users to reduce the number of indexes they need while still providing rich features and low latency.

In the following sections we will take a deep dive into index intersection and how it can be applied to applications.

An Example - The Phone Book

Let’s take the example of a phone book with the following schema.

{
    FirstName
    LastName
    Phone_Number
    Address
}

If I were to search for “Smith, John” how would I index the following query to be as efficient as possible?

db.phonebook.find({ FirstName : “John”, LastName : “Smith” })

I could use an individual index on FirstName and search for all of the “Johns”.

This would look something like ensureIndex( { FirstName : 1 } )

We run this query and we get back 200,000 John Smiths. Looking at the explain() output below however, we see that we scanned 1,000,000 “Johns” in the process of finding 200,000 “John Smiths”.

> db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 1000000,
    "nscanned" : 1000000,
    "nscannedObjectsAllPlans" : 1000101,
    "nscannedAllPlans" : 1000101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 2,
    "nChunkSkips" : 0,
    "millis" : 2043,
    "indexBounds" : {
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

How about creating an individual index on LastName?

This would look something like ensureIndex( { LastName : 1 } )

Running this query we get back 200,000 “John Smiths” but our explain output says that we now scanned 400,000 “Smiths”. How can we make this better?

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 400000,
    "nscanned" : 400000,
    "nscannedObjectsAllPlans" : 400101,
    "nscannedAllPlans" : 400101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 1,
    "nChunkSkips" : 0,
    "millis" : 852,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

So we know that there are 1,000,000 “John” entries, 400,000 “Smith” entries, and 200,000 “John Smith” entries in our phonebook. Is there a way that we can scan just the 200,000 we need?

In the case of a phone book this is somewhat simple; since we know that we want it to be sorted by Lastname, Firstname we can create a compound index on them, like the below.

ensureIndex( {  LastName : true, FirstName : 1  } ) 

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1_FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 200000,
    "nscanned" : 200000,
    "nscannedObjectsAllPlans" : 200000,
    "nscannedAllPlans" : 200000,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 370,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ],
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

Looking at the explain on this, we see that the index only scanned the 200,000 documents that matched, so we got a perfect hit.

Beyond Compound Indexes

The compound index is a great solution in the case of a phonebook in which we always know how we are going to be querying our data. Now what if we have an application in which users can arbitrarily query for different fields together? We can’t possibly create a compound index for every possible combination because of the overhead imposed by indexes, as we discussed above, and because MongoDB limits you to 64 indexes per collection. Index intersection can really help.

Imagine the case of a medical application which doctors use to filter through patients. At a high level, omitting several details, a basic schema may look something like the below.

{
      Fname
      LName
      SSN
      Age
      Blood_Type
      Conditions : [] 
      Medications : [ ]
      ...
      ...
}

Some sample searches that a doctor/nurse may run on this system would look something like the below.

Find me a Patient with Blood_Type = O under the age of 50

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

Find me all patients over the age of 60 on Medication X

db.patients.find( { Medications : “X” , Age : { $gt : 60} })

Find me all Diabetic patients on medication Y

db.patients.find( { Conditions : “Diabetes”, Medications : “Y” } )

With all of the unstructured data in modern applications, along with the desire to be able to search for things as needed in an ad-hoc way, it can become very difficult to predict usage patterns. Since we can’t possibly create compound indexes for every combination of fields, because we don’t necessarily know what those will be ahead of time, we can try indexing individual fields to try to salvage some performance. But as shown above in our phone book application, this can lead to performance issues in which we pull documents into memory that are not matches.

To avoid the paging of unnecessary data, the new index intersection feature in 2.6 increases the overall efficiency of these types of ad-hoc queries by processing the indexes involved individually and then intersecting the result set to find the matching documents. This means you only pull the final matching documents into memory and everything else is processed using the indexes. This processing will utilize more CPU, but should greatly reduce the amount of IO done for queries where all of the data is not in memory as well as allow you to utilize your memory more efficiently.

For example, looking at the earlier example:

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

It is inefficient to find all patients with BloodType: O (which could be millions) and then pull into memory each document to find the ones with age < 50 or vice versa.

Instead, the query planner finds all patients with bloodType: O using the index on BloodType, and all patients with age < 50 using the index on age, and then only pulls the intersection of these 2 result sets into memory. The query planner only needs to fit the subsets of the indexes in memory, instead of pulling in all of the documents. This in turn causes less paging, and less thrashing of the contents of memory, which will yield overall better performance.

Index intersection allows for much more efficient use of existing RAM so less total memory will usually be required to fit the working set then previously. Also, if you had several compound indices that were made up of different combinations of fields, then you can reduce the total number of indexes on the system. This means storing less indices in memory as well as achieving better insert/update performance since fewer indices must be updated.

As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.

Optimizing Multi-key Indexes It is also possible to intersect an index with itself in the case of multi-key indexes. Consider the below query:

Find me all patients with Diabetes & High Blood Pressure

db.patients.find( {  Conditions : { $all : [ “Diabetes”, “High Blood Pressure” ]  }    }  )

In this case we will find the result set of all Patients with Diabetes, and the result set of all patients with High blood pressure, and intersect the two to get all patients with both. Again, this requires less memory and disk speed for better overall performance. As of the 2.6.0 release, an index can intersect with itself up to 10 times.

Do We Still Need Compound Indexes?

To be clear, compound indexing will ALWAYS be more performant IF you know what you are going to be querying on and can create one ahead of time. Furthermore, if your working set is entirely in memory, then you will not reap any of the benefits of Index Intersection as it is primarily based on reducing IO. But in a more ad-hoc case where one cannot predict the shape of the queries and the working set is much larger than available memory, index intersection will automatically take over and choose the most performant path.

Integrating MongoDB Text Search with a Python App

Jun 4 • Posted 1 year ago

By Mike O’Brien, 10gen Software engineer and maintainer of Mongo-Hadoop

With the release of MongoDB 2.4, it’s now pretty simple to take an existing application that already uses MongoDB and add new features that take advantage of text search. Prior to 2.4, adding text search to a MongoDB app would have required writing code to interface with another system like Solr, Lucene, ElasticSearch, or something else. Now that it’s integrated with the database we are already using, we can accomplish the same result with reduced complexity, and fewer moving parts in the deployment.

Here we’ll go through a practical example of adding text search to Planet MongoDB, our blog aggregator site.

Read more

MongoDB Schema Design: Insights and Tradeoffs from Jetlore


MongoDB’s flexible schema is a powerful feature, and to build a successful first application you need to know how to leverage this feature to its full extent. In this presentation, Montse Medina outlines lessons learned from building Jetlore, a social content marketing platform. Some performance tips from this video:


  • Sometimes it’s ok to randomize your sharding key. When you have lots of users that want to read from other users, you’ll need to randomize it in order to have fewer disk seeks per shard.
  • Reduce collection size by always using short field names as a convention. This will help you save memory over time.
  • Always test your queries with .explain() to check that you’re hitting the right index.


October MongoDB Blogroll and Releases

Nov 5 • Posted 1 year ago

blog comments powered by Disqus