Aug 5 • Posted 2 months ago
By David Lynch: Principal Engineer at Soundwave
Soundwave is a smartphone app that tracks music as quickly as it is played. Soundwave tracks each user’s music-listening habits across multiple platforms and streaming services1, creating a listening profile. Users follow each other, facilitating listening, sharing, discovery, and discussion of music old and new, popular and niche.
MongoDB is our database of choice. We track around 250,000 plays per day from a user base that has grown to over 1 million in the past 13 months. Soundwave has users in all time zones, making availability critical. MongoDB replica sets provide us fault tolerance and redundancy, allowing us to perform scheduled maintenance without affecting users.
We consider responsiveness to be a key part of the Soundwave user experience so we use a sharded configuration that allows us to add compute and storage resources as needed.
We use indexes liberally, and app usage patterns require maintaining a fairly large working set in memory. Sharding allows us to deploy a larger number of relatively smaller, and disproportionately cheaper, AWS instances2 while maintaining the snappy user experience we require.
The volume of data we process today is significant, but not huge. It was important for us to architect a system from the outset that would scale easily and economically.
The complexity of data retrieval is moderate but well suited to MongoDB. Our App has been featured a couple of times by both Apple and Google. At our peak rate, we handle 30,000 new sign-ups per day. Our MongoDB configuration handles this load comfortably. In fact, with respect to scale and schema, our deployment is pretty boring and by the book.
One of Soundwaves’ most compelling features is the Music Map. Our users can draw a free-form enclosing polygon on a Google Map of any area, at any zoom level, and see what music was recently played in that area. This is captured in the screenshot below. These constraints required some interesting engineering outside the scope of the MongoDB playbook.
We’re kind of proud of our implementation. Many location aware apps with similar features reduce the user experience to some form of circular $geoNear around a point. In our minds, this is not as interesting as a randomly drawn polygon - nor is it a great user experience to promise polygons and return some other approximation of what a user wanted.
We set a target maximum 95th percentile latency of 1.5 seconds for the map feature. From a client perspective, we decided anything over this wait time is just boring for the user. We discovered that users perform searches that are either pretty-zoomed-out - like Ireland - or-pretty-zoomed in - like Landsdowne Road.
For lots of reasons, we have clusters of data around certain areas. Big cities typically have lots of dense data points, but some cities are better represented than others. This makes pretty-zoomed-out and pretty-zoomed-in hard to quantify. It follows also that zoom-level of the map is somewhat useless as an optimization datum. We wanted all our location data to be available to the user - there are a few plays in Antarctica, but they happened a while ago - it would be a shame to cut the search space by time and miss this.
Our deployment contains around 90 million play actions, each one represented by a single document. For any interaction with the app, for example capturing a play, there is an associated geo-coordinate and document. Our first implementation of this feature leveraged the 2d index type with a legacy location data point associated with each action document.
Our geo-query therefore took the form of a $within statement using a polygon built of points on the free form finger-drag.
The supporting index was constructed as follows.
This actually worked quite nicely out of the box. For legacy queries, MongoDB $within is not too fussy about self-intersecting nor is it concerned about the uniqueness of each point, or the completeness of the polygon.
The query, however was dog-slow. For dense areas at city sized zoom levels, we could not get response times under multiple seconds for our queries. For areas where significant numbers of plays were happening, hinting an index on action-time performed much better than 2d in all cases.
For areas of low activity, this strategy didn’t work particularly well. Searches at the country or continent scale didn’t complete in any reasonable time - 10s of seconds - and began to affect the latency of other parts of the app. MongoDB’s Explain showed scans over half, if not more, of the records in the collection for city sized searches, yet also confirmed that the 2d index was used. This led us to conclude that the 2d index may not be powerful enough for our needs.
Migration to GeoJSON & 2dsphere
After some hints from the MongoDB community and migration to MongoDB 2.4, we decided to move to the 2dsphere index. This required a rework of our schema and a rework of our query engine. Firstly, we needed GeoJSON points.
Next the index
Alert readers will notice the emergence of a new collection,
action_loc, in our schema. Version 2.4 2dsphere indexes do not respect the sparse property, but this was changed in 2.6. Roughly 65% of our Action collection consists of documents that do not have an associated location.
Creating the 2dsphere index on the
action collection results in 57 million documents indexed on location with no benefit. In practice this resulted in an index roughly 10GB in size. Moving to a collection where all records had locations resulted in a 7.3Gb decrease in the size of the index. Given that reads dominate writes in our application, we were happy to incur the latency of one additional write per play to the
action_loc collection when location is available for specific actions.
GeoJSON support has some tougher constraints for queries. There can be no duplicate points. Polygons must form a closed linear ring. In practice, our client application sometimes produced multiple points i.e. when the polygon self-intersected, and didn’t provide a closed linear ring ever. Finally, self-intersecting polygons, which are a common side effect of finger-led free-form drawing and interesting random searches, are not acceptable. Technically, a self-intersecting polygon could be considered as a geometry containing n distinct non self-intersecting polygons, multiple polygon geometries are not supported on MongoDB 2.4 but were added in 2.5.1.
The JTS Java Library helped us fix all this at the application tier. De-duplication and closing of linear rings was trivial however, supporting self-intersecting polygons was a little trickier. The final solution involved calculation of the convex hull of the search polygon. This guaranteed a closed linear ring around any polygon geometry, therefore removing the chance of self-intersection. As illustrated in the graphic above it does, however, result in a larger result set than intended.
Rather than show these points and confuse the user, we cull them from the response. This preserves the bounding-box user experience we were after. In other words, users get what they would expect. With regard to wasted resources, this is not ideal in theory, but works well in practice. Users typically draw self-intersecting polygons by accident3. The 2nd polygon is usually an order of magnitude smaller than the larger of two polygons that result. Therefore the wasted search space is typically small.
Once migrated, city and street level searches were an order of magnitude faster in generating results. We attribute this to the 2dsphere index and the s2 cursor. It still remained, though, that low zoom level searches, at state and country level, pushed the 95th percentile latency beyond 1.5 seconds on average and were often a painfully slow user experience.
We limit results of music map searches to 100 plays. If there are more than 100, we show the 100 most recent.
Queries at a low zoom level are usually dominated by plays that have happened recently. Tests showed that for low-zoom level wide-area searches an inverse time index was much faster at finding records than the 2dsphere index. Hence we created two indexes on location, one based on time, the other one 2dsphere.
However, In our experience, MongoDB’s query planner always chose the 2dsphere index over the time-based index for $geoWithin queries, even when tests showed that the time-based index would provide faster results in a specific case.
MongoDB’s query planner works by periodically using multiple indexes in parallel and seeing which one returns faster. The problem is that most of the time, the right answer is to use the 2dsphere index and MongoDB only periodically re-evaluates this choice. If the best choice of index varies for two queries that have the same signature, MongoDB will make the wrong choice some percentage of the time. MongoDB makes an index choice that works more often than not, but leaves the remaining cases unacceptably slow.
Some conjunction of polygon-area and pre-generated metrics of data-point density for the area was explored but quickly abandoned due to difficulty of implementation.
Finally a simple strategy of parallel execution was used. We run the query two ways, in parallel, one query hinting the 2dsphere index, the other hinting the time index. The winner is the query that returns results the fastest. If the loser runs for longer than 10 seconds it is forcibly killed to save wasted resources.
In production, we found that if we did not kill long-running speculative queries, the resulting load on the database adversely affected the read latency for other queries, and thus user experience.
Some care is needed to ensure the operations on the collection that are long running are not related to the redistribution of chunks across shards.
In production, our inverse time index beats our 2dsphere index roughly one third of the time, typically for pretty zoomed-out searches. Intuitively, this makes sense, because a continent is large and encompasses a significant portion of the geographic space. For example, if you evaluate the 1000 most recent items, 100 of them are likely to also satisfy your geographical constraint if your geographic constraint is the size of a continent.
Using $geoWithin when very zoomed out performs poorly. Again looking at the example of a continent at least 10% of all plays are probably within the continent, but as we only want to return the most recent data, we will wind up discarding nearly all the data we pull from the database as we search for the most recent data from the continent.
In the first example [Tbl. 1], the 2dsphere index wins by a huge margin. The cells covered by the areas that are considered in the query do not have very many recent plays in relation to the global set of recent plays. Therefore, the thread using time index needs to work much harder to fill the 100 pin bucket from a global perspective, scanning a factor of 40 more records to do the same work. Meanwhile, the query that uses the 2dsphere index finds that a relatively high proportion of the plays that satisfy the geographic constraint also satisfy the time constraint.
In the second example [Tbl. 2], the probability that a recent play is also geographically compatible is high, causing the time based index to win comfortably, scanning half the records of 2dsphere and bringing us back under our desired latency limit.
To reiterate, we always use the result from the faster query. The result from the slower query is discarded if it returns prior to being killed. By speculatively going down the path of both queries, we are able to get a result back to the user within our requirements without analyzing the query and its parameters. We don’t need to understand whether it’s zoomed out or zoomed in.
Parallel Index Usage
The graphic below shows a few days’ worth of queries and the respective index used. The data-points have been plotted for both winning and losing queries. For many of the of the points shown, results were not used in user responses, but latency was measured anyway.
The graph serves to show that it is the combination of indexes, that is both the red and the green points, are responsible for the p95 latency shown in the graphic below. Moreover, it illustrates roughly how the p95 might be affected should we use only one of the indexes in question. Also, the proportion of 2dsphere searches to time searches can be more roughly deduced.
We process roughly 0.4 map searches on average per second. The p95 response time is well under 1.5 seconds. We are able to achieve these results even though we are performing every query two ways, for up to 10 seconds. The graph shown below shows these results over a week in March.
MongoDB geo-search feature enables one of Soundwave’s most unique features: Music Maps. The Music Map is one of the main reasons we chose MongoDB as a primary data store.
Compared to older features like automatic sharded cluster rebalancing and aggregation, geo-search and supporting documentation are still somewhat immature. That said, with some extra work we managed to build a reasonably complex search feature without having to deploy a separate GIS stack.
Furthermore, MongoDB’s strategy of incorporating the S2 library and more GeoJSON geometries leaves us confident that we can easily build upon our map feature, now and in the future. This allows us to keep location awareness front and center as a feature of our app. Hopefully this article will clear up a few pain points for users implementing similar features.
1. Soundwave is available for iOS and Android. Installation details at http://www.soundwave.com
3. We observed queries on self-intersecting polygons about half the time. This is mostly an accident; it’s hard to draw an exact linear-ring with your finger. However, sometimes people just want to draw figures of 8.