Posts tagged:

sharding

Tiered Storage Models in MongoDB: Optimizing Latency and Cost

May 14 • Posted 2 months ago

By Rohit Nijhawan, Senior Consulting Engineer at MongoDB with André Spiegel and Chad Tindel

For a user-facing application, speed and uptime are critical to success. There are a number of ways you can tune your application and hardware setup to provide the best experience for your customers — the trick is doing so at optimal cost. Here we provide an example for improving performance and lowering costs with MongoDB using Tiered Storage, a method of prioritizing data storage based on latency requirements.

In this example, we will be segmenting data by date: recent data is more frequently accessed and should exhibit lower latency than less recent data. However, the idea applies to other ways of segmenting data, such as location, user, source, size, or other criteria. This approach takes advantage of a powerful feature in MongoDB called tag-aware sharding that has been available since MongoDB 2.2.

Example Application: Insurance Claims

In many applications, low-latency access to data becomes less important as data ages. For example, an insurance company might prioritize access to claims from the last 12 months. Users should be able to view recent claims quickly, but once claims are more than a year old they tend to be accessed much less frequently, and the latency requirements tend to become less demanding.

By creating tiers of storage with different performance and cost profiles, the insurance company can provide a better experience for users while optimizing their costs. Older claims can be stored in a storage tier with more cost-effective hardware such as commodity hard drives. More recent data can be stored in a high-performance storage tier that provides lower latency such as SSD. Because the majority of the claims are more than a year old, storing older data in the lower-cost tier can provide significant cost advantages. The insurance company can optimize their hardware spread across the two tiers, providing a great user experience at an optimized cost point.

The requirements for this application can be summarized as:

The trailing 12 months of claims should reside on faster storage tier Claims over a year old should move to slower storage tier Over time new claims arrive, and older claims need to move from the faster tier to the slower tier

For simplicity, throughout this overview, we’ll distinguish the claims data by “current” and “tier-2” data.

Building Your Own Process: An Operational Headache

One approach to these requirements is use periodic batch jobs: selecting data, loading it into the archive, and erasing it from the faster storage. However, this is inherently complex:

  • The move process must be carefully coded to fail gracefully. In the event that a load fails, you don’t want to delete the original data!
  • If the data to be moved is large, you may wish to throttle the operations.
  • If moves succeed partially, you have to retry the unfinished data.
  • Unless you plan on halting your application during the move (generally unacceptable), your application needs custom code to find the data before, during, and after the move.
  • Your application needs to understand the physical location of the data, which unnecessarily complicates your code to the partitioning logic.

Furthermore, introducing another custom component to your operations requires additional maintenance and monitoring.

It’s an operational headache that many teams are forced to endure, but there is a simpler way: have MongoDB handle the load of migrating documents from the recent storage machines to the tier 2 storage machines, transparently. As it turns out, you can easily implement this approach with a feature called Tag-Aware Sharding.

The MongoDB Way: Tag-aware Sharding

MongoDB provides a feature called sharding to scale systems horizontally across multiple machines. Sharding is transparent to your application - whether you have 1 or 100 shards, your application code is the same. For a comprehensive description of sharding please see the Sharding Guide.

A key component of sharding is a process called the balancer. As collections grow, the balancer operates in the background to carefully move documents between shards. Normally the balancer works to achieve a uniform distribution of documents across shards. However, with tag-aware sharding we can create policies that affect where documents are stored. This feature can be applied in many use cases. One example is to keep user data in data centers that are near the user. In our application, we can use this feature to keep current data on our fast servers, and tier 2 data on cheaper, slower servers.

Here’s how it works:

  • Shards are assigned tags. A tag is an alphanumeric alias like “London-DC”.
  • Unique shard key ranges are ‘pinned’ to tags.
  • During normal balancing operations, chunks migrate only to shards whose tag is associated with a key range which contains the chunk’s key range*.
  • There are a few subtleties regarding what happens when a chunk’s key range overlaps more than one tag range. Please read the documentation carefully regarding this particular case

This means that we can assign the “tier-2” tag to shards running on slow servers and “current” tags to shards running on fast servers, and the balancer will handle migrating the data between tiers automatically. What’s great is that we can keep all the data in one database, so our application code doesn’t need to change as data moves between storage tiers.

Determining the shard key

When you query a sharded collection, the query router will do its best to only inspect the shards holding your data, but it can only do this if you provide the shard key as part of your query. (See Sharded Cluster Query Routing for more information.)

So we need to make sure that the we look up documents by the shard key. We also know that time is the basis for determining the location of documents in our two storage tiers. Accordingly, the shard key must contain an explicit timestamp. In our example, we’ll be using Enron’s email dataset, and we’ll set the top-level “date” as the shard key. Here’s a sample document:

Because the time is stored in the most significant digits of the date, messages from any given day will numerically precede messages from subsequent days.

Implementation

Here are the the steps to set up this system:

Set up an empty, sharded MongoDB cluster Create a target database to host the sharded collection Assign tags to different shards corresponding to the storage tiers Assign tag ranges to the shards Load data into the MongoDB Cluster

Set up the cluster The first thing you will want to do is set up your sharded cluster. You can see more information on how to set this up here.

In this case we will have a database called “enron” and a collection called “messages” which holds part of the Enron email corpus. In this example, we’ve set up a cluster with three shards. The first, shard0000, is optimized for low-latency access to data. The other two, shard0001 and shard0002, use more cost effective hardware for data that is older than the identified cutoff date.

Here’s our sharded cluster. These are empty machines with no data:

Adding the tags We can “tag” each of these shards to associate them with documents that should belong to our “current” tier or those that should belong to “tier-2.” In the absence of tags and range based tags, balancing will try to ensure that the number of chunks on each shard are equal without regard to any other data in the fields. Before we add the data to our collection, let’s tag shard0000 as “current” and the other two as “tier-2”:

Now we can verify our tags by calling sh.status():

Next, we need to set up a database and collection for the Enron emails. We’ll set up a new database ‘enron’ with a collection called ‘messages’ and enable sharding on that collection:

Since we’re going to shard the collection, we’ll need to set up a shard key. We will use the ‘date’ field as our shard key since this is the field that will define how the documents are distributed across shards:

Defining the cutoff date between tiers The cutoff point between “current” data and “tier-2” data is a point in time that we will update periodically to keep the most recent documents in our “current” shard. We will start with a cutoff of July 1, 2001, saved as an ISO Date ISODate(“2001-07-01”). Once we add the data to our collection, we will set this as the tag range. Going forward, when we add documents to the “messages” collection, any documents newer than July 1, 2001 will end up on the “current” shard, and documents older than that will end up on the “tier-2” shard.

It’s important that the two ranges overlap at exactly the same point in time. The lower bound of a tag range is inclusive, and the upper bound is exclusive. This means a document that has an date of exactly ‘ISODate(“2001-07-01”)’ will go on the “current” shard, not the “tier-2” shard.

Below you will see each of the shard’s new tag ranges:

As a final check, look in the config database for the tag range definitions.

Now, that all the shards and ranges are defined, we are ready to load the message data into the server. The collection will follow the instructions given by the tag ranges and land on the correct machines.

Now, let’s check the sharding status to see where the documents reside

That’s it! The mongos process automatically moves documents to comply with the tag ranges. In this example, it took all documents still on the “current” shard with an ISODate older than ISODate(“2001-07-01T00:00:00Z”) and move them to the “tier-2” shard.

The tag ranges must be updated on a regular basis to keep the cutoff point at the correct interval of time in the past (1 year, in our case). In order to do this, both ranges need to be updated. To perform this change the balancer should temporarily be disabled, so there is no point where the ranges overlap. Stopping the balancer temporarily is a safe operation - it will not affect the application or the experience of users.

If you wanted to move the cutoff back another month, to August 1, 2001, you just need to follow these three steps:

Stop the balancer sh.setBalancerState(false) Create a chunk split at August 1 sh.splitAt('enron.messages', {"date" : ISODate("2001-08-01")}) Move the cutoff date to ISODate(“2001-08-01T00:00:00Z”) var configdb=db.getSiblingDB("config"); configdb.tags.update({tag:"tier-2"},{$set:{'max.date':ISODate("2001-08-01")}}) configdb.tags.update({tag:"current"},{$set:{'min.date':ISODate("2001-08-01")}}) Re-start the balancer sh.setBalancerState(true) Verify the sharding status

By updating the chunk split to August 1, we have migrated all the documents with a date after July 1 but before August 1 from the “current” shard to the “tier-2” shards. The good news is that we were able to perform this operation without changing our application code and with no database downtime. We can also see that it would be simple to schedule this process to run automatically through an external process.

From Operational Headache to Simplicity

The end result is one collection spread across three shards and two different storage systems. This solution allows you to lower your storage costs without adding complexity to the architecture of your system. Instead of a complex setup with different databases on different machines we have one database to query, and instead of a data migration we update some simple rules to control the location of data in the system.

Like what you see? Sign up for the MongoDB Newsletter

Background Indexing on Secondaries and Orphaned Document Cleanup in MongoDB 2.6

Jan 27 • Posted 6 months ago

By Alex Komyagin, Technical Services Engineer in MongoDB’s New York Office

The MongoDB Support Team has broad visibility into the community’s use of MongoDB, issues they encounter, feature requests, bug fixes and the work of the engineering team. This is the first of a series of posts to help explain, from our perspective, what is changing in 2.6 and why.

Many of these changes are available today for testing in the 2.5.4 Development Release, which is available as of November 18, 2013 (2.5.5 release, coming soon, will be feature complete). Development Releases have odd-numbered minor versions (e.g., 2.1, 2.3, 2.5), and Production Releases have even-numbered minor versions (e.g., 2.2, 2.4, 2.6). MongoDB 2.6 will become available a little later this year.

Community testing helps MongoDB improve. You can test the development of MongoDB 2.5.4 today. Downloads are available here, and you can log Server issues in Jira.

Background indexes on secondaries (SERVER-2771)

Suppose you have a production replica set with two secondary servers, and you have a large, 1TB collection. At some point, you may decide that you need a new index to reflect a recent change in your application, so you build one in the background:

db.col.ensureIndex({..},{background : true})

Let’s also suppose that your application uses secondary reads (users should take special care with this feature, especially in sharded systems; for an example of why, see the next section in this post). After some time you observe that some of your application reads have started to fail, and replication lag on both secondaries has started to grow. While you are searching Google Groups for answers, everything magically goes back to normal by itself. Secondaries have caught up, and application reads on your secondaries are working fine. What happened?

One would expect that building indexes in the background would allow the replica set to continue serving regular operations during the index build. However, in all MongoDB releases before 2.6, background index builds on primaries become foreground index builds on secondaries, as noted in the documentation. Foreground index building is resource intensive and it can also affect replication and read/write operations on the database (see the FAQ on the MongoDB Docs). The good news is that impact can be minimized if your indexed collections are small enough for index builds to be relatively fast (on the order of minutes to complete).

The only way to make sure that indexing operations are not affecting the replica set in earlier versions of MongoDB was to build indexes in a rolling fashion. This works perfectly for most users, but not for everyone. For example, it wouldn’t work well for those who use a write concern “w:all”.

Starting with MongoDB 2.6, a background index build on the primary becomes a background index build on the secondaries. This behavior is much more intuitive and will improve the replica set robustness. We feel this will be a welcome enhancement for many users.

Please note that background index building normally takes longer than foreground building, because it allows other operations on the database to run. Keep in mind that, like most database systems, indexing in MongoDB is resource intensive and will increase the load on your system, whether it is a foreground or background process. Accordingly, it is best to perform these operations during a maintenance window or during off-peak hours.

The actual time needed to build a background index varies with the active load on your system, number of documents, database size and your hardware specs. Therefore, for production systems with large collections users can still take advantage of building indexes in a rolling fashion, or building them in foreground during maintenance windows if they believe a background index build will require more time than is acceptable.

Orphaned documents cleanup command (SERVER-8598)

MongoDB provides horizontal scaling through a feature called sharding. If you’re unfamiliar with sharding and how it works, I encourage you to read the nice new introduction to this feature the documentation team added a few months ago. Let me try and summarize some of the key concepts:

  • MongoDB partitions documents across shards.
  • Periodically the system runs a balancing process to ensure documents are uniformly distributed across the shards.
  • Groups of documents, called chunks, are the unit of a balancing job.
  • In certain failure scenarios stale copies of documents may remain on shards, which we call “orphaned documents.”

Under normal circumstances, there will be no orphaned documents in your system. However, in some production systems, “normal circumstances” are a very rare event, and migrations can fail (e.g., due to network connectivity issues), thus leaving orphaned documents behind.

The presence of orphaned documents can produce incorrect results for some queries. While orphaned documents are safe to delete, in versions prior to 2.6 there was no simple way to do so. In MongoDB 2.6 we implemented a new administrative command for sharded clusters: cleanupOrphaned(). This command removes orphaned documents from the shard in a single range of data.

The scenario where users typically encounter issues related to orphaned documents is when issuing secondary reads. In a sharded cluster, primary replicas for each shard are aware of the chunk placements, while secondaries are not. If you query the primary (which is the default read preference), you will not see any issues as the primary will not return orphaned documents even if it has them. But if you are using secondary reads, the presence of orphaned documents can produce unexpected results, because secondaries are not aware of the chunk ownerships and they can’t filter out orphaned documents. This scenario does not affect targeted queries (those having the shard key included), as mongos automatically routes them to correct shards.

To illustrate this discussion with an example, one of our clients told us that after a series of failed migrations he noticed that his queries were returning duplicate documents. He was using scatter-gather queries, meaning that they did not contain the shard key and were broadcast by mongos to all shards, as well as secondary reads. Shards return all the documents matching the query (including orphaned documents), which in this situation lead to duplicate entries in the final result set.

A short term solution was to remove orphaned documents (we used to have a special script for this). But a long term workaround for this particular client was to make their queries targeted, by including the shard key in each query. This way, mongos could efficiently route each query to the correct shard, not hitting the orphaned data. Routed queries are a best practice in any system as they also scale much better than scatter-gather queries.

Unfortunately, there are a few cases where there is no good way to make queries targeted, and you would need to either switch to primary reads or implement a regular process for removing orphaned documents.

The cleanupOrphaned() command is the first step on the path to automated cleanup of orphaned documents. This command should be run on the primary server and will clean up one unowned range on this shard. The idea is to run the command repeatedly, with a delay between calls to tune the cleanup rate.

In some configurations secondary servers might not be able to keep up with the rate of delete operations, resulting in replication lag. In order to control the lag, cleanupOrphaned() waits for the majority of the replica set members after the range removal is complete. Additionally, you can use the secondaryThrottle option, and each individual delete operation will be made with write concern w:2 (waits for one secondary). This may be useful for reducing the impact of removing orphaned documents on your regular operations.

You can find command usage examples and more information about the command in the 2.6 documentation.

I hope you will find these features helpful. We look forward to hearing your feedback on these features. If you would like to test them out, download MongoDB 2.5.4, the most recent Development Release of MongoDB.

MongoDB Sharding Visualizer

Sep 14 • Posted 1 year ago

We’re happy to share with you the initial release of the MongoDB sharding visualizer. The visualizer is a Google Chrome app that provides an intuitive overview of a sharded cluster. This project provides an alternative to the printShardingStatus() utility function available in the MongoDB shell.

Features

The visualizer provides two different perspectives of the cluster’s state.

The collections view is a grid where each rectangle represents a collection. Each rectangle’s area is proportional to that collection’s size relative to the other collections in the cluster. Inside each rectangle a pie chart shows the distribution of that collection’s chunks over all the shards in the cluster.

The shards view is a bar graph where each bar represents a shard and each segment inside the shard represents a collection. The size of each segment is relative to the other collections on that shard.

Additionally, the slider underneath each view allows rewinding the state of the cluster. select and view the state of the cluster at a specific time.

Installation

To install the plugin, download and unzip the source code from 10gen labs. In Google Chrome, go to Preferences > Extensions, enable Developer Mode, and click “Load unpacked extension…”. When prompted, select the “plugin” directory. Then, open a new tab in Chrome and navigate to the Apps page and launch the visualizer.

Feedback

We very much look forward to hearing feedback and encourage everyone to look at the source code which is available https://github.com/10gen-labs/shard-viz .

blog comments powered by Disqus