Posts tagged:

nosql

Appboy’s co-founder and CIO Jon Hyman discusses how the leading platform for app marketing automation leverages MongoDB and ObjectRocket for real-time data aggregation and scale and gives a preview of his talk with Kenny Gorman of ObjectRocket at MongoDB World.

Want to see more? MongoDB World will feature over 80 MongoDB experts from around the world. Early bird ticket prices for the event end May 23. Register now to grab your seat

maxTimeMS() and Query Optimizer Introspection in MongoDB 2.6

Apr 23 • Posted 3 months ago

By Alex Komyagin, Technical Service Engineer in MongoDB’s New York Offices

In this series of posts we cover some of the enhancements in MongoDB 2.6 and how they will be valuable for users. You can find a more comprehensive list of changes in our 2.6 release notes. These are changes I believe you will find helpful, especially for our advanced users.

This post introduces two new features: maxTimeMS and query optimizer introspection. We will also look at a specific support case where these features would have been helpful.

maxTimeMS: SERVER-2212

The socket timeout option (e.g. MongoOptions.socketTimeout in the Java driver) specifies how long to wait for responses from the server. This timeout governs all types of requests (queries, writes, commands, authentication, etc.). The default value is “no timeout” but users tend to set the socket timeout to a relatively small value with the intention of limiting how long the database will service a single request. It is thus surprising to many users that while a socket timeout will close the connection, it does not affect the underlying database operation. The server will continue to process the request and consume resources even after the connection has closed.

Many applications are configured to retry queries when the driver reports a failed connection. If a query whose connection is killed by a socket timeout continues to consume resources on the server, a retry can give rise to a cascading effect. As queries run in MongoDB, the application may retry the same resource intensive query multiple times, increasing the load on the system leading to severe performance degradation. Before 2.6 the only way to cancel these queries was to issue db.killOp, which requires manual intervention.

MongoDB 2.6 introduces a new queryparameter, maxTimeMS, which limits how long an operation can run in the database. When the time limit is reached, MongoDB automatically cancels the query. maxTimeMS can be used to efficiently prevent unexpectedly slow queries from overloading MongoDB. Following the same logic, maxTimeMS can also be used to identify slow or unoptimized queries.

There are a few important notes about using maxTimeMS:

  • maxTimeMS can be set on a per-operation basis — long running aggregation jobs can have a different setting than simple CRUD operations.
  • requests can block behind locking operations on the server, and that blocking time is not counted, i.e. the timer starts ticking only after the actual start of the query when it initially acquires the appropriate lock;
  • operations are interrupted only at interrupt points where an operation can be safely aborted — the total execution time may exceed the specified maxTimeMS value;
  • maxTimeMS can be applied to both CRUD operations and commands, but not all commands are interruptible;
  • if a query opens a cursor, subsequent getmore requests will be included in the total time (but network roundtrips will not be counted);
  • maxTimeMS does not override the inactive cursor timeout for idle cursors (default is 10 min).

All MongoDB drivers support maxTimeMS in their releases for MongoDB 2.6. For example, the Java driver will start supporting it in 2.12.0 and 3.0.0, and Python - in version 2.7.

To illustrate how maxTimeMS can come in handy, let’s talk about a recent case that the MongoDB support team was working on. The case involved performance degradation of a MongoDB cluster that is powering a popular site with millions of users and heavy traffic. The degradation was caused by a huge amount of read queries performing full scans of the collection. The collection in question stores user comment data, so it has billions of records. Some of these collection scans were running for 15 minutes or so, and they were slowing down the whole system.

Normally, these queries would use an index, but in this case the query optimizer was choosing an unindexed plan. While the solution to this kind of problem is to hint the appropriate index, maxTimeMS can be used to prevent unintentional runaway queries by controlling the maximum execution time. Queries that exceed the maxTimeMS threshold will error out on the application side (in the Java driver it’s the MongoExecutionTimeoutException exception). maxTimeMS will help users to prevent unexpected performance degradation, and to gain better visibility into the operation of their systems.

In the next section we’ll take a look at another feature which would help in diagnosing the case we just discussed.

Query Optimizer Introspection: SEVER-666

To troubleshoot the support case described above and to understand why the query optimizer was not picking up the correct index for queries, it would have been very helpful to know a few things, such as whether the query plan had changed and to see the query plan cache.

MongoDB determines the best execution plan for a query and stores this plan in a cache for reuse. The cached plan is refreshed periodically and under certain operational conditions (which are discussed in detail below). Prior to 2.6 this caching behavior was opaque to the client. Version 2.6 provides a new query subsystem and query plan cache. Now users have visibility and control over the plan cache using a set of new methods for query plan introspection.

Each collection contains at most one plan cache. The cache is cleared every time a change is made to the indexes for the collection. To determine the best plan, multiple plans are executed in parallel and the winner is selected based on the amount of retrieved results within a fixed amount of steps, where each step is basically one “advance” of the query cursor. When a query has successfully passed the planning process, it is added to the cache along with the related index information.

A very interesting feature of the new query execution framework is the plan cache feedback mechanism. MongoDB stores runtime statistics for each cached plan in order to remove plans that are determined to be the cause of performance degradation. In practice, we don’t see these degradations often, but if they happen it is usually a consequence of a change in the composition of the data. For example, with new records being inserted, an indexed field may become less selective, leading to slower index performance. These degradations are extremely hard to manually diagnose, and we expect the feedback mechanism to automatically handle this change if a better alternative index is present.

The following events will result in a cached plan removal:

  • Performance degradation
  • Index add/drop
  • No more space in cache (the total number of plans stored in the collection cache is currently limited to 200; this is a subject to change in future releases)
  • Explicit commands to mutate cache
  • After the number of write operations on a collection exceeds the built-in limit, the whole collection plan cache is dropped (data distribution has changed)

MongoDB 2.6 supports a set of commands to view and manipulate the cache. List all known query shapes (planCacheListQueryShapes), display cached plans for a query shape (planCacheListPlans), manual removal of a query shape from the cache as well as emptying the whole cache (planCacheClear).

Here is an example invocation of the planCacheListQueryShapes command, that lists the shape of the query

db.test.find({first_name:"john",last_name:"galt"},

{_id:0,first_name:1,last_name:1}).sort({age:1}):

> db.runCommand({planCacheListQueryShapes: "test"})
{
    "shapes" : [
        {
            "query" : {
                "first_name" : "alex",
                "last_name" : "komyagin"
            },
            "sort" : {
                "age" : 1
            },
            "projection" : {
                "_id" : 0,
                "first_name" : 1,
                "last_name" : 1
            }
        }
    ],
    "ok" : 1
}

The exact values in the query predicate are insignificant in determining the query shape. For example, a query predicate

{first_name:"john", last_name:"galt"} is equivalent to the query predicate {first_name:"alex", last_name:"komyagin"}.

Additionally, with log level set to 1 or greater MongoDB will log plan cache changes caused by the events listed above. You can set the log level using the following command:

use admin
db.runCommand( { setParameter: 1, logLevel: 1 } )

Please don’t forget to change it back to 0 afterwards, since log level 1 logs all query operations and it can negatively affect the performance of your system.

Together, the new plan cache and the new query optimizer should make related operations more transparent and help users to have the visibility and control necessary for maintaining predictable response times for their queries.

You can see how these new features work yourself, by trying our latest release MongoDB 2.6 available for download here. I hope you will find these features helpful. We look forward to hearing your feedback, please post your questions in the mongodb-user google group.

Like what you see? Get MongoDB updates straight to your inbox

Processing 2 Billion Documents A Day And 30TB A Month With MongoDB

Mar 14 • Posted 4 months ago

This is a guest post by David Mytton. He has been programming Python for over 10 years and founded his website and and monitoring company, Server Density, back in 2009.

Server Density processes over 30TB/month of incoming data points from the servers and web checks we monitor for our customers, ranging from simple Linux system load average to website response times from 18 different countries. All of this data goes into MongoDB in real time and is pulled out when customers need to view graphs, update dashboards and generate reports.

We’ve been using MongoDB in production since mid-2009 and have learned a lot over the years about scaling the database. We run multiple MongoDB clusters but the one storing the historical data does the most throughput and is the one I shall focus on in this article, going through some of the things we’ve done to scale it.

1. Use dedicated hardware, and SSDs

All our MongoDB instances run on dedicated servers across two data centers at Softlayer. We’ve had bad experiences with virtualisation because you have no control over the host, and databases need guaranteed performance from disk i/o. When running on shared storage (e.g., a SAN) this is difficult to achieve unless you can get guaranteed throughput from things like AWS’s Provisioned IOPS on EBS (which are backed by SSDs).

MongoDB doesn’t really have many bottlenecks when it comes to CPU because CPU bound operations are rare (usually things like building indexes), but what really causes problem is CPU steal - when other guests on the host are competing for the CPU resources.

The way we have combated these problems is to eliminate the possibility of CPU steal and noisy neighbours by moving onto dedicated hardware. And we avoid problems with shared storage by deploying the dbpath onto locally mounted SSDs.

I’ll be speaking in-depth about managing MongoDB deployments in virtualized or dedicated hardware at MongoDB World this June.

2. Use multiple databases to benefit from improved concurrency

Running the dbpath on an SSD is a good first step but you can get better performance by splitting your data across multiple databases, and putting each database on a separate SSD with the journal on another.

Locking in MongoDB is managed at the database level so moving collections into their own databases helps spread things out - mostly important for scaling writes when you are also trying to read data. If you keep databases on the same disk you’ll start hitting the throughput limitations of the disk itself. This is improved by putting each database on its own SSD by using the directoryperdb option. SSDs help by significantly alleviating i/o latency, which is related to the number of IOPS and the latency for each operation, particularly when doing random reads/writes. This is even more visible for Windows environments where the memory mapped data files are flushed serially and synchronously. Again, SSDs help with this.

The journal is always within a directory so you can mount this onto its own SSD as a first step. All writes go via the journal and are later flushed to disk so if your write concern is configured to return when the write is successfully written to the journal, making those writes faster by using an SSD will improve query times. Even so, enabling the directoryperdb option gives you the flexibility to optimise for different goals (e.g., put some databases on SSDs and some on other types of disk, or EBS PIOPS volumes, if you want to save cost).

It’s worth noting that filesystem based snapshots where MongoDB is still running are no longer possible if you move the journal to a different disk (and so different filesystem). You would instead need to shut down MongoDB (to prevent further writes) then take the snapshot from all volumes.

3. Use hash-based sharding for uniform distribution

Every item we monitor (e.g., a server) has a unique MongoID and we use this as the shard key for storing the metrics data.

The query index is on the item ID (e.g. the server ID), the metric type (e.g. load average) and the time range; but because every query always has the item ID, it makes it a good shard key. That said, it is important to ensure that there aren’t large numbers of documents under a single item ID because this can lead to jumbo chunks which cannot be migrated. Jumbo chunks arise from failed splits where they’re already over the chunk size but cannot be split any further.

To ensure that the shard chunks are always evenly distributed, we’re using the hashed shard key functionality in MongoDB 2.4. Hashed shard keys are often a good choice for ensuring uniform distribution, but if you end up not using the hashed field in your queries, you could actually hurt performance because then a non-targeted scatter/gather query has to be used.

4. Let MongoDB delete data with TTL indexes

The majority of our users are only interested in the highest resolution data for a short period and more general trends over longer periods, so over time we average the time series data we collect then delete the original values. We actually insert the data twice - once as the actual value and once as part of a sum/count to allow us to calculate the average when we pull the data out later. Depending on the query time range we either read the average or the true values - if the query range is too long then we risk returning too many data points to be plotted. This method also avoids any batch processing so we can provide all the data in real time rather than waiting for a calculation to catch up at some point in the future.

Removal of the data after a period of time is done by using a TTL index. This is set based on surveying our customers to understand how long they want the high resolution data for. Using the TTL index to delete the data is much more efficient than doing our own batch removes and means we can rely on MongoDB to purge the data at the right time.

Inserting and deleting a lot of data can have implications for data fragmentation, but using a TTL index helps because it automatically activates PowerOf2Sizes for the collection, making disk usage more efficient. Although as of MongoDB 2.6, this storage option will become the default.

5. Take care over query and schema design

The biggest hit on performance I have seen is when documents grow, particularly when you are doing huge numbers of updates. If the document size increases after it has been written then the entire document has to be read and rewritten to another part of the data file with the indexes updated to point to the new location, which takes significantly more time than simply updating the existing document.

As such, it’s important to design your schema and queries to avoid this, and to use the right modifiers to minimise what has to be transmitted over the network and then applied as an update to the document. A good example of what you shouldn’t do when updating documents is to read the document into your application, update the document, then write it back to the database. Instead, use the appropriate commands - such as set, remove, and increment - to modify documents directly.

This also means paying attention to the BSON data types and pre-allocating documents, things I wrote about in MongoDB schema design pitfalls.

6. Consider network throughput & number of packets

Assuming 100Mbps networking is sufficient is likely to cause you problems, perhaps not during normal operations, but probably when you have some unusual event like needing to resync a secondary replica set member.

When cloning the database, MongoDB is going to use as much network capacity as it can to transfer the data over as quickly as possible before the oplog rolls over. If you’re doing 50-60Mbps of normal network traffic, there isn’t much spare capacity on a 100Mbps connection so that resync is going to be held up by hitting the throughput limits.

Also keep an eye on the number of packets being transmitted over the network - it’s not just the raw throughput that is important. A huge number of packets can overwhelm low quality network equipment - a problem we saw several years ago at our previous hosting provider. This will show up as packet loss and be very difficult to diagnose.

Conclusions

Scaling is an incremental process - there’s rarely one thing that will give you a big win. All of these tweaks and optimisations together help us to perform thousands of write operations per second and get response times within 10ms whilst using a write concern of 1.

Ultimately, all this ensures that our customers can load the graphs they want incredibly quickly. Behind the scenes we know that data is being written quickly, safely and that we can scale it as we continue to grow.

MongoDB Bug Hunt Extended to March 8

Feb 28 • Posted 4 months ago

On February 21, we launched the first ever MongoDB Bug Hunt. We have been impressed with the community’s enthusiasm during the first week and have decided to extend the hunt until March 8. This will allow more members of the community to get involved and help improve MongoDB for users worldwide.

As a reminder, you can download the latest release at www.MongoDB.org/downloads. If you find a bug, submit the issue to Jira (Core Server project) by March 8 at 12:00AM GMT. Bug reports will be judged on three criteria: user impact, severity and prevalence.

We will review all bugs submitted against 2.6.0-rc0. Winners will be announced on the MongoDB blog and user forum by March 13. There will be one first place winner, one second place winner and at least two honorable mentions.

For more info on the Bug Hunt see our announcement on the MongoDB Blog.

Thanks to everyone who has downloaded and tested the server so far. Keep on hunting!

Announcing the MongoDB Bug Hunt 2.6.0-rc0 

Feb 21 • Posted 5 months ago

The MongoDB team released MongoDB 2.6.0-rc0 today and is proud to announce the MongoDB Bug Hunt. The MongoDB Bug Hunt is a new initiative to reward our community members who contribute to improving this MongoDB release. We’ve put the release through rigorous correctness, performance and usability testing. Now it’s your turn. Over the next 10 days, we challenge you to test and uncover any lingering issues in MongoDB 2.6.0-rc0.

How it works

You can download this release at MongoDB.org/downloads. If you find a bug, submit the issue to Jira (Core Server project) by March 4 at 12:00AM GMT. Bug reports will be judged on three criteria: user impact, severity and prevalence.

We will review all bugs submitted against 2.6.0-rc0. Winners will be announced on the MongoDB blog and user forum by March 8. There will be one first place winner, one second place winner and at least two honorable mentions.

The Rewards
First Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt
Second Prize:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Amazon Gift Card
  • MongoDB Contributor T-shirt
Honorable Mentions:
  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

How to get started:

  • Deploy in your test environment: Software is best tested in a realistic environment. Help us see how 2.6 fares with your code and data so that others can build and run applications on MongoDB 2.6 successfully.
  • Test new features and improvements: Dozens of new features were added in 2.6. See the 2.6 Release Notes for a full list.
  • Log a ticket: If you find an issue, create a report in Jira. See the documentation for a guide to submitting well written bug reports.

If you are interested in doing this work full time, consider applying to join our engineering teams in New York City, Palo Alto and Austin, Texas.

Happy hunting!

Eliot, Dan and the MongoDB Team

The MongoDB Java Driver 3.0: What’s Changing

Aug 30 • Posted 11 months ago

By Trisha Gee, MongoDB Java Engineer and Evangelist

In the last post, we covered the design goals for the new MongoDB Java Driver. In this one, we’re going to go into a bit more detail on the changes you can expect to see, and how to start playing with an alpha version of the driver. Please note, however, that the driver is still a work in progress, and not ready for production.

New features

Other than the overall changes to design detailed above, the 3.0 driver has the following new features:

  • Pluggable Codecs: This means you can do simple changes to serialisation/deserialisation, like tell the driver to use Joda Time instead of java.util.Date, or you can take almost complete control of how to turn your Java objects into BSON. This should be particularly useful for ODMs or other libraries, as they can write their own codecs to convert Java objects to BSON bytes.
  • Predictable cluster management: We’ve done quite a lot of work around discovering the servers in your cluster and determining which ones to talk to. In particular, the driver doesn’t have to wait for all servers to become available before it can start using the ones that are definitely there - the design is event-based so as soon as a server notifies the driver of its state the driver can take appropriate action - use it if it’s active, or start ignoring it if it’s no longer available.
  • Additional Connection Pool features: We’ve added support for additional connection pool settings, and a number of other improvements around connection management. Here’s the full list.
  • Deprecated methods/classes will be removed: In the next 2.x release a number of methods and classes will be deprecated. These, along with existing deprecated methods, will be removed in the 3.0 driver. This should point you in the right direction to help you migrate from 2.x to 3.x.

Speaking of Migration…

We’ve worked hard to maintain backwards compatibility whilst moving forwards with the architecture of the Java driver for MongoDB. We want to make migration as painless as possible, in many cases it should be a simple drop-in replacement if you want to keep using the existing API. We hope to provide a step-by-step guide to migrating from 2.x to 3.0 in the very near future. For now, it’s worth mentioning that upgrading will be easiest if you update to 2.12 (to be released soon), migrate any code that uses deprecated features, and then move to the compatible mode of the new driver.

Awesome! Can I try it?

Yes you can! You can try out an alpha of the new driver right now, but as you’d expect there are CAVEATS: this is an alpha, it does not support all current features (notably aggregation); although it has been tested it is still in development and we can’t guarantee everything will work as you expect. Features which have been or will be deprecated in the 2.x driver are missing completely from the 3.0 driver. Please don’t use it in production. However, if you do want to play with it in a development environment, or want to run your existing test suite against it, please do send us any feedback you have.

If you want to use the compatible mode, with the old API (minus deprecations) and new architecture:

Maven

Gradle

You should be able to do a drop-in replacement with this dependency - use this instead of your existing MongoDB driver, run it in your test environment and see how ready you are to use the new driver.

If you want to play with the new, ever-changing, not-at-all-final API, then you can use the new driver with the new API. Because we wanted to be able to support both APIs and not have a big-bang switchover, there’s a subtle difference to the location of the driver with the updated API, see if you can spot it:

Maven

Gradle

Note that if you use the new API version, you don’t have access to the old compatible API.

Of course, the code is in GitHub

In Summary

For 3.0, we will deliver the updated, simplified architecture with the same API as the existing driver, as well as working towards a more fluent style of API. This means that although in future you have the option of using the new API, you should also be able to do a simple drop-in replacement of your driver jar file and have the application work as before.

A release date for the 3.0 driver has not been finalized, but keep your eyes open for it.

All Hail the new Java driver!

The MongoDB Java Driver 3.0

Aug 13 • Posted 11 months ago

By Trisha Gee, MongoDB Java Engineer and Evangelist

You may have heard that the JVM team at 10gen is working on a 3.0 version of the Java driver. We’ve actually been working on it since the end of last year, and it’s probably as surprising to you as it is to me that we still haven’t finished it yet. But this is a bigger project than it might seem, and we’re working hard to get it right.

So why update the driver? What are we trying to achieve?

Well, the requirements are:

  • More maintainable
  • More extensible
  • Better support for ODMs, third party libraries and other JVM languages
  • More idiomatic for Java developers
Read more

November Driver Releases

Dec 10 • Posted 1 year ago

On November 27, all 10gen supported drivers were updated with new error checking and reporting defaults. Each driver now has a MongoClient connection class to handle the error checking. On the same day there was also a server release with fixes on 2.2

September Blog, Release and 2.2 Roundup

Oct 2 • Posted 1 year ago

Fast datetimes in MongoDB

Oct 1 • Posted 1 year ago

This was originally posted to Mike Friedman’s blog. Mike is a Perl Evangelist at 10gen, working on the Perl Driver for MongoDB One of the most common complaints about the Perl MongoDB driver is that it tries to be a little too clever. In the current production release of MongoDB.pm (version 0.46.2 as of this writing), all datetime values retrieved by a query are automatically instantiated as DateTime objects. DateTime is a remarkable CPAN distribution. In fact, I would say that DateTime and its related distributions on CPAN comprise one of the best date and time manipulation libraries in any programming language. But that power comes with a cost. The DateTime codebase is large, and instantiating DateTime objects is expensive. The constructor performs a great deal of validation, and creates a large amouunt of metadata which is stored inside the object. Upcoming changes to the Perl MongoDB driver solve this problem. Read more below. If you need to perform a series of complex arithmetic operations with dates, then the cost of DateTime is justified. But frequently, all you want is a simple read-only value that is sufficient for displaying to a user or saving elsewhere. If you are running queries involving a large number of documents, the automatic instantiation of thousands of complex objects becomes barrier to performance.

Read more

How MongoDB makes custom e-commerce easy

Sep 17 • Posted 1 year ago

The market for open source e-commerce software has gone through a lot of stages already, as you might know it by popular platforms like osCommerce, Magento, Zen Cart, PrestaShop, Spree, just to name a few. These platforms are frequently used as a basis for custom e-commerce apps, and they all require a SQL database. Given the inherent challenge in adapting open source software to custom features, it would seem that MongoDB is poised to play an important role in the next wave of e-commerce innovation.

Kyle Banker was one of the first to blog about MongoDB and e-commerce in April 2010, and there’s been surprisingly little written about it since then. In his blog, Kyle writes about Magento and other SQL based platforms: “What you’ll see is a flurry of tables working together to provide a flexible schema on top of a fundamentally inflexible style of database system.”

To this we must ask, why is a flexible schema so important in e-commerce?

Open source platforms are meant to be adapted to many different designs, conversion flows, and business processes. A flexible schema helps by giving developers a way to relate custom data structures to the platform’s existing model. Without a flexible schema, the developer has to get over high hurdles to make a particular feature possible. When the cost of creating and maintaining a custom feature is too high, the options are: give up the feature, start over with a different platform, or build a platform from scratch. That’s an expensive proposition.

There is a better way

For the past year we’ve been developing Forward, a new open source e-commerce platform combined with MongoDB. It’s been in production use since March 2012, and finally reached a point where we can demonstrate the benefits that MongoDB’s schema-less design brings to custom feature development.

The following examples demonstrate Forward’s REST-like ORM conventions, which are only available in the platform itself, but the underlying concepts map directly to MongoDB’s document structure. In this case, think of get() as db.collection.find() — put() as insert/update() — post() as insert() — and delete() as… delete().

Prototype faster

The majority of e-commerce sites represent small businesses, where moving fast can be the most important aspect of a web platform. When the flexible document structure of MongoDB is carried through the platform’s model interface, adding custom fields becomes easier than ever.

For example, let’s say you need a simple administrative view for adding a couple custom attributes to a product. Here’s a basic example for that purpose, written in Forward’s template syntax:

{args $product_id}

{if $request.post}
    {$product = put("/products/$product_id", [
        spec => $params.spec,
        usage => $params.usage
    ])}
    {flash notice="Saved" refresh=true}
{else}
    {$product = get("/products/$product_id")}
{/if}

<for method="post">
    <div class="field">
        <label>Product specification</label>
        <textarea name="spec">{$product.spec|escape}</textarea>
    </div>
    <div class="field">
        <label>Product usage instructions</label>
        <textarea name="usage">{$product.usage|escape}</textarea>
    </div>
    <button type="submit">Save product</button>
</form>

It might be obvious what this template does, but what might be less obvious is that the platform knows nothing about the “spec” or “usage” fields, and yet they are treated as if the e-commerce data model was designed for them. No database migration necessary, just code.

You may argue this can be accomplished with a fuzzy SQL database structure, and you would be correct, but it’s not pretty, or readable with standard database tools. Ad-hoc queries on custom fields would become difficult.

Query on custom fields

If all we needed were custom key/value storage, you might not benefit that much from of a flexible schema. Where MongoDB really shines is in its ability to query on any document field, even embedded documents.

{get $oversized_products from "/products" [
    oversized => true,
    active => true
]}

There are {$oversized_products.count} active oversized products

These fields may or may not be known by the e-commerce API, but in this case MongoDB’s query syntax finds only the documents with matching fields.

No more relational complexity

For those who spent years writing relational SQL queries, this is a big change. How do we create data relationships without joins? There are many different strategies, but Forward defines a field as either a static value or a callback method. This allows a field to return another document or collection based on a query. The result is a data model that can walk through relationships without joins. For example (PHP):

// class Accounts extends AppModel
...
$this->fields => array(
    ...
    'orders' => function ($order) {
        return get("/orders", array('account_id' => $account['id']));
    }
);

This relationship can be used in a template like this:

{get $account from "/accounts/$session.account_id"}

You’ve placed

<table>
    {foreach $account.orders as $order}
        <tr>
            <td>#{$order.id}</td>
            <td>${$order.sub_total}</td>
            <td>${$order.grand_total}</td>
            <td>{$order.items|count} item(s)</td>
        </tr>
    {/foreach}
</table>

Relationships can be defined by simple or complex queries. Results are lazy-loaded, making this example possible:

{get $order from "/orders/123"}

{$order.account.name} placed {$order.account.orders.count} orders since {$order.account.orders.first.date_created|date_format}

// Output: John Smith placed 3 orders since Jun 14, 2012

What about transactions?

Many people bring up MongoDB’s lack of atomic transactions across collections as evidence that it’s not suitable for e-commerce applications. This has not been a significant barrier in our experience so far.

There are other ways to approach data integrity. In systems with low-moderate data contention, optimistic locking is sufficient. We’ll share more details about these strategies as things progress.

In conclusion

The future of e-commerce software looks bright with MongoDB. It’s time to blaze new trails where convoluted schemas, complex relational queries, and hair raising database migrations are a thing of the past. If you’re interested in working with Forward before public release, please consider joining the private beta and help us reinvent open source e-commerce as the world knows it.

A guest post from Eric Ingram, developer/founder @getfwd

Perl Driver 0.46.1 Released

Sep 5 • Posted 1 year ago

This was originally posted to Mike Friedman’s personal blog

I’m happy to announce that after a long delay, version 0.46.1 of the Perl MongoDB driver has now been uploaded to CPAN, and should be available on your friendly local CPAN mirror soon.

This release is mostly a series of minor fixes and housekeeping, in preparation for developing a more detailed roadmap for more frequent releases down the line. Here’s what’s new so far:

Most of the distribution has been successfully transitioned to Dist::Zilla for automated building, tagging, and releasing to CPAN. This has vastly reduced the amount of effort needed to get releases out the door.

The behind-the-scenes algorithm for validating UTF-8 strings has been replaced with a more compliant and much faster version. Thanks to Jan Anderssen for contributing the fix.

Serialization of regexes has been improved and now supports proper stripping of unsupported regex flags across all recent Perl versions. Thanks to Arkadiy Kukarkin for reporting the bug and @ikegami for help with figuring out how to serialize regexes properly via the Perl API.

The driver will now reject document key names with NULL bytes, a possible source of serious bugs. Additionally, much of the distribution metadata has been cleaned up, thanks to the automation provided by Dzil. In particular, the official distribution repository and bug-tracker links now point to our GitHub and JIRA sites. Hopefully more bugs will now come in via those channels instead of RT.

Looking ahead, there is a lot of work yet to be done. I have prioritized the following tasks for version 0.47, which should help us moving forward to an eventual 1.0 release.

  • Eliminating the dependency on Module::Install
  • Significantly re-working the documentation to include better organization and more examples.
  • Additionally, much of the current documentation will be refactored via Pod::Weaver.
  • Replacing AUTOLOADed database and collection methods with safer generated symbols upon connection. Beginning with 0.48, these will have a deprecation warning added and will be removed entirely before the 1.0 release in favor of the get_database and get_collection methods. The docs will be updated to reflect this change.

I’m very excited about the future of MongoDB support for Perl, and looking forward to improving the CPAN distribution in concert with the Perl community!

Mike Friedman is the Perl Engineer and Evangelist at 10gen, working on the Perl Driver for MongoDB. You can follow his blog at friedo.com

Motor: Asynchronous Driver for MongoDB and Python

Sep 5 • Posted 1 year ago

Tornado is a popular asynchronous Python web server. Alas, to connect to MongoDB from a Tornado app requires a tradeoff: You can either use PyMongo and give up the advantages of an async web server, or use AsyncMongo, which is non-blocking but lacks key features.

I decided to fill the gap by writing a new async driver called Motor (for “MOngo + TORnado”), and it’s reached the public alpha stage. Please try it out and tell me what you think. I’ll maintain a homepage for it here, including basic documentation.

Status

Motor is alpha. It is certainly buggy. Its implementation and possibly its API will change in the coming months. I hope you’ll help me by reporting bugs, requesting features, and pointing out how it could be better.

Advantages

Two good projects, AsyncMongo and APyMongo, took the straightforward approach to implementing an async MongoDB driver: they forked PyMongo and rewrote it to use callbacks. But this approach creates a maintenance headache: now every improvement to PyMongo must be manually ported over. Motor sidesteps the problem. It uses a Gevent-like technique to wrap PyMongo and run it asynchronously, while presenting a classic callback interface to Tornado applications. This wrapping means Motor reuses all of PyMongo’s code and, aside from GridFS support, Motor is already feature-complete. Motor can easily keep up with PyMongo development in the future.

Installation

Motor depends on greenlet and, of course, Tornado. It is compatible with CPython 2.5, 2.6, 2.7, and 3.2; and PyPy 1.9. You can get the code from my fork of the PyMongo repo, on the motor branch:

pip install tornado greenlet pip install git+https://github.com/ajdavis/mongo-python-driver.git@motor To keep up with development, watch my repo and do

pip install -U git+https://github.com/ajdavis/mongo-python-driver.git@motor when you want to upgrade.

Example

Here’s an example of an application that can create and display short messages:

Other examples are Chirp, a Twitter-like demo app, and Motor-Blog, which runs this site.

Support For now, email me directly if you have any questions or feedback.

Roadmap In the next week I’ll implement the PyMongo feature I’m missing, GridFS. Once the public alpha and beta stages have shaken out the bugs and revealed missing features, Motor will be included as a module in the official PyMongo distribution.

A. Jesse Jiryu Davis

August MongoDB Releases and Blogroll

Sep 2 • Posted 1 year ago

This August saw a number of new MongoDB releases, including MongoDB 2.2 and compatible driver releases

Blog posts on MongoDB 2.2

Noteworthy Blog Posts of the Month

_Have a blog post you’d like to be included in our next update? Send us a note

Designing MongoDB Schemas with Embedded, Non-Embedded and Bucket Structures

Aug 10 • Posted 1 year ago

This was originally posted to the Red Hat OpenShift blog

With the rapid adoption of schema-less, NoSQL data stores like MongoDB, Cassandra and Riak in the last few years, developers now have the ability enjoy greater agility when it comes to their application’s persistence model. However, just because a datastore is schema-less, doesn’t mean the structure of the stored documents won’t play an important role in the overall performance and resilience of the application. In this first, of a four part blog series about MongoDB we’ll explore a few strategies you should consider when designing your document structure.

Application requirements should drive schema design

If you ask a dozen experienced developers to design the relational database structure of an application, such as a book review site, it’s likely that each of the structures will be very similar. You’ll likely see tables for authors, books, commenters and comments and so on.. The likelihood of having varied relational structures is small because relational database structures are generally well understood. However, if you ask dozen experienced NoSQL developers to create a similar structure, you’re likely to get a dozen different answers.

Why is there so much variability when it comes to designing a NoSQL schema? To optimize application performance and reliability, a NoSQL schema must be driven by the application’s use case. It’s a novel idea, but it works. Luckily, there are only a few key factors you need to understand when deriving your schema from application requirements. These factors include: • How your documents reference children collections • The structure and the use of indexes • How your data will be sharded

Elements of MongoDB Schemas

Of these factors, how your documents reference child collections, or embedding, is the most important decision you need to make. This point is best demonstrated with an example.

Suppose we’re building the book review site as we mentioned in the introduction. Our application will have authors and books, as well as reviews with threaded comments. How should we structure the collections? Unfortunately, the answers depend on the number of comments we’re expecting per book and how frequently comments are read vs. written. Let’s look at our possible use cases.

The first possibility is were we’re only going to have a few dozen reviews per book, and each review is likely to have a few hundred comments. In this case, embedding the reviews and comments with the book is a viable possibility. Here’s what that might look like:

Listing 1 – Embedded

// Books { “_id”: ObjectId(“500c680c1fe9193b67b898a3”), “publisher”: “O’Reilly Media”, “isbn”: “978-1-4493-8156-1”, “description”: “How does MongoDB help you…”, “title”: “MongoDB: The Definitive Guide”, “formats”: [“Print”, “Ebook”, “Safari Books Online”], “authors”: [{ “lastName”: “Chodorow”, “firstName”: “Kristina” }, { “lastName”: “Dirolf”, “firstName”: “Michael” }], “pages”: “210” }

// Reviews { “_id”: ObjectId(“500c680c1fe9193b67b898a4”), “rating”: 5, “description”: “The Authors made an excellent work…”, “title”: “One of O’Reilly excellent books”, “created”: ISODate(“2012-07-04T09:48:17Z”), “book_id”: { “$ref”: “books”, “$id”: ObjectId(“500c680c1fe9193b67b898a3”) }, “reviewer”: “Giuseppe” }

// Comments { “_id”: ObjectId(“500c680c1fe9193b67b898a5”), “comment”: “This review helped me choose the correct book.”, “commenter”: “Nick”, “review_id”: { “$ref”: “reviews”, “$id”: ObjectId(“500c680c1fe9193b67b898a4”) }, “created”: ISODate(“2012-07-20T13:15:37Z”) }

While simple, this method does have some trade-offs. First, our reviews and comments are strewn throughout the disk. We’re potentially loading thousands of documents to display a page. This leads us to another common embedding strategy – “buckets”.

By bucketing review comments, we can maintain the benefit of fewer reads to display substantial amounts of content, while at the same time maintaining fast writes to smaller documents. An example of a bucketed structure is presented below:

Figure 1 – Hybrid Structure

In this example, the bucket, or hybrid, structure breaks the comments into chunks of roughly 100 comments. Each comment collection maintains a reference to the parent review, as well as its page and current number of contained comments.

Of course, as software developers, we’re painfully aware there’s no free lunch. The downside to buckets is the increased complexity your application has to deal with. The previous strategies were trivial to implement from an application perspective, but suffered from inefficiencies at scale. Buckets address these inefficiencies, but your application has to do a bit more bookkeeping, such as keeping track of the number of comment buckets for a given review.

Conclusion

My own personal projects with MongoDB have used each one of these strategies at one point or another, but I’ve always grown into more complicated strategies from the most basic, as the application requirements changed. One of the benefits of MongoDB is the ability to change your storage strategy at will and you shouldn’t be afraid to take advantage of this flexibility. By starting simple, you can maintain development velocity early and migrate to a more scalable strategy as the need arises. Stay tuned for additional blogs in this series covering the use of MongoDB indexes, sharding and replica sets.

If you are interested in experimenting with a few of the concepts without having to download and install MongoDB, try in on Red Hat’s OpenShift. It’s FREE to sign up and all it takes is an email and your minutes from having a MongoDB instance running in the cloud.

References

http://www.mongodb.org/display/DOCS/Schema+Designhttp://www.10gen.com/presentations/mongosf2011/schemascale

blog comments powered by Disqus