The MongoDB NoSQL Database Blog

Month

March 2010

12 posts

On Distributed Consistency -- Part 1

See also:

  • Part 2 - Eventual Consistency
  • Part 3 - Network Partitions
  • Part 4 - Multi Data Center
  • Part 5 - Multi Writer Eventual Consistency
  • Part 6 - Consistency Chart

For distributed databases, consistency models are a topic of huge importance. We’d like to delve a bit deeper on this topic with a series of articles, discussing subjects such as what model is right for a particular use case. Please jump in and help us in the comments.

We’ll start here with a basic introduction to the subject.

CAP

The CAP theorem states that one can only have two of consistency, availability, and tolerance to network partitions at the same time. In distributed systems, network partitioning is inevitable and must be tolerated, so essential CAP means that we cannot have both consistency and 100% availability. 

Informally, I would summarize the CAP theorem as:

  • If the network is broken, your database won’t work.

However, we do get to pick the definition of “won’t work”.  It can either mean down (unavailable) or inconsistent (stale data).

More precisely what do we mean by “consistency”? The academic work in this area is referring to “one copy serializability” or “linearizability”. If a series of operations or transactions are performed, they are applied in a consistent order. One less formal way of thinking about the trade-off is “Could I be reading and manipulating stale/dirty data? Can I always write?”

Embodiments

We have two classes of architectures: a C class (strongly consistent) and an A class (higher availability looser consistency). Let’s consider some real-world distributed systems and where they are classified.

Amazon Dynamo is a distributed data store which implements consistent hashing and is in the A camp. It provides eventual consistency. One may read old data.

CouchDB is typically used with asynchronous master-master replication and is in the A camp. It provides eventual consistency.

A MongoDB auto-sharding+replication cluster has a master server at a given point in time for each shard. This is in the C camp.  Traditional RDBMS systems are also strongly consistent (as typically used) - a synchronous RDBMS cluster for example.

It’s worth noting that alternate configurations of these products sometimes alter their consistency (and performance) properties. For our discussion here, we’ll assume these products are configured in their common case setup, unless otherwise specified.

Write Availability, not Read Availability, is the Main Question

With most databases today, it’s easy to have any number of asynchronous slave replicas distributed about the world. If networks partition, we would then still have access to local slave data. As the replication is asynchronous, this data is eventually consistent, so this result is not surprising — we are now in the A class of systems. However, almost all designs, even from the C class, can add on asynchronous read capabilities easy. Thus, the critical design decisions are around write availability.

The Trade-offs

  • even load distribution is easier in eventually consistent systems
  • multi-data center support is easier in eventually consistent systems
  • some problems are not solvable with eventually consistent systems
  • code is sometimes simpler to write in strongly consistent systems

We will discuss these pros in cons in more details in subsequent articles.

—dwight

Mar 26, 20104 notes
MongoDB 1.4 Ready for Production

The MongoDB team is very excited to announce the release of MongoDB 1.4.0.  This is the culmination of 3 months of work in the 1.3 branch and has a large number of very important changes.

Many users have been running 1.3 in production, so this release is already very thoroghly vetted both by our regressions systems and by real users.

Some highlights:

Core server enhancements

  • concurrency improvements
  • indexing memory improvements
  • background index creation
  • better detection of regular expressions so the index can be used in more cases
  • performance numbers

Replication & Sharding

  • better handling for restarting slaves offline for a while
  • fast new slaves from snapshots
  • configurable slave delay
  • replication handles clock skew on master
  • $inc replication fixes
  • sharding alpha 3 - notably 2 phase commit on config servers

Deployment & production

  • configure “slow” for profiling
  • ability to do fsync + lock for backing up raw files
  • option for separate directory per db
  • http://localhost:28017/_status to get serverStatus via http
  • REST interface is off by default for security (—rest to enable)
  • can rotate logs with a db command, logRotate
  • enhancements to serverStatus - counters/replication lag
  • new mongostat tool and db.serverStatus() enhancements

Query language improvements

  • $all with regex
  • $not
  • partial matching of array elements $elemMatch
  • $ operator for updating arrays
  • $addToSet
  • $unset
  • $pull supports object matching
  • $set with array indices

Geo

  • 2d geospatial search
  • geo $center and $box searches

Downloads: www.mongodb.org/display/DOCS/Downloads

Full Change Log: jira

Release Notes: http://www.mongodb.org/display/DOCS/1.4+Release+Notes

Thanks for all your continued support, and we hope MongoDB 1.4 works great for you.

As always, please let us know of any issues,

-Eliot and the MongoDB Team

Mar 25, 20109 notes
MongoDB 1.4 Performance

We generally avoid posting benchmarks and suggest people create their own targeting their use cases. However, we have decided to publish a few of our internal micro-benchmarks comparing 1.2 with 1.4RC2 (aka 1.3.5) to show that in almost all cases performance is the same or better (sometime significantly so), even though we’ve added many new features.

The test works by spawning N threads and having them hammer the DB with one operation as fast as they can. It is probably best to ignore the raw numbers and only look at the relative performance. In particular, these numbers shouldn’t be compared against other databases. 

Results

Code for benchmarks

A few highlights (and one lowlight):

  • Single threaded query performance increased slightly
  • Query performance increases linearly or super-linearly as more cores are used
  • Insert performance increases by 10-30% vs 1.2
  • It is usually faster to create an index after importing your data than before. More so in 1.4.
  • Not shown, but performance for both 1.2 and 1.4 held steady from 10 threads up to at least 500 threads. Even where it looks like the lines will cross, they do not.
  • Update and Upsert performance is the same or higher until using more than 4 hammering threads. In practice, this shouldn’t effect users unless they are already doing more updates per second than the server can handle. Even so, we will look into solutions to this in the 1.5.x series.

This test was run on an Intel Core i7 Quad 860 2.8GHz with 8GB of RAM and an Intel G2 SSD. Even though they improved performance, HyperThreading and TurboBoost were disabled as they can skew results in nondeterministic ways.

Mar 25, 20102 notes
MongoDB Day Austin coming up on Saturday, March 27

If you’re in the Austin area on March 27, you won’t want to miss MongoDB Day at Cospace in Austin. MongoDB Day Austin will be hosted by GeekAustin, a Slashdot style news site, and sponsored by 10gen. 

This conference will have something for anyone interested in using MongoDB, from introductory sessions to more advanced discussions on sharding and MapReduce. Presenters at the conference will include Four Kitchens co-founder David Strauss and 10gen’s Mathias Stearn. 

Tickets are running out fast! Visit http://mongodbday.eventbrite.com/to learn more and register. 

Mar 23, 20102 notes
Are you going to Structure?

GigaOm’s Structure is one of the most interesting conferences in the Bay Area this year. In 2010, we’re excited that Eliot Horowitz from 10gen / MongoDB will be speaking. GigaOm, who were media sponsors at the recently completed NoSQL Live event in Boston, has provided a special discount code for friends of MongoDB to register for the conference at a $100 savings. Hope to see you there!

Details, including the discount code from our partner GigaOm:

GigaOM’s Structure conference is back for 2010! Get your ticket now!

Your $100 discount on this year’s conference is available now!

http://structure2010.eventbrite.com/?discount=NOSQL100

GigaOM’s flagship conference, Structure, returns on June 23rd and 24th for two days of deep insight on the Cloud Computing industry. Taking place at the Mission Bay Conference Center in San Francisco, Structure 2010 promises to be our best Structure conference yet.

**Save the dates**

June 23rd and 24th

Mission Bay Conference Center, San Francisco 
http://events.gigaom.com/structure/10/

The $1.4 trillion IT market is undergoing a massive shake-up due to Cloud Computing. From the chips that power the compute clouds to the broadband that transports the computation and the software that ties it all together, Cloud Computing is creating a fundamental shift in how we think about and buy computing services. And at each point in the chain, such disruption creates another opportunity. At Structure 2010 you will learn about those opportunities and how to profit from them.

 Our speaker list is growing every day. Confirmed speakers include:

Erich Clementi -  VP, Strategy & General Manager, Enterprise Initiatives, IBM

Marc Benioff - Chairman and CEO, salesforce.com

Werner Vogels - CTO, Amazon.com

Dr. Amr Awadallah - CTO and co-founder, Cloudera

Eliot Horowitz – CTO and co-founder, 10gen / MongoDB

Nick McKeown - Professor, Stanford University

William Forrest - Principal, McKinsey

For the most up-to-date list, see our web site:

http://events.gigaom.com/structure/10/

When you attend Structure 2010 you will learn:

Why Cloud Computing is important — The scenarios in which it reduces cost, improves collaboration, speeds the real-time enterprise and increases enterprise agility.

Why new computing architectures are needed to support Cloud Computing and what they are — Hint: It’s not what’s currently in your data center.

Why Big Data means Big Problems — How do you make sense of exascale data in a timely and cost-effective manner? What new opportunities exist to improve this?

Why we might need to re-invent Internet technologies — The Internet is now asked to transport vast chunks of computation rather than small pieces of text, as it was designed to do.

What impacts “Real Time” has on the cloud — What extra considerations does real-time business infrastructure require?

…and much, much more. Check out the full schedule on our web site. 

So join us on June 23rd and June 24th in San Francisco to be part of the discussion at the Cloud Computing industry’s premier conference: Structure 2010.

Register now and save $100 off the early-bird ticket price! 

http://structure2010.eventbrite.com/?discount=NOSQL100

Structure 2010 also represents a great way to directly address one of the most influential tech audiences anywhere. Call Mike Sly at (415) 235-0358 to find out how your company can exhibit.

Mar 19, 20102 notes
Announcing MongoSF

Please join us for MongoSF, a full-day conference on Friday, April 30 at Bently Reserve in San Francisco.

MongoSF will include sessions on database features, development with MongoDB in a wide range of dynamic languages, in-depth examples of production deployments, and development workshops. In addition to several of the MongoDB developers from 10gen, confirmed speakers include John Nunemaker and Steve Smith (Ordered List), Emmett Shear (CTO, Justin.tv), Les Hill (Hashrocket), James Williams (BT/Ribbit), David Mytton (Boxed Ice), Jason McCay (MongoHQ), and Ryan Angilly (Punchbowl Software), among others. 

MongoSF is sponsored by 10gen and Hashrocket. 10gen sponsors the open source MongoDB project, and provides commercial support for MongoDB. Hashrocket is an expert web design and development group.

Early bird registration for MongoSF costs $50 and runs through April 9. For additional agenda details or to register for MongoSF, visit the 10gen website. 

Mar 18, 20103 notes
NoSQL Live Boston Recap

Check out the 10gen blog for a good recap of NoSQL Live Boston.

Mar 16, 20101 note
Should MongoDB Use SQL as a Query Language?

MongoDB does not use SQL as a query language.  Why not?  This is a very good question and we have discussed it on the project for a long time.  There are a few reasons for this.

Given the document-oriented nature of the storage, if we were to do SQL, it really world be a variant, not true SQL.  There would be no joins, and we would need extensions to handle the nested constructs involved in JSON storage elegantly.  The extensions wouldn’t be that much but we would need something like the current MongoDB dot notation to reach into objects — something like this perhaps:

SELECT * FROM users WHERE addr.state = ‘NY’

Reaching into arrays would need something too:

SELECT * FROM posts WHERE comments[].author = ‘fred’

In Mongo’s JSON query syntax the above would be:

{ “comments.author” : “fred” }

The term NoSQL is a bit inaccurate - we are really talking about horizontally scalable postrelational stores, not about the query language.  I would consider the Google App Engine Data Store NoSQL, and it uses a SQL-like query language GQL.

The main reason we went the way we did with the query language - representing queries as JSON - was to normalize the data we are storing with the query mechanism.  If we are storing JSON in the database, can we not represent the queries that way too?  We thought that made sense.

I’ve never been a fan of embedding one programming language in another.  Building up strings of SQL dynamically has always seemed a bit strange to me.  I much prefer representing the queries in a data-driven way instead.

Mar 14, 20105 notes
State of MongoDB March, 2010

Every once in a while, I think its important for us (the core MongoDB team) to give a broad picture of where we think MongoDB is and where we’re hoping to take it.  This is useful both as a gut check for us, to give the community some insight into what we’re thinking, and to make sure we’re all on the same page.

MongoDB has made great strides in the last year.  The first public release was just over a year ago (2/11/2009) and since then we’ve seen tremendous support and interest from the developer community.  We’ve made a lot of great progress on the core database, drivers and tools.  The community has contributed a large and growing number of great drivers and tools, as well as invaluable testing and feedback.  We’ve also seen a really great amount of production MongoDB installations coming online in the last year.

MongoDB got 2 stable releases (1.0 and 1.2) and there is a third coming (1.4) which has many things we’re very proud of:  better concurrency, geospatial indexing, “usability” enhancements and speed enhancements to name a few.  We’re planning on a stable release every 3 months as a way to balance speed, carefulness, and practicality.

So, where are we in our grand view of the world: about half done.  MongoDB is in a great place today, but we have a long way to go.  MongoDB was never designed nor intended to be a niche database for a small subset of problems, but a new type of database, that solves lots of real world problems for a large subset of the developer community.  We’re getting there, and we’re suitable for a lot of problems, but there are lots of things we still need to do.  So if you’re looking at MongoDB and saying “its not good enough for this” or “it would be great if only it had X,” its very likely we agree with you.

Some major things we’re thinking about for the next 6-12 months

  • better replication: real time, replica sets, more options for data durability
  • production ready sharding
  • more features for working with embedded documents
  • flushing out more atomic update operators
  • single server durability
  • full text search

Talking about embedded objects as one example, we added support for this very early on because we think its often a better way to model your data in the database.  Being able to store addresses for a user, or tags for a blog post inside of the main document is great for many reasons, particularly speed and manageability.  This is a very different paradigm than relational, and we’ve had to add a lot of features to make it work nicely: indexed embedded fields, in-place incremental updates,  etc…  We still have many features we want to add that people have asked for to make it even nicer to program with embedded documents.

I’ve recently changed the way I describe MongoDB when I first talk to people that I think sheds some light on how we’re thinking.  MongoDB wasn’t designed in a lab.  We built MongoDB from our own experiences building large scale, high availability, robust systems.  We didn’t start from scratch, we really tried to figure out what was broken, and tackle that.  So the way I think about MongoDB is that if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important.  There are lots of things that work great in relational databases: indexes, dynamic queries and updates to name a few, and we haven’t changed much there.  For example, the way you design your indexes in MongoDB should be exactly the way you do it in MySql or Oracle, you just have the option of indexing an embedded field.

So in conclusion, I hope you find MongoDB useful and productive now, we hope to make great strides in the next year, and are grateful for the community’s support, advice, debugging and interest.

-Eliot and the core MongoDB Team

Mar 8, 20105 notes
You need to learn MongoDB

You need to learn MongoDB. We’re offering an informative, hands-on training session to help you do just that. From document-based data modeling to high-performance optimizations, we’ll answer your questions and prepare you for the move to Mongo.
Among the topics we’ll cover:

  • How to use the language drivers, and how they work
  • How to make the most of atomic updates
  • Data-modeling with documents
  • Administration with the JavaScript shell
  • Scaling out using master/slave configurations and auto-sharding
  • Backups and recovery

The first sessions will be held in San Francisco and New York City. Discounts are available for startups. If you need to learn MongoDB, a session like this will take you a long way. Sign up now! If you have any questions, send us an email at info@10gen.com.

Mar 5, 20102 notes
2d geospatial indexing

We have now added geospatial indexing to the product.  Our approach has been to make something simple but fast: 2d only, and effective for common real world use cases such as lat/long location searches.

Would love to get some feedback on features people would like to see, how its working, etc…

  • Geospatial docs
  • More notes on 1.3.3 release
Mar 3, 20105 notes
MongoDB March Events and NYC Office Hours

Upcoming MongoDB Events

MongoDB will be featured at several events, conferences, and meetups in March, including a webinar on MongoDB internals, Mountain West Ruby Conference in Salt Lake City, NoSQL Live Boston, QCon London, and Cloud Connect in Santa Clara. There’s a MongoDB training session in San Francisco and there will even be a MongoDB Day in Austin! Check the Events page for a complete listing.

In addition, 10gen will be hosting MongoDB Office Hours on Wednesdays in New York City, starting on March 17. Stop by to meet the MongoDB team, ask questions, or have a beer.

MongoDB Office Hours

Wednesdays (Starting March 17)
4pm - 6pm
17 West 18th Street - 8th Floor
Between 5th & 6th Avenues
New York, NY

February Highlights

The best of the MongoDB presentations in February:

  • MongoDB Isn’t Water - Kyle Banker at Chicago Ruby on February 2
  • Introduction to MongoDB - Kristina Chodorow at FOSDEM (Brussels, Belgium) on February 7
  • How Python, TurboGears, and MongoDB are Transforming SourceForge.net - Rick Copeland at PyCon (Atlanta, GA) on February 21
  • Intro to MongoDB - Alex Sharp at LA WebDev Meetup on February 23
Mar 1, 2010
Next page →
2012 2013
  • January 3
  • February 1
  • March 4
  • April 3
  • May 5
  • June 3
  • July
  • August
  • September
  • October
  • November
  • December
2011 2012 2013
  • January 1
  • February 1
  • March
  • April 2
  • May 4
  • June 5
  • July 8
  • August 10
  • September 5
  • October 8
  • November 7
  • December 5
2010 2011 2012
  • January 1
  • February
  • March 2
  • April 2
  • May 3
  • June 3
  • July 2
  • August 1
  • September 3
  • October 1
  • November 1
  • December 2
2009 2010 2011
  • January 1
  • February 6
  • March 12
  • April 6
  • May 3
  • June 3
  • July 1
  • August 1
  • September 1
  • October
  • November
  • December 1
2009 2010
  • January 1
  • February
  • March
  • April 4
  • May 2
  • June 3
  • July 5
  • August 6
  • September 2
  • October 3
  • November 4
  • December 2