The MongoDB NoSQL Database Blog

Month

April 2010

6 posts

On Distributed Consistency - Part 4 - Multi Data Center

See also:

  • Part 1 - Introduction and CAP
  • Part 2 - Eventual Consistency
  • Part 3 - Network Partitions
  • Part 5 - Multi Writer Eventual Consistency
  • Part 6 - Consistency Chart

Eventual consistency makes multi-data center data storage easier.  There are reasons eventual consistency is helpful for multi-data center that are unrelated to availability and CAP.  And as mentioned in Part 3, some common types of network partitions, such as loss of an entire data center, are actually trivial network partitions and may not even effect availability anyway.

Here are a few architectures for multi-data center data storage:

  • DR
  • Single Region
  • Local reads, remote writes
  • Intelligent Homing
  • Eventual consistency

DR

By DR we mean a traditional disaster recovery / business continuity architecture.  It’s pretty simple: we serve everything from one data center, with replication to a secondary facility that is offline.  In a failure we cut over.

Availability can be quite high in this model as on any issue with the first data center, including internal network partitions, we cut over, and with the whole first data center disabled, the partition is trivial.

This model works fine with strong consistency.

Multi Data Center, Single Region

This option is analogous to using multiple data centers within a single region.  Amazon and DoubleClick have used this scheme in the past.  We have multiple data centers, physically separated, but all within one region (such as the Northwest).  The latency between data centers is then reasonable: if we stay within a 150 mile radius, we can have round trip times of around 5ms.  We might have a fiber ring among say, 3 or 4 data centers.  As the latency is reasonable, for many problems, a WAN operation here is fine.  With a ring topology, a non-trivial network partition is unlikely.

Single region is useful both for strong consistent and eventually consistent architectures.  With a Dynamo style product, when N=W or N=R, this is a good option, as otherwise when using multiple data centers we will have a long wait time to confirm remote writes.

Local Reads, Remote Writes

For read-heavy use cases, this is a good option.  Here we read eventually consistent data (easy with most database products including RDBMS systems) but do all writes back to the master facility over the WAN.  A dynamo style system in multiple data centers with a very high W value and low R value can be thought of this way also.

This pattern would work great for traditional content management: publishing is infrequent and reading is very frequent.

Using a Content Delivery Network (CDN), with a centralized origin web site serving dynamic content, is another example.

Intelligent Homing

We discussed “Intelligent Homing” a bit in Part 3.  The idea is to store the master copy of a given data entity near its user.

This model works quite well if data correlates with the user, such as the user’s profile, inbox, etc.

We have fast locally confirmed writes.  If a data center goes completely down, we could still fail over master status to somewhere else which has a replica.

Eventual consistency

Many-writer eventual consistency gives us two benefits with multiple data centers:

  • higher availability in the face of network outages;
  • fast locally confirmed writes

In the diagram below, a client of a dynamo-style system writes the data to four servers (N=4).  However, it only awaits confirmation of the writes from two servers in its local data center, to keep write confirmation latency low.

Note however that if R+W > N, we can’t have both fast local reads and writes at the same time if all the data centers are equal peers.

Combinations

Combinations often make sense.  For example, it’s common to mix DR and Read Local Write Remote.

Apr 12, 20101 note
#datacenter #nosql
On Distributed Consistency - Part 3 - Network Partitions

See also:

  • Part 1 - Introduction and CAP
  • Part 2 - Eventual Consistency
  • Part 4 - Multi Data Center
  • Part 5 - Multi Writer Eventual Consistency
  • Part 6 - Consistency Chart

It’s fascinating that the formal theorem statement for CAP, in the first proof (that I know of), doesn’t use the word partition!

Theorem 1 It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties:
• Availability
• Atomic consistency in all fair executions (including those in which messages are lost).

That said, let’s talk about partitions, as “messages lost…in the asynchronous network model” is directly analogous.

Let’s look at an example:

In our diagram above, the network is partitioned.  The left and right halves (perhaps these correspond say to two continents) cannot communicate at all.  Four clients and four data server nodes are shown in the diagram.  So what are our options?

  1. Deny all writes.  If we deny all writes when the network is partitioned, we can still read fully consistent data on both sides.  So this is one option.  We give up write availability, and keep consistency.
  2. Allow writes on one side.  Via some sort of consensus mechanism, we could let one side of the partition “win” and have a master (as shown by the “M” in the diagram).  In this case, reads and writes could occur on that side.  On the other non-master partitions, we could either (a) be strict and allow no operations, or (b) allow eventually consistent reads, but no writes.  So in this situation we have full consistency in one partition, and partial operation in all others.
  3. Allow reads and writes in all partitions.  Here, we keep availability, but we must sacrifice strong consistency.  One partition will not see the operations and state from the other until the network is restored.  Once restored, we will need to a method to merge operations that occurred while disconnected.

A mitigation technique also comes to mind.  Suppose a particular client C has a much higher probability of needing an entity X than other clients.  If we store the master copy of X on a server close to C, we increase the probability that C can read and write X in option (2) above.  Let’s call this “intelligent homing”.  A real world example of this would be to “store master copies of data for east coast users on servers on the east coast”.  Intelligent homing doesn’t solve our problems, but would likely significantly decrease their frequency — that’s good, we just want more nines anyway.

Hopefully the above is a good informal “proof” of CAP.  It really is pretty simple.

Trivial Network Partitions

Many common network partitions are what we might term trivial.  Let’s consider from the perspective of option (2) above. We define a trivial network partition is one such that on all non-master partitions, there are either

  • no live clients at all, or
  • no servers at all

For example, if we have many data centers and our clients are Internet web browsers, and one of our data centers goes completely dark (and we have more left), that is a trivial network partition (we assume here that we can fail over master status in such a situation).  Likewise, losing a single rack in its entirety is often a trivial network partition.

In these situations, we can still be consistent and available.  (Well, for the partitioned client, we are unavailable, but that is of course a certainty if it cannot reach any servers anywhere.)

Apr 8, 2010
On Distributed Consistency - Part 2 - Some Eventual Consistency Forms

See Also

  • Part 1 - Introduction and CAP
  • Part 3 - Network Partitions
  • Part 4 - Multi Data Center
  • Part 5 - Multi Writer Eventual Consistency
  • Part 6 - Consistency Chart

In Part 1 we discussed C-class and A-class behaviors.  For A-class, we need to weaken consistency constraints.  This does not mean the system need be completely inconsistent, but it does mean we will need to relax the consistency model to some extent.

Amazon popularized the concept of “Eventual Consistency”.  Their definition is: 

the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value.

This is not new, but it is great to have the concept formalized/popularized.  A few examples of eventually consistent systems:

  1. DNS (mentioned in the above paper)
  2. Asynchronous master/slave replication on an RDBMS (also on MongoDB)
  3. memcached in front of mysql, caching reads

Many (not all) traditional examples that come to mind have eventually consistent reads, but a single writer (by “single writer”, we mean a data server, not the clients).  Things get more interesting — and complex — with when there are many writers.  Amazon Dynamo is an example of a “many writer eventually consistent” system.  All of the above are perhaps “single writer eventually consistent”.

One other traditional technology worth noting is message queues.  It has properties reminiscent of eventual consistency.

Forms of Consistency

Let’s look at a particular example.  Consider a system using MongoDB in the following configuration:

“master”, “slave”, and “slave” could be mongod instances for example — or other databases with asynchronous replication.  Clients randomly read from any slave for a given query, and always write to the master.  Two slaves and two clients are shown, but let’s assume each of those scale out.

This sort of system we term “single writer eventual consistency”.  So what are its properties?  (1) A client could read stale data. (2) The client could see out-of-order write operations.

Let’s suppose we are storing some entity x in the datastore.  Let’s assume entities have an initial value of zero.  There are a series of writes to x by clients:

  W(x=3), W(x=7), W(x=5)

Because the system is eventually consistent, if writes to x stop at some point, we know we will eventually read 5 — that is, R(x==5).  However in the short term a client might  for example see:

  R(x==7), R(x==0), R(x==5), R(x==3)

(Note more nodes than 2 slaves are needed for this example behavior.)

So this is our weakest form of consistency - eventually consistent with out of order reads in the short term. 

We can make this stronger.  Consider the SourceForge mongodb configuration (larger diagram here).  This configuration is eventually consistent, but we will not see the result of writes out of order.  It provides monotonic read consistency.

One possible eventual consistency property is read-your-own-writes consistency, meaning a process is guaranteed to see the writes it has made when it does reads.  This is a very useful property that makes programming easier. Note that neither of the above examples provide read-your-own-writes consistency.  Also worth considering with this model is the definition of “your”.  On a web application, that might be the user.  If the system’s load balancer sends requests to different app servers, having read-your-own-write consistency for a single app server might not solve the real world consistency need.

EC Use Case Checklist

Thus when using eventual consistency, it is good for the architect to ask:

  • can my use case tolerate stale reads?
  • can it tolerate reading values out of order?  if not, is my configuration monotonic read consistent?
  • can it tolerate not reading my own writes?  if not, is my configuration read-your-own-write consistent?
Apr 5, 20104 notes

March 2010

12 posts

On Distributed Consistency -- Part 1

See also:

  • Part 2 - Eventual Consistency
  • Part 3 - Network Partitions
  • Part 4 - Multi Data Center
  • Part 5 - Multi Writer Eventual Consistency
  • Part 6 - Consistency Chart

For distributed databases, consistency models are a topic of huge importance. We’d like to delve a bit deeper on this topic with a series of articles, discussing subjects such as what model is right for a particular use case. Please jump in and help us in the comments.

We’ll start here with a basic introduction to the subject.

CAP

The CAP theorem states that one can only have two of consistency, availability, and tolerance to network partitions at the same time. In distributed systems, network partitioning is inevitable and must be tolerated, so essential CAP means that we cannot have both consistency and 100% availability. 

Informally, I would summarize the CAP theorem as:

  • If the network is broken, your database won’t work.

However, we do get to pick the definition of “won’t work”.  It can either mean down (unavailable) or inconsistent (stale data).

More precisely what do we mean by “consistency”? The academic work in this area is referring to “one copy serializability” or “linearizability”. If a series of operations or transactions are performed, they are applied in a consistent order. One less formal way of thinking about the trade-off is “Could I be reading and manipulating stale/dirty data? Can I always write?”

Embodiments

We have two classes of architectures: a C class (strongly consistent) and an A class (higher availability looser consistency). Let’s consider some real-world distributed systems and where they are classified.

Amazon Dynamo is a distributed data store which implements consistent hashing and is in the A camp. It provides eventual consistency. One may read old data.

CouchDB is typically used with asynchronous master-master replication and is in the A camp. It provides eventual consistency.

A MongoDB auto-sharding+replication cluster has a master server at a given point in time for each shard. This is in the C camp.  Traditional RDBMS systems are also strongly consistent (as typically used) - a synchronous RDBMS cluster for example.

It’s worth noting that alternate configurations of these products sometimes alter their consistency (and performance) properties. For our discussion here, we’ll assume these products are configured in their common case setup, unless otherwise specified.

Write Availability, not Read Availability, is the Main Question

With most databases today, it’s easy to have any number of asynchronous slave replicas distributed about the world. If networks partition, we would then still have access to local slave data. As the replication is asynchronous, this data is eventually consistent, so this result is not surprising — we are now in the A class of systems. However, almost all designs, even from the C class, can add on asynchronous read capabilities easy. Thus, the critical design decisions are around write availability.

The Trade-offs

  • even load distribution is easier in eventually consistent systems
  • multi-data center support is easier in eventually consistent systems
  • some problems are not solvable with eventually consistent systems
  • code is sometimes simpler to write in strongly consistent systems

We will discuss these pros in cons in more details in subsequent articles.

—dwight

Mar 26, 20104 notes
MongoDB 1.4 Ready for Production

The MongoDB team is very excited to announce the release of MongoDB 1.4.0.  This is the culmination of 3 months of work in the 1.3 branch and has a large number of very important changes.

Many users have been running 1.3 in production, so this release is already very thoroghly vetted both by our regressions systems and by real users.

Some highlights:

Core server enhancements

  • concurrency improvements
  • indexing memory improvements
  • background index creation
  • better detection of regular expressions so the index can be used in more cases
  • performance numbers

Replication & Sharding

  • better handling for restarting slaves offline for a while
  • fast new slaves from snapshots
  • configurable slave delay
  • replication handles clock skew on master
  • $inc replication fixes
  • sharding alpha 3 - notably 2 phase commit on config servers

Deployment & production

  • configure “slow” for profiling
  • ability to do fsync + lock for backing up raw files
  • option for separate directory per db
  • http://localhost:28017/_status to get serverStatus via http
  • REST interface is off by default for security (—rest to enable)
  • can rotate logs with a db command, logRotate
  • enhancements to serverStatus - counters/replication lag
  • new mongostat tool and db.serverStatus() enhancements

Query language improvements

  • $all with regex
  • $not
  • partial matching of array elements $elemMatch
  • $ operator for updating arrays
  • $addToSet
  • $unset
  • $pull supports object matching
  • $set with array indices

Geo

  • 2d geospatial search
  • geo $center and $box searches

Downloads: www.mongodb.org/display/DOCS/Downloads

Full Change Log: jira

Release Notes: http://www.mongodb.org/display/DOCS/1.4+Release+Notes

Thanks for all your continued support, and we hope MongoDB 1.4 works great for you.

As always, please let us know of any issues,

-Eliot and the MongoDB Team

Mar 25, 20109 notes
MongoDB 1.4 Performance

We generally avoid posting benchmarks and suggest people create their own targeting their use cases. However, we have decided to publish a few of our internal micro-benchmarks comparing 1.2 with 1.4RC2 (aka 1.3.5) to show that in almost all cases performance is the same or better (sometime significantly so), even though we’ve added many new features.

The test works by spawning N threads and having them hammer the DB with one operation as fast as they can. It is probably best to ignore the raw numbers and only look at the relative performance. In particular, these numbers shouldn’t be compared against other databases. 

Results

Code for benchmarks

A few highlights (and one lowlight):

  • Single threaded query performance increased slightly
  • Query performance increases linearly or super-linearly as more cores are used
  • Insert performance increases by 10-30% vs 1.2
  • It is usually faster to create an index after importing your data than before. More so in 1.4.
  • Not shown, but performance for both 1.2 and 1.4 held steady from 10 threads up to at least 500 threads. Even where it looks like the lines will cross, they do not.
  • Update and Upsert performance is the same or higher until using more than 4 hammering threads. In practice, this shouldn’t effect users unless they are already doing more updates per second than the server can handle. Even so, we will look into solutions to this in the 1.5.x series.

This test was run on an Intel Core i7 Quad 860 2.8GHz with 8GB of RAM and an Intel G2 SSD. Even though they improved performance, HyperThreading and TurboBoost were disabled as they can skew results in nondeterministic ways.

Mar 25, 20102 notes
MongoDB Day Austin coming up on Saturday, March 27

If you’re in the Austin area on March 27, you won’t want to miss MongoDB Day at Cospace in Austin. MongoDB Day Austin will be hosted by GeekAustin, a Slashdot style news site, and sponsored by 10gen. 

This conference will have something for anyone interested in using MongoDB, from introductory sessions to more advanced discussions on sharding and MapReduce. Presenters at the conference will include Four Kitchens co-founder David Strauss and 10gen’s Mathias Stearn. 

Tickets are running out fast! Visit http://mongodbday.eventbrite.com/to learn more and register. 

Mar 23, 20102 notes
Are you going to Structure?

GigaOm’s Structure is one of the most interesting conferences in the Bay Area this year. In 2010, we’re excited that Eliot Horowitz from 10gen / MongoDB will be speaking. GigaOm, who were media sponsors at the recently completed NoSQL Live event in Boston, has provided a special discount code for friends of MongoDB to register for the conference at a $100 savings. Hope to see you there!

Details, including the discount code from our partner GigaOm:

GigaOM’s Structure conference is back for 2010! Get your ticket now!

Your $100 discount on this year’s conference is available now!

http://structure2010.eventbrite.com/?discount=NOSQL100

GigaOM’s flagship conference, Structure, returns on June 23rd and 24th for two days of deep insight on the Cloud Computing industry. Taking place at the Mission Bay Conference Center in San Francisco, Structure 2010 promises to be our best Structure conference yet.

**Save the dates**

June 23rd and 24th

Mission Bay Conference Center, San Francisco 
http://events.gigaom.com/structure/10/

The $1.4 trillion IT market is undergoing a massive shake-up due to Cloud Computing. From the chips that power the compute clouds to the broadband that transports the computation and the software that ties it all together, Cloud Computing is creating a fundamental shift in how we think about and buy computing services. And at each point in the chain, such disruption creates another opportunity. At Structure 2010 you will learn about those opportunities and how to profit from them.

 Our speaker list is growing every day. Confirmed speakers include:

Erich Clementi -  VP, Strategy & General Manager, Enterprise Initiatives, IBM

Marc Benioff - Chairman and CEO, salesforce.com

Werner Vogels - CTO, Amazon.com

Dr. Amr Awadallah - CTO and co-founder, Cloudera

Eliot Horowitz – CTO and co-founder, 10gen / MongoDB

Nick McKeown - Professor, Stanford University

William Forrest - Principal, McKinsey

For the most up-to-date list, see our web site:

http://events.gigaom.com/structure/10/

When you attend Structure 2010 you will learn:

Why Cloud Computing is important — The scenarios in which it reduces cost, improves collaboration, speeds the real-time enterprise and increases enterprise agility.

Why new computing architectures are needed to support Cloud Computing and what they are — Hint: It’s not what’s currently in your data center.

Why Big Data means Big Problems — How do you make sense of exascale data in a timely and cost-effective manner? What new opportunities exist to improve this?

Why we might need to re-invent Internet technologies — The Internet is now asked to transport vast chunks of computation rather than small pieces of text, as it was designed to do.

What impacts “Real Time” has on the cloud — What extra considerations does real-time business infrastructure require?

…and much, much more. Check out the full schedule on our web site. 

So join us on June 23rd and June 24th in San Francisco to be part of the discussion at the Cloud Computing industry’s premier conference: Structure 2010.

Register now and save $100 off the early-bird ticket price! 

http://structure2010.eventbrite.com/?discount=NOSQL100

Structure 2010 also represents a great way to directly address one of the most influential tech audiences anywhere. Call Mike Sly at (415) 235-0358 to find out how your company can exhibit.

Mar 19, 20102 notes
Announcing MongoSF

Please join us for MongoSF, a full-day conference on Friday, April 30 at Bently Reserve in San Francisco.

MongoSF will include sessions on database features, development with MongoDB in a wide range of dynamic languages, in-depth examples of production deployments, and development workshops. In addition to several of the MongoDB developers from 10gen, confirmed speakers include John Nunemaker and Steve Smith (Ordered List), Emmett Shear (CTO, Justin.tv), Les Hill (Hashrocket), James Williams (BT/Ribbit), David Mytton (Boxed Ice), Jason McCay (MongoHQ), and Ryan Angilly (Punchbowl Software), among others. 

MongoSF is sponsored by 10gen and Hashrocket. 10gen sponsors the open source MongoDB project, and provides commercial support for MongoDB. Hashrocket is an expert web design and development group.

Early bird registration for MongoSF costs $50 and runs through April 9. For additional agenda details or to register for MongoSF, visit the 10gen website. 

Mar 18, 20103 notes
NoSQL Live Boston Recap

Check out the 10gen blog for a good recap of NoSQL Live Boston.

Mar 16, 20101 note
Should MongoDB Use SQL as a Query Language?

MongoDB does not use SQL as a query language.  Why not?  This is a very good question and we have discussed it on the project for a long time.  There are a few reasons for this.

Given the document-oriented nature of the storage, if we were to do SQL, it really world be a variant, not true SQL.  There would be no joins, and we would need extensions to handle the nested constructs involved in JSON storage elegantly.  The extensions wouldn’t be that much but we would need something like the current MongoDB dot notation to reach into objects — something like this perhaps:

SELECT * FROM users WHERE addr.state = ‘NY’

Reaching into arrays would need something too:

SELECT * FROM posts WHERE comments[].author = ‘fred’

In Mongo’s JSON query syntax the above would be:

{ “comments.author” : “fred” }

The term NoSQL is a bit inaccurate - we are really talking about horizontally scalable postrelational stores, not about the query language.  I would consider the Google App Engine Data Store NoSQL, and it uses a SQL-like query language GQL.

The main reason we went the way we did with the query language - representing queries as JSON - was to normalize the data we are storing with the query mechanism.  If we are storing JSON in the database, can we not represent the queries that way too?  We thought that made sense.

I’ve never been a fan of embedding one programming language in another.  Building up strings of SQL dynamically has always seemed a bit strange to me.  I much prefer representing the queries in a data-driven way instead.

Mar 14, 20105 notes
State of MongoDB March, 2010

Every once in a while, I think its important for us (the core MongoDB team) to give a broad picture of where we think MongoDB is and where we’re hoping to take it.  This is useful both as a gut check for us, to give the community some insight into what we’re thinking, and to make sure we’re all on the same page.

MongoDB has made great strides in the last year.  The first public release was just over a year ago (2/11/2009) and since then we’ve seen tremendous support and interest from the developer community.  We’ve made a lot of great progress on the core database, drivers and tools.  The community has contributed a large and growing number of great drivers and tools, as well as invaluable testing and feedback.  We’ve also seen a really great amount of production MongoDB installations coming online in the last year.

MongoDB got 2 stable releases (1.0 and 1.2) and there is a third coming (1.4) which has many things we’re very proud of:  better concurrency, geospatial indexing, “usability” enhancements and speed enhancements to name a few.  We’re planning on a stable release every 3 months as a way to balance speed, carefulness, and practicality.

So, where are we in our grand view of the world: about half done.  MongoDB is in a great place today, but we have a long way to go.  MongoDB was never designed nor intended to be a niche database for a small subset of problems, but a new type of database, that solves lots of real world problems for a large subset of the developer community.  We’re getting there, and we’re suitable for a lot of problems, but there are lots of things we still need to do.  So if you’re looking at MongoDB and saying “its not good enough for this” or “it would be great if only it had X,” its very likely we agree with you.

Some major things we’re thinking about for the next 6-12 months

  • better replication: real time, replica sets, more options for data durability
  • production ready sharding
  • more features for working with embedded documents
  • flushing out more atomic update operators
  • single server durability
  • full text search

Talking about embedded objects as one example, we added support for this very early on because we think its often a better way to model your data in the database.  Being able to store addresses for a user, or tags for a blog post inside of the main document is great for many reasons, particularly speed and manageability.  This is a very different paradigm than relational, and we’ve had to add a lot of features to make it work nicely: indexed embedded fields, in-place incremental updates,  etc…  We still have many features we want to add that people have asked for to make it even nicer to program with embedded documents.

I’ve recently changed the way I describe MongoDB when I first talk to people that I think sheds some light on how we’re thinking.  MongoDB wasn’t designed in a lab.  We built MongoDB from our own experiences building large scale, high availability, robust systems.  We didn’t start from scratch, we really tried to figure out what was broken, and tackle that.  So the way I think about MongoDB is that if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important.  There are lots of things that work great in relational databases: indexes, dynamic queries and updates to name a few, and we haven’t changed much there.  For example, the way you design your indexes in MongoDB should be exactly the way you do it in MySql or Oracle, you just have the option of indexing an embedded field.

So in conclusion, I hope you find MongoDB useful and productive now, we hope to make great strides in the next year, and are grateful for the community’s support, advice, debugging and interest.

-Eliot and the core MongoDB Team

Mar 8, 20105 notes
You need to learn MongoDB

You need to learn MongoDB. We’re offering an informative, hands-on training session to help you do just that. From document-based data modeling to high-performance optimizations, we’ll answer your questions and prepare you for the move to Mongo.
Among the topics we’ll cover:

  • How to use the language drivers, and how they work
  • How to make the most of atomic updates
  • Data-modeling with documents
  • Administration with the JavaScript shell
  • Scaling out using master/slave configurations and auto-sharding
  • Backups and recovery

The first sessions will be held in San Francisco and New York City. Discounts are available for startups. If you need to learn MongoDB, a session like this will take you a long way. Sign up now! If you have any questions, send us an email at info@10gen.com.

Mar 5, 20102 notes
2d geospatial indexing

We have now added geospatial indexing to the product.  Our approach has been to make something simple but fast: 2d only, and effective for common real world use cases such as lat/long location searches.

Would love to get some feedback on features people would like to see, how its working, etc…

  • Geospatial docs
  • More notes on 1.3.3 release
Mar 3, 20105 notes
MongoDB March Events and NYC Office Hours

Upcoming MongoDB Events

MongoDB will be featured at several events, conferences, and meetups in March, including a webinar on MongoDB internals, Mountain West Ruby Conference in Salt Lake City, NoSQL Live Boston, QCon London, and Cloud Connect in Santa Clara. There’s a MongoDB training session in San Francisco and there will even be a MongoDB Day in Austin! Check the Events page for a complete listing.

In addition, 10gen will be hosting MongoDB Office Hours on Wednesdays in New York City, starting on March 17. Stop by to meet the MongoDB team, ask questions, or have a beer.

MongoDB Office Hours

Wednesdays (Starting March 17)
4pm - 6pm
17 West 18th Street - 8th Floor
Between 5th & 6th Avenues
New York, NY

February Highlights

The best of the MongoDB presentations in February:

  • MongoDB Isn’t Water - Kyle Banker at Chicago Ruby on February 2
  • Introduction to MongoDB - Kristina Chodorow at FOSDEM (Brussels, Belgium) on February 7
  • How Python, TurboGears, and MongoDB are Transforming SourceForge.net - Rick Copeland at PyCon (Atlanta, GA) on February 21
  • Intro to MongoDB - Alex Sharp at LA WebDev Meetup on February 23
Mar 1, 2010

February 2010

6 posts

Announcing Speakers for NoSQL Live

It’s not too late to register for NoSQL Live in Boston on March 11th. We have an exciting lineup of speakers and panelists who will discuss real use cases for NoSQL in production systems.

Session topics at NoSQL Live will include scaling with NoSQL, NoSQL in the cloud, schema design with document-oriented databases, the evolution of graph data structure from research to production, the enterprise adoption of NoSQL, and toward web standards for NoSQL. In addition to speakers and panels, the conference will also include lightning talks and a NoSQL Lab for practical exploration of working with specific NoSQL products.

Here’s the confirmed list of speakers, panelists, and moderators:

— Dwight Merriman, CEO, 10gen

— Eliot Horowitz, CTO, 10gen

— Adam Kocoloski, CTO, Cloudant

— Alan Hoffman, CEO, Cloudant

— Durran Jordan, Senior Developer, Hashrocket

— Les Hill, Software Adventurer, Hashrocket

— Marko Rodriguez, Graph Systems Architect, AT&T Interactive

— Ryan King, Technical Lead, Storage Team, Twitter

— Alex Feinberg, Senior Software Engineer, LinkedIn

— Jonathan Ellis, Systems Architect, The Rackspace Cloud

— Sandro Hawke, Software Developer and Systems Architect, W3C

— Benjamin Day, Microsoft MVP

— Ryan Rawson, Systems Architect, StumbleUpon

— Bryan Fink, Senior Software Developer, Basho Technologies

— Rusty Klophaus, Senior Software Engineer, Basho Technologies

— Adam Wiggins, Co-founder, Heroku

— Mark Atwood, Director of Community Development, Gear6

— Sourav Mazumder, Principal Technology Architect, Infosys Technologies Limited

— Tim Anglade, CTO, GemKitty

— Bradford Stephens, Founder, Drawn to Scale

— Doug Judd, CEO, Hypertable

— Daniel Rinehart, Chief Software Architect, Allurent

— Emil Eifrém, CEO, Neo Technology

— Paul Davis, Research Assistant, New England Biolabs

— Borislav Iordanov, Software Architecture Consultant, Kobrix

— Jim Wilson, Lead Software Engineer, Vistaprint

— Ryan Angilly, Senior Developer, Punchbowl Software

Sign ups for the $40 early bird registration end today. If you’re interested in presenting or sponsoring, contact Meghan Gill, event coordinator, at meghan@10gen.com. Check out http://nosqlboston.eventbrite.com for more information and to register.

Feb 23, 20103 notes
MongoDB: How it Works Webinar

In October, 10gen hosted a webinar where we heard from 10gen CEO Dwight Merriman and The Business Insider Lead Developer Ian White about the basics of developing applications with MongoDB and about how MongoDB is used in production at TBI.

We’d like to follow up with a webinar focused on how MongoDB works “under the hood.” Please join us on March 8 at 12:30 PM Eastern Time. 10gen software engineer Mike Dirolf will lead the session. Registration is free but limited to 125 attendees.

Register now.

Feb 22, 2010
MongoDB Survey Results

A couple weeks ago we asked people on Twitter, IRC, and the mailing list to fill out a survey on how they were using MongoDB.  About 120 people responded (thanks guys!).

Here is what we gleaned:

Everyone’s a noob

How long people have been using Mongo:


Most people haven’t been using Mongo for very long.  Exactly 0% said they’d been using Mongo for a year or more (which makes sense, given our first official release was ~12 months ago).

Interesting things being stored in Mongo

Lots of people are storing log data, analytics, user info… the usual.  Some less usual stuff:

  • Game title development info
  • Patents
  • Crime reports and warrants

And quite a few people said: ”Everything.”

So, how big is it?

One person said they were testing up to 40 billion documents, but I wasn’t clear on if they had actually put in 40 billion or were going to.  So, we’ll ignore the outlier, but we can pretty safely say people are storing ~70 million documents.

On a scale of 1-10, would you recommend Mongo to a friend?

Happily, the average was 9.64!  If you are happy with MongoDB, please consider tweeting, writing a blog post, or giving a talk at a conference or meetup… the biggest obstacle we’re facing right now is letting people know we exist!

If you were below average (haha), I’d encourage you to hit the list or IRC.  We’d love to help out (or at least find out why you’re unhappy).

And, finally, most importantly, religious wars:

Kyle has the most users

Ruby wins handily with over 40% of users.

“Other” contains mainly C#, Perl, and Groovy users.

OS X: the universal dev environment

OS people are using for development:

OS people are using for production:

Go Linux go!

If you feel left out, feel free to fill out the survey now.  Thanks to everyone for you input!

Feb 18, 20104 notes
What about Durability?
Note: Journaling is available in MongoDB v1.8+ and is the default in v2.0. This post is pretty old.

Update (3/16/2011): MongoDB v1.8 includes journaling

Update (4/9/2010): real time replication and blocking for replication are implemented in v1.6

We get lots of questions about why MongoDB doesn’t have full single server durability, and there are many people that think this is a major problem.  We wanted to shed some light on why we haven’t done single server durability, what our suggestions are, and our future plans.

To start, there are some very practical reasons why we think single server durability is overvalued.  First, there are many scenarios in which that server loses all its data no matter what.  If there is water damage, fire, some hardware problems, etc… no matter how durable the software is, data can be lost.  Yes - there are ways to mitigate some of these, but those add another layer of complexity, that has to be tested, proofed, and adds more variables which can fail.

In the real world, traditional durability often isn’t even done correctly.  If you are using a DBMS that uses a transaction log for durability, you either have to turn off hardware buffering or have a battery backed RAID controller.  Without hardware buffering, transaction logs are very slow.  Battery backed raid controllers will work well, but you have to really have one.  With the move towards the cloud and outsourced hosting, custom hardware is not always an option.

Requirements for web applications are also changing.  99.99% uptime is no longer the goal, people want 100% uptime as much as possible.  If you have durability through a transaction log, then you have to replay it to come back up.  If you have a master and slave in the same data center and you lose power, both will have to recover which could take 5-30 minutes.[1]

Another feature of new non-relational databases is horizontal scalability.  While MongoDB’s auto-sharding is still in Alpha, we still feel this is a core component. With horizontal scalability comes many servers.  If you have a 100 node cluster, worrying about every machine is a liability.  If a machine goes down in the middle of the night, you want the system to recover as fast as possible, without human intervention.  Given that, and that a high percentage of failures are hardware, the best thing is to just mark that server as inactive, and ignore it until someone can look at it easily (could be hours or days).

Given all this, we’re not saying durability isn’t important, we just think that single server durability isn’t the best way to get true durability.  We think the right path to durability is replication (local and remote) and snapshotting.  That’s why we’ve spent so much time making replication fast and easy and work over wide area networks in MongoDB.

We are currently planning many more enhancements to replication to make it better.

  • psuedo real-time with optional blocking for writes until on multiple servers
  • replica sets instead of replica pairs
  • easier to create new slaves with large data sets

Now - there are definitely some cases where single server durability is the best option.  It is on our road map, its just not on the short list right now.  We know what we want to do and how we want to do it, it’s just a matter of code :)

[1] Some databases such as CouchDB use an append only model that allows for instantaneous restarts. However, this type of design usually requires compaction routines to be run periodically, so can be costly in high update scenarios.

Feb 10, 20109 notes
Practical MongoDB Training with Kyle Banker

10gen is offering day-long MongoDB training sessions in San Francisco and New York City! Kyle Banker, a software engineer at 10gen, will be leading both sessions. Kyle has presented MongoDB in numerous forums, most recently at Chicago Ruby, and is excited to share his expertise. Kyle is preparing several interesting and challenging projects so that attendees can really get their hands dirty. Whether you are brand new to MongoDB or you’ve played with it already, you will leave this course with a comprehensive understanding of how to build applications with MongoDB.

More details available on the 10gen website.

Feb 8, 20101 note
Next page →
2012 2013
  • January 3
  • February 1
  • March 4
  • April 3
  • May 5
  • June 3
  • July
  • August
  • September
  • October
  • November
  • December
2011 2012 2013
  • January 1
  • February 1
  • March
  • April 2
  • May 4
  • June 5
  • July 8
  • August 10
  • September 5
  • October 8
  • November 7
  • December 5
2010 2011 2012
  • January 1
  • February
  • March 2
  • April 2
  • May 3
  • June 3
  • July 2
  • August 1
  • September 3
  • October 1
  • November 1
  • December 2
2009 2010 2011
  • January 1
  • February 6
  • March 12
  • April 6
  • May 3
  • June 3
  • July 1
  • August 1
  • September 1
  • October
  • November
  • December 1
2009 2010
  • January 1
  • February
  • March
  • April 4
  • May 2
  • June 3
  • July 5
  • August 6
  • September 2
  • October 3
  • November 4
  • December 2