Posts tagged:

best of

Planet MongoDB

Oct 17 • Posted 1 year ago

To read up on the most up-to-date blog posts on MongoDB check out Planet MongoDB, a newly-released aggregator of the best blogs on MongoDB. MongoDB community members share tips, ticks and best practices on using MongoDB on a regular basis, and Planet MongoDB will bring that expertise into a single feed.

If you have a blog you would like to see added to the aggregator let us know so we can add you in.

Fast Updates with MongoDB (update-in-place)

Nov 18 • Posted 4 years ago

One nice feature with MongoDB is that updates can happen “in place” — the database does not have to allocate and write a full new copy of the object.

This can be highly performant for frequent update use cases.  For example, incrementing a counter is a highly efficient operation.  We need not fetch the document from the server, we can simply send an increment operation over:

db.my_collection.update( { _id : ... }, { $inc : { y : 2 } } ); // increment y by 2

MongoDB disk writes are lazy.  If we receive 1,000 increments in one second for the object, it will only be written once.  Physical writes occur a couple of seconds after the operation.

One question is what happens when an object grows.  If the object fits in its previous allocation space, it will update in place.  If it does not, it will be moved to a new location in the datafile, and its index keys must be updated, which is slower.  Because of this, Mongo uses an adaptive algorithm to try to minimize moves on an update.  The database computes a padding factor for each collection based on how often items grow and move.  The more often the objects grow, the larger the padding factor will be; when less frequent, smaller.

See also:

http://www.mongodb.org/display/DOCS/Updating

http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

Storing Large Objects and Files in MongoDB

Sep 9 • Posted 4 years ago

Large objects, or “files”, are easily stored in MongoDB.  It is no problem to store 100MB videos in the database.  For example, MusicNation uses MongoDB to store its videos.

This has a number of advantages over files stored in a file system.  Unlike a file system, the database will have no problem dealing with millions of objects.  Additionally, we get the power of the database when dealing with this data: we can do advanced queries to find a file, using indexes; we can also do neat things like replication of the entire file set.

MongoDB stores objects in a binary format called BSON.  BinData is a BSON data type for a binary byte array.  However, MongoDB objects are typically limited to 4MB in size.  To deal with this, files are “chunked” into multiple objects that are less than 4MB each.  This has the added advantage of letting us efficiently retrieve a specific range of the given file.

While we could write our own chunking code, a standard format for this chunking is predefined, call GridFS.  GridFS support is included in many MongoDB drivers and also in the mongofiles command line utility.

A good way to do a quick test of this facility is to try out the mongofiles utility.  See the MongoDB documentation for more information on GridFS.

1.0 GA Released

Aug 27 • Posted 4 years ago

The MongoDB team is very happy to announce that we have released MongoDB version 1.0.0.

MongoDB 1.0.0 is production ready for single master, master/slave and replica pair environments.  While there are many more features that people want and that we are working on, 1.0 is very stable and the code base has been used in production for over 18 months.

As usual, you can get from here: http://www.mongodb.org/display/DOCS/Downloads

Note: No changes have been made between 0.9.10 and 1.0.0.  There is a v1.0 branch on github for the 1.0.x releases.  See http://www.mongodb.org/display/DOCS/Version+Numbers for more notes about version numbers.

MongoDB is Fantastic for Logging

Aug 26 • Posted 4 years ago

We’re all quite used to having log files on lots of servers, in disparate places.  Wouldn’t it be nice to have centralized logs for a production system?  Logs that can be queried?

I would encourage everyone to consider using MongoDB for log centralization.  It’s a very good fit for this problem for several reasons:

  1. MongoDB inserts can be done asynchronously.  One wouldn’t want a user’s experience to grind to a halt if logging were slow, stalled or down.  MongoDB provides the ability to fire off an insert into a log collection and not wait for a response code.  (If one wants a response, one calls getLastError() — we would skip that here.)
  2. Old log data automatically LRU’s out.  By using capped collections, we preallocate space for logs, and once it is full, the log wraps and reuses the space specified.  No risk of filling up a disk with excessive log information, and no need to write log archival / deletion scripts.
  3. It’s fast enough for the problem.  First, MongoDB is very fast in general, fast enough for problems like this.  Second, when using a capped collection, insertion order is automatically preserved: we don’t need to create an index on timestamp.  This makes things even faster, and is important given that the logging use case has a very high number of writes compared to reads (opposite of most database problems).
  4. Document-oriented / JSON is a great format for log information.  Very flexible and “schemaless” in the sense we can throw in an extra field any time we want.

The MongoDB profiler works very much in the way outlined above, storing profile timings in a collection that is very log-like.  We have been very happy with that implementation to date.

Using MongoDB for Real-time Analytics

Aug 25 • Posted 4 years ago

Some MongoDB developers use the database as a way to track real-time performance metrics for their websites (page views, uniques, etc.)  Tools like Google Analytics are great but not real-time — sometimes it is useful to build a secondary system that provides basic realtime stats.

Using the Mongo upsert and $inc features, we can efficiently solve the problem.  When an app server renders a page, the app server can send one or more updates to the database to update statistics.

We can be do this efficiently for a few reasons.  First, we send a single message to the server for the update.  The message is an “upsert” — if the object exists, we increment the counters, if it does not, the object is created.  Second, we do not wait for a response — we simply send the operation, and immediately return to other work at hand.  As the data is simply page counters, we do not need to wait and see if the operation completes (we wouldn’t report such an error to our web site user anyway).  Third, the special $inc operator lets us efficiently update an existing object without requiring a much more expensive query/modify/update sequence.

The example below demonstrates this using the mongo shell syntax (analogous steps can be done in any programming language for which one has a Mongo driver).
$ ./mongo
> c = db.uniques_by_hour;
> c.find();
> cur_hour = new Date("Mar 05 2009 10:00:00")
> c.ensureIndex( { hour : 1, site : 1 } );
> c.update( { hour : cur_hour, site : "abc" },
{ $inc : { uniques:1, pageviews: 1} },
{ upsert : true } )
> c.find();
{"_id" : "49aff5c62f47a38ee77aa5cf" ,
"hour" : "Thu Mar 05 2009 10:00:00 GMT-0500 (EST)" ,
"site" : "abc" , "uniques" : 1 ,
"pageviews" : 1}
> c.update( { hour : cur_hour, site : "abc" },
{ $inc : { uniques:1, pageviews: 1} },
{ upsert : true } )
> c.find();
{"_id" : "49aff5c62f47a38ee77aa5cf" ,
"hour" : "Thu Mar 05 2009 10:00:00 GMT-0500 (EST)" ,
"site" : "abc" , "uniques" : 2 , "pageviews" : 2}
> c.update( { hour : cur_hour, site : "abc" },
{ $inc : { uniques:0, pageviews: 1} },
{ upsert : true } )
> c.find();
{"_id" : "49aff5c62f47a38ee77aa5cf" ,
"hour" : "Thu Mar 05 2009 10:00:00 GMT-0500 (EST)" ,
"site" : "abc" , "uniques" : 2 , "pageviews" : 3}






What is the Right Data Model?

Jul 16 • Posted 4 years ago

There is certainly plenty of activity in the nonrelational (“NOSQL”) db space right now.  We know for these projects the data model is not relational.  But what is the data model?  What is the right model?

There are many possibilities, the most popular of which are:

Key/Value. Pure key/value stores are blobs stored by key.

Tabular. Some projects use a Google BigTable-like data model which we call “tabular” here — or one can think of it as “multidimensional tabular”.

Document-Oriented. Typical of these are JSON-style data stores.

We think this is a very important topic.  What is the right data model?  Should there be standardization?

Below are some thoughts on the approaches above.  Of course, as MongoDB committers, we are biased — you know which one we’re going to like.

Key/value has the advantage of being simple.  It is easy to make such systems fast and scalable.  Con is that it is too simple for easy implementation of some real world problems.  We’d like to see something more general purpose.

The tabular space brings more flexibility.  But why are we sticking to tables?  Shouldn’t we do something closer to the data model of our programming languages?  Tabular jettisons the theoretical underpinnings of relational algebra, yet we still have significant mapping work from program objects to “tables”.  If I were going to work with tables, I’d really like to have full relational power.

We really like the document-oriented approach.  The programming languages we use today, not to mention web services, map very nicely to say, JSON.  A JSON store gives us an object-like representation, yet also is not tied too tightly to any one single language, which seems wrong for a database.

Would love to hear the thoughts of others.

See also: the BSON blog post

Reaching into Objects

Jul 4 • Posted 4 years ago

MongoDB is a JSON-style store.  Just like JSON, we can nest objects within other objects, and also arrays of data within objects.

This then suggests the question or issue: how does one perform a query on nested objects?  Index keys in nested objects?  This is very important of course.  The following doc page explains the method.

http://www.mongodb.org/display/DOCS/Dot+Notation

Databases and Predictability of Performance

Jul 1 • Posted 4 years ago

A subject which perhaps doesn’t get enough attention is whether the performance of a database is predictable. What we are asking is: are there ever any surprises or gotchas in the time it takes for a db operation to execute?  For traditional database management systems, the answer is yes.

For example, statistical query optimizers can be unpredictable: if the statistics for a table change in production, the query plan may change.  This could result in a big change in performance — perhaps better, perhaps worse — but it certainly wasn’t an expected change.  Query plans and performance profiles that were never tested in QA may go into effect.

Another potential issue is locking.  A lock from one transaction may cause another operation that is normally very fast to be slow.

If a system is simple enough, it is predictable.  memcached is very predictable in performance: perhaps that is one reason it is so widely used.  Yet we also need more sophisticated tools, and as they become more advanced, predictability is hard.  A goal of the MongoDB project is to be reasonably predictable in performance.  Note this is a goal: the database is far from perfect in this regard today, but we think it certainly moves things in the right direction.

For example, the MongoDB query optimizer utilitizes concurrent query plan evaluation to assure good worst-case performance on queries, at a slight expense to average query time.  Further, the lockless design eliminates unpredictability from locking.  Other areas of the system could still use improvement: particularly concurrent query execution.  That said, this is certainly considered an important area for the project and will only get better over time.

Why Schemaless?

Jun 8 • Posted 4 years ago

MongoDB is a JSON-style data store.  The documents stored in the database can have varying sets of fields, with different types for each field.  One could have the following objects in a single collection:

{ name : “Joe”, x : 3.3, y : [1,2,3] }

{ name : “Kate”, x : “abc” }

{ q : 456 }

Of course, when using the database for real problems, the data does have a fairly consistent structure.  Something like the following would be more common:

{ name : “Joe”, age : 30, interests : ‘football’ }

{ name : “Kate”, age : 25 }

Generally, there is a direct analogy between this “schemaless” style and dynamically typed languages.  Constructs such as those above are easy to represent in PHP, Python and Ruby.  What we are trying to do here is make this mapping to the database natural.

Note the database does have some structure.  The system namespace contains explicit lists of our collections and indexes.  Collections may be implicitly or explicitly created, while indexes are explicitly declared (except for predefined _id index).

One of the great benefits of these dynamic objects is that schema migrations become very easy.  With a traditional RDBMS, releases of code might contain data migration scripts.  Further, each release should have a reverse migration script in case a rollback is necessary.  ALTER TABLE operations can be very slow and result in scheduled downtime.

With a schemaless database, 90% of the time adjustments to the database become transparent and automatic.  For example, if we wish to add GPA to the student objects, we add the attribute, resave, and all is well — if we look up an existing student and reference GPA, we just get back null.  Further, if we roll back our code, the new GPA fields in the existing objects are unlikely to cause problems if our code was well written.

Capped Collections

Jun 1 • Posted 4 years ago

With MongoDB one may create collections of a predefined size, where old data automatically ages out on a least recently inserted basis.  This can be quite handy.  In the mongo JavaScript shell, it is as simple as this:

db.createCollection("mycoll", {capped: true, size:100000})

When capped, a MongoDB collection has a couple of interesting properties.  First, the data automatically ages out when the collection is full on a least recently inserted basis.

Second, for capped collections, MongoDB automatically keeps the objects in the collection in their insertion order.  This is great for logging-types of problems where order should be preserved.  To retrieve items in their insertion order:

db.mycoll.find().sort( {$natural:1} ); // oldest to newest
db.mycoll.find().sort( {$natural:-1} ); // newest to oldest

The implementation of the above two properties in the database is done at a low level and is very fast and efficient.  We could simulate this behavior by using a timestamp column and index, but with a significant speed penalty.

In fact, the capped collection performance is so good that MongoDB uses capped collections as the storage mechanism for its own replication logs. One can inspect these logs with standard MongoDB commands.  For example, if you have a master MongoDB database running, try this from the mongo shell:

use local
db.db.oplog.$main.find(); // show some replication log data
db.getReplicationInfo(); // distill from it some summary statistics
db.getReplicationInfo; // shows the js code for getReplicationinfo function

See the documentation for more information.

BSON

May 28 • Posted 4 years ago

MongoDB stores documents (objects) in a format called BSON.  BSON is a binary serialization of JSON-like documents. BSON stands for “Binary JSON”, but also  contains extensions that allow representation of data types that are not part of JSON.  For example, BSON has a Date data type and BinData type.

The MongoDB client drivers perform the serialization and unserialization.  For a given language, the driver performs translation from the language’s “object” (ordered associative array) data representation to BSON, and back. While the client performs this work, the database understands the internals of the format and can “reach into” BSON objects when appropriate: for example to build index keys, or to match an object against a query expression.  That is to say, MongoDB is not just a blob store.

Thus, BSON is a language independent data interchange format.

The BSON serialization code from any MongoDB driver can be used to serialize and unserialize BSON, even for applications where the Mongo database proper is completely uninvolved.  This usage is encouraged and we would be happy to work with others on making the format as generically useful as possible.

Other Formats

The key advantage over XML and JSON is efficiency (both in space and compute time), as it is a binary format.

BSON can be compared to binary interchange formats, such as Protocol Buffers.  BSON is more “schemaless” than Protocol Buffers — this being both an advantage in flexibility, and a slight disadvantage in space as BSON has a little overhead for fieldnames within the serialized BSON data.

See Also

BSON Specification