Update: You can view a video of Jeremy Zawodny’s talk at MongoSF on 10gen.com.
MongoDB is now live at Craigslist, where it is being used to archive billions of records.
Craiglist has kept every post anyone has ever made in a large MySQL cluster. A few months ago, they began looking for alternatives: schema changes were taking forever (Craigslist’s schema has changed a couple times since 1995) and it wasn’t really relational information. They wanted to be able to add new machines without downtime (which sharding provides) and route around dead machines without clients failing (which replica sets provide), so MongoDB was a very strong candidate. After looking into a few of the most popular non-relational database systems, they decided to go with MongoDB.
Jeremy Zawodny is a software engineer at Craigslist and an author of High Performance MySQL (O’Reilly). He kindly agreed to answer some questions about their MongoDB cluster (editor’s comments in italics).
Any numbers you can give us?
We’re sizing the install for around 5 billion documents. That’s from the initial 2 billion document import we need to do plus room to grow for a few years to come. Average document size is right around 2KB. (Five billion 2KB documents is 10TB of data.) We’re getting our feet wet with MongoDB so this particular task isn’t high throughput or growing in unpredictable ways.
We can put data into MongoDB faster than we can get it out of MySQL during the migration.
What does your cluster topology look like?
We have several three-machine replica sets, each set serving a shard of our “archive” database cluster. The configuration is three replica sets in each colo (two total) to handle our initial build out. Obviously there will be a set of config servers and routing processes in the mix as well.
Craigslist is using the MongoDB Perl driver.
Did you find any stumbling blocks relational database developers should watch out for?
Oh, yeah. :-) I’m planning to write up a blog post on that and I talked about it a bit at MongoSV (watch the video), too. The short version is that you have to think differently about indexing and do a bit more bookkeeping of your own. But on the plus side, you don’t have to pay the join penalty, so you can get your data back a lot faster.
Character set issues come up as well, since we’re a Latin-1 or Windows-1252 shop currently (but really need to go UTF-8 across the board). That means some upfront work, but it’s good that MongoDB is UTF-8 end-to-end already.
Got future plans for your MongoDB cluster?
Too soon to tell! But I have a few ideas about ways we can use MongoDB to supplement other needs and possibly replace other data stores. But I really need to think more about them before I go spouting off.
Looking into the future, Zawodny hopes for more polish on sharding and replica sets and that some good front-end admin tools will be developed. “But that will come in time, I’m sure.”