Hosting and Developing the HTML5 Game Cobalt Calibur with MongoDB, Node.js and OpenShift

This was originally posted on the OpenShift blog by Thomas Hunter.

So, you’re interested in getting the HTML5 Game Cobalt Calibur hosted for free? Look no further, Red Hat’s OpenShift can do that for you. Follow this guide and you’ll be up and running in no time. Cobalt Calibur is a multiplayer browser-based game which uses a bunch of HTML5 features to run on the frontend, and requires a Node.js and MongoDB server on the backend. Luckily OpenShift will satisfy these requirements for you.

The first thing you’ll want to do is create an OpenShift account. It’s quite easy and painless, I promise. Once you’re done getting it setup, be sure to click any email validation links and then log in to the website.

Once you’ve got your account setup, you’re going to want to create an SSH key for your computer (if you haven’t done so previously). To create your SSH key, you will want to open up a Terminal emulator and run some commands. These commands should work fine for both OS X and Linux computers. If you’ve already got an SSH key (which you should if you’re a GitHub user), you can skip these steps.

If you’re on a Mac, you’ll want to go to your list of applications and run Terminal. You can get to this app quickly by pressing Cmd+Space, typing in Terminal, and pressing enter.

Below is what your terminal window will end up looking like. You’ll want to type the command ssh-keygen -t rsa, and press enter. You will then be prompted a few questions; just leave everything blank and keep hitting enter.

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/USERNAME/.ssh/id_rsa): <press enter>
Created directory '/home/USERNAME/.ssh'.
Enter passphrase (empty for no passphrase): <press enter>
Enter same passphrase again: <press enter>
Your identification has been saved in /home/USERNAME/.ssh/id_rsa.
Your public key has been saved in /home/USERNAME/.ssh/id_rsa.pub.

Congrats, you’ve now got an SSH public/private key. This is a file which can be used to prove to a remote computer that you are who you say you are. We need to give a copy of this file to OpenShift so that you can use git to push changes to your code to them.

To get a copy of your key file, you’ll want to copy the text from ~/.ssh/id_rsa.pub. You can run the command

cat ~/.ssh/id_rsa.pub which will display the contents of that file to your screen. Select the text and copy the output into your clipboard (everything from ssh-rsa to the username@hostname part):

$ cat ~/.ssh/id_rsa.pub 
ssh-rsa AAAAB3NzaC1yc2BLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAHBLAH== USERNAME@HOSTNAME

Once you’ve copied that text, on the OpenShift website, click My Account > Add a new key to visit the Add a Public Key page, and paste the contents of the output into the big text box. In the small text box above it you can name your key (such as Living Room Desktop or Developer MacBook). You’ll want to use a descriptive name, because if your key is ever compromised, you’ll want to know which one to disable.

Now, click the Create button. OpenShift is now aware of your SSH public key, and you can interact with the git server they provide without problems. Feel free to repeat this process from other machines you plan on working from.

If you get an error when you save the key, you might not have copied the whole thing. If so, you might need to open it in an editor. On a Mac try Open ~/.ssh/id_rsa.pub, and on Linux, you might try gedit ~/.ssh/id_rsa.pub.

Now it is time to create our OpenShift application. To do this, visit the Create Application page from the main OpenShift navigation. On this page, you will see a big list of all the types of applications supported by OpenShift. Scroll down until you see the Node.js option and select that.

On the next screen you will be prompted for some very basic information. Specifically, you will be asked to name your application. Since we are uploading the Cobalt Calibur engine, it makes sense to name it something like cobaltcalibur.

You will also be prompted to create a “namespace” for your account. This is basically a way to associate all of your app URLs with your account. This is so that multiple people can have the same apps named “cobaltcalibur” without stepping on each others toes. I already entered a namespace name before, so I didn’t need to this time.

After you click Create Application, OpenShift will work it’s cloud magic behind the scenes. During this time it is probably creating some DNS entries, copying some skeleton files, creating a git repository, the works. After the process is done, you will be taken to a new screen:

If you like, you can click the blue link to see the skeleton application OpenShift has created for you. It will be a pretty boring, static page which is displayed by a very simple Node.js app.

What you will want to do though is copy the commands in green and paste them into your terminal. This will pull the skeleton code from your applications git repository and make a local copy. There are some (probably) important files in here that we will want to keep.

If you see the same listing of files, then congratulations, you’ve checked out your application from OpenShift.

Now that you’ve got your application created and checked out, we want to add MongoDB support to the application. OpenShift calls these Cartridges.

To add MongoDB, first browse to the All Applications page, and then click the title of the application you created:

On this screen you can see the information for accessing your git repository again, but more importantly, there is a big Add Cartridge button.

Click that big blue button, and the next screen will prompt you for the type of cartridge to be added. Click the MongoDB option:

Once you do this, it will prompt you to make sure you want to add MongoDB. Click the Add Cartridge button again, and after some processing happens in the background it will be added to your application. You will want to copy all of the information you are provided with on this screen, notably the user, password, database, and connection URL which contains the IP address and port number for the database. We’ll give this information to the Cobalt Calibur game later on.

Now that we’ve got the MongoDB cartridge added to our application, we want to actually start the MongoDB server. To do this, you will first need to install the rhc command line utility. You’ll want to follow steps 1 and 2 on that page, you can ignore the other steps. The rhc utility gives you more control over your OpenShift applications that the website does, and is needed to start up the MongoDB server. Run the command rhc app cartridge start -a APPNAME -c mongodb-2.0 and this will start the server for you:

You are now ready to download the Cobalt Calibur source code, configure it to work with your OpenShift account, and upload it to the server. To do this, browse to the Cobalt Calibur GitHub page and simply download the ZIP file.

Extract it to the same folder that the Node.js application was checked out into. This will overwrite the index.html page, the server.js file, and the node_modules/ folder; that is all fine.

Now, it’s time to update the server.js file so that it is able to connect to your MongoDB daemon, as well as bind to the proper ip address and port number that OpenShift requires. You can open up server.js in whatever your favorite editor is. Here is what the old code looks like:

// Web Server Configuration
var server_port = 80; // most OS's will require sudo to listen on 80
var server_address = '127.0.0.1';

// MongoDB Configuration
var mongo_host = '127.0.0.1';
var mongo_port = 27017;
var mongo_req_auth = false; // Does your MongoDB require authentication?
var mongo_user = 'admin';
var mongo_pass = 'password';
var mongo_collection = 'terraformia';

And here is what you will want to change it to:

// Web Server Configuration
var server_port = process.env.OPENSHIFT_INTERNAL_PORT; // most OS's will require sudo to listen on 80
var server_address = process.env.OPENSHIFT_INTERNAL_IP;

// MongoDB Configuration
var mongo_host = 'MONGO IP ADDRESS';
var mongo_port = 27017;
var mongo_req_auth = true; // Does your MongoDB require authentication?
var mongo_user = 'admin';
var mongo_pass = 'MONGO PASSWORD';
var mongo_collection = 'MONGO DATABASE NAME';

Notice how OpenShift provides some environment variables for the web server port and ip address. It might also provide these same variables for the mongo connection, but I didn’t see this information.

The application is now configured properly. You’ll want to now add your files to git, commit the files into git, and push your changes to the server.

git add -A .
git commit -m "Adding Cobalt Calibur files"
git push

You’ll see a bunch of messages from all of the git hooks performing various actions, this is probably a good thing.

Now, if you browse to the URL for your game instance and refresh the page, it should load for you. If not, you might need to run the following command (I needed to for some reason):

rhc app restart -a game

Congratulations, you’ve now got your own personal instance of Cobalt Calibur running on OpenShift for free!

There is one big bug with OpenShift though, they don’t support websockets yet. My guess is that the different apps are hosted in a shared environment, and each application gets one port number to the outside world. Websockets require a bunch of random high ports for different clients, so this doesn’t really work with the shared host environment. Luckily, socket.io will fallback to using long-polling AJAX. The game doesn’t always run perfectly under these conditions, e.g. the monsters or corruption might no load. OpenShift is planning on adding this feature sooner or later, you can vote on it in the mean time.

Thomas Hunter is an evented Node.js hacker transitioning from the world of request/response PHP web development, building everything from hardware control software to traditional web apps. Follow him on Twitter at @tlhunter.

Interview with David Mytton, organiser of the London MongoDB User Group

The London MongoDB User Group was founded in March 2011, and since then has grown to approximately 650 members. The group meets the last Tuesday of every month at 10gen’s new London office in Shoreditch.

A very short interview with David Mytton, organiser of the London MongoDB User Group and founder of Server Density.

What are the biggest challenges you have running the London MongoDB User Group? How do you find your speakers for the group?

Finding speakers is the hardest part. It’s like running a conference every month because you have to find several people to provide talks on interesting topics to encourage members to come each month. There’s only so many people using MongoDB in each meetup area so it’s not like a yearly conference which allows time between each event for new people to start up and existing users to create new projects or learn new things.

How have you helped and encouraged the user group to grow? What advice would you give to someone who was starting their own user group?

Making sure we have interesting speakers is the best way to do it. Then using your own promotional channels (Twitter, Blog, telling friends) and connecting with companies using the project. 10gen help with this as well because they’re doing this kind of activity on a full time basis.

Aside from your work and MongoDB, tell me about something you are passionate about?

I particularly enjoy cycling and just returned from a 3 week cycling trip in Japan.

The London MongoDB User Group was recently featured on the Meetup HQ blog, and next meets on Aug 28.

RSVP here

Getting going quickly with Python, MongoDB, and Spatial data on OpenShift: Part II

This post originally appeared on the OpenShift blog

As a follow up to my last post about getting spatial going in MongoDB on OpenShift, today we are going to put a web service in front of it using Python. There are several goals for this article:

  • Learn a little bit about Flask - a Python web framework
  • Learn about how to connect to MongoDB from Python
  • Create a REST Style web service to use in our SoLoMo application

I hope by the end you can see how using a Platform as a Service can get you going with Python, MongoDB, and Spatial faster than you can say…“Awesome Sauce”. We have a lot of ground to cover so let’s dig right in.

Creating the Python application

Here is OpenShift the command line to create the Python app

rhc app create -t python-2.6 -a pythonws 

Using the flask quickstart from GitHub

We have already put together a flask quickstart in the openshift github space. To get the framework into your application all you have to do is (from the README.md):

cd pythonws git remote add upstream -m master git://github.com/openshift/openshift-mongo-flask-example.git git pull -s recursive -X theirs upstream master 

There we now have a flask app that we can modify source code.

If you want to just check out the source code I used in the app you can see it on Github and follow the README.md instructions to clone it into your OpenShift account

Adding MongoDB and importing data

Time to add MongoDB to our application:

 rhc app cartridge add -a pythonws -t mongodb-2.0 

The previous post in this series will cover how to import the data from a JSON file of the national parks into your mondodb database and prepare it for spatial queries. Please follows those instructions to import the data into the pythonws DB into a collection called parkpoints.

Quick digression to explain Flask

Before we get into our specific application I am going to take a moment to explain the Python framework for this demo. Flask basically allows you to map URL patterns to methods (it also does a lot more, like templating, but this is the only part we are using today). For example, in the mybottleapp.py file that is now in your project you can find the line: @route(‘/’) def index(): return ‘Hello World!

This says that when a request comes in for the base URL, the function named

index gets executed. In this case the function just returns the string “Hello World!” and returning has the effect of sending the string to the requestor. @route(‘/name/’) def nameindex(name=’Stranger’): return ‘Hello, %s!’ % name

We can also grab pieces of the requested URL and pass it into the function. By enclosing a part of the URL in a < >, it indicates that we want to access it within our function. Here you can see where if the url looks like:

http://www.mysite.com/name/steve

Then the response will be Hello, steve!

Or the URL could be http://www.mysite.com/name

Hello, Stranger!

We are going to define URL mappings for some basic REST like functionality to interact with our spatial MongoDB data store.

Modify the source code

The first function we are going to write will be to just simply return all the records in the database. In a more full featured app you would probably want to add pagination and other features to this query but we won’t be doing that today.@app.route(“/ws/parks”) def parks(): #setup the connection conn = pymongo.Connection(os.environ[‘OPENSHIFT_NOSQL_DB_URL’]) db = conn.parks

 #query the DB for all the parkpoints result = db.parkpoints.find() #Now turn the results into valid JSON return str(json.dumps({'results':list(result)},default=json_util.default)) 

I chose to put the web services under the url /ws/parks so that we could use other parts of the URL namespace for other functionality. You can now go to your application URL (http://pythonws-.rhcloud.com/ws/parks) and you should be able to see all the documents in the DB.

Using MongoDB in Python

In the code above we simply make a connection to the MongoDB instance for this application and then execute a query. The pymongo package provides all the functionality to interact with the MongoDB instance from our Python code. The pymongo commands are very similar to the MongoDB command line interaction except two word commands like db.collection.findOne are split with a _, such as db.collection.find_one. Please go to the pymongo site to read more about the documentation.

Notice we use the environment variables to specify the connection URL. While not hard coding database connection parameters is good practice in non-cloud apps, in our case you MUST use the environment variables. Since your app can be idled and then spun up or it could be autoscaled, the IP and ports are not always guaranteed. By using the environment variables we make our code portable.

We pass the result set (which comes back as a Python dictionary) into json.dump so we can return JSON straight to the client. Since pymongo is returning the results in UTF and we want just plain text, we need to pass the json_util.default from the bson library into the json.dump command.

This is probably the easiest experience I have ever had writing a web service. I love Flask, Pymongo, and Python for the simplicity of “Just Getting Stuff Done”.

Grab just one park

Next we will implement the code to get back a park given a parks uniqueID. For ID we will just use the ID generated by MongoDB on document insertion (_id). The ID looks like a long random sequence and that is what we will pass into the URL.

return a specific park given it’s mongo _id

@app.route(“/ws/parks/park/”) def onePark(parkId): #setup the connection conn = pymongo.Connection(os.environ[‘OPENSHIFT_NOSQL_DB_URL’]) db = conn.parks

 #query based on the objectid result = db.parkpoints.find({'_id': objectid.ObjectId(parkId)}) #turn the results into valid JSON return str(json.dumps({'results' : list(result)},default=json_util.default)) 

Here you have to use another class from the bson library - ObjectID. The actual ObjectID in MongoDB is an object and so we have to take the ID passed in on the url and create an Object from it. The ObjectID class allows us to create one of these objects to pass into the query. Other than that the code is the same as above.

This little snippet also shows an example of grabbing part of the URL and passing it to a function. I explained this concept above but here we can see it in practice.

Time for the spatial query

Here we do a query to find national parks near a lattitude longitude pair

find parks near a lat and long passed in as query parameters (near?lat=45.5&lon=-82)

@app.route(“/ws/parks/near”) def near(): #setup the connection conn = pymongo.Connection(os.environ[‘OPENSHIFT_NOSQL_DB_URL’]) db = conn.parks

 #get the request parameters lat = float(request.args.get('lat')) lon = float(request.args.get('lon')) #use the request parameters in the query result = db.parkpoints.find({"pos" : { "$near" : [lon,lat]}}) #turn the results into valid JSON return str(json.dumps({'results' : list(result)},default=json_util.default)) 

This piece of code shows how to get request parameters from the URL. We capture the lat and lon from the request url and then cast them to floats to use in our query. Remember, everything in a URL comes across as a string so it needs to be converted before being used in the query. In a production app you would need to make sure that you were actually passed strings that could be parsed as floating point numbers. But since this app is just for demo purposes I am not going to show that here.

Once we have the coordinates, we pass them in the the query just like we did from the command line MongoDB client. The results come back in distance order from the point passed into the query. Remember, the ordering of the coordinates passed into the query need to match the ordering of the coordinates in your MongoDB collection.

Finish it off with a Regex query with spatial goodness

The final piece of code we are going to write allows for a query based both on the name and the location of interest.

find parks with a certain name (using regex) near a lat long pair such as above

@app.route(“/ws/parks/name/near/”) def nameNear(name): #setup the connection conn = pymongo.Connection(os.environ[‘OPENSHIFT_NOSQL_DB_URL’]) db = conn.parks

 #get the request parameters lat = float(request.args.get('lat')) lon = float(request.args.get('lon')) #compile the regex we want to search for and make it case insensitive myregex = re.compile(name, re.I) #use the request parameters in the query along with the regex result = db.parkpoints.find({"Name" : myregex, "pos" : { "$near" : [lon,lat]}}) #turn the results into valid JSON return str(json.dumps({'results' : list(result)},default=json_util.default)) 

Just like the example above we parse out the lat and lon from the URL query parameters. In looking at my architecture I do think it might have been better to add the name as a query parameter as well, but this will still work for this article. We grab the name from the end of the URL path and then compile it into a standard Python regular expression (regex). I added the re.I to make the regex case-insenstive. I then use the regex to search against the Name field in the document collection and do a geo search against the pos field. Again, the results will come back in distance order from the point passed into the query.

Conclusion

And with that we have wrapped up our little web service code - simple and easy using Python and MongoDB. Again, there are some further changes required for going to production, such as request parameter checking, maybe better URL patterns, exception catching, and perhaps a checkin URL - but overall this should put you well on your way. There are examples of:

  • Using Flask to write some nice REST style services in Python
  • Various methods to get URL information so you can use it in your code
  • How to interact with your MongoDB in Python using PyMongo and BSON libraries
  • Getting spatial data out of your application

Give it all a try on OpenShift and drop me a line to show me what you built. I can’t wait to see all the interesting spatial apps built by shifters.

Pub/sub with MongoDB

There are plenty of existing messaging systems out there (Redis, AMQP, ØMQ, etc.) but I’ve recently found MongoDB to be a very compelling alternative, especially if you’re already running MongoDB somewhere in your setup. Using MongoDB’s capped collections and tailable cursors we can build a simple pub/sub system to communicate messages (documents) between processes.

Tailable Cursors

When retrieving records from a tailable cursor we’re able to instruct the MongoDB server to block until some data becomes available (at which point it will be returned by the cursor). It’s worth noting here that the server will timeout after a few seconds of waiting for data and return nothing. In this case the driver you’re using will most likely initiate another blocking call behind the scenes- giving us the impression that the cursor is “listening” for data. This process may sound reminiscent of HTTP long polling in the way that data can be “pushed” to the listener. While we could achieve something similar by constantly re-querying for new data, using tailable cursors like this offers a much nicer solution.

Example

I put together a very basic example to demonstrate this functionality using Node.js. You can grab it here if you want to follow along: https://gist.github.com/3210919 It assumes that you already have MongoDB installed and running locally.

First we need to create the capped collection in which messages will be stored. Unfortunately, it turns out that MongoDB won’t keep a tailable cursor open if the collection is empty, so let’s also create a blank document to “prime” the collection. We’ll fire up the Mongo shell to do this:

$ mongo use pubsub db.messages.insert({ message: 'Hello world', time: Date.now() })

Without anyone listening for these message inserts, though, we haven’t accomplished anything terribly exciting.

Subscribe

When subscribing to newly inserted messages we first need to find the last document currently in the messages collection. We’ll then use the _id of that document to ensure that our tailable cursor only returns messages created in the future. Beware that since a capped collection does not have a unique index on _id by default, this initial query requires scanning the entire collection. Depending on the size of your capped collection it may be wise to create an index on _id.

 var query = { _id: { $gt: doc._id }, message: { $regex: /foo/i }}; 

I find the ability to perform complex queries like this an incredibly powerful feature and big selling point of using this setup.

With our tailable cursor created, we can then repeatedly “poll” the cursor for any new messages- keeping in mind that the callback passed to nextObject will not be called until data is available:

node-mongodb-native module to connect with MongoDB. Install it and then start up the subscriber: Mubsub.

Honestly, I’d love to see this sort of functionality baked right into MongoDB itself. Until then, though, I think the amount of effort required is pretty minimal for what we get. If you’re using MongoDB for messaging like this I’d be curious to hear about it. Hit me up on Twitter (@scttnlsn) or discuss it in the comment section below.

Scott Nelson is a JavaScript developer from Ithaca, NY. He is an open source enthusiast, freelancer, and fervent practitioner of Node.js and MongoDB!

Designing MongoDB Schemas with Embedded, Non-Embedded and Bucket Structures

This was originally posted to the Red Hat OpenShift blog

With the rapid adoption of schema-less, NoSQL data stores like MongoDB, Cassandra and Riak in the last few years, developers now have the ability enjoy greater agility when it comes to their application’s persistence model. However, just because a datastore is schema-less, doesn’t mean the structure of the stored documents won’t play an important role in the overall performance and resilience of the application. In this first, of a four part blog series about MongoDB we’ll explore a few strategies you should consider when designing your document structure.

Application requirements should drive schema design

If you ask a dozen experienced developers to design the relational database structure of an application, such as a book review site, it’s likely that each of the structures will be very similar. You’ll likely see tables for authors, books, commenters and comments and so on.. The likelihood of having varied relational structures is small because relational database structures are generally well understood. However, if you ask dozen experienced NoSQL developers to create a similar structure, you’re likely to get a dozen different answers.

Why is there so much variability when it comes to designing a NoSQL schema? To optimize application performance and reliability, a NoSQL schema must be driven by the application’s use case. It’s a novel idea, but it works. Luckily, there are only a few key factors you need to understand when deriving your schema from application requirements. These factors include: • How your documents reference children collections • The structure and the use of indexes • How your data will be sharded

Elements of MongoDB Schemas

Of these factors, how your documents reference child collections, or embedding, is the most important decision you need to make. This point is best demonstrated with an example.

Suppose we’re building the book review site as we mentioned in the introduction. Our application will have authors and books, as well as reviews with threaded comments. How should we structure the collections? Unfortunately, the answers depend on the number of comments we’re expecting per book and how frequently comments are read vs. written. Let’s look at our possible use cases.

The first possibility is were we’re only going to have a few dozen reviews per book, and each review is likely to have a few hundred comments. In this case, embedding the reviews and comments with the book is a viable possibility. Here’s what that might look like:

Listing 1 – Embedded

// Books { “_id”: ObjectId(“500c680c1fe9193b67b898a3”), “publisher”: “O’Reilly Media”, “isbn”: “978-1-4493-8156-1”, “description”: “How does MongoDB help you…”, “title”: “MongoDB: The Definitive Guide”, “formats”: [“Print”, “Ebook”, “Safari Books Online”], “authors”: [{ “lastName”: “Chodorow”, “firstName”: “Kristina” }, { “lastName”: “Dirolf”, “firstName”: “Michael” }], “pages”: “210” }

// Reviews { “_id”: ObjectId(“500c680c1fe9193b67b898a4”), “rating”: 5, “description”: “The Authors made an excellent work…”, “title”: “One of O’Reilly excellent books”, “created”: ISODate(“2012-07-04T09:48:17Z”), “book_id”: { “$ref”: “books”, “$id”: ObjectId(“500c680c1fe9193b67b898a3”) }, “reviewer”: “Giuseppe” }

// Comments { “_id”: ObjectId(“500c680c1fe9193b67b898a5”), “comment”: “This review helped me choose the correct book.”, “commenter”: “Nick”, “review_id”: { “$ref”: “reviews”, “$id”: ObjectId(“500c680c1fe9193b67b898a4”) }, “created”: ISODate(“2012-07-20T13:15:37Z”) }

While simple, this method does have some trade-offs. First, our reviews and comments are strewn throughout the disk. We’re potentially loading thousands of documents to display a page. This leads us to another common embedding strategy – “buckets”.

By bucketing review comments, we can maintain the benefit of fewer reads to display substantial amounts of content, while at the same time maintaining fast writes to smaller documents. An example of a bucketed structure is presented below:

Figure 1 – Hybrid Structure

In this example, the bucket, or hybrid, structure breaks the comments into chunks of roughly 100 comments. Each comment collection maintains a reference to the parent review, as well as its page and current number of contained comments.

Of course, as software developers, we’re painfully aware there’s no free lunch. The downside to buckets is the increased complexity your application has to deal with. The previous strategies were trivial to implement from an application perspective, but suffered from inefficiencies at scale. Buckets address these inefficiencies, but your application has to do a bit more bookkeeping, such as keeping track of the number of comment buckets for a given review.

Conclusion

My own personal projects with MongoDB have used each one of these strategies at one point or another, but I’ve always grown into more complicated strategies from the most basic, as the application requirements changed. One of the benefits of MongoDB is the ability to change your storage strategy at will and you shouldn’t be afraid to take advantage of this flexibility. By starting simple, you can maintain development velocity early and migrate to a more scalable strategy as the need arises. Stay tuned for additional blogs in this series covering the use of MongoDB indexes, sharding and replica sets.

If you are interested in experimenting with a few of the concepts without having to download and install MongoDB, try in on Red Hat’s OpenShift. It’s FREE to sign up and all it takes is an email and your minutes from having a MongoDB instance running in the cloud.

References

http://www.mongodb.org/display/DOCS/Schema+Designhttp://www.10gen.com/presentations/mongosf2011/schemascale

Introducing Mongo Connector

MongoDB is a great general purpose data store, but for some workflows, you may want to use another tool or integrate data from MongoDB into another system. To address this common interest, we built Mongo Connector, which is a generic connection system that you can use to integrate MongoDB with another system with simple CRUD operational semantics (i.e. insert, update, delete, and search operations.)

Consider the following use cases for this system, which could include:

  • Connecting MongoDB to search engines for more advanced search.
  • Creating a secondary, backup MongoDB cluster that uses Mongo Connector to keep both clusters in sync.
  • Storing specific collections or specific information in other, possibly relational, database systems.
  • Connecting MongoDB to integration platforms such as Mule
  • Dumping your data from MongoDB to any other storage systems, with support to stop and restart the dump at any point.

On startup, Mongo Connector copies your documents from MongoDB to your target system. Afterwards, it constantly performs updates on the target system to keep MongoDB and the target in sync. The connector supports both Sharded Clusters and standalone Replica Sets, hiding the internal complexities such as rollbacks and chunk migrations. Mongo Connector abstracts the MongoDB internals so you only have to implement one class: the DocManager.

The DocManager is a simple, lightweight, and most importantly, simple to write class that defines a limited number of CRUD operations for the target system. The DocManager API explains what functions must be implemented, and Mongo Connector uses those functions to link up MongoDB and the target system.

For the first release, we have implementations of the Doc Manager for Solr, ElasticSearch and, of course, MongoDB (if you want to connect your MongoDB to another MongoDB instance).

To install Mongo Connector, issue the following command at your systems shell:

pip install mongo-connector

After that, start the Mongo Connector. For example, suppose there is a Sharded Cluster with a mongos running on localhost:27217, a Solr search server running on localhost:8080, and the Solr access URL being http://localhost:8080/solr. Then, use the following command to have Mongo Connector sync the MongoDB cluster with Solr:

python mongo_connector.py -m localhost:27217 -t http://localhost:8080/solr

The connector will start syncing the data to the Solr connection at http://localhost:8080/solr

Check out our github repo for requests for new doc managers, bug reports, and documentation on Mongo Connector: https://github.com/10gen-labs/mongo-connector

About us: Mongo Connector was designed, coded, tested, packaged, and released by Leonardo Stedile and Aayush Upadhyay, two of 10gen’s summer interns. Special thanks to Spencer Brody and Randolph Tan, our two mentors. We hope you find Mongo Connector useful, and that it helps you build awesome things with MongoDB.

MacOSX Preferences Pane for MongoDB

This is a guest post from

RémySAISSY of OCTOTechnology

In my work as a developer, I keep a full development environment with several MongoDB instances and data sets on mylaptop. As an OS X user, I love having beautiful and efficient applications to do everything.

Today,I have the pleasure to announce the release of the MacOSX Preferences Pane for MongoDB.


What is it for?

The MacOSX preferences pane for MongoDB aims to provide a simple and efficient user interface to control the status of a local MongoDB server, just like the MySQL Preferences Pane.

My focus has been on simplicity, and it has the following features:

  • It runs on MacOSX Snow Leopard, Lion and Moutain Lion
  • You can manually start and stop the MongoDB server from your system control panel.
  •  You can configure MongoDB to start and stop automatically with your system.

 If use Homebrew, and you have customized your system’s launchd plist, the MacOSX Preferences pane for MongoDB will:

  • migrate your exiting launchd configuration for use with the preferences pane
  • keep all launchd configurations your customizations through a;; enable/disable cycles 

To prevent upgrade issues from taking time and attention the preference pane comes with an automatic update mecanism. Once a new version has been installed, the preferences pane will simply ask you to restart your preferences pane to start using the new version.

 

Sounds good but I am not an English speaker

The preferences pane for MongoDB comes in several languages :

  •  English
  • French
  • Simplified Chinese
  • Spanish
  • Brazilian/Portugese

Feel free to contribute by adding a new language!

Prerequisites 

Since it is only a preferences pane, it does not embed a MongoDB Server. Therefore, the first thing you have to do is installing MongoDB.

A simple way to accomplish this is to use Homebrew:

$brew install mongodb

Installation 

TheMongoDB Preferences Pane is available on Github:

https://github.com/remysaissy/mongodb-macosx-prefspane.

    1. Download the latest version: https://github.com/remysaissy/mongodb-macosx-prefspane/raw/master/download/MongoDB.prefPane.zip
    2. Unzip MongoDB.prefPane.zip
    3. Double click on MongoDB.prefPane

That’s all.

I hope this will be useful. Do not hesitate to contribute and send me your feedback!

July 2012 Release Summary

At the same time the drivers team has been hard at work improving the drivers and adding support for new features in the upcoming 2.2 release. These releases are:

For up-to-date information on new MongoDB releases join the MongoDB announcements mailing list.

MongoDB Blogroll: The Best of July 2012 

Every month, we’ll be publishing the best community blog posts from the month. Here is the digest for July:

Want your blog post to be included in the next update? Tweet it out with the #mongodb hashatag or send it to us directly

Edda: a log visualizer for MongoDB

We are pleased to announce the initial release of Edda. Edda is a tool for MongoDB that takes mongod log files and generates easy-to-parse pictures of the represented servers.

Edda showing a five-member set with replication paths and member states.

MongoDB servers generate some pretty substantial log files. These lengthy logs are one of the more important tools we have for diagnosing issues with MongoDB servers. However, correlating logs from multiple servers can be time-consuming. Enter Edda, a log visualizer for MongoDB. We hope that this tool will be helpful to MongoDB administrators.

Possible states represented.

For its first release, we focused on visualizing replica sets with Edda. We plan to support visualizing logs from sharded clusters in the future.

A three-member set with one primary, one secondary, and one down node.

Want to try Edda? Install it with pip!

$ pip install edda

Then run Edda from the command line, giving one or more log files for it to parse:

$ edda server1.log server2.log server3.log

Edda requires a mongod to be running. Once Edda has parsed the logs, it will pop up a browser window with a timeline of the events.

You can run Edda on any subset of log files available. This is an example of running Edda on one log file from a seven-member replica set.

Check out our Github repo for feature requests, bug reports, and further documentation on Edda: https://github.com/kchodorow/edda

A bit about the team: Edda was designed, coded, tested, packaged, and released by Samantha Ritter and Kaushal Parikh, two of 10gen’s summer interns. We are so happy to have the chance to build a tool for MongoDB and see it through its first release.

Using the Python toolkit Ming to accelerate your MongoDB development

This is a guest post from Rick Copeland of Arborian.

Ming is a Python toolkit providing schema enforcement, an object/document mapper, an in-memory database, and various other goodies developed at SourceForge during our rewrite of the site from a PHP/Postgres stack to a Python/MongoDB one.

Why Ming?

If you’ve come to MongoDB from the world of relational databases, you have probably been struck by just how easy everything is: no big object/relational mapper needed, no new query language to learn (well, maybe a little, but we’ll gloss over that for now), everything is just Python dictionaries, and it’s so, so fast! While this is all true to some extent, one of the big things you give up with MongoDB is structure.

MongoDB is sometimes referred to as a schema-free database. (This is not technically true; I find it more useful to think of MongoDB as having dynamically typed documents. The collection doesn’t tell you anything about the type of documents it contains, but each individual document can be inspected.) While this can be nice, as it’s easy to iterate on your schema quickly in development, it’s also easy to get yourself in trouble the first time your application tries to query by a field that only exists in some of your documents.

The fact of the matter is that even if the database cares nothing about your schema, your application does, and if you play too fast and lose with document structure, it will come back to haunt you in the end. At SourceForge, we created Ming (as in “…the Merciless”, the villan who ruled the planet Mongo in Flash Gordon) to deal with precisely this problem. We wanted a (thin) layer on top of PyMongo that would do a couple of things for you:

  • Make sure that we don’t put malformed data into the database
  • Try to ‘fix’ malformed data coming back from the database

Ming’s Architecture

Ming’s architecture is based on the excellent SQL toolkit SQLAlchemy. While much younger than SQLAlchemy and not including any of its code, MongoDB takes its design inspiration from there.

Ming actually consists of a number of components, including:

  • The schema enforcement layer - This is ‘basic’ Ming, providing validation and conversion of documents on their way in and out of MongoDB. There are actually two APIs at this layer, the imperative syntax and a more declarative syntax.
  • The object/document mapper - The ODM Layer extends the schema enforcement layer by providing a unit of work, identity map, and psuedo-relational concepts (one-to-many joins, for instance).
  • MongoDB-in-Memory - This is layer designed to be a drop-in replacement for the native pymongo driver used for testing your application without needing to have access to a MongoDB server.

Let’s take a look at each of these components in turn…

Ming Schema Enforcement

A Ming schema is fairly straightforward. Below is an example containing the schema for a blog post in both the imperative and declarative syntaxes:

from ming import collection, Field, Session
from ming import schema as S

session = Session() # ming abstraction for database

# Set up the User schema ahead-of-time
User = dict(username=str, display_name=str)

# "Imperative" style
BlogPost = collection(
   'blog.posts', session, 
   Field('_id', S.ObjectId),
   Field('posted', datetime, if_missing=datetime.utcnow),
   Field('title', str),
   Field('author', User),
   Field('text', str),
   Field('comments', [ 
       dict(author=User,
            posted=S.DateTime(if_missing=datetime.utcnow),
            text=str) ]))

# "Declarative" style
from ming.declarative import Document

class BlogPost(Document):
    class __mongometa__:
        session=session
        name='blog.posts'
        indexes=['author.name', 'comments.author.name']
    _id=Field(str)
    title=Field(str)
    posted=Field(datetime, if_missing=datetime.utcnow)
    author=Field(User)
    text=Field(str)
    comments=Field([
        dict(author=User, 
             posted=datetime,
             text=str) ])

Once you have your schema set up, you can use it to perform all the same operations you can do in pymongo using the manager object attached to the attribute m:

# Bind the session to the database
from ming.datastore import DataStore 
session.bind = DataStore(
    'mongodb://localhost:27017', database='test')

# Queries
BlogPost.m.find(...) # equiv. to db.blog.posts.find(...)

# Inserts
post0 = BlogPost(dict(... fields here ... ))
post0.m.insert()

# Updates using save()
post1 = BlogPost.m.find({'author.username': 'rick446'}).first()
post1.author.username = 'rick447'
post1.m.save()

# Updates using update_partial()
BlogPost.m.update_partial(
  { '_id': ... },
  { '$push': { 'comments': {... comment data...} } })

# Deletes
post1.m.delete() # single document
BlogPost.m.remove({...query...}) # delete by query

The Object-Document Mapper

Building on the schema enforcement layer is the object-document mapper, which provides two useful patterns:

  • Unit of Work - This pattern collects the changes to your objects in memory until a point at which you flush() them all to the database at once.
  • Identity Map - This guarantees that if you load the same database document twice, you’ll get the same object in memory. This keeps you from accidentally loading the object twice, modifying it twice, and having your two sets of changes overwrite one another.

Ming also allows you to model relationships between your documents via ForeignIdProperty and RelationProperty. Here is an example schema for a blog hosting site with multiple blogs:

from ming import schema as S
from ming.odm.declarative import MappedClass
from ming.odm.property import FieldProperty, RelationProperty
from ming.odm.property import ForeignIdProperty
from ming.odm import ODMSession

# wrap the session from the schema layer
odm_session = ODMSession(session)

class Blog(MappedClass):
    class __mongometa__:
        session = odm_session
        name = 'blog.blog'

    _id = FieldProperty(S.ObjectId)
    name = FieldProperty(str)
    posts = RelationProperty('Post')

class Post(MappedClass):
    class __mongometa__:
        session = odm_session
        name = 'blog.posts'

    _id = FieldProperty(S.ObjectId)
    title = FieldProperty(str)
    text = FieldProperty(str)
    blog_id = ForeignIdProperty(Blog)
    blog = RelationProperty(Blog)

Once you have the classes defined, you can load and modify the objects, using the odm_session to save your changes to MongoDB:

# Queries
Blog.query.find(...) # equiv. to db.blog.posts.find(...)
blog = Blog.query.get(name='MongoDB Blog')
blog.posts # returns a list of post objects for the blog
blog.posts[0].blog # returns the blog object

# Inserts
post = Post(blog=blog, ...) # automatically sets blog_id

# Updates 
post.title = 'The cool post'

# Save your changes
odm_session.flush()

# Mark post for deletion
post.delete()

# Actually delete
odm_session.flush()

MongoDB-in-Memory

The third main component of Ming is an implementation of the pymongo API that allows you to perform testing of your application without having a dependency on a MongoDB server. To use MIM, you can swap out the creation of your pymongo connection:

from ming import mim
import unittest

class TestCase(unittest.TestCase):

    def setUp(self):
        # self.connection = Connection()
        self.connection = mim.Connection()

MIM’s support of the pymongo api and MongoDB query syntax has largely been driven by the various APIs and queries used internal to SourceForge, so there are some gaps, but these are rapidly filled when reported. For instance, MIM does provide support for gridfs and mapreduce already (mapreduce Javascript support provided by python-spidermonkey). And of course MIM integrates well with the rest of Ming, allowing you to substitute a mim:// URL for the normal mongodb:// url in your datastore:

from ming import mim
from ming.datastore import DataStore
import unittest

class TestCase(unittest.TestCase):

    def setUp(self):
        self.ds = DataStore(
            'mongodb://localhost:27017', database='test')

Conclusion

There are other good bits in MongoDB, including lazy and eager migrations, support for the MongoDB filesystem gridfs, WSGI auto-flushing middleware for the ODMSession, and more. We’re also experimenting with support for GQL, Google’s query language for the Google App Engine (GAE), to facilitate porting apps from GAE to MongoDB. Ming is actively maintained and is a mission-critical part of the SourceForge application stack, where it’s been in production use for over 2 years.

So what do you think? Is Ming something that you would use for your projects? Have you chosen one of the other MongoDB mappers? Please let us know in the comments below!

To learn more about development with Ming, check out Rick’s ebook MongoDB with Python and Ming or visit the Atlanta MongoDB User Group on Wednesday, where Rick is presenting.

MongoDB on Windows Azure

This post originally appeared on the Microsoft Interoperability Blog.  

Do you need to build a high-availability web application or service? One that can scale out quickly in response to fluctuating demand? Need to do complex queries against schema-free collections of rich objects? If you answer yes to any of those questions, MongoDB on Windows Azure is an approach you’ll want to look at closely.

People have been using MongoDB on Windows Azure for some time (for example), but recently the setup, deployment, and development experience has been streamlined by the release of the MongoDB Installer for Windows Azure. It’s now easier than ever to get started with MongoDB on Windows Azure!

MongoDB

MongoDB is a very popular NoSQL database that stores data in collections of BSON (binary JSON) objects. It is very easy to learn if you have JavaScript (or Node.js) experience, featuring a JavaScript interpreter shell for administrating databases, JSON syntax for data updates and queries, and JavaScript-based map/reduce operations on the server. It is also known for a simple but flexible replication architecture based on replica sets, as well as sharding capabilities for load balancing and high availability. MongoDB is used in many high-volume web sites including Craigslist, FourSquare, Shutterfly, The New York Times, MTV, and others.

If you’re new to MongoDB, the best way to get started is to jump right in and start playing with it. Follow the instructions for your operating system from the list of Quickstart guides on MongoDB.org, and within a couple of minutes you’ll have a live MongoDB installation ready to use on your local machine. Then you can go through the MongoDB.org tutorial to learn the basics of creating databases and collections, inserting and updating documents, querying your data, and other common operations.

MongoDB Installer for Windows Azure

The MongoDB Installer for Windows Azure is a command-line tool (Windows PowerShell script) that automates the provisioning and deployment of MongoDB replica sets on Windows Azure virtual machines. You just need to specify a few options such as the number of nodes and the DNS prefix, and the installer will provision virtual machines, deploy MongoDB to them, and configure a replica set.

Once you have a replica set deployed, you’re ready to build your application or service. The tutorial How to deploy a PHP application using MongoDB on Windows Azure takes you through the steps involved for a simple demo app, including the details of configuring and deploying your application as a cloud service in Windows Azure. If you’re a PHP developer who is new to MongoDB, you may want to also check out the MongoDB tutorial
on php.net
.

Developer Choice

MongoDB is also supported by a wide array of programming languages, as you can see on the Drivers page of MongoDB.org. The example above is PHP-based, but if you’re a Node.js developer you can find a the tutorialNode.js Web Application with Storage on MongoDB over on the Developer Center, and for .NET developers looking to take advantage of MongoDB (either on Windows Azure or Windows), be sure to register for the free July 19 webinar that will cover the latest features of the MongoDB .NET driver in detail.

The team at Microsoft Open Technologies is looking forward to working closely with 10gen to continue to improve the MongoDB developer experience on Windows Azure going forward. We’ll keep you updated here as that collaboration continues!

mongo, the MongoDB Shell

The MongoDB shell (mongo) is an extended SpiderMonkey (JavaScript) shell, so you can use it to execute JavaScript code just like you’re used to writing.

The shell is best at things like testing out queries, examining specific records, configuring replica sets and sharding, and administrative tasks like creating indexes.

Many objects in the shell have help functions in case you forget how to do things. When in doubt:

    > help
    > db.help()
    DB methods:
        db.addUser(username, password[, readOnly=false])
        db.auth(username, password)
        ...

    > db.demo.help()
    DBCollection help
        db.demo.find().help() - show DBCursor help
        ...

Some of the most common tasks the shell is needed for involve sending commands to the MongoDB server - like changing profiler settings, losetting. These kinds of functions are performed by using the shell to send database commands for example db.runCommand("shutdown") or db.runCommand({profile:-1}). You can get info on these commands and how to use them from the shell like this:

    > db.listCommands() // print a listing of all the available commands
    > db.commandHelp("compact") // show details about how to use the "compact" database command.

Sometimes, it’s useful to see how a particular shell function works - you can do this by leaving off the ( ), which forces the shell to print the source code of the function instead of just executing it. For example:

    > db.printReplicationInfo() // print info about current replication status
       ...
    > db.printReplicationInfo // print the source code of the printReplicationInfo function
    function () {
        ...
    }

In the shell, the default behavior when executing a query is to print the output, unformatted.

    db.posts.find()
    { "_id" : ObjectId("4e697832c67f0623d40000ad"), "content" : "lorem ipsum" }
    . . .
    { "_id" : ObjectId("4e697832c67f0623d40000f0"), "content" : "four score and 7 years ago" }
    has more

Try adding .pretty() to the function to format the output in a more readable matter, like this:

    
    db.posts.find().pretty()

By default, queries with lots of results only print a limit of 20 results at a time - type it at the shell to show the next batch of 20. You can adjust this size limit by setting the value of DBQuery.shellBatchSize.

When the mongo shell starts, it will read and execute any JavaScript code in the file .mongorc.jsin your home directory. By adding your settings, utility functions, or tweaks into this file you can make them available in every shell session. For example, add this snippet of code to the file, and it will allow you to run the inspect() function on any javascript object in the shell, which will print information about its properties:

function inspect(o, i) {
    if (typeof i == "undefined") {
        i = "";
    }
    if (i.length > 50) {
        return "[MAX ITERATIONS]";
    }
    var r = [];
    for (var p in o) {
        var t = typeof o[p];
        r.push(i + "\"" + p + "\" (" + t + ") => " + (t == "object" ? "object:" + xinspect(o[p], i + "  ") : o[p] + ""));
    }
    return r.join(i + "\n");
}

You can also add snippets of code in your .mongorc.js file to do other cool stuff, like customize your shell prompt. In addition, you can execute any file containing JavaScript code from within the shell by calling load(filename).

Frequently it’s useful to execute some mongo shell commands without leaving your operating system shell - for example, to pipe the output to another process or redirect to a file. This can be done easily by just calling the shell command with the --eval option followed by the JavaScript you want to execute. Just be aware that since the output isn’t being automatically printed by the shell process with this approach, so if you need to print the JSON representation of documents in your queries, you will need to explicitly use the printjson() function to generate correct output. For example:

    $ mongo --eval "db.posts.find().forEach(function(x){printjson(x)})"
    { "_id" : ObjectId("4fbec0b9f3ecac6f43bc1c13"), "x" : 10 }
    ...

When writing statements in the shell that leave an open bracket, parenthesis, or quote, hitting enter will prompt you with “…” for more input. So if you need to write a long block of code, you can let it span multiple lines:

    > for(var i=0;i<100;i++){
    …

If you screw up or want to cancel it and start over, just hit enter twice - the entire block of code will be aborted.

A new shell feature available in versions 2.1.x and later is the ability to edit blocks of code using a text editor. Use the “edit” keyword with the name of a function, and it will invoke your editor with the block of source code for that function:

    > edit testfunc 
    // now we get dropped into an editor where we can edit code for the function
   
    > testfunc // show the source of the function we just wrote
    function testfunc() {
        print("hello world!");
    }
    > testfunc()
    hello world!

MongoDB for the PHP Mind, Part 2

This is part 2 of a series, with part 1 covering the bare essentials to get you going. In this post we are going to take a closer look at queries and how indexes work in MongoDB.

Introduction

I’d like to kick off this post with a thanks to the folks behind the PHP extension for MongoDB, who have done a fantastic job of matching the functionality of the Mongo shell client. This is important when you start to see how similarly the two function, and you might find that you can tweak your logic using the shell and quickly implement the same logic from within PHP.

The PHP extension supports something that is rather new to a lot of folks in the PHP world, a feature called method chaining: The ability to run several methods at the same time on one object. For example, you might want to run a query and then apply a limit to it. Most folks would think that this is two operations, and they are correct, however with method chaining, you can do both in one shot, like this:

$result = $songs->find()->limit(2);

Of course this works in the Mongo shell too. You would basically do the same thing:

result = db.songs.find().limit(2);

For more reading on method chaining, there’s an excellent blog post about method chaining in PHP 5.

Before we dig deep into finding and manipulating your data, let’s discuss the different data types that MongoDB supports.

MongoDB Data Types

All databases have their own data types, and MongoDB is no different. A summary of MongoDB’s available data types are as follows:

  • ObjectId: Also known within PHP as MongoId, is a unique object usually provided as a primary key by default in the property _id. It is 12 bytes long, and is automatically created by the database when you insert a document without an _id property set. You can also set your own values, but remember that this is used as the primary key and so must be unique.
  • String: Just like strings in PHP, however all strings must be UTF-8. You need to convert non-UTF-8 strings before inserting into your database.
  • Binary: Used for non-UTF-8 strings and other binary data.
  • Boolean: You can use the familiar TRUE and FALSE right from PHP.
  • Numbers: This is a bit more complex, based on whether you are running on a 32 or 64 bit system. On a 32 bit system, numbers are generally stored as 32 bit integers, and 64 bit systems default to 64 bit integers. You can read more detail on MongoInt32 and MongoInt64.
  • Dates: Known as the MongoDate class, which are based on milliseconds since the epoch.
  • Null values: You can assign NULL values from PHP as well.

A Special Word About MongoId

Thoughtful consideration needs to be given to the MongoId data type, as it is used as a primary key for most documents. It is recommended that you allow this feature to run automatically unless you have very specific needs and your own naturally unique primary keys.

A common mistake is the assumption by PHP engineers that MongoIds are strings. They are not. A MongoId is stored as an object. So if you are working with a document whose _id property is set as a MongoId instance with the value of 4cb4ab6d7addf98506010000, you will need to search for that document with an instance of the MongoId class with that value. For instance, imagine you are looking for the document with the previously mentioned MongoId as the _id property:

// This is only a string, this is NOT a MongoId
$mongoid = '4cb4ab6d7addf98506010000';

// You will not find anything by searching by string alone
$nothing = $collection->find(array('_id' => $mongoid));
echo $nothing->count(); // This should echo 0

// THIS is how you find something by MongoId
$realmongoid = new MongoId($mongoid);

// Pass the actual instance of the MongoId object to the query
$something = $collection->find(array('_id' => $realmongoid));
echo $something->count(); // This should echo 1

Always keep this in mind when working with MongoDB. Types are important here, just like PostgreSQL, which will punish you if you attempt to join using columns with slightly different data types.

Another note on the previous example: I assigned the result of the find() to variables $nothing and $something, and then called methods on them. That is because the find() method returns a recordset called a MongoCursor, which provides its own methods. You can get a count on the number of documents returned by a query, as well as iterate through them, and even get an explain plan to see how the query is being executed.

Here is a very common question: So what should I store for _id values in other collections such as user_id or article_id? The solution is simple: Always use MongoId instances. I’ve made the mistake of storing _id values as strings in another collection, and was rewarded by having to always instantiate a new MongoId object for every query. Apt punishment for not thinking things all the way through.

Simply put, if you have a users collection where a given user has an _id property, and they need to store that same value in the posts collection as author_id, then make sure you save author_id as a MongoId object and not a string. Otherwise, every time you wish to display the details of an author to a post, you have to manually instantiate author_id as a MongoId object so you can find the user document by _id primary key.

Oh yeah, MongoDate

MongoDate is also stored as an object, as opposed to an integer or string. Like MongoIds, you need to treat MongoDate objects with additional care; however it’s then possible to do some neat things like find a document that has a MongoDate between 1971 and 1999 for instance:

// Instantiate dates for the range of the query
$start = new MongoDate(strtotime('1971-01-01 00:00:00'));
$end = new MongoDate(strtotime('1999-12-31 23:59:59'));

// Now find documents with create_date between 1971 and 1999
$collection->find(array("create_date" => array('$gt' => $start, '$lte' => $end)));

Queries from PHP

Now it is time to construct some more complex documents to demonstrate how you can find and manipulate your data in MongoDB. Let’s create several documents with a few properties, including a nested array, nested document and a variety of data types discussed earlier in this post. Note the deliberate difference between strings and numbers, as I spell out numbers as strings for simplicity. I’m using the shell to insert these documents quickly and easily, and suggest you follow along:

one =   {
  "string" : "This is not my beautiful house",
  "number" : 42,
  "boolean" : true,
  "list" : ["one", "two", "three"],
  "doc" : {"one" : 1, "two" : 2}
};
db.things.save(one);
two = {
  "string" : "This is not my beautiful wife",
  "number" : 666,
  "boolean" : false,
  "list" : [1, 2, 3],
  "doc" : {"1" : "one", "2" : "two"}
};
db.things.save(two);
three = {
  "string" : "Same as it ever was",
  "number" : 117,
  "boolean" : true,
  "list" : ["one", "two", "four"],
  "doc" : {"one" : 1, "four" : 4}
};
db.things.save(three);

You probably want to see how this went, so you can get a nicely formatted list of what is in your things collection thusly:

> db.things.find().pretty()
{
  "_id" : ObjectId("4fdc77f74e300a45bea9897a"),
  "string" : "This is not my beautiful house",
  "number" : 42,
  "boolean" : true,
  "list" : ["one", "two", "three"],
  "doc" : {"one" : 1, "two" : 2}
}
{
  "_id" : ObjectId("4fdc77f74e300a45bea9897b"),
  "string" : "This is not my beautiful wife",
  "number" : 666,
  "boolean" : false,
  "list" : [1, 2, 3],
  "doc" : {"1" : "one", "2" : "two"}
}
{
  "_id" : ObjectId("4fdc77f94e300a45bea9897c"),
  "string" : "Same as it ever was",
  "number" : 117,
  "boolean" : true,
  "list" : ["one", "two", "four"],
  "doc" : {"one" : 1, "four" : 4}
}

Notice that each of your new documents has an _id property. Chances are your _id values are different than mine, as they have been designed to be unique based on hardware, time and other aspects. This is greatly useful when you are running hundreds (thousands!) of MongoDB servers and need a single value to be unique across all of them.

You can now do some interesting things both from the shell and PHP. I’m hopping back to PHP as, um, this is a series on PHP…

// Connect to test database on localhost
$db = new Mongo('mongodb://localhost/test');

// Get the users collection
$c_things = $db->things;

// Get a count of documents in the things collection
$count_things = $c_things->count();
echo "There are $count_things documents in the things collection.\n";

// How many have the boolean property set to true?
$count_things = $c_things->count(array('boolean' => true));
echo "There are $count_things true documents in the things collection.\n";

// How many have a string property set, regardless of value?
$count_things = $c_things->count(array('string' => array('$exists' => true)));
echo "There are $count_things documents with strings in the things collection.\n";

// How many have a list property with array values including "one" and "two"?
$count_things = $c_things->count(array('list' => array('$in' => array('one','two'))));
echo "There are $count_things documents with 'one' and 'two' as list array values in the things collection.\n";

// How many have a list property with array values not including 'three'?
$count_things = $c_things->count(array('list' => array('$nin' => array('three'))));
echo "There are $count_things documents not including the string 'three' in list array values in the things collection.\n";

// How many have include 'ever was' in the string property? Using a regular expression:
$regex = new MongoRegex("/ever was/");
$count_things = $c_things->count(array('string' => $regex));
echo "There are $count_things documents including the string 'ever was' in string property in the things collection.\n";

This is what you should see when running this script on your machine:

$ php -f example.php 
There are 3 documents in the things collection.
There are 2 true documents in the things collection.
There are 3 documents with strings in the things collection.
There are 2 documents with 'one' and 'two' as list array values in the things collection.
There are 2 documents not including the string 'three' in list array values in the things collection.
There are 1 documents including the string 'ever was' in string property in the things collection.

Most importantly, notice that we searched on embedded values, including an array and an embedded object. We were able to search by the existence of a property, values set for a property, and pass an array to see if any of those values were set in an embedded array.

That last example illustrates how MongoDB can search with regular expressions - which are case sensitive by default. There are a great many more query options available, with explanation for advanced queries being your best start.

Returning Documents to PHP

So far we’ve stuck to the command line and simple counts as our results. Now we will take a look at how MongoDB returns documents to your PHP applications. The last item in the previous example used a regular expression, but returned just the count. What if you wanted the document instead?

// Find a document that includes 'ever was' in the string property using a regular expression:
$regex = new MongoRegex("/ever was/");
$ever_was = $c_things->findOne(array('string' => $regex));
var_dump($ever_was);

Running this script should look like this, which will probably look very familiar to many of you who have been working in PHP with other databases:

$ php -f example.php 
array(6) {
  '_id' =>
    class MongoId#7 (1) {
      public $$id =>
      string(24) "4fdc77f94e300a45bea9897c"
    }
  'string' =>
    string(47) "Same as it ever was"
  'number' =>
    double(117)
  'boolean' =>
    bool(true)
  'list' =>
    array(3) {
      [0] =>
        string(3) "one"
      [1] =>
        string(3) "two"
      [2] =>
        string(4) "four"
    }
  'doc' =>
    array(2) {
      'one' =>
        double(1)
      'four' =>
        double(4)
    }
}

The result was an array, including all the elements of the document returned by the query. By calling findOne() we ensured only one document would be returned, and for multiple documents you could shorten this to just find() and iterate over the results like your ordinary database query.

Indexing your MongoDB Data

This section is not PHP specific, but critical if you want your PHP apps to perform adequately.

Indexes in MongoDB are similar to what you are familiar with for other databases. When you reach a certain number of documents (and data set size) indexes will become necessary to ensure your queries execute fast and efficiently. Being a document database, however, means that you can index array values and even embedded objects.

Creating an index is simple, as the following shell example illustrates:

> db.things.ensureIndex({"string":1});
> db.things.ensureIndex({"number":1});
> db.things.ensureIndex({"boolean":1});

We just plopped indexes on the string, number and boolean properties for all documents in this collection. What is interesting about this is that there could be documents in this collection that do not have any of these properties set. With a relational database you would be forced to allow NULL values for those columns, which is not always accurate.

What about those nested arrays and objects? We can index the properties at the top level, and we can even index an embedded property if we wanted to. Look at the following example:

> db.things.ensureIndex({"list":1});
> db.things.ensureIndex({"docs.two":1});

We just indexed a property that has an embedded array. What this means is that MongoDB is smart enough to figure out how to provide index values for each element in that array, per document. So if you search for all documents that have the string one in their list property, MongoDB will still use the index. Type conversion is handled as well, so you can use the same index to search for all documents that have the number 2 instead. MongoDB refers to indexed arrays as multikeys.

The second example demonstrates one of the many powers of BSON: reaching inside a document for embedded information. We just created an index on the docs property, which applies to all documents, including those that do not have that property set.

This can be extremely useful when embedding arrays in your documents. For example, I have an application where I have multiple third parties that have users with special privileges only for their applications that are running within the main website. I can now store an embedded object called partners and store each partner name and a value based on their access levels for their own applications. All of this can live happily in the users collection, making maintenance and reporting a breeze!

But what about compound indexes? If you are doing a ton of queries based on the values of two properties, you can create a single index that includes both:

> db.things.ensureIndex({"string":1,"boolean":1});

Of course, you can search on the first property and still use the index, so you don’t have to create separate indexes for each property in the compound index. That said, this process only works left-to-right, meaning that using the above index, you can search on string, string and boolean; but if you search solely on boolean you will not use the index.

You’re probably wondering what the numbers are behind the properties in the index creation statements. Those numbers (1 and -1) tell MongoDB whether this is an ascending or descending index, respectively. Note that index order is irrelevant for single-key indexes, and mainly comes into play with operations like sorting.

Let’s take a quick look at what we’ve done to the things collection today, using the shell:

>db.things.stats()
{
  "ns" : "test.things",
  "count" : 3,
  "size" : 656,
  "avgObjSize" : 218.66666666666666,
  "storageSize" : 12288,
  "numExtents" : 1,
  "nindexes" : 7,
  "lastExtentSize" : 12288,
  "paddingFactor" : 1,
  "flags" : 1,
  "totalIndexSize" : 57232,
  "indexSizes" : {
    "_id_" : 8176,
    "string_1" : 8176,
    "number_1" : 8176,
    "boolean_1" : 8176,
    "list_1" : 8176,
    "docs.two_1" : 8176,
    "string_1_boolean_1" : 8176
  },
  "ok" : 1
}

This is the collStats feature in the shell, which gives you statistics about the mentioned collection. This can be useful if you are experiencing unexpected behavior with your collections or indexes.

A new index feature in MongoDB is the sparse index. Imagine having a users collection with around 300 million documents, with only 35 of them also having a specific property set. Do you really want to have an index that includes entries for all 300 million documents when searching for those with that property? A sparse index basically only includes documents that have that indexed property. So if you only have 35 documents in your users collection with that property set, you will want to use a sparse index.

One final word on indexes in MongoDB: You can do many more things like dropping duplicates, unique indexes, and indexing geospatial data. A great place to read in detail is indexing advice and FAQ where a lot of common questions are answered.

Outro, or What’s Coming Next

There are a few more posts coming in this series, including detailed coverage on document data design, a comparison between ODM/ORM/driver approaches and frameworks, advanced queries and how they relate to PHP, taking advantage of map reduce, and a few sample applications demonstrating a few common use cases that I’ll be sharing on GitHub. It is safe to say that there is a lot more coming for the PHP universe on this blog!