How Journaling and Replication Interact

Jun 6 • Posted 2 years ago

Version 1.8 of MongoDB supports journaling in the storage engine for crash safety and fast recovery. An interesting question arises then regarding how journaling interacts with replication. A traditional approach might be to wait for the commit (i.e., journal physical write confirmed) before replicating any data. MongoDB does not do this. Instead, it allows data to replicate even if the journal write has yet to occur or be confirmed. We then must ask “but what happens if we crash before journaling but the data replicated out?” With replica sets, it turns out this is ok. In a replica set the rule is that the freshest node will be elected primary. Thus if the crashed node comes back up, but the node which received the unjournaled data is ahead, it will be primary. We might then ask about a cascade of failures. This is ok too as replica sets have a notion of rolling back to a consistent point of view. How do we know our data won’t be rolled back? The answer is that a write is truly committed in a replica set when it has been written at a majority of set members. We can confirm this with the getLastError command. For example, if our write has made it to the journal on two out of three total set members, we know the data is committed even if nodes fail in a cascading sequence, and even if a minority of nodes are permanently lost. Why bother replicating so quickly? It lets us minimize latency between secondaries and primaries. In addition more writes will be successful than traditionally when a crash occurs. Yet the latency reduction is the biggest advantage: fsyncing to disks can be slow — replication lag (if on a LAN) can be less than the time to fsync to disk. Plus both can then be underway concurrently.