Checking Disk Performance with the mongoperf Utility

Jan 17 • Posted 1 year ago

Note: while this blog post uses some Linux commands in its examples, mongoperf runs and is useful on just about all operating systems.

mongoperf is a utility for checking disk i/o performance of a server independent of MongoDB. It performs simple timed random disk i/o’s.

mongoperf has a couple of modes: mmf:false and mmf:true  

mmf:false mode is a completely generic random physical I/O test — there is effectively no MongoDB code involved.


With mmf:true mode, the test is a benchmark of memory-mapped file based I/O.  The code is not the MongoDB code but the actions are analogous.  Thus this is a good baseline test of a system including the operating system virtual memory manager’s behavior.

To build the mongoperf tool:

scons mongoperf

(Or, “scons mongoperf.exe” on Windows.)

or grab a prebuilt binary here.


Then try

mmf:false mode

Here’s an example of a test run with 32 threads performing random physical reads. Note that mongoperf gradually adds more threads so that you can see the difference in performance with more concurrency.

 $ echo "{nThreads:32,fileSizeMB:1000,r:true}" | mongoperf 
mongoperf
use -h for help
parsed options:
{ nThreads: 32, fileSizeMB: 1000, r: true }
creating test file size:1000MB ...
testing...
options:{ nThreads: 32, fileSizeMB: 1000, r: true }
wthr 32
new thread, total running : 1
read:1 write:0
4759 ops/sec 18 MB/sec
4752 ops/sec 18 MB/sec
4760 ops/sec 18 MB/sec
4758 ops/sec 18 MB/sec
4752 ops/sec 18 MB/sec
4754 ops/sec 18 MB/sec
4758 ops/sec 18 MB/sec
4755 ops/sec 18 MB/sec
new thread, total running : 2
9048 ops/sec 35 MB/sec
9039 ops/sec 35 MB/sec
9056 ops/sec 35 MB/sec
9029 ops/sec 35 MB/sec
9047 ops/sec 35 MB/sec
9072 ops/sec 35 MB/sec
9040 ops/sec 35 MB/sec
9042 ops/sec 35 MB/sec
new thread, total running : 4
15116 ops/sec 59 MB/sec
15346 ops/sec 59 MB/sec
15401 ops/sec 60 MB/sec
15448 ops/sec 60 MB/sec
15450 ops/sec 60 MB/sec
15502 ops/sec 60 MB/sec
15474 ops/sec 60 MB/sec
15480 ops/sec 60 MB/sec
read:1 write:0
read:1 write:0
new thread, total running : 8
read:1 write:0
read:1 write:0
15999 ops/sec 62 MB/sec
21811 ops/sec 85 MB/sec
21888 ops/sec 85 MB/sec
21964 ops/sec 85 MB/sec
21876 ops/sec 85 MB/sec
22058 ops/sec 86 MB/sec
21966 ops/sec 85 MB/sec
21976 ops/sec 85 MB/sec
new thread, total running : 16
24316 ops/sec 94 MB/sec
24949 ops/sec 97 MB/sec
25239 ops/sec 98 MB/sec
25032 ops/sec 97 MB/sec
25020 ops/sec 97 MB/sec
25331 ops/sec 98 MB/sec
25175 ops/sec 98 MB/sec
25081 ops/sec 97 MB/sec
new thread, total running : 32
24314 ops/sec 94 MB/sec
24991 ops/sec 97 MB/sec
24779 ops/sec 96 MB/sec
24743 ops/sec 96 MB/sec
24932 ops/sec 97 MB/sec
24947 ops/sec 97 MB/sec
24831 ops/sec 96 MB/sec
24750 ops/sec 96 MB/sec
24843 ops/sec 97 MB/sec

The above test was ran on an SSD volume on a 64 bit Red Hat Enterprise Linux server. Notice how the ops/second increase as we add more threads (to a point). It’s interesting to look at the output of iostat while this was running:

iostat -xm 2

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s   avgrq-sz avgqu-sz   await  svctm  %util
dm-0              0.00     0.00  1532.00  4104.00     5.98    16.03     8.00  2354.34  517.87   0.17  96.30
dm-0              0.00     0.00  4755.00     0.00    18.57     0.00     8.00     0.93    0.19   0.19  92.55
dm-0              0.00     0.00  4755.50     0.00    18.58     0.00     8.00     0.93    0.20   0.20  93.20
dm-0              0.00     0.00  4753.50     0.00    18.57     0.00     8.00     0.93    0.20   0.20  93.30
dm-0              0.00     0.00  6130.50     1.50    23.95     0.01     8.00     1.23    0.20   0.16  95.15
dm-0              0.00     0.00  9047.50     0.00    35.34     0.00     8.00     1.84    0.20   0.11 100.05
dm-0              0.00     0.00  9033.50     0.00    35.29     0.00     8.00     1.84    0.20   0.11  99.95
dm-0              0.00     0.00  9053.50     9.50    35.37     0.04     8.00     2.00    0.22   0.11 100.00
dm-0              0.00     0.00 10901.00     0.00    42.58     0.00     8.00     2.43    0.22   0.09 100.05
dm-0              0.00     0.00 15404.50     0.00    60.17     0.00     8.00     3.56    0.23   0.06 100.05
dm-0              0.00     0.00 15441.50     0.00    60.32     0.00     8.00     3.58    0.23   0.06 100.20
dm-0              0.00     0.00 15476.50     0.00    60.46     0.00     8.00     3.56    0.23   0.06 100.00
dm-0              0.00     0.00 15433.00     0.00    60.29     0.00     8.00     4.87    0.23   0.06 100.05
dm-0              0.00     0.00 21024.00     0.00    82.12     0.00     8.00     7.06    0.39   0.05 100.40
dm-0              0.00     0.00 21917.00     0.00    85.62     0.00     8.00     6.91    0.31   0.05 100.35
dm-0              0.00     0.00 21964.00     0.00    85.80     0.00     8.00     6.96    0.32   0.05 100.30
dm-0              0.00     0.00 22738.00     0.00    88.82     0.00     8.00     8.07    0.34   0.04 100.25
dm-0              0.00     0.00 24893.00     0.00    97.24     0.00     8.00    10.05    0.41   0.04 100.60
dm-0              0.00     0.00 25060.00     0.00    97.89     0.00     8.00    10.21    0.40   0.04 100.20
dm-0              0.00     0.00 25236.50     0.00    98.58     0.00     8.00    10.34    0.40   0.04 100.70
dm-0              0.00     0.00 24802.00     0.00    96.88     0.00     8.00    11.28    0.40   0.04 100.60
dm-0              0.00     0.00 24859.00     0.00    97.11     0.00     8.00    10.08    0.45   0.04 100.70
dm-0              0.00     0.00 24793.50     0.00    96.85     0.00     8.00     9.89    0.39   0.04 101.10
dm-0              0.00     0.00 24881.00     0.00    97.19     0.00     8.00     9.93    0.39   0.04 100.90
dm-0              0.00     0.00 24823.00     0.00    96.96     0.00     8.00     9.79    0.39   0.04 100.50
dm-0              0.00     0.00 24805.00     0.00    96.89     0.00     8.00     9.92    0.40   0.04 100.40
dm-0              0.00     0.00 24901.00     0.00    97.27     0.00     8.00     9.97    0.39   0.04 100.90

A few things stand out.

  • First, the read per second (“r/s”) numbers match our mongoperf results. 
  • Second, it’s clear that the “%util” column is fairly meaningless in this particular case — we were able to increase r/s even after %util hit 100. I assume this is because the %util value is a modeled value and the assumptions involved which don’t hold for this device.
  • Third, note that if you multiply the r/s value by 4KB, you get the rMB/s value — so we are really doing 4KB reads in this case. 

We can now try some writes:

$ echo "{nThreads:32,fileSizeMB:1000,w:true}|mongoperf
new thread, total running : 1
549 ops/sec 2 MB/sec
439 ops/sec 1 MB/sec
270 ops/sec 1 MB/sec
295 ops/sec 1 MB/sec
281 ops/sec 1 MB/sec
371 ops/sec 1 MB/sec
235 ops/sec 0 MB/sec
379 ops/sec 1 MB/sec
new thread, total running : 2
243 ops/sec 0 MB/sec
354 ops/sec 1 MB/sec
310 ops/sec 1 MB/sec
2491 ops/sec 9 MB/sec
2293 ops/sec 8 MB/sec
2077 ops/sec 8 MB/sec
2559 ops/sec 9 MB/sec
1099 ops/sec 4 MB/sec
new thread, total running : 4
2676 ops/sec 10 MB/sec
2667 ops/sec 10 MB/sec
2536 ops/sec 9 MB/sec
2600 ops/sec 10 MB/sec
2612 ops/sec 10 MB/sec
2498 ops/sec 9 MB/sec
2506 ops/sec 9 MB/sec
2492 ops/sec 9 MB/sec
new thread, total running : 8
2463 ops/sec 9 MB/sec
2439 ops/sec 9 MB/sec
2445 ops/sec 9 MB/sec
2401 ops/sec 9 MB/sec
2271 ops/sec 8 MB/sec
2202 ops/sec 8 MB/sec
2206 ops/sec 8 MB/sec
2181 ops/sec 8 MB/sec
new thread, total running : 16
2105 ops/sec 8 MB/sec
2263 ops/sec 8 MB/sec
2305 ops/sec 9 MB/sec
2408 ops/sec 9 MB/sec
2324 ops/sec 9 MB/sec
2244 ops/sec 8 MB/sec
2013 ops/sec 7 MB/sec
2004 ops/sec 7 MB/sec
new thread, total running : 32
read:0 write:1
2088 ops/sec 8 MB/sec
2091 ops/sec 8 MB/sec
2365 ops/sec 9 MB/sec
2278 ops/sec 8 MB/sec
2322 ops/sec 9 MB/sec
2241 ops/sec 8 MB/sec
2105 ops/sec 8 MB/sec
2241 ops/sec 8 MB/sec
2040 ops/sec 7 MB/sec
1997 ops/sec 7 MB/sec
2062 ops/sec 8 MB/sec
2111 ops/sec 8 MB/sec
2150 ops/sec 8 MB/sec
2253 ops/sec 8 MB/sec
2246 ops/sec 8 MB/sec
2188 ops/sec 8 MB/sec

This relatively old SSD drive can only do 2K random writes per second. It appears we need more than one thread to saturate too; we could run with nThreads:1 for a long time to verify that is true. Here are some mongoperf statistics from a test run on an Amazon EC2 machine with internal SSD storage:

         iops, thousands	
threads  read test write test
-------  --------- ----------
1		4	   8
2	 	8	   8
4	 	16	   8
8	 	32	   8	
16	 	64	   8
32	 	70	   8

Here’s a read test on a RAID-10 volume comprised of four spinning disks (SATA):

parsed options:
{ nThreads: 32, fileSizeMB: 1000, r: true }
creating test file size:1000MB ...
new thread, total running : 1
150 ops/sec 0 MB/sec
174 ops/sec 0 MB/sec
169 ops/sec 0 MB/sec
new thread, total running : 2
351 ops/sec 1 MB/sec
333 ops/sec 1 MB/sec
347 ops/sec 1 MB/sec
new thread, total running : 4
652 ops/sec 2 MB/sec
578 ops/sec 2 MB/sec
715 ops/sec 2 MB/sec
new thread, total running : 16
719 ops/sec 2 MB/sec
722 ops/sec 2 MB/sec
493 ops/sec 1 MB/sec
new thread, total running : 32
990 ops/sec 3 MB/sec
955 ops/sec 3 MB/sec
842 ops/sec 3 MB/sec

Note that when testing a volume using spinning disks it is important to make your test file large — much larger than the 1GB test file in the examples above. Otherwise the test will only be hitting a few adjacent cylinders on the disk and report results that are faster than you would achieve if the disk is used in its entirety. Let’s try a larger file:

{ nThreads: 32, fileSizeMB: 20000, r: true }
new thread, total running : 1
86 ops/sec 0 MB/sec
98 ops/sec 0 MB/sec
91 ops/sec 0 MB/sec
new thread, total running : 2
187 ops/sec 0 MB/sec
188 ops/sec 0 MB/sec
192 ops/sec 0 MB/sec
new thread, total running : 4
295 ops/sec 1 MB/sec
296 ops/sec 1 MB/sec
233 ops/sec 0 MB/sec
new thread, total running : 8
307 ops/sec 1 MB/sec
429 ops/sec 1 MB/sec
414 ops/sec 1 MB/sec
new thread, total running : 16
554 ops/sec 2 MB/sec
501 ops/sec 1 MB/sec
455 ops/sec 1 MB/sec
new thread, total running : 32
893 ops/sec 3 MB/sec
603 ops/sec 2 MB/sec
814 ops/sec 3 MB/sec

Let’s now try a write test on the RAID-10 spinning disks:

parsed options:
{ nThreads: 32, fileSizeMB: 1000, w: true }
creating test file size:1000MB ...
new thread, total running : 1
113 ops/sec 0 MB/sec
117 ops/sec 0 MB/sec
113 ops/sec 0 MB/sec
new thread, total running : 2
120 ops/sec 0 MB/sec
113 ops/sec 0 MB/sec
115 ops/sec 0 MB/sec
new thread, total running : 4
115 ops/sec 0 MB/sec
115 ops/sec 0 MB/sec
112 ops/sec 0 MB/sec
new thread, total running : 8
111 ops/sec 0 MB/sec
110 ops/sec 0 MB/sec
111 ops/sec 0 MB/sec
new thread, total running : 16
116 ops/sec 0 MB/sec
110 ops/sec 0 MB/sec
105 ops/sec 0 MB/sec
new thread, total running : 32
115 ops/sec 0 MB/sec
111 ops/sec 0 MB/sec
114 ops/sec 0 MB/sec

The write result above seems slower than one would expect — this is an example where more investigation and analysis would then be appropriate, and an example of a case where running mongoperf might prove useful.

mmf:true mode

mongoperf has another test mode where instead of using direct (physical) i/o, it tests random reads and writes via memory mapped file regions. In this case caching will come into effect — you should see very high read speeds if the datafile is small, and speeds that begin to approach physical random I/O speed as the datafile becomes larger than RAM. For example:

parsed options:
{ recSizeKB: 8, nThreads: 8, fileSizeMB: 1000, r: true, mmf: true }
creating test file size:1000MB ...
new thread, total running : 1
read:1 write:0
65 ops/sec
79 ops/sec
92 ops/sec
107 ops/sec
111 ops/sec
87 ops/sec
125 ops/sec
141 ops/sec
new thread, total running : 2
273 ops/sec
383 ops/sec
422 ops/sec
594 ops/sec
1220 ops/sec
2598 ops/sec
36578 ops/sec
489132 ops/sec
new thread, total running : 4
183926 ops/sec
171128 ops/sec
173286 ops/sec
172908 ops/sec
173187 ops/sec
173322 ops/sec
173961 ops/sec
175195 ops/sec
new thread, total running : 8
389256 ops/sec
396595 ops/sec
398382 ops/sec
402393 ops/sec
400701 ops/sec
404904 ops/sec
400571 ops/sec

The numbers start low as at the beginning of the reading the test file is not in the file system cache (in the Linux version of mongoperf anyway). Data faults into the cache quite quickly as the readahead for the volume is quite large. Once the entire file is in ram the number of accesses per second is quite high.

We can look at the readahead settings for the device with “sudo blockdev —report”. Note that the value reported by this utility in the “RA” field the number of 512 byte sectors.

During the above test, if we look at iostat, we see large reads occuring because of the readahead setting that was used (the avgrq-sz column, which specifies number of sectors requested):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc             148.33     0.00  116.00    0.00    22.30     0.00   393.63     1.68   14.48   7.19  83.40
sdd             130.67     0.00  113.00    0.00    20.38     0.00   369.35     1.54   13.58   7.19  81.23
sde             154.00     0.00  113.67    0.00    21.85     0.00   393.64     1.84   16.23   7.38  83.87
sdb             140.00     0.00  107.00    0.00    20.27     0.00   387.91     1.88   17.58   7.84  83.87
md0               0.00     0.00 1025.33    0.00    85.34     0.00   170.45     0.00    0.00   0.00   0.00

Thus we are reading ahead about approximately 200KB apparently from each spindle on a physical random read I/O.

Note that if your database is much larger than RAM and you expect there to be cache misses on a regular basis, this readahead setting might be too large — if the object to be fetched from disk is only 8KB, another ~200KB in this case is being read ahead with it. This is good for cache preheating but that readahead could eject other data from the file system cache; thus if the data read ahead were “cold” and unlikely to be used, that would be bad. In that situation, make the readahead setting for your volume smaller. 32KB might be a good setting, perhaps 16KB on a solid state disk. (It is likely never helpful to go below 8KB (four sectors) as MongoDB b-tree buckets are 8KB.)

One of a couple trade-offs with readahead is that cache preheating will take a long time if the readahead setting is tiny. Consider the following run where there was no readahead (just 4KB reads on faults with no readahead occurring):

parsed options:
{ nThreads: 32, fileSizeMB: 1000, r: true, mmf: true }
creating test file size:1000MB ...
testing...
new thread, total running : 1
67 ops/sec
110 ops/sec
184 ops/sec
167 ops/sec
174 ops/sec
159 ops/sec
189 ops/sec
190 ops/sec
new thread, total running : 2
362 ops/sec
393 ops/sec
371 ops/sec
354 ops/sec
374 ops/sec
388 ops/sec
384 ops/sec
394 ops/sec
new thread, total running : 4
486 ops/sec
400 ops/sec
570 ops/sec
589 ops/sec
567 ops/sec
545 ops/sec
576 ops/sec
412 ops/sec
new thread, total running : 8
666 ops/sec
601 ops/sec
499 ops/sec
731 ops/sec
618 ops/sec
448 ops/sec
508 ops/sec
547 ops/sec
new thread, total running : 16
815 ops/sec
802 ops/sec
917 ops/sec
580 ops/sec
955 ops/sec
1006 ops/sec
1048 ops/sec
938 ops/sec
new thread, total running : 32
1993 ops/sec
1186 ops/sec
1331 ops/sec
1317 ops/sec
1298 ops/sec
991 ops/sec
1431 ops/sec
1406 ops/sec
1395 ops/sec
1099 ops/sec
1265 ops/sec
1400 ops/sec
1484 ops/sec
1436 ops/sec
1352 ops/sec
1438 ops/sec
1380 ops/sec
1350 ops/sec
1565 ops/sec
1440 ops/sec
1015 ops/sec
1253 ops/sec
1414 ops/sec
1443 ops/sec
1478 ops/sec
1405 ops/sec
1305 ops/sec
1518 ops/sec
1217 ops/sec
1573 ops/sec
1605 ops/sec
1476 ops/sec
1130 ops/sec
1362 ops/sec
1463 ops/sec
1740 ops/sec
1682 ops/sec
1653 ops/sec
1135 ops/sec
1521 ops/sec
1821 ops/sec
1708 ops/sec
1701 ops/sec
1631 ops/sec
1195 ops/sec
1752 ops/sec
1701 ops/sec

... time passes ...

353038 ops/sec
353508 ops/sec
353159 ops/sec

Near the end of the run, the entire test file is in the file system cache:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28564 dwight    20   0 1292m 1.0g 1.0g S 609.1  3.3   0:47.11 mongoperf

Note though if fetching only 4KB at a time, and 400 physical random reads per second, we’ll need up to 1GB / 4KB per page / 400 pages fetched per second = 655 seconds to heat up the cache. (And 1GB is a small cache, imagine a machine with 128GB of RAM and a database that large or larger.) Note there are ways to preheat a cache other than readahead, for more info see: http://blog.mongodb.org/post/10407828262/cache-reheating-not-to-be-ignored. Suggestion: on Linux, we suggest using a recSizeKB of 8 or larger when using mmf:true — it seems that when only a single 4KB page is touched, certain kernel versions may not perform readahead. (At least the way mongoperf is coded…)

Writes with mmf:true

We can also do load testing and simulations of writes via memory-mapped files (analogous to what MongoDB does in its storage engine) with mongoperf. Use mmf:true and w:true for this.

MongoDB writes are written to the crash recovery log (journal) by mongod almost immediately, however the datafile writes can be deferred up to a minute. mongoperf simulates this behavior by fsync’ing its test datafile once a minute. Since writes are only allowed to be lazy by that amount, even if the data written fits in RAM, it will be written to disk fairly soon (within a minute) — thus you may see a good amount of random write I/O when mongoperf is running even if the test datafile fits in RAM. This is one reason SSDs are often popular in MongoDB deployments.

For example, consider a scenario where we run the following:

$ echo "{recSizeKB:8,nThreads:32,fileSizeMB:1000,w:true,mmf:true}" | mongoperf

If our drive can write 1GB (the test datafile size) sequentially in less than a minute (not unusual), the test will likely report a very high sustained write rate, even after running more than a minute. However if we then make the file far larger than 1GB, you will likely see a significant slowdown in write speed as the background flushing of data >= 1 minute old will become a factor (at least on spinning disks).

Mixed mode

Note that mongoperf has some other options, see the —help option for more info. In particular you can run a test with concurrent reads and writes in the same test, and also you can specify the read and write rates to explicitly simulate a certain scenario you would like to test.

Conclusions and Caveats

Note that mongoperf is not MongoDB. mmf:false mode is testing physical disk i/o with no caching; because of caching MongoDB will usually perform vastly better than that. Additionally, mmf:true is *not* a perfect simulation of MongoDB. You might get superior performance in MongoDB than mongoperf indicates.

P.S. The mongoperf utility is very simple (a couple hundred lines of code), so you may wish to take a look at its source code

blog comments powered by Disqus
blog comments powered by Disqus