среда, 22 мая 2013 г.

MongoDb vs. MS SQL: how to write to journal without additional seeks

In my previous post I figured out why single-threaded MongoDb benchmark is limited to 1000 inserts per second. But actually only SSD can reach this limit: 980 inserts per second on SSD and only 700 on HDD. Modified multi-threaded (8 threads) benchmark gives 7900 inserts on SSD and 5200 on HDD. Why there is such a big difference if journal is append-only storage. HDD should perform quite well in sequential write scenario. Can we close the gap?

If you read first article in the series you remember I was surprised that MS SQL does near the same number of writes as number of test iterations. But MongoDb doubles this numbers accessing not only journal file, but also NTFS metadata file.

Look how journal file is created:
















The purpose of FILE_FLAG_NO_BUFFERING is obvious:
In these situations, caching can be turned off. This is done at the time the file is opened by passing FILE_FLAG_NO_BUFFERING as a value for the dwFlagsAndAttributes parameter of CreateFile. When caching is disabled, all read and write operations directly access the physical disk. However, the file metadata may still be cached. 
And what does FILE_FLAG_WRITE_THROUGH?
A write-through request via FILE_FLAG_WRITE_THROUGH also causes NTFS to flush any metadata changes, such as a time stamp update or a rename operation, that result from processing the request.
Windows provides this ability through write-through caching. A process enables write-through caching for a specific I/O operation by passing the FILE_FLAG_WRITE_THROUGH flag into its call to CreateFile. With write-through caching enabled, data is still written into the cache, but the cache manager writes the data immediately to disk rather than incurring a delay by using the lazy writer.  
In other words this flag a) flushes NTFS metadata b) keeps written data available in cache manager memory.
Do we really need to update file attributes on every write operation paying additional HDD seek?! Do we really need this data to be cached? I expect data is read from journal only during recovery.

I removed FILE_FLAG_WRITE_THROUGH and here are numbers: 7900 inserts per second on SSD and 7800 on HDD! Gap is closed!

I expect pre-allocating journal file may give additional improvement, but now our 1 ms introduced delay is real bottleneck and should be addressed first. May be after journal throughput is improved there is no need to keep default value for journalCommitInterval at 100 ms?

Let me submit issue to JIRA accompanied by pull request and see what 10gen folks will say about it.

UPDATED: BTW, after modification disk load drops from 55% to 7-10% during benchmark. As I already said this guy is our current bottleneck:
sleepmillis(oneThird);
UPDATED: [proof] after modification mongod.exe doesn't double number of writes:



Stay tuned!

Комментариев нет:

Отправить комментарий

Wider Two Column Modification courtesy of The Blogger Guide