Raid Parity and Checksum Vectorization
In April Rick Wagner profiled SDSC's Comet system, and found that RAID parity and checksum calculations were the main bottleneck slowing down ZFS's read and write bandwidth. Since then, people have rewritten these algorithms to take advantage of SSE and AVX vector instructions. The purpose of my testing has been to measure how performance has changed with these vectorization patches.
Procedure
Testing was performed using two of SDSC's Wombat servers, each with two 8-core Ivy Bridge Xeon E5-2650 v2 processors, with 126GB RAM at 1867 MHz, which allowed us to test SSE and AVX128, but not AVX2. For the zpools, we used eight data drives and 0-3 parity drives, depending on the parity level (0 for striped). The pools each have a total capacity of 29.6TB, not including parity data.
Testing consists of two stages, writing and reading. During the writing stage, 128GB of random data (generated with lrand48) is written to each of 4 files on a pool, concurrently. After a ten second pause we begin the reading stage, where dd is used to write each file to /dev/null, once again concurrently. During each of these stages zpool iostat data is recorded every 60 seconds, and the entire process is recorded with perf.
Results
Multiple Pool Tests
My first set of tests were with 6 pools per machine, to simulate how these servers are actually used. I did not test Raidz3 in this stage, because of a lack of drives. For the control I used commit #5475aad on the master branch. For testing the vectorization, I applied Dolbeau's patches to the control code, along with some of my own fixes. That code can be found here.
Version | vdev_raidz_generate_parity CPU % | Kthreadd and z_wr_iss CPU % | Average Read Bandwith GB/s | Average Write Bandwith GB/s |
---|---|---|---|---|
Raidz1 Control | 3.72 | 14.17 | 7.192 | 5.870 |
Raidz1 SSE | 3.21 | 14.18 | 7.241 | 6.119 |
Raidz1 AVX128 | 3.25 | 14.31 | 7.243 | 5.904 |
Raidz2 Control | 11.25 | 21.03 | 7.280 | 3.852 |
Raidz2 SSE | 5.74 | 15.90 | 7.301 | 3.915 |
Raidz2 AVX128 | 5.75 | 15.99 | 7.309 | 3.920 |
From these tests we see that while Raidz1 has minimal speed increases, Raidz2 spends noticeably less time doing parity calculations. That change is not mirrored in bandwidth though; while time spent in parity calculations drops by roughly five percentage points, bandwidth increases by less than 2%, meaning there is some other bottleneck. My hypothesis was that running multiple pools at once was the cause, so I decided to repeat these tests with only a single pool running.
Single Pool Tests
For the this round of tests, I used the same code as previously, but only setup a single zpool, and used numactl to limit the processes to a single cpu. Because less disks were used in these tests, I was also able to profile Raidz3.
Version | vdev_raidz_generate_parity CPU % | Kthreadd and z_wr_iss CPU % | Average Read Bandwith GB/s | Average Write Bandwith GB/s |
---|---|---|---|---|
Raidz1 Control | 3.38 | 13.03 | 1.25 | 1.27 |
Raidz1 SSE | 3.10 | 13.94 | 1.265 | 1.277 |
Raidz1 AVX128 | 2.77 | 12.53 | 1.27 | 1.283 |
Raidz2 Control | 10.11 | 19.20 | 1.23 | 0.745 |
Raidz2 SSE | 5.44 | 15.28 | 1.238 | 0.752 |
Raidz2 AVX128 | 5.40 | 15.24 | 1.233 | 0.750 |
Raidz3 Control | 21.50 | 28.68 | 1.58 | 1.223 |
Raidz3 SSE | 8.90 | 18.45 | 1.606 | 1.227 |
Raidz3 AVX128 | 8.22 | 17.21 | 1.61 | 1.228 |
Once again Raidz1 has minimal changes, but both Raidz2 and Raidz3 spent significantly less time in parity calculation. Yet once again, similar changes are not seen in bandwith: even though vectorization causes an over 10 percentage point drop in parity calculation time for Raidz3, bandwith changes by less 1%. Becausing reducing the number of pools had no effect, I decided to run my tests against the ABD branch, to see if those changes could have fixed the bottleneck.
ABD Tests
For the control I used the code from commit #98ba1b9, while for vectorization I included Dolbeau's and my patches, which can be found here.
Version | vdev_raidz_generate_parity CPU % | z_wr_iss CPU % | Average Read Bandwith GB/s | Average Write Bandwith GB/s |
---|---|---|---|---|
Raidz1 Control | 4.79 | 10.86 | 1.247 | 1.275 |
Raidz1 SSE | 2.68 | 9.45 | 1.263 | 1.282 |
Raidz1 AVX128 | 2.67 | 9.39 | 1.262 | 1.277 |
Raidz2 Control | 8.21 | 13.46 | 1.228 | 0.744 |
Raidz2 SSE | 3.75 | 9.58 | 1.238 | 0.751 |
Raidz2 AVX128 | 3.39 | 9.72 | 1.235 | 0.752 |
Raidz3 Control | 18.20 | 22.90 | 1.528 | 1.237 |
Raidz3 SSE | 6.05 | 12.04 | 1.497 | 1.243 |
Raidz3 AVX128 | 5.92 | 11.76 | 1.544 | 1.242 |
Here we see similar results (large changes in parity calculation time, but not bandwidth) which means that ABD did not solve the problem. This is the last of the parity tests I ran though, so finding the real bottleneck must wait for a later time.
Striped Tests
I also tested some striped pools, in order to compare against raid, in terms of bandwidth.
Version | Average Read Bandwith GB/s | Average Write Bandwith GB/s |
---|---|---|
Striped, Multiple Pools | 7.260 | 6.640 |
Striped, Single Pool | 1.260 | 1.307 | Striped, Single Pool w/ ABD | 1.285 | 1.308 |
We can see that striped pools have a much higher write bandwidth than raid pools (as expected), but strangely Raidz3 pools actuallly have higher read bandwidth; all other raid levels have lower read bandwidth.
Checksum Tests
Aside from testing vectorized parity functions, I also tested Tuxoco's vectorized SHA256 checksums. For my control I used the same code as used in the normal tests, while for testing vectorization I added in Tuxoco's, and this code can be found under my vectorized-checksum branch. For these tests I also turned deduplication on for the pools, because SHA256 is only significantly used during deduplication; the main checksum used by ZFS is Fletcher's.
Version | Kthreadd, z_rd_int, z_wr_iss CPU % | Average Read Bandwith GB/s | Average Write Bandwith GB/s |
---|---|---|---|
Raidz1 Control | 77.17 | 0.496 | 0.333 |
Raidz1 SSE | 72.74 | 0.660 | 0.427 |
Raidz1 AVX | 71.51 | 0.660 | 0.428 |
Raidz2 Control | 78.68 | 0.556 | 0.300 |
Raidz2 SSE | 74.67 | 1.054 | 0.499 |
Raidz2 AVX | 72.23 | 0.695 | 0.385 |
Raidz3 Control | 76.01 | 0.598 | 0.315 |
Raidz3 SSE | 72.45 | 0.750 | 0.398 |
Raidz3 AVX | 71.60 | 0.765 | 0.411 |
Striped Control | 79.81 | 0.680 | 0.325 |
Striped SSE | 73.26 | 0.695 | 0.429 |
Striped AVX | 72.15 | 0.690 | 0.432 |
Here we see major changes in both checksum calculation time and bandwidth; write bandwidth increases by roughly 30%, regardless of which Raid level is in use.
Full Results
These tables only contain the most important information gathered during these tests; if you want to see the actual flamegraph for each test, or the raw iostat output, go here. I also have results from ABD tests with multiple pools and more striped tests, which were excluded from my report because they were mostly redundant.
Conclusion
From these tests we can see that the time needed to calculate parity is significantly reduced with vectorization, but there is little bandwith increase, meaning there must be some other bottleneck. One way to find this bottleneck is to compare flamegraphs:
Looking at the ABD Raidz3 control and ABD Raidz3 AVX128 graphs, we can see that a much higher percentage of the time is spent generating random data, which could serve as a bottleneck. The swapper process also takes a higher percetage of CPU time, suggesting that hardware may be the problem. For further investigation, I would suggest pre-generating the data to be written, and testing it on other systems, to see if our hardware is a problem.
Vectorized checksum calculations, on the other hand, do result in a large increase in bandwidth, showing that checksums are the main bottleneck for deduplication. Using deduplication still results in a significant decrease in bandwidth, but vectorization has reduced that, and therefore is a successful optimization.