Raid Parity and Checksum Vectorization

In April Rick Wagner profiled SDSC's Comet system, and found that RAID parity and checksum calculations were the main bottleneck slowing down ZFS's read and write bandwidth. Since then, people have rewritten these algorithms to take advantage of SSE and AVX vector instructions. The purpose of my testing has been to measure how performance has changed with these vectorization patches.

Procedure

Testing was performed using two of SDSC's Wombat servers, each with two 8-core Ivy Bridge Xeon E5-2650 v2 processors, with 126GB RAM at 1867 MHz, which allowed us to test SSE and AVX128, but not AVX2. For the zpools, we used eight data drives and 0-3 parity drives, depending on the parity level (0 for striped). The pools each have a total capacity of 29.6TB, not including parity data.

Testing consists of two stages, writing and reading. During the writing stage, 128GB of random data (generated with lrand48) is written to each of 4 files on a pool, concurrently. After a ten second pause we begin the reading stage, where dd is used to write each file to /dev/null, once again concurrently. During each of these stages zpool iostat data is recorded every 60 seconds, and the entire process is recorded with perf.

Results

Multiple Pool Tests

My first set of tests were with 6 pools per machine, to simulate how these servers are actually used. I did not test Raidz3 in this stage, because of a lack of drives. For the control I used commit #5475aad on the master branch. For testing the vectorization, I applied Dolbeau's patches to the control code, along with some of my own fixes. That code can be found here.

Version	vdev_raidz_generate_parity CPU %	Kthreadd and z_wr_iss CPU %	Average Read Bandwith GB/s	Average Write Bandwith GB/s
Raidz1 Control	3.72	14.17	7.192	5.870
Raidz1 SSE	3.21	14.18	7.241	6.119
Raidz1 AVX128	3.25	14.31	7.243	5.904
Raidz2 Control	11.25	21.03	7.280	3.852
Raidz2 SSE	5.74	15.90	7.301	3.915
Raidz2 AVX128	5.75	15.99	7.309	3.920

From these tests we see that while Raidz1 has minimal speed increases, Raidz2 spends noticeably less time doing parity calculations. That change is not mirrored in bandwidth though; while time spent in parity calculations drops by roughly five percentage points, bandwidth increases by less than 2%, meaning there is some other bottleneck. My hypothesis was that running multiple pools at once was the cause, so I decided to repeat these tests with only a single pool running.

Single Pool Tests

For the this round of tests, I used the same code as previously, but only setup a single zpool, and used numactl to limit the processes to a single cpu. Because less disks were used in these tests, I was also able to profile Raidz3.

Version	vdev_raidz_generate_parity CPU %	Kthreadd and z_wr_iss CPU %	Average Read Bandwith GB/s	Average Write Bandwith GB/s
Raidz1 Control	3.38	13.03	1.25	1.27
Raidz1 SSE	3.10	13.94	1.265	1.277
Raidz1 AVX128	2.77	12.53	1.27	1.283
Raidz2 Control	10.11	19.20	1.23	0.745
Raidz2 SSE	5.44	15.28	1.238	0.752
Raidz2 AVX128	5.40	15.24	1.233	0.750
Raidz3 Control	21.50	28.68	1.58	1.223
Raidz3 SSE	8.90	18.45	1.606	1.227
Raidz3 AVX128	8.22	17.21	1.61	1.228

Once again Raidz1 has minimal changes, but both Raidz2 and Raidz3 spent significantly less time in parity calculation. Yet once again, similar changes are not seen in bandwith: even though vectorization causes an over 10 percentage point drop in parity calculation time for Raidz3, bandwith changes by less 1%. Becausing reducing the number of pools had no effect, I decided to run my tests against the ABD branch, to see if those changes could have fixed the bottleneck.

ABD Tests

For the control I used the code from commit #98ba1b9, while for vectorization I included Dolbeau's and my patches, which can be found here.

Version	vdev_raidz_generate_parity CPU %	z_wr_iss CPU %	Average Read Bandwith GB/s	Average Write Bandwith GB/s
Raidz1 Control	4.79	10.86	1.247	1.275
Raidz1 SSE	2.68	9.45	1.263	1.282
Raidz1 AVX128	2.67	9.39	1.262	1.277
Raidz2 Control	8.21	13.46	1.228	0.744
Raidz2 SSE	3.75	9.58	1.238	0.751
Raidz2 AVX128	3.39	9.72	1.235	0.752
Raidz3 Control	18.20	22.90	1.528	1.237
Raidz3 SSE	6.05	12.04	1.497	1.243
Raidz3 AVX128	5.92	11.76	1.544	1.242

Here we see similar results (large changes in parity calculation time, but not bandwidth) which means that ABD did not solve the problem. This is the last of the parity tests I ran though, so finding the real bottleneck must wait for a later time.

Striped Tests

I also tested some striped pools, in order to compare against raid, in terms of bandwidth.

Version	Average Read Bandwith GB/s	Average Write Bandwith GB/s
Striped, Multiple Pools	7.260	6.640
Striped, Single Pool	1.260	1.307
Striped, Single Pool w/ ABD	1.285	1.308

We can see that striped pools have a much higher write bandwidth than raid pools (as expected), but strangely Raidz3 pools actuallly have higher read bandwidth; all other raid levels have lower read bandwidth.

Checksum Tests

Aside from testing vectorized parity functions, I also tested Tuxoco's vectorized SHA256 checksums. For my control I used the same code as used in the normal tests, while for testing vectorization I added in Tuxoco's, and this code can be found under my vectorized-checksum branch. For these tests I also turned deduplication on for the pools, because SHA256 is only significantly used during deduplication; the main checksum used by ZFS is Fletcher's.

Version	Kthreadd, z_rd_int, z_wr_iss CPU %	Average Read Bandwith GB/s	Average Write Bandwith GB/s
Raidz1 Control	77.17	0.496	0.333
Raidz1 SSE	72.74	0.660	0.427
Raidz1 AVX	71.51	0.660	0.428
Raidz2 Control	78.68	0.556	0.300
Raidz2 SSE	74.67	1.054	0.499
Raidz2 AVX	72.23	0.695	0.385
Raidz3 Control	76.01	0.598	0.315
Raidz3 SSE	72.45	0.750	0.398
Raidz3 AVX	71.60	0.765	0.411
Striped Control	79.81	0.680	0.325
Striped SSE	73.26	0.695	0.429
Striped AVX	72.15	0.690	0.432

Here we see major changes in both checksum calculation time and bandwidth; write bandwidth increases by roughly 30%, regardless of which Raid level is in use.

Full Results

These tables only contain the most important information gathered during these tests; if you want to see the actual flamegraph for each test, or the raw iostat output, go here. I also have results from ABD tests with multiple pools and more striped tests, which were excluded from my report because they were mostly redundant.

Conclusion

From these tests we can see that the time needed to calculate parity is significantly reduced with vectorization, but there is little bandwith increase, meaning there must be some other bottleneck. One way to find this bottleneck is to compare flamegraphs:

Looking at the ABD Raidz3 control and ABD Raidz3 AVX128 graphs, we can see that a much higher percentage of the time is spent generating random data, which could serve as a bottleneck. The swapper process also takes a higher percetage of CPU time, suggesting that hardware may be the problem. For further investigation, I would suggest pre-generating the data to be written, and testing it on other systems, to see if our hardware is a problem.

Vectorized checksum calculations, on the other hand, do result in a large increase in bandwidth, showing that checksums are the main bottleneck for deduplication. Using deduplication still results in a significant decrease in bandwidth, but vectorization has reduced that, and therefore is a successful optimization.