SIMD Row Based Matchfinder 🚀 #2494

senhuang42 · 2021-02-11T17:05:52Z

Followups after this PR

Paramgrill to re-tune the compression levels.
Decide what to do with DMS/DDS

This PR introduces the row-based lazy matchfinder, iterated on the prototype from @terrelln, and presents some initial results.

The comment above ZSTD_RowFindBestMatch_generic() explains at a high-level how the matchfinder works.

Preliminary benchmarks: parameters/compression levels are just approximations, so there will be some compression ratio differences. We see some pretty huge gains though.

gcc: silesia.tar, -b5e12:

clevel	dev ratio	row_hash ratio	dev speed	row_hash speed	speed delta
5	3.313	3.322	101.3 MB/s	125.2 MB/s	+23.8%
6	3.371	3.366	80.4 MB/s	119.3 MB/s	+48.5%
7	3.456	3.448	59.7 MB/s	85.9 MB/s	+43.9%
8	3.489	3.480	47.0 MB/s	68.5 MB/s	+45.7%
9	3.522	3.537	35.5 MB/s	56.2 MB/s	+58.3%
10	3.561	3.575	30.5 MB/s	51.8 MB/s	+69.8%
11	3.578	3.584	25.0 MB/s	47.4 MB/s	+89.6%
12	3.607	3.615	17.2 MB/s	36.3 MB/s	+111.0% 🚀

gcc: enwik7, -b5e12:

clevel	dev ratio	row_hash ratio	dev speed	row_hash speed	speed delta
5	2.896	2.907	83.8 MB/s	102.0 MB/s	+21.7%
6	2.958	2.951	62.3 MB/s	95.3 MB/s	+52.9%
7	3.060	3.048	45.4 MB/s	65.9 MB/s	+45.1%
8	3.099	3.087	34.7 MB/s	51.5 MB/s	+48.4%
9	3.126	3.145	24.3 MB/s	41.4 MB/s	+70.4%
10	3.171	3.186	20.6 MB/s	37.8 MB/s	+83.5%
11	3.194	3.197	16.8 MB/s	34.9 MB/s	+107.7%
12	3.219	3.226	10.9 MB/s	26.9 MB/s	+146.7% 🚀

On small sources we are roughly neutral compared to zstd 1.4.8 if we remove prefetching. Currently the templating on the prefetching is probably not done properly and slows down everything in general. The following benchmarks with "no prefetch" compare the speed if the prefetching code is just removed.

gcc: enwik7, -b4e8 -B1KB

clevel	dev ratio	row_hash ratio	dev speed	row_hash speed	row_hash speed no prefetch
4	1.765	1.761	60.1 MB/s	57.7 MB/s	58.8 MB/s
5	1.785	1.782	52.8 MB/s	49.7 MB/s	50.4 MB/s
6	1.787	1.784	47.5 MB/s	44.6 MB/s	46.7 MB/s
7	1.787	1.784	47.4 MB/s	46.0 MB/s	49.9 MB/s
8	1.787	1.784	47.3 MB/s	46.0 MB/s	49.9 MB/s

terrelln · 2021-02-11T22:52:43Z

Run-time detection of SSE support.

If we just need SSE, then we can probably do a compile-time check. It's only for AVX that we need runtime checks.

terrelln · 2021-02-17T02:07:47Z

lib/compress/zstd_lazy.c

+        for (; (matches > 0) && (nbAttempts > 0); --nbAttempts, matches &= (matches - 1)) {
+            U32 const matchPos = (head + ZSTD_VecMask_next(matches)) & rowMask;
+            U32 const matchIndex = row[matchPos];
+            if (matchIndex < lowLimit)
+                break;
+            if (shouldPrefetch) {
+                if ((dictMode != ZSTD_extDict) || matchIndex >= dictLimit) {
+                    PREFETCH_L1(base + matchIndex);
+                } else {
+                    PREFETCH_L1(dictBase + matchIndex);
+                }
+            }
+            matchBuffer[numMatches++] = matchIndex;


Have you tried inserting the scalar fallback here instead?

In the SSE version we iterate over the matches mask.

In the scalar version, we could iterate over the tags.

for (U32 pos = 0; pos < 16; ++pos) { U32 const matchPos = (head + pos) & rowMask; U32 const matchIndex = row[matchPos]; if (tagRow[matchPos] != tag) continue; if (matchIndex < lowLimit) break; if (nbAttempts == 0) break; --nbAttempts; // Rest is the same

You could also try a cmov-style version to try to avoid branches. Though, who knows if the architecture supports cmov or not, since this code won't be executing on x86.

for (U32 pos = 0; pos < 16; ++pos) { U32 const matchPos = (head + pos) & rowMask; U32 const matchIndex = row[matchPos]; int const tagMatches = (tagRow[matchPos] == tag) ? 1 : 0; if (matchIndex < lowLimit) break; if (nbAttempts == 0) break; nbAttempts -= tagMatches; // Rest is the same except don't ++ numMatches & don't prefetch b/c it is probably disabled anyways because not x86 numMatches += tagMatches;

Its probably not right to be measuring performance on x86-64, since we know for a fact that the scalar code won't be executing on x86-64.

You could try executing in 32-bit mode (-m32). That would give a slightly more interesting comparison. And/or you could measure on ARM, though that is harder to set up. Either of those would be a more interesting comparison than x86-64 speed.

Definitely an interesting idea, and a more natural way to go about it - I'll give it a shot!

After some more measurements, it actually doesn't seem like either of these offers a speed improvement compared to 32-bit no SIMD. On very small files it seems to provides some benefit, but we don't use the row-based matchfinder for those small files at that point anyways.

terrelln

I've looked at the integration portion, but not zstd_lazy.c this pass.

lib/compress/zstd_compress.c

terrelln · 2021-02-27T02:16:44Z

tests/fuzzer.c

+            size_t const target_nodict_cSize[22+1] = { 3840, 3770, 3870, 3830, 3770+1,
+                                                       3770+5, 3770+5, 3770+5, 3750+5, 3750+5,


It's fine to change these, but I don't know if they were a tight bound before, or fairly loose. So make sure there isn't a large regression.

terrelln · 2021-02-27T02:50:17Z

The appveyor + travis tests all seem like legitimate failures

senhuang42 · 2021-03-02T18:20:55Z

Advanced param gating

Now this is gated behind an advanced param useRowMatchfinder, which will toggle a new set of CParams to select from. The new row-based matchfinder will use three new compression strategies: ZSTD_{greedy, lazy, lazy2}_row. This presents some issues:

Some public APIs (like ZSTD_getCParams() and ZSTD_getParams()) will still always return the default set of CParams to avoid having to change their interface.
Though zstd.h says that we're allowed to add new strategies, this introduces some integration difficulties with certain parts of the code that assume that the strategies are mapped to the previous numerical values.
The CI in general can't really run with this advanced param enabled. I've tested that they pass if I hardcode useRowMatchfinder = 1 but this is not ideal.

These ongoing changes have introduced about a 1-2% decrease in speed compared to the initial PR.

Memory usage comparison

Parameters have been manually tuned: we're roughly neutral on memory. We can optimize further by making chainLog = 0 a valid value in zstd.
Measured with /usr/bin/time maxresident size with CLI compression.

silesia.tar: entire file

level	row_hash5 (mem in K)	dev (mem in K)
5	48132	48100
6	48100	48840
7	48028	48972
8	47824	48944
9	50968	50956
10	94512	94520
11	106772	106764
12	106660	106660

enwik7: first 250K

level	row_hash5 (mem in K)	dev (mem in K)
5	3812	3812
6	5088	4576
7	6108	6108
8	6128	6128
9	6108	6108
10	6108	6104

enwik7: first 64K

level	row_hash5 (mem in K)	dev (mem in K)
5	3424	3428
6	3416	3420
7	3416	3420
8	3416	3420
9	3412	3420
10	3412	3412

32-bit vs. 64-bit comparisons

A comparison of the row-hash (useRowMatchfinder = 1) with no SIMD in 32-bit mode vs. default params (useRowMatchfinder = 0). Validated the default params (useRowMatchfinder = 0) against a 1.4.8 checkout to make sure they're still the same.

The regressions don't look too too bad. Also, 32-bit + SIMD does perform noticeably better than without. Worst case, 32-bit users can just not enable this advanced parameter. But when we get rid of the advanced parameter, we'll at least want to pay some attention to the impact here.

(note: no comparisons under 16KB since then we're just using the same parameters: old lazy matchfinder)
useRowMatchfinder=1 parameters, 32-bit, no SIMD
1MB:
 5#enwik7            :  10000000 ->   3440416 (2.907),  62.3 MB/s , 459.2 MB/s 
 6#enwik7            :  10000000 ->   3363547 (2.973),  56.0 MB/s , 477.7 MB/s 
 7#enwik7            :  10000000 ->   3280554 (3.048),  42.1 MB/s , 499.5 MB/s 
 8#enwik7            :  10000000 ->   3239530 (3.087),  33.7 MB/s , 519.7 MB/s 
 9#enwik7            :  10000000 ->   3179704 (3.145),  28.8 MB/s , 527.2 MB/s 
10#enwik7            :  10000000 ->   3138517 (3.186),  27.3 MB/s , 529.3 MB/s 
11#enwik7            :  10000000 ->   3127634 (3.197),  25.9 MB/s , 531.2 MB/s 
12#enwik7            :  10000000 ->   3100250 (3.226),  19.2 MB/s , 539.8 MB/s

250K:
 5#enwik7            :  10000000 ->   3655059 (2.736),  56.0 MB/s , 487.0 MB/s 
 6#enwik7            :  10000000 ->   3584634 (2.790),  41.6 MB/s , 505.9 MB/s 
 7#enwik7            :  10000000 ->   3563594 (2.806),  37.6 MB/s , 461.0 MB/s 
 8#enwik7            :  10000000 ->   3535039 (2.829),  29.1 MB/s , 479.7 MB/s 
 9#enwik7            :  10000000 ->   3521692 (2.840),  23.1 MB/s , 485.3 MB/s 
10#enwik7            :  10000000 ->   3512850 (2.847),  14.1 MB/s , 489.6 MB/s

120K:
 5#enwik7            :  10000000 ->   3808402 (2.626),  58.7 MB/s , 423.0 MB/s 
 6#enwik7            :  10000000 ->   3716544 (2.691),  41.0 MB/s , 445.7 MB/s 
 7#enwik7            :  10000000 ->   3688586 (2.711),  33.0 MB/s , 463.3 MB/s 
 8#enwik7            :  10000000 ->   3674326 (2.722),  30.0 MB/s , 468.5 MB/s 
 9#enwik7            :  10000000 ->   3665281 (2.728),  24.6 MB/s , 472.2 MB/s 
10#enwik7            :  10000000 ->   3658267 (2.734),  19.6 MB/s , 476.0 MB/s

64K:
 5#enwik7            :  10000000 ->   3912484 (2.556),  56.8 MB/s , 410.7 MB/s 
 6#enwik7            :  10000000 ->   3832542 (2.609),  40.6 MB/s , 429.7 MB/s 
 7#enwik7            :  10000000 ->   3808642 (2.626),  33.0 MB/s , 442.7 MB/s 
 8#enwik7            :  10000000 ->   3796982 (2.634),  30.6 MB/s , 446.6 MB/s 
 9#enwik7            :  10000000 ->   3790430 (2.638),  25.1 MB/s , 449.4 MB/s 
10#enwik7            :  10000000 ->   3786226 (2.641),  26.4 MB/s , 451.2 MB/s

Default params, 32-bit
1MB:
 5#enwik7            :  10000000 ->   3452564 (2.896),  62.9 MB/s , 459.0 MB/s 
 6#enwik7            :  10000000 ->   3380432 (2.958),  49.6 MB/s , 471.1 MB/s 
 7#enwik7            :  10000000 ->   3267861 (3.060),  34.9 MB/s , 503.0 MB/s 
 8#enwik7            :  10000000 ->   3226874 (3.099),  27.8 MB/s , 523.3 MB/s 
 9#enwik7            :  10000000 ->   3198994 (3.126),  20.8 MB/s , 529.7 MB/s 
10#enwik7            :  10000000 ->   3153880 (3.171),  17.7 MB/s , 528.2 MB/s

250K:
 5#enwik7            :  10000000 ->   3650936 (2.739),  54.2 MB/s , 506.3 MB/s 
 6#enwik7            :  10000000 ->   3583359 (2.791),  41.4 MB/s , 524.8 MB/s 
 7#enwik7            :  10000000 ->   3559814 (2.809),  31.0 MB/s , 480.1 MB/s 
 8#enwik7            :  10000000 ->   3531452 (2.832),  24.3 MB/s , 498.6 MB/s 
 9#enwik7            :  10000000 ->   3518956 (2.842),  18.3 MB/s , 504.3 MB/s 
10#enwik7            :  10000000 ->   3512850 (2.847),  14.2 MB/s , 507.9 MB/s

120K:
 5#enwik7            :  10000000 ->   3808702 (2.626),  62.5 MB/s , 441.5 MB/s 
 6#enwik7            :  10000000 ->   3713482 (2.693),  43.9 MB/s , 466.0 MB/s 
 7#enwik7            :  10000000 ->   3685429 (2.713),  35.5 MB/s , 483.8 MB/s 
 8#enwik7            :  10000000 ->   3669211 (2.725),  29.0 MB/s , 490.3 MB/s 
 9#enwik7            :  10000000 ->   3661579 (2.731),  23.9 MB/s , 494.1 MB/s 
10#enwik7            :  10000000 ->   3658267 (2.734),  20.1 MB/s , 496.7 MB/s

64K:
 5#enwik7            :  10000000 ->   3911244 (2.557),  64.6 MB/s , 429.4 MB/s 
 6#enwik7            :  10000000 ->   3831278 (2.610),  47.6 MB/s , 448.6 MB/s 
 7#enwik7            :  10000000 ->   3807315 (2.627),  39.0 MB/s , 461.8 MB/s 
 8#enwik7            :  10000000 ->   3794626 (2.635),  34.0 MB/s , 466.3 MB/s 
 9#enwik7            :  10000000 ->   3788754 (2.639),  29.8 MB/s , 469.1 MB/s 
10#enwik7            :  10000000 ->   3786226 (2.641),  26.9 MB/s , 470.8 MB/s

Still to-do:

Any test failures that show up here.
Investigate why for 32 entry rows, compression doesn't benefit from sizing up to the correct amount, since it actually uses 33 entries including the "head" entry, so jumping rows in increments of 32 should theoretically be bad, since we lose one entry. But actually, this way we do save some memory since we wouldn't need to waste an additional 31 bytes per row.
Clean up git history
Remove prefetch toggle code in zstd_lazy.c if the speed impact of the extra code can't be minimized.

terrelln · 2021-03-02T19:45:14Z

Investigate why for 32 entry rows, compression doesn't benefit from sizing up to the correct amount, since it actually uses 33 entries including the "head" entry, so jumping rows in increments of 32 should theoretically be bad, since we lose one entry. But actually, this way we do save some memory since we wouldn't need to waste an additional 31 bytes per row.

What do you mean by that?

senhuang42 · 2021-03-02T20:00:52Z

Investigate why for 32 entry rows, compression doesn't benefit from sizing up to the correct amount, since it actually uses 33 entries including the "head" entry, so jumping rows in increments of 32 should theoretically be bad, since we lose one entry. But actually, this way we do save some memory since we wouldn't need to waste an additional 31 bytes per row.

What do you mean by that?

So for 17-entry tagTable rows, we actually use 32 rows (wasting 15). When kRowEntries==32, then we actually have 33-entry tagTable rows, so theoretically we should be using 64 entries (wasting 31).

But for some reason, the code seems to work as-is. Changing the logic so that we allocate a bigger tagTable and multiplying the tagTable row we end up on by 2 (or just changing tagTable to be a table of U32 instead of U16) doesn't do anything.

lib/compress/zstd_lazy.c

terrelln

This is a review of everything execpt zstd_lazy.c

lib/compress/zstd_compress.c

lib/compress/zstd_cwksp.h

lib/zstd.h

programs/zstdcli.c

terrelln

Review of zstd_lazy.c

lib/compress/zstd_lazy.c

senhuang42 · 2021-03-31T16:08:40Z

Some benchmarks regarding NEON performance (samsung galaxy s20):

silesia.tar - NEON row matchfinder:
 5#silesia.tar       : 211950592 ->  63794305 (3.322), 113.6 MB/s , 834.5 MB/s
 6#silesia.tar       : 211950592 ->  62967928 (3.366), 109.6 MB/s , 850.1 MB/s
 7#silesia.tar       : 211950592 ->  61471202 (3.448),  80.5 MB/s , 909.6 MB/s
 8#silesia.tar       : 211950592 ->  60901887 (3.480),  64.7 MB/s , 936.1 MB/s
 9#silesia.tar       : 211950592 ->  59917881 (3.537),  49.7 MB/s , 932.2 MB/s
10#silesia.tar       : 211950592 ->  59283096 (3.575),  40.6 MB/s , 898.5 MB/s
11#silesia.tar       : 211950592 ->  59140552 (3.584),  37.2 MB/s , 891.0 MB/s
12#silesia.tar       : 211950592 ->  58630838 (3.615),  25.9 MB/s , 899.2 MB/s

silesia.tar - scalar row matchfinder:
 5#silesia.tar       : 211950592 ->  63794305 (3.322), 105.2 MB/s , 821.9 MB/s
 6#silesia.tar       : 211950592 ->  62967928 (3.366), 102.1 MB/s , 837.8 MB/s
 7#silesia.tar       : 211950592 ->  61471202 (3.448),  75.3 MB/s , 894.5 MB/s
 8#silesia.tar       : 211950592 ->  60901887 (3.480),  61.0 MB/s , 921.8 MB/s
 9#silesia.tar       : 211950592 ->  59917881 (3.537),  47.5 MB/s , 918.8 MB/s
10#silesia.tar       : 211950592 ->  59283096 (3.575),  38.8 MB/s , 882.4 MB/s
11#silesia.tar       : 211950592 ->  59140552 (3.584),  35.2 MB/s , 873.6 MB/s
12#silesia.tar       : 211950592 ->  58630838 (3.615),  24.2 MB/s , 884.5 MB/s

silesia.tar - no row matchfinder
 5#silesia.tar       : 211950592 ->  63981847 (3.313),  95.4 MB/s , 825.2 MB/s
 6#silesia.tar       : 211950592 ->  62882463 (3.371),  63.9 MB/s , 834.3 MB/s
 7#silesia.tar       : 211950592 ->  61327719 (3.456),  49.3 MB/s , 898.0 MB/s
 8#silesia.tar       : 211950592 ->  60745717 (3.489),  41.9 MB/s , 921.8 MB/s
 9#silesia.tar       : 211950592 ->  60174265 (3.522),  28.4 MB/s , 935.6 MB/s
10#silesia.tar       : 211950592 ->  59525124 (3.561),  18.5 MB/s , 903.0 MB/s
11#silesia.tar       : 211950592 ->  59230515 (3.578),  13.9 MB/s , 894.8 MB/s
12#silesia.tar       : 211950592 ->  58769028 (3.607),  9.26 MB/s , 897.8 MB/s

Mid to high single digits speed improvement. I'd be fine either keeping or removing NEON support, though I don't see much harm in leaving it there since it's well encapsulated behind a single #ifdef block.

Regarding dictionaries (on devserver, github-users corpus):

--no-row-match-finder: DDS
 5# 9114 files       :   7484607 ->    715168 (10.47), 112.3 MB/s ,1247.2 MB/s 
 6# 9114 files       :   7484607 ->    713140 (10.50),  71.4 MB/s ,1281.5 MB/s 
 7# 9114 files       :   7484607 ->    717198 (10.44),  61.0 MB/s ,1157.1 MB/s 
 8# 9114 files       :   7484607 ->    718963 (10.41),  50.8 MB/s ,1137.3 MB/s 
 9# 9114 files       :   7484607 ->    727015 (10.29),  41.1 MB/s ,1102.1 MB/s 
10# 9114 files       :   7484607 ->    734618 (10.19),  30.2 MB/s ,1052.0 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  42.1 MB/s ,1061.5 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  40.7 MB/s ,1027.4 MB/s

--no-row-match-finder: DMS
 5# 9114 files       :   7484607 ->    719205 (10.41), 108.8 MB/s ,1227.3 MB/s 
 6# 9114 files       :   7484607 ->    713140 (10.50),  70.6 MB/s ,1262.1 MB/s 
 7# 9114 files       :   7484607 ->    717198 (10.44),  61.1 MB/s ,1155.9 MB/s 
 8# 9114 files       :   7484607 ->    718963 (10.41),  49.1 MB/s ,1109.9 MB/s 
 9# 9114 files       :   7484607 ->    727015 (10.29),  39.7 MB/s ,1128.2 MB/s 
10# 9114 files       :   7484607 ->    734618 (10.19),  27.4 MB/s ,1087.0 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  41.4 MB/s ,1044.3 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  41.9 MB/s ,1037.3 MB/s 


--row-match-finder: DDS
 5# 9114 files       :   7484607 ->    714943 (10.47),  92.2 MB/s ,1270.3 MB/s 
 6# 9114 files       :   7484607 ->    713036 (10.50),  61.5 MB/s ,1312.3 MB/s 
 7# 9114 files       :   7484607 ->    717014 (10.44),  50.8 MB/s ,1165.9 MB/s 
 8# 9114 files       :   7484607 ->    719434 (10.40),  43.2 MB/s ,1121.0 MB/s 
 9# 9114 files       :   7484607 ->    727215 (10.29),  35.9 MB/s ,1095.5 MB/s 
10# 9114 files       :   7484607 ->    727215 (10.29),  37.7 MB/s ,1135.3 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  40.4 MB/s ,1064.4 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  40.6 MB/s ,1053.3 MB/s


--row-match-finder: DMS
 5# 9114 files       :   7484607 ->    715591 (10.46),  85.6 MB/s ,1252.9 MB/s 
 6# 9114 files       :   7484607 ->    713574 (10.49),  60.1 MB/s ,1280.3 MB/s 
 7# 9114 files       :   7484607 ->    717583 (10.43),  49.1 MB/s ,1135.5 MB/s 
 8# 9114 files       :   7484607 ->    720191 (10.39),  42.6 MB/s ,1135.5 MB/s 
 9# 9114 files       :   7484607 ->    728295 (10.28),  37.3 MB/s ,1123.1 MB/s 
10# 9114 files       :   7484607 ->    728295 (10.28),  34.8 MB/s ,1066.1 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  41.0 MB/s , 994.5 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  39.7 MB/s ,1016.4 MB/s


--no-row-match-finder: no dict
 5# 9114 files       :   7484607 ->   2471555 (3.028),  64.6 MB/s , 278.6 MB/s 
 6# 9114 files       :   7484607 ->   2471608 (3.028),  61.4 MB/s , 276.5 MB/s 
 7# 9114 files       :   7484607 ->   2471609 (3.028),  60.9 MB/s , 276.8 MB/s 
 8# 9114 files       :   7484607 ->   2471609 (3.028),  60.4 MB/s , 278.1 MB/s 
 9# 9114 files       :   7484607 ->   2471608 (3.028),  40.8 MB/s , 279.9 MB/s 
10# 9114 files       :   7484607 ->   2471608 (3.028),  40.7 MB/s , 281.0 MB/s 
11# 9114 files       :   7484607 ->   2471638 (3.028),  23.2 MB/s , 283.3 MB/s 
12# 9114 files       :   7484607 ->   2454019 (3.050),  19.1 MB/s , 283.0 MB/s 

--row-match-finder: no dict
 5# 9114 files       :   7484607 ->   2462015 (3.040),  52.5 MB/s , 277.3 MB/s 
 6# 9114 files       :   7484607 ->   2462012 (3.040),  50.1 MB/s , 282.6 MB/s 
 7# 9114 files       :   7484607 ->   2462013 (3.040),  51.8 MB/s , 283.7 MB/s 
 8# 9114 files       :   7484607 ->   2462013 (3.040),  49.3 MB/s , 261.5 MB/s 
 9# 9114 files       :   7484607 ->   2471608 (3.028),  40.8 MB/s , 289.7 MB/s 
10# 9114 files       :   7484607 ->   2471608 (3.028),  41.8 MB/s , 283.9 MB/s 
11# 9114 files       :   7484607 ->   2471638 (3.028),  22.6 MB/s , 279.9 MB/s 
12# 9114 files       :   7484607 ->   2454019 (3.050),  19.0 MB/s , 281.5 MB/s

Performance with dictionaries seems reasonable. DMS is slower than DDS. And since these files are small, row hash isn't very advantageous anyways (and for this dataset, would not be activated by default).

terrelln

This is looking really good! Just a few mostly minor comments. We should be ready to land this once they are fixed and #2546 is merged.

Can you also add force-enabled and force-disabled row-based hash tables to the regression test, with both 16 & 32 entry rows, and with/without a dictionary in DMS mode?

lib/compress/zstd_compress.c

lib/compress/zstd_lazy.c

terrelln · 2021-04-01T02:28:52Z

lib/compress/zstd_lazy.c

+    if (dictMode == ZSTD_dedicatedDictSearch) {
+        ml = ZSTD_dedicatedDictSearch_lazy_search(offsetPtr, ml, nbAttempts, dms,
+                                                  ip, iLimit, prefixStart, curr, dictLimit, ddsIdx);
+    } else if (dictMode == ZSTD_dictMatchState) {


I didn't include prefetching at all since it seemed like it was not good on smaller files/smaller hashtables, and the DMS typically aren't huge - but in either case, yeah, we should measure that, and I'll add it todo.

Prefetching in the DMS is very different then prefetching for a the file's hash table. We're inserting into the hash table as we compress, so those writes mean it is hot and likely already in L1.

However, the dictionary is read-only, and likely starts out in RAM (cold). So we'll see lots of cache misses. @felixhandte found that prefetching was very? beneficial for DDS.

senhuang42 · 2021-04-02T16:09:40Z

I've collected a bunch of benchmarks regarding DDS vs. DMS+attach vs. DMS+copy vs. forceLoad - in addition to row hash vs. non row hash methods, and am currently just trying to parse through the data (as well as the updated results.csv) to make sense of it and see if there's anything particularly worrying or if it reveals any other bugs with how dictionaries are handled.

Still trying to figure out how to best present this info, but the raw data is here if you're curious (some of the levels aren't using row hash for smaller inputs though it's reported anyway):
DDS with row: https://pastebin.com/wLuHbkG2
DMS+attach with row: https://pastebin.com/iqC0fMRU
DMS+copy with row: https://pastebin.com/5QBr1iaa
Force Load with row: https://pastebin.com/HpW4Pnn6

DDS no row: https://pastebin.com/wm6d5m92
DMS+attach no row: https://pastebin.com/QFp9a9fW
DMS+copy no row: https://pastebin.com/hqdVKhjc
Force Load with no row: https://pastebin.com/bwqqGNCj

terrelln

Looks good to me!

Please just look into the with dict load regression test before landing. Its always giving exactly the same numbers, no matter what level, or if row is enabled/disabled. That seems very suspicious.

If it is broken, you can fix it in a separate PR, as long as it isn't hiding a bug (unlikely).

terrelln · 2021-04-02T19:32:27Z

tests/regression/results.csv

+github,                             level 5 row 2 with dict dms,        advanced one pass,                  38758
+github,                             level 5 row 2 with dict dds,        advanced one pass,                  38737
+github,                             level 5 row 2 with dict copy,       advanced one pass,                  38759
+github,                             level 5 row 2 with dict load,       advanced one pass,                  42252


I'm suspicious of this result. Why do we get exactly the same result as with row = 1 and row = 2 with dict load? The rest are all different.

Was actually just looking into this since I noticed this in my separate benchmarking as well initially. It appears to be an existing bug here:

zstd/lib/compress/zstd_compress.c

Lines 5170 to 5171 in 980f3bb

if (cctx->cdict)

params.compressionLevel = cctx->cdict->compressionLevel; /* let cdict take priority in terms of compression level */

If the cdict exists but a compression level isn't set, but we reload the dictionary, we end up taking the level of the cdict which is fixed, rather than the cParams. I'll put up a separate PR for this since it affects non-row matchfinder results too.

You can replicate this behavior if you hardcode the attach dict pref to forceLoad and run benchmarks like -b1e19. Everything is just compressed with default compression level.

Yeah, I figured it was something like that, nice find!

I'm fine to fix it in a separate PR, we'll just need to double check the row-based match finder results after that PR, to make sure they make sense.

terrelln · 2021-04-02T19:33:18Z

tests/regression/results.csv

+github,                             level 7 row 1 with dict dms,        advanced one pass,                  38771
+github,                             level 7 row 1 with dict dds,        advanced one pass,                  38771
+github,                             level 7 row 1 with dict copy,       advanced one pass,                  38745
+github,                             level 7 row 1 with dict load,       advanced one pass,                  42252


This is the same number too.

terrelln · 2021-04-02T19:33:30Z

tests/regression/results.csv

+github,                             level 9 with dict,                  advanced one pass,                  39437
+github,                             level 9 with dict dms,              advanced one pass,                  39437
+github,                             level 9 with dict dds,              advanced one pass,                  39338
+github,                             level 9 with dict copy,             advanced one pass,                  39398
 github,                             level 9 with dict load,             advanced one pass,                  42252


And this is the same number.

terrelln · 2021-04-02T19:34:57Z

tests/regression/results.csv

+github.tar,                         level 5 row 2 with dict dms,        advanced one pass,                  39024
+github.tar,                         level 5 row 2 with dict dds,        advanced one pass,                  39023
+github.tar,                         level 5 row 2 with dict copy,       advanced one pass,                  39040
+github.tar,                         level 5 row 2 with dict load,       advanced one pass,                  37956


senhuang42 · 2021-04-07T16:18:17Z

Some comparisons of DDS vs. DMS attach vs. DMS copy to help inform a decision on what to do about the three methods. Results for force dict reloading are included too, though those are separate from the main topic here.

It actually seems like one of DMS+attach or DMS+copy is typically faster than DDS, except on the smaller src Sizes (where non-row has the advantage). So if the auto-selection configuration as it is right now is somewhat accurate, these results actually seem to indicate that we don't need DDS.

For the case that we typically care about for dictionary compression (smaller srcSize), DMS+attach actually seems to be faster than DDS, and DMS+copy is still pretty good for the larger srcSizes. For reference, the attachDictCutoff is 32KB for all of the affected strategies.

All of the results below are for level 7, with silesia.tar cut up into block of srcSize size. The levels for which row-hash is used across each set of srcSize params all tell a pretty similar story to this one.

Speed:

Ratio:
1K srcsize:

dictsize	8000	64000	112640	250000	500000
DDS	2.093	2.220	2.314	2.399	2.428
DMSCopy	2.093	2.219	2.309	2.394	2.421
DMSAttach	2.093	2.219	2.309	2.394	2.421
DDS_norow	2.100	2.228	2.324	2.411	2.439
DMSCopy_norow	2.100	2.228	2.324	2.411	2.439
DMSAttach_norow	2.100	2.228	2.324	2.411	2.439
load	2.029	2.021	2.043	2.036	2.035
load_norow	2.076	2.044	2.062	2.050	2.046

10K:

dictsize	8000	64000	112640	250000	500000
DDS	2.719	2.813	2.866	2.935	2.934
DMSCopy	2.719	2.813	2.863	2.933	2.930
DMSAttach	2.719	2.813	2.863	2.933	2.930
DDS_norow	2.723	2.815	2.867	2.937	2.936
DMSCopy_norow	2.723	2.815	2.867	2.937	2.936
DMSAttach_norow	2.723	2.815	2.867	2.937	2.936
load	2.719	2.732	2.733	2.728	2.720
load_norow	2.723	2.803	2.818	2.791	2.766

100K:

dictsize	8000	64000	112640	250000	500000
DDS	3.086	3.167	3.181	3.205	3.191
DMSCopy	3.088	3.167	3.180	3.206	3.190
DMSAttach	3.087	3.167	3.180	3.205	3.190
DDS_norow	3.096	3.168	3.183	3.207	3.191
DMSCopy_norow	3.127	3.167	3.182	3.207	3.191
DMSAttach_norow	3.127	3.168	3.183	3.207	3.191
load	3.132	3.165	3.178	3.166	3.190
load_norow	3.133	3.166	3.179	3.166	3.191

200K:

dictsize	8000	64000	112640	250000	500000
DDS	3.119	3.237	3.245	3.269	3.254
DMSCopy	3.122	3.236	3.244	3.269	3.253
DMSAttach	3.119	3.236	3.245	3.269	3.254
DDS_norow	3.143	3.234	3.249	3.271	3.255
DMSCopy_norow	3.180	3.232	3.248	3.271	3.255
DMSAttach_norow	3.179	3.234	3.249	3.271	3.255
load	3.217	3.203	3.212	3.235	3.253
load_norow	3.218	3.203	3.213	3.235	3.255

1M:

dictsize	8000	64000	112640	250000	500000
DDS	3.145	3.332	3.334	3.383	3.381
DMSCopy	3.149	3.331	3.332	3.383	3.379
DMSAttach	3.145	3.332	3.334	3.383	3.381
DDS_norow	3.189	3.323	3.344	3.379	3.386
DMSCopy_norow	3.234	3.321	3.343	3.376	3.384
DMSAttach_norow	3.231	3.323	3.344	3.379	3.386
load	3.346	3.354	3.358	3.368	3.379
load_norow	3.351	3.359	3.362	3.373	3.384

terrelln · 2021-04-07T17:51:13Z

I'm suspicious of those results, because I don't see a large difference between DDS & DMSAttach_norow. Where we know that DDS is significantly faster than DMSAttach_norow.

What is your dictionary? Is it possible that you get about the same compression ratio with or without a dictionary? So the match finder will basically not find matches within the dictionary, so all methods will appear to be about the same speed.

terrelln · 2021-04-07T17:53:25Z

Silesia isn't a good corpus for dictionary benchmarking, because it contains several files, and each file is completely independent. I recommend running this on one of of the dictionary compression corpuses.

senhuang42 · 2021-04-07T17:55:50Z

I'm suspicious of those results, because I don't see a large difference between DDS & DMSAttach_norow. Where we know that DDS is significantly faster than DMSAttach_norow.

What is your dictionary? Is it possible that you get about the same compression ratio with or without a dictionary? So the match finder will basically not find matches within the dictionary, so all methods will appear to be about the same speed.

The dictionary I used is just a dictionary trained on silesia.tar split up into 128K blocks. Though that's a good point, I suspect that a dictionary of size 8K-500K for silesia.tar can't possibly be very good since the files are so different. A quick spot check on the dict vs. no dict sizes reveals that that's accurate, only a few percent better compression.

aqrit · 2021-04-29T22:16:39Z

Clang has a codegen issue that costs ZSTD_vmovmaskq_u8 3 instructions.

If ZSTD_row_getMatchMask is passed rowEntries == 32 then ZSTD_vmovmaskq_u8 is called twice (once every 16 bytes). This is unnecessary. Combining the 128-bit halves together would saves several operations.

The following snippet avoids clang's vsra issue:
*It still uses a vsra but the optimizer is not smart enough to realize it could be lowered to a shift + or.

uint32_t NEON_i8x16_MatchMask (const uint8_t* ptr, uint8_t match_byte)
{
    uint8x16_t src = vld1q_u8(ptr);
    uint8x16_t dup = vdupq_n_u8(match_byte);
    uint16x8_t cmp = vreinterpretq_u16_u8(vceqq_u8(src, dup)); // #1

    uint16x8_t t0 = vshlq_n_u16(cmp, 7); // #2
    uint32x4_t t1 = vreinterpretq_u32_u16(vsriq_n_u16(t0, t0, 14)); // #3
    uint64x2_t t2 = vreinterpretq_u64_u32(vshrq_n_u32(t1, 14)); // #4
    uint8x16_t t3 = vreinterpretq_u8_u64(vsraq_n_u64(t2, t2, 28));
    return vgetq_lane_u8(t3, 0) | (vgetq_lane_u8(t3, 8) << 8);

/*
1) ...|77777777|66666666|55555555|44444444|33333333|22222222|11111111|00000000
2) ...|76666666|6xxxxxxx|54444444|4xxxxxxx|32222222|2xxxxxxx|10000000|0xxxxxxx
3) ...|76666666|6xxxxx76|54444444|4xxxxx54|32222222|2xxxxx32|10000000|0xxxxx10
4) ...|xxxxxxxx|xxxxxx76|6666666x|xxxx7654|xxxxxxxx|xxxxxx32|2222222x|xxxx3210
*/
}

A 256-bit version could like look:

uint32_t NEON_i8x32_MatchMask (const unsigned char* ptr, unsigned char val)
{
    uint16x8x2_t src = vld2q_u16((unsigned short*)ptr);
    uint8x16_t dup_val = vdupq_n_u8(val);
    uint8x16_t cmp0 = vceqq_u8(vreinterpretq_u8_u16(src.val[0]), dup_val);
    uint8x16_t cmp1 = vceqq_u8(vreinterpretq_u8_u16(src.val[1]), dup_val);

    uint8x8_t t0 = vreinterpret_u8_s8(vqmovn_s16(vreinterpretq_s16_u8(cmp0)));
    uint8x8_t t1 = vreinterpret_u8_s8(vqmovn_s16(vreinterpretq_s16_u8(cmp1)));
    uint8x8_t t2 = vsri_n_u8(t1, t0, 2); // ...|FEDCxxxx|BA98xxxx|7654xxxx|3210xxxx
    uint8x8x2_t t3 = vuzp_u8(t2, t0); // **see note
    uint8x8_t t4 = vsri_n_u8(t3.val[1], t3.val[0], 4);
    return vget_lane_u32(vreinterpret_u32_u8(t4), 0);

/* 	vuzp_u8 splits the input into odd and even lanes
	we don't care about the hi 32-bits; just pass old garbage as the hi arg.
	(note: don't pass `t2` twice as the cause an extra mov)
*/

/*
pack with signed saturation (vqmovn_s16):
0x0000 -> 0x00
0x00FF -> 0x7F
0xFF00 -> 0x80
0xFFFF -> 0xFF
*/
}

terrelln · 2021-04-29T22:30:36Z

Thanks for the feedback @aqrit!

Definitely most of our time was spent optimizing the performance on x86. We'd love to take a PR that improves NEON performance!

aqrit · 2021-05-20T21:18:08Z

The scalar version of ZSTD_Vec128_cmpMask8 may not be compatible with big-endian memory byte order?
Is big-endian support desired? If not, can the other big-endian paths be ripped out?

I'm considering creating a PR. I believe a SWAR approach for scalar ZSTD_Vec128_cmpMask8 would use one-fourth as many instructions.

size_t SWAR_MatchMask(const unsigned char* s, unsigned char c, size_t n) {
//	assert((n == 16) || (n == 32));
//	assert((sizeof(size_t) == 4) || (sizeof(size_t) == 8));

	size_t shift = (sizeof(size_t) * 8) - sizeof(size_t);
	size_t mul = 0x0204081;
	size_t cmp = c * 0x01010101;
	size_t x7F = 0x7F7F7F7F;
	size_t matches = 0;
	
 	if (8 == sizeof(size_t)) {
		mul |= mul << 28;
		cmp |= cmp << 32;
		x7F |= x7F << 32;
	}
	
	for (size_t i = 0; i < n; i += sizeof(size_t)) {
		size_t v;
		memcpy(&v, &s[i], sizeof(size_t));
		v ^= cmp;
		v = (~(((v & x7F) + x7F) | v | x7F) * mul) >> shift;
		matches |= v << i;
	}
	return matches;
}

Also, maybe set ZSTD_ROW_HASH_TAG_OFFSET to 16 and use aligned reads?
Each row in the tagTable has (at least) 15 bytes of padding and is always aligned to (at least) a 32 byte boundary..?

terrelln · 2021-05-20T23:45:08Z

Is big-endian support desired? If not, can the other big-endian paths be ripped out?

Yes, it is desired. We do have big-endian tests, but I suspect this bug wasn't caught because it would only cause a ratio regression, not a crash / bad output. And we only run our ratio regression tests on little endian platforms.

@senhuang42 can you add a test that would expose this? We'd have to add it to one of the tests we run on big-endian platforms, and make sure we run it with vectorization disabled.

I'm considering creating a PR. I believe a SWAR approach for scalar ZSTD_Vec128_cmpMask8 would use one-fourth as many instructions.

Yeah, we'd welcome a PR that improves the scalar version. We've spent most of our time looking at optimizing the vectorized code, so there are probably wins here.

Each row in the tagTable has (at least) 15 bytes of padding and is always aligned to (at least) a 32 byte boundary..?

Yeah, we align the tagTable so the whole row fits in 1 cache line.

facebook-github-bot added the CLA Signed label Feb 11, 2021

senhuang42 force-pushed the row_hash2 branch from a2b0ec6 to 4604216 Compare February 11, 2021 18:10

senhuang42 changed the title ~~Row hash2~~ SIMD Row-based Matchfinder Feb 11, 2021

senhuang42 changed the title ~~SIMD Row-based Matchfinder~~ SIMD Row Based Matchfinder Feb 11, 2021

senhuang42 changed the title ~~SIMD Row Based Matchfinder~~ [rfc] SIMD Row Based Matchfinder Feb 11, 2021

senhuang42 force-pushed the row_hash2 branch from 75bcbe4 to 233f4de Compare February 17, 2021 00:32

terrelln reviewed Feb 17, 2021

View reviewed changes

senhuang42 force-pushed the row_hash2 branch 2 times, most recently from f5a867f to 1852ef5 Compare February 23, 2021 17:17

senhuang42 changed the title ~~[rfc] SIMD Row Based Matchfinder~~ SIMD Row Based Matchfinder 🚀 Feb 25, 2021

terrelln reviewed Feb 27, 2021

View reviewed changes

senhuang42 force-pushed the row_hash2 branch from 1852ef5 to 06896cd Compare March 2, 2021 18:20

senhuang42 force-pushed the row_hash2 branch from 06896cd to 7239923 Compare March 2, 2021 18:26

senhuang42 force-pushed the row_hash2 branch 2 times, most recently from d1e5be0 to e6a5cda Compare March 3, 2021 22:37

terrelln reviewed Mar 3, 2021

View reviewed changes

lib/compress/zstd_lazy.c Show resolved Hide resolved

lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved

lib/compress/zstd_lazy.c Show resolved Hide resolved

lib/compress/zstd_lazy.c Show resolved Hide resolved

senhuang42 force-pushed the row_hash2 branch 2 times, most recently from 3df4848 to d20f313 Compare March 8, 2021 20:45

senhuang42 force-pushed the row_hash2 branch 4 times, most recently from 1533090 to 19b6704 Compare March 16, 2021 20:58

terrelln reviewed Mar 16, 2021

View reviewed changes

terrelln reviewed Mar 17, 2021

View reviewed changes

senhuang42 force-pushed the row_hash2 branch from 19ebee7 to d0e0b1f Compare March 18, 2021 21:43

senhuang42 force-pushed the row_hash2 branch 4 times, most recently from 9a53feb to a79f52d Compare March 31, 2021 15:57

senhuang42 force-pushed the row_hash2 branch 3 times, most recently from 21c6908 to 8c10476 Compare March 31, 2021 16:55

terrelln reviewed Apr 1, 2021

View reviewed changes

senhuang42 force-pushed the row_hash2 branch 3 times, most recently from f08e097 to 69a6516 Compare April 2, 2021 16:02

senhuang42 force-pushed the row_hash2 branch from 69a6516 to a5fd041 Compare April 2, 2021 18:16

terrelln approved these changes Apr 2, 2021

View reviewed changes

Add and integrate lazy row hash strategy

4694423

senhuang42 force-pushed the row_hash2 branch 2 times, most recently from 42f78ce to 4d63d6e Compare April 7, 2021 17:26

Update results.csv, add Row hash to regression test

4d63d6e

senhuang42 merged commit 56421f3 into facebook:dev Apr 8, 2021

senhuang42 mentioned this pull request May 11, 2021

🎉 Zstd 1.5.0 Release 🎉 #2636

Merged

ghost mentioned this pull request May 18, 2021

Is it possible to use SIMD instructions? faster-cpython/ideas#50

Closed

		size_t const target_nodict_cSize[22+1] = { 3840, 3770, 3870, 3830, 3770+1,
		3770+5, 3770+5, 3770+5, 3750+5, 3750+5,

	if (cctx->cdict)
	params.compressionLevel = cctx->cdict->compressionLevel; /* let cdict take priority in terms of compression level */

SIMD Row Based Matchfinder 🚀 #2494

SIMD Row Based Matchfinder 🚀 #2494

Uh oh!

Conversation

senhuang42 commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrelln commented Feb 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrelln commented Feb 27, 2021

Uh oh!

senhuang42 commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Advanced param gating

Memory usage comparison

32-bit vs. 64-bit comparisons

Still to-do:

Uh oh!

terrelln commented Mar 2, 2021

Uh oh!

senhuang42 commented Mar 2, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

senhuang42 commented Mar 31, 2021

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

senhuang42 commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

senhuang42 commented Feb 11, 2021 •

edited

Loading

senhuang42 commented Mar 2, 2021 •

edited

Loading

senhuang42 commented Apr 2, 2021 •

edited

Loading

senhuang42 Apr 2, 2021 •

edited

Loading

senhuang42 commented Apr 7, 2021 •

edited

Loading

aqrit commented Apr 29, 2021 •

edited

Loading