Skip to content

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Feb 11, 2021

Followups after this PR

  • Paramgrill to re-tune the compression levels.
  • Decide what to do with DMS/DDS

This PR introduces the row-based lazy matchfinder, iterated on the prototype from @terrelln, and presents some initial results.

The comment above ZSTD_RowFindBestMatch_generic() explains at a high-level how the matchfinder works.

Preliminary benchmarks: parameters/compression levels are just approximations, so there will be some compression ratio differences. We see some pretty huge gains though.

gcc: silesia.tar, -b5e12:

clevel dev ratio row_hash ratio dev speed row_hash speed speed delta
5 3.313 3.322 101.3 MB/s 125.2 MB/s +23.8%
6 3.371 3.366 80.4 MB/s 119.3 MB/s +48.5%
7 3.456 3.448 59.7 MB/s 85.9 MB/s +43.9%
8 3.489 3.480 47.0 MB/s 68.5 MB/s +45.7%
9 3.522 3.537 35.5 MB/s 56.2 MB/s +58.3%
10 3.561 3.575 30.5 MB/s 51.8 MB/s +69.8%
11 3.578 3.584 25.0 MB/s 47.4 MB/s +89.6%
12 3.607 3.615 17.2 MB/s 36.3 MB/s +111.0% 🚀

gcc: enwik7, -b5e12:

clevel dev ratio row_hash ratio dev speed row_hash speed speed delta
5 2.896 2.907 83.8 MB/s 102.0 MB/s +21.7%
6 2.958 2.951 62.3 MB/s 95.3 MB/s +52.9%
7 3.060 3.048 45.4 MB/s 65.9 MB/s +45.1%
8 3.099 3.087 34.7 MB/s 51.5 MB/s +48.4%
9 3.126 3.145 24.3 MB/s 41.4 MB/s +70.4%
10 3.171 3.186 20.6 MB/s 37.8 MB/s +83.5%
11 3.194 3.197 16.8 MB/s 34.9 MB/s +107.7%
12 3.219 3.226 10.9 MB/s 26.9 MB/s +146.7% 🚀

On small sources we are roughly neutral compared to zstd 1.4.8 if we remove prefetching. Currently the templating on the prefetching is probably not done properly and slows down everything in general. The following benchmarks with "no prefetch" compare the speed if the prefetching code is just removed.

gcc: enwik7, -b4e8 -B1KB

clevel dev ratio row_hash ratio dev speed row_hash speed row_hash speed no prefetch
4 1.765 1.761 60.1 MB/s 57.7 MB/s 58.8 MB/s
5 1.785 1.782 52.8 MB/s 49.7 MB/s 50.4 MB/s
6 1.787 1.784 47.5 MB/s 44.6 MB/s 46.7 MB/s
7 1.787 1.784 47.4 MB/s 46.0 MB/s 49.9 MB/s
8 1.787 1.784 47.3 MB/s 46.0 MB/s 49.9 MB/s

@senhuang42 senhuang42 changed the title Row hash2 SIMD Row-based Matchfinder Feb 11, 2021
@senhuang42 senhuang42 changed the title SIMD Row-based Matchfinder SIMD Row Based Matchfinder Feb 11, 2021
@senhuang42 senhuang42 changed the title SIMD Row Based Matchfinder [rfc] SIMD Row Based Matchfinder Feb 11, 2021
@terrelln
Copy link
Contributor

Run-time detection of SSE support.

If we just need SSE, then we can probably do a compile-time check. It's only for AVX that we need runtime checks.

Comment on lines 976 to 1348
for (; (matches > 0) && (nbAttempts > 0); --nbAttempts, matches &= (matches - 1)) {
U32 const matchPos = (head + ZSTD_VecMask_next(matches)) & rowMask;
U32 const matchIndex = row[matchPos];
if (matchIndex < lowLimit)
break;
if (shouldPrefetch) {
if ((dictMode != ZSTD_extDict) || matchIndex >= dictLimit) {
PREFETCH_L1(base + matchIndex);
} else {
PREFETCH_L1(dictBase + matchIndex);
}
}
matchBuffer[numMatches++] = matchIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried inserting the scalar fallback here instead?

In the SSE version we iterate over the matches mask.

In the scalar version, we could iterate over the tags.

for (U32 pos = 0; pos < 16; ++pos) {
  U32 const matchPos = (head + pos) & rowMask;
  U32 const matchIndex = row[matchPos];
  if (tagRow[matchPos] != tag) continue;
  if (matchIndex < lowLimit) break;
  if (nbAttempts == 0) break;
  --nbAttempts;

  // Rest is the same

You could also try a cmov-style version to try to avoid branches. Though, who knows if the architecture supports cmov or not, since this code won't be executing on x86.

for (U32 pos = 0; pos < 16; ++pos) {
  U32 const matchPos = (head + pos) & rowMask;
  U32 const matchIndex = row[matchPos];
  int const tagMatches = (tagRow[matchPos] == tag) ? 1 : 0;
  if (matchIndex < lowLimit) break;
  if (nbAttempts == 0) break;
  nbAttempts -= tagMatches;

  // Rest is the same except don't ++ numMatches & don't prefetch b/c it is probably disabled anyways because not x86

  numMatches += tagMatches;

Its probably not right to be measuring performance on x86-64, since we know for a fact that the scalar code won't be executing on x86-64.

You could try executing in 32-bit mode (-m32). That would give a slightly more interesting comparison. And/or you could measure on ARM, though that is harder to set up. Either of those would be a more interesting comparison than x86-64 speed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely an interesting idea, and a more natural way to go about it - I'll give it a shot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more measurements, it actually doesn't seem like either of these offers a speed improvement compared to 32-bit no SIMD. On very small files it seems to provides some benefit, but we don't use the row-based matchfinder for those small files at that point anyways.

@senhuang42 senhuang42 force-pushed the row_hash2 branch 2 times, most recently from f5a867f to 1852ef5 Compare February 23, 2021 17:17
@senhuang42 senhuang42 changed the title [rfc] SIMD Row Based Matchfinder SIMD Row Based Matchfinder 🚀 Feb 25, 2021
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at the integration portion, but not zstd_lazy.c this pass.

tests/fuzzer.c Outdated
Comment on lines 1745 to 1711
size_t const target_nodict_cSize[22+1] = { 3840, 3770, 3870, 3830, 3770+1,
3770+5, 3770+5, 3770+5, 3750+5, 3750+5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to change these, but I don't know if they were a tight bound before, or fairly loose. So make sure there isn't a large regression.

@terrelln
Copy link
Contributor

The appveyor + travis tests all seem like legitimate failures

@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 2, 2021

Advanced param gating

Now this is gated behind an advanced param useRowMatchfinder, which will toggle a new set of CParams to select from. The new row-based matchfinder will use three new compression strategies: ZSTD_{greedy, lazy, lazy2}_row. This presents some issues:

  1. Some public APIs (like ZSTD_getCParams() and ZSTD_getParams()) will still always return the default set of CParams to avoid having to change their interface.
  2. Though zstd.h says that we're allowed to add new strategies, this introduces some integration difficulties with certain parts of the code that assume that the strategies are mapped to the previous numerical values.
  3. The CI in general can't really run with this advanced param enabled. I've tested that they pass if I hardcode useRowMatchfinder = 1 but this is not ideal.

These ongoing changes have introduced about a 1-2% decrease in speed compared to the initial PR.

Memory usage comparison

Parameters have been manually tuned: we're roughly neutral on memory. We can optimize further by making chainLog = 0 a valid value in zstd.
Measured with /usr/bin/time maxresident size with CLI compression.

silesia.tar: entire file

level row_hash5 (mem in K) dev (mem in K)
5 48132 48100
6 48100 48840
7 48028 48972
8 47824 48944
9 50968 50956
10 94512 94520
11 106772 106764
12 106660 106660

enwik7: first 250K

level row_hash5 (mem in K) dev (mem in K)
5 3812 3812
6 5088 4576
7 6108 6108
8 6128 6128
9 6108 6108
10 6108 6104

enwik7: first 64K

level row_hash5 (mem in K) dev (mem in K)
5 3424 3428
6 3416 3420
7 3416 3420
8 3416 3420
9 3412 3420
10 3412 3412

32-bit vs. 64-bit comparisons

A comparison of the row-hash (useRowMatchfinder = 1) with no SIMD in 32-bit mode vs. default params (useRowMatchfinder = 0). Validated the default params (useRowMatchfinder = 0) against a 1.4.8 checkout to make sure they're still the same.

The regressions don't look too too bad. Also, 32-bit + SIMD does perform noticeably better than without. Worst case, 32-bit users can just not enable this advanced parameter. But when we get rid of the advanced parameter, we'll at least want to pay some attention to the impact here.

(note: no comparisons under 16KB since then we're just using the same parameters: old lazy matchfinder)
useRowMatchfinder=1 parameters, 32-bit, no SIMD
1MB:
 5#enwik7            :  10000000 ->   3440416 (2.907),  62.3 MB/s , 459.2 MB/s 
 6#enwik7            :  10000000 ->   3363547 (2.973),  56.0 MB/s , 477.7 MB/s 
 7#enwik7            :  10000000 ->   3280554 (3.048),  42.1 MB/s , 499.5 MB/s 
 8#enwik7            :  10000000 ->   3239530 (3.087),  33.7 MB/s , 519.7 MB/s 
 9#enwik7            :  10000000 ->   3179704 (3.145),  28.8 MB/s , 527.2 MB/s 
10#enwik7            :  10000000 ->   3138517 (3.186),  27.3 MB/s , 529.3 MB/s 
11#enwik7            :  10000000 ->   3127634 (3.197),  25.9 MB/s , 531.2 MB/s 
12#enwik7            :  10000000 ->   3100250 (3.226),  19.2 MB/s , 539.8 MB/s

250K:
 5#enwik7            :  10000000 ->   3655059 (2.736),  56.0 MB/s , 487.0 MB/s 
 6#enwik7            :  10000000 ->   3584634 (2.790),  41.6 MB/s , 505.9 MB/s 
 7#enwik7            :  10000000 ->   3563594 (2.806),  37.6 MB/s , 461.0 MB/s 
 8#enwik7            :  10000000 ->   3535039 (2.829),  29.1 MB/s , 479.7 MB/s 
 9#enwik7            :  10000000 ->   3521692 (2.840),  23.1 MB/s , 485.3 MB/s 
10#enwik7            :  10000000 ->   3512850 (2.847),  14.1 MB/s , 489.6 MB/s

120K:
 5#enwik7            :  10000000 ->   3808402 (2.626),  58.7 MB/s , 423.0 MB/s 
 6#enwik7            :  10000000 ->   3716544 (2.691),  41.0 MB/s , 445.7 MB/s 
 7#enwik7            :  10000000 ->   3688586 (2.711),  33.0 MB/s , 463.3 MB/s 
 8#enwik7            :  10000000 ->   3674326 (2.722),  30.0 MB/s , 468.5 MB/s 
 9#enwik7            :  10000000 ->   3665281 (2.728),  24.6 MB/s , 472.2 MB/s 
10#enwik7            :  10000000 ->   3658267 (2.734),  19.6 MB/s , 476.0 MB/s

64K:
 5#enwik7            :  10000000 ->   3912484 (2.556),  56.8 MB/s , 410.7 MB/s 
 6#enwik7            :  10000000 ->   3832542 (2.609),  40.6 MB/s , 429.7 MB/s 
 7#enwik7            :  10000000 ->   3808642 (2.626),  33.0 MB/s , 442.7 MB/s 
 8#enwik7            :  10000000 ->   3796982 (2.634),  30.6 MB/s , 446.6 MB/s 
 9#enwik7            :  10000000 ->   3790430 (2.638),  25.1 MB/s , 449.4 MB/s 
10#enwik7            :  10000000 ->   3786226 (2.641),  26.4 MB/s , 451.2 MB/s

Default params, 32-bit
1MB:
 5#enwik7            :  10000000 ->   3452564 (2.896),  62.9 MB/s , 459.0 MB/s 
 6#enwik7            :  10000000 ->   3380432 (2.958),  49.6 MB/s , 471.1 MB/s 
 7#enwik7            :  10000000 ->   3267861 (3.060),  34.9 MB/s , 503.0 MB/s 
 8#enwik7            :  10000000 ->   3226874 (3.099),  27.8 MB/s , 523.3 MB/s 
 9#enwik7            :  10000000 ->   3198994 (3.126),  20.8 MB/s , 529.7 MB/s 
10#enwik7            :  10000000 ->   3153880 (3.171),  17.7 MB/s , 528.2 MB/s

250K:
 5#enwik7            :  10000000 ->   3650936 (2.739),  54.2 MB/s , 506.3 MB/s 
 6#enwik7            :  10000000 ->   3583359 (2.791),  41.4 MB/s , 524.8 MB/s 
 7#enwik7            :  10000000 ->   3559814 (2.809),  31.0 MB/s , 480.1 MB/s 
 8#enwik7            :  10000000 ->   3531452 (2.832),  24.3 MB/s , 498.6 MB/s 
 9#enwik7            :  10000000 ->   3518956 (2.842),  18.3 MB/s , 504.3 MB/s 
10#enwik7            :  10000000 ->   3512850 (2.847),  14.2 MB/s , 507.9 MB/s

120K:
 5#enwik7            :  10000000 ->   3808702 (2.626),  62.5 MB/s , 441.5 MB/s 
 6#enwik7            :  10000000 ->   3713482 (2.693),  43.9 MB/s , 466.0 MB/s 
 7#enwik7            :  10000000 ->   3685429 (2.713),  35.5 MB/s , 483.8 MB/s 
 8#enwik7            :  10000000 ->   3669211 (2.725),  29.0 MB/s , 490.3 MB/s 
 9#enwik7            :  10000000 ->   3661579 (2.731),  23.9 MB/s , 494.1 MB/s 
10#enwik7            :  10000000 ->   3658267 (2.734),  20.1 MB/s , 496.7 MB/s

64K:
 5#enwik7            :  10000000 ->   3911244 (2.557),  64.6 MB/s , 429.4 MB/s 
 6#enwik7            :  10000000 ->   3831278 (2.610),  47.6 MB/s , 448.6 MB/s 
 7#enwik7            :  10000000 ->   3807315 (2.627),  39.0 MB/s , 461.8 MB/s 
 8#enwik7            :  10000000 ->   3794626 (2.635),  34.0 MB/s , 466.3 MB/s 
 9#enwik7            :  10000000 ->   3788754 (2.639),  29.8 MB/s , 469.1 MB/s 
10#enwik7            :  10000000 ->   3786226 (2.641),  26.9 MB/s , 470.8 MB/s

Still to-do:

  • Any test failures that show up here.
  • Investigate why for 32 entry rows, compression doesn't benefit from sizing up to the correct amount, since it actually uses 33 entries including the "head" entry, so jumping rows in increments of 32 should theoretically be bad, since we lose one entry. But actually, this way we do save some memory since we wouldn't need to waste an additional 31 bytes per row.
  • Clean up git history
  • Remove prefetch toggle code in zstd_lazy.c if the speed impact of the extra code can't be minimized.

@terrelln
Copy link
Contributor

terrelln commented Mar 2, 2021

Investigate why for 32 entry rows, compression doesn't benefit from sizing up to the correct amount, since it actually uses 33 entries including the "head" entry, so jumping rows in increments of 32 should theoretically be bad, since we lose one entry. But actually, this way we do save some memory since we wouldn't need to waste an additional 31 bytes per row.

What do you mean by that?

@senhuang42
Copy link
Contributor Author

Investigate why for 32 entry rows, compression doesn't benefit from sizing up to the correct amount, since it actually uses 33 entries including the "head" entry, so jumping rows in increments of 32 should theoretically be bad, since we lose one entry. But actually, this way we do save some memory since we wouldn't need to waste an additional 31 bytes per row.

What do you mean by that?

So for 17-entry tagTable rows, we actually use 32 rows (wasting 15). When kRowEntries==32, then we actually have 33-entry tagTable rows, so theoretically we should be using 64 entries (wasting 31).

But for some reason, the code seems to work as-is. Changing the logic so that we allocate a bigger tagTable and multiplying the tagTable row we end up on by 2 (or just changing tagTable to be a table of U32 instead of U16) doesn't do anything.

@senhuang42 senhuang42 force-pushed the row_hash2 branch 2 times, most recently from d1e5be0 to e6a5cda Compare March 3, 2021 22:37
@senhuang42 senhuang42 force-pushed the row_hash2 branch 2 times, most recently from 3df4848 to d20f313 Compare March 8, 2021 20:45
@senhuang42 senhuang42 force-pushed the row_hash2 branch 4 times, most recently from 1533090 to 19b6704 Compare March 16, 2021 20:58
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a review of everything execpt zstd_lazy.c

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of zstd_lazy.c

@senhuang42 senhuang42 force-pushed the row_hash2 branch 4 times, most recently from 9a53feb to a79f52d Compare March 31, 2021 15:57
@senhuang42
Copy link
Contributor Author

Some benchmarks regarding NEON performance (samsung galaxy s20):

silesia.tar - NEON row matchfinder:
 5#silesia.tar       : 211950592 ->  63794305 (3.322), 113.6 MB/s , 834.5 MB/s
 6#silesia.tar       : 211950592 ->  62967928 (3.366), 109.6 MB/s , 850.1 MB/s
 7#silesia.tar       : 211950592 ->  61471202 (3.448),  80.5 MB/s , 909.6 MB/s
 8#silesia.tar       : 211950592 ->  60901887 (3.480),  64.7 MB/s , 936.1 MB/s
 9#silesia.tar       : 211950592 ->  59917881 (3.537),  49.7 MB/s , 932.2 MB/s
10#silesia.tar       : 211950592 ->  59283096 (3.575),  40.6 MB/s , 898.5 MB/s
11#silesia.tar       : 211950592 ->  59140552 (3.584),  37.2 MB/s , 891.0 MB/s
12#silesia.tar       : 211950592 ->  58630838 (3.615),  25.9 MB/s , 899.2 MB/s

silesia.tar - scalar row matchfinder:
 5#silesia.tar       : 211950592 ->  63794305 (3.322), 105.2 MB/s , 821.9 MB/s
 6#silesia.tar       : 211950592 ->  62967928 (3.366), 102.1 MB/s , 837.8 MB/s
 7#silesia.tar       : 211950592 ->  61471202 (3.448),  75.3 MB/s , 894.5 MB/s
 8#silesia.tar       : 211950592 ->  60901887 (3.480),  61.0 MB/s , 921.8 MB/s
 9#silesia.tar       : 211950592 ->  59917881 (3.537),  47.5 MB/s , 918.8 MB/s
10#silesia.tar       : 211950592 ->  59283096 (3.575),  38.8 MB/s , 882.4 MB/s
11#silesia.tar       : 211950592 ->  59140552 (3.584),  35.2 MB/s , 873.6 MB/s
12#silesia.tar       : 211950592 ->  58630838 (3.615),  24.2 MB/s , 884.5 MB/s

silesia.tar - no row matchfinder
 5#silesia.tar       : 211950592 ->  63981847 (3.313),  95.4 MB/s , 825.2 MB/s
 6#silesia.tar       : 211950592 ->  62882463 (3.371),  63.9 MB/s , 834.3 MB/s
 7#silesia.tar       : 211950592 ->  61327719 (3.456),  49.3 MB/s , 898.0 MB/s
 8#silesia.tar       : 211950592 ->  60745717 (3.489),  41.9 MB/s , 921.8 MB/s
 9#silesia.tar       : 211950592 ->  60174265 (3.522),  28.4 MB/s , 935.6 MB/s
10#silesia.tar       : 211950592 ->  59525124 (3.561),  18.5 MB/s , 903.0 MB/s
11#silesia.tar       : 211950592 ->  59230515 (3.578),  13.9 MB/s , 894.8 MB/s
12#silesia.tar       : 211950592 ->  58769028 (3.607),  9.26 MB/s , 897.8 MB/s

Mid to high single digits speed improvement. I'd be fine either keeping or removing NEON support, though I don't see much harm in leaving it there since it's well encapsulated behind a single #ifdef block.

Regarding dictionaries (on devserver, github-users corpus):

--no-row-match-finder: DDS
 5# 9114 files       :   7484607 ->    715168 (10.47), 112.3 MB/s ,1247.2 MB/s 
 6# 9114 files       :   7484607 ->    713140 (10.50),  71.4 MB/s ,1281.5 MB/s 
 7# 9114 files       :   7484607 ->    717198 (10.44),  61.0 MB/s ,1157.1 MB/s 
 8# 9114 files       :   7484607 ->    718963 (10.41),  50.8 MB/s ,1137.3 MB/s 
 9# 9114 files       :   7484607 ->    727015 (10.29),  41.1 MB/s ,1102.1 MB/s 
10# 9114 files       :   7484607 ->    734618 (10.19),  30.2 MB/s ,1052.0 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  42.1 MB/s ,1061.5 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  40.7 MB/s ,1027.4 MB/s

--no-row-match-finder: DMS
 5# 9114 files       :   7484607 ->    719205 (10.41), 108.8 MB/s ,1227.3 MB/s 
 6# 9114 files       :   7484607 ->    713140 (10.50),  70.6 MB/s ,1262.1 MB/s 
 7# 9114 files       :   7484607 ->    717198 (10.44),  61.1 MB/s ,1155.9 MB/s 
 8# 9114 files       :   7484607 ->    718963 (10.41),  49.1 MB/s ,1109.9 MB/s 
 9# 9114 files       :   7484607 ->    727015 (10.29),  39.7 MB/s ,1128.2 MB/s 
10# 9114 files       :   7484607 ->    734618 (10.19),  27.4 MB/s ,1087.0 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  41.4 MB/s ,1044.3 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  41.9 MB/s ,1037.3 MB/s 


--row-match-finder: DDS
 5# 9114 files       :   7484607 ->    714943 (10.47),  92.2 MB/s ,1270.3 MB/s 
 6# 9114 files       :   7484607 ->    713036 (10.50),  61.5 MB/s ,1312.3 MB/s 
 7# 9114 files       :   7484607 ->    717014 (10.44),  50.8 MB/s ,1165.9 MB/s 
 8# 9114 files       :   7484607 ->    719434 (10.40),  43.2 MB/s ,1121.0 MB/s 
 9# 9114 files       :   7484607 ->    727215 (10.29),  35.9 MB/s ,1095.5 MB/s 
10# 9114 files       :   7484607 ->    727215 (10.29),  37.7 MB/s ,1135.3 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  40.4 MB/s ,1064.4 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  40.6 MB/s ,1053.3 MB/s


--row-match-finder: DMS
 5# 9114 files       :   7484607 ->    715591 (10.46),  85.6 MB/s ,1252.9 MB/s 
 6# 9114 files       :   7484607 ->    713574 (10.49),  60.1 MB/s ,1280.3 MB/s 
 7# 9114 files       :   7484607 ->    717583 (10.43),  49.1 MB/s ,1135.5 MB/s 
 8# 9114 files       :   7484607 ->    720191 (10.39),  42.6 MB/s ,1135.5 MB/s 
 9# 9114 files       :   7484607 ->    728295 (10.28),  37.3 MB/s ,1123.1 MB/s 
10# 9114 files       :   7484607 ->    728295 (10.28),  34.8 MB/s ,1066.1 MB/s 
11# 9114 files       :   7484607 ->    738799 (10.13),  41.0 MB/s , 994.5 MB/s 
12# 9114 files       :   7484607 ->    738201 (10.14),  39.7 MB/s ,1016.4 MB/s


--no-row-match-finder: no dict
 5# 9114 files       :   7484607 ->   2471555 (3.028),  64.6 MB/s , 278.6 MB/s 
 6# 9114 files       :   7484607 ->   2471608 (3.028),  61.4 MB/s , 276.5 MB/s 
 7# 9114 files       :   7484607 ->   2471609 (3.028),  60.9 MB/s , 276.8 MB/s 
 8# 9114 files       :   7484607 ->   2471609 (3.028),  60.4 MB/s , 278.1 MB/s 
 9# 9114 files       :   7484607 ->   2471608 (3.028),  40.8 MB/s , 279.9 MB/s 
10# 9114 files       :   7484607 ->   2471608 (3.028),  40.7 MB/s , 281.0 MB/s 
11# 9114 files       :   7484607 ->   2471638 (3.028),  23.2 MB/s , 283.3 MB/s 
12# 9114 files       :   7484607 ->   2454019 (3.050),  19.1 MB/s , 283.0 MB/s 

--row-match-finder: no dict
 5# 9114 files       :   7484607 ->   2462015 (3.040),  52.5 MB/s , 277.3 MB/s 
 6# 9114 files       :   7484607 ->   2462012 (3.040),  50.1 MB/s , 282.6 MB/s 
 7# 9114 files       :   7484607 ->   2462013 (3.040),  51.8 MB/s , 283.7 MB/s 
 8# 9114 files       :   7484607 ->   2462013 (3.040),  49.3 MB/s , 261.5 MB/s 
 9# 9114 files       :   7484607 ->   2471608 (3.028),  40.8 MB/s , 289.7 MB/s 
10# 9114 files       :   7484607 ->   2471608 (3.028),  41.8 MB/s , 283.9 MB/s 
11# 9114 files       :   7484607 ->   2471638 (3.028),  22.6 MB/s , 279.9 MB/s 
12# 9114 files       :   7484607 ->   2454019 (3.050),  19.0 MB/s , 281.5 MB/s

Performance with dictionaries seems reasonable. DMS is slower than DDS. And since these files are small, row hash isn't very advantageous anyways (and for this dataset, would not be activated by default).

@senhuang42 senhuang42 force-pushed the row_hash2 branch 3 times, most recently from 21c6908 to 8c10476 Compare March 31, 2021 16:55
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! Just a few mostly minor comments. We should be ready to land this once they are fixed and #2546 is merged.

Can you also add force-enabled and force-disabled row-based hash tables to the regression test, with both 16 & 32 entry rows, and with/without a dictionary in DMS mode?

if (dictMode == ZSTD_dedicatedDictSearch) {
ml = ZSTD_dedicatedDictSearch_lazy_search(offsetPtr, ml, nbAttempts, dms,
ip, iLimit, prefixStart, curr, dictLimit, ddsIdx);
} else if (dictMode == ZSTD_dictMatchState) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't include prefetching at all since it seemed like it was not good on smaller files/smaller hashtables, and the DMS typically aren't huge - but in either case, yeah, we should measure that, and I'll add it todo.

Prefetching in the DMS is very different then prefetching for a the file's hash table. We're inserting into the hash table as we compress, so those writes mean it is hot and likely already in L1.

However, the dictionary is read-only, and likely starts out in RAM (cold). So we'll see lots of cache misses. @felixhandte found that prefetching was very? beneficial for DDS.

@senhuang42 senhuang42 force-pushed the row_hash2 branch 3 times, most recently from f08e097 to 69a6516 Compare April 2, 2021 16:02
@senhuang42
Copy link
Contributor Author

senhuang42 commented Apr 2, 2021

I've collected a bunch of benchmarks regarding DDS vs. DMS+attach vs. DMS+copy vs. forceLoad - in addition to row hash vs. non row hash methods, and am currently just trying to parse through the data (as well as the updated results.csv) to make sense of it and see if there's anything particularly worrying or if it reveals any other bugs with how dictionaries are handled.

Still trying to figure out how to best present this info, but the raw data is here if you're curious (some of the levels aren't using row hash for smaller inputs though it's reported anyway):
DDS with row: https://pastebin.com/wLuHbkG2
DMS+attach with row: https://pastebin.com/iqC0fMRU
DMS+copy with row: https://pastebin.com/5QBr1iaa
Force Load with row: https://pastebin.com/HpW4Pnn6

DDS no row: https://pastebin.com/wm6d5m92
DMS+attach no row: https://pastebin.com/QFp9a9fW
DMS+copy no row: https://pastebin.com/hqdVKhjc
Force Load with no row: https://pastebin.com/bwqqGNCj

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Please just look into the with dict load regression test before landing. Its always giving exactly the same numbers, no matter what level, or if row is enabled/disabled. That seems very suspicious.

If it is broken, you can fix it in a separate PR, as long as it isn't hiding a bug (unlikely).

github, level 5 row 2 with dict dms, advanced one pass, 38758
github, level 5 row 2 with dict dds, advanced one pass, 38737
github, level 5 row 2 with dict copy, advanced one pass, 38759
github, level 5 row 2 with dict load, advanced one pass, 42252
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suspicious of this result. Why do we get exactly the same result as with row = 1 and row = 2 with dict load? The rest are all different.

Copy link
Contributor Author

@senhuang42 senhuang42 Apr 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was actually just looking into this since I noticed this in my separate benchmarking as well initially. It appears to be an existing bug here:

if (cctx->cdict)
params.compressionLevel = cctx->cdict->compressionLevel; /* let cdict take priority in terms of compression level */

If the cdict exists but a compression level isn't set, but we reload the dictionary, we end up taking the level of the cdict which is fixed, rather than the cParams. I'll put up a separate PR for this since it affects non-row matchfinder results too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replicate this behavior if you hardcode the attach dict pref to forceLoad and run benchmarks like -b1e19. Everything is just compressed with default compression level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured it was something like that, nice find!

I'm fine to fix it in a separate PR, we'll just need to double check the row-based match finder results after that PR, to make sure they make sense.

github, level 7 row 1 with dict dms, advanced one pass, 38771
github, level 7 row 1 with dict dds, advanced one pass, 38771
github, level 7 row 1 with dict copy, advanced one pass, 38745
github, level 7 row 1 with dict load, advanced one pass, 42252
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same number too.

github, level 9 with dict, advanced one pass, 39437
github, level 9 with dict dms, advanced one pass, 39437
github, level 9 with dict dds, advanced one pass, 39338
github, level 9 with dict copy, advanced one pass, 39398
github, level 9 with dict load, advanced one pass, 42252
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is the same number.

github.tar, level 5 row 2 with dict dms, advanced one pass, 39024
github.tar, level 5 row 2 with dict dds, advanced one pass, 39023
github.tar, level 5 row 2 with dict copy, advanced one pass, 39040
github.tar, level 5 row 2 with dict load, advanced one pass, 37956
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too.

@senhuang42
Copy link
Contributor Author

senhuang42 commented Apr 7, 2021

Some comparisons of DDS vs. DMS attach vs. DMS copy to help inform a decision on what to do about the three methods. Results for force dict reloading are included too, though those are separate from the main topic here.

It actually seems like one of DMS+attach or DMS+copy is typically faster than DDS, except on the smaller src Sizes (where non-row has the advantage). So if the auto-selection configuration as it is right now is somewhat accurate, these results actually seem to indicate that we don't need DDS.

For the case that we typically care about for dictionary compression (smaller srcSize), DMS+attach actually seems to be faster than DDS, and DMS+copy is still pretty good for the larger srcSizes. For reference, the attachDictCutoff is 32KB for all of the affected strategies.

All of the results below are for level 7, with silesia.tar cut up into block of srcSize size. The levels for which row-hash is used across each set of srcSize params all tell a pretty similar story to this one.

Speed:
image
image
image
image
image

Ratio:
1K srcsize:

dictsize 8000 64000 112640 250000 500000
DDS 2.093 2.220 2.314 2.399 2.428
DMSCopy 2.093 2.219 2.309 2.394 2.421
DMSAttach 2.093 2.219 2.309 2.394 2.421
DDS_norow 2.100 2.228 2.324 2.411 2.439
DMSCopy_norow 2.100 2.228 2.324 2.411 2.439
DMSAttach_norow 2.100 2.228 2.324 2.411 2.439
load 2.029 2.021 2.043 2.036 2.035
load_norow 2.076 2.044 2.062 2.050 2.046

10K:

dictsize 8000 64000 112640 250000 500000
DDS 2.719 2.813 2.866 2.935 2.934
DMSCopy 2.719 2.813 2.863 2.933 2.930
DMSAttach 2.719 2.813 2.863 2.933 2.930
DDS_norow 2.723 2.815 2.867 2.937 2.936
DMSCopy_norow 2.723 2.815 2.867 2.937 2.936
DMSAttach_norow 2.723 2.815 2.867 2.937 2.936
load 2.719 2.732 2.733 2.728 2.720
load_norow 2.723 2.803 2.818 2.791 2.766

100K:

dictsize 8000 64000 112640 250000 500000
DDS 3.086 3.167 3.181 3.205 3.191
DMSCopy 3.088 3.167 3.180 3.206 3.190
DMSAttach 3.087 3.167 3.180 3.205 3.190
DDS_norow 3.096 3.168 3.183 3.207 3.191
DMSCopy_norow 3.127 3.167 3.182 3.207 3.191
DMSAttach_norow 3.127 3.168 3.183 3.207 3.191
load 3.132 3.165 3.178 3.166 3.190
load_norow 3.133 3.166 3.179 3.166 3.191

200K:

dictsize 8000 64000 112640 250000 500000
DDS 3.119 3.237 3.245 3.269 3.254
DMSCopy 3.122 3.236 3.244 3.269 3.253
DMSAttach 3.119 3.236 3.245 3.269 3.254
DDS_norow 3.143 3.234 3.249 3.271 3.255
DMSCopy_norow 3.180 3.232 3.248 3.271 3.255
DMSAttach_norow 3.179 3.234 3.249 3.271 3.255
load 3.217 3.203 3.212 3.235 3.253
load_norow 3.218 3.203 3.213 3.235 3.255

1M:

dictsize 8000 64000 112640 250000 500000
DDS 3.145 3.332 3.334 3.383 3.381
DMSCopy 3.149 3.331 3.332 3.383 3.379
DMSAttach 3.145 3.332 3.334 3.383 3.381
DDS_norow 3.189 3.323 3.344 3.379 3.386
DMSCopy_norow 3.234 3.321 3.343 3.376 3.384
DMSAttach_norow 3.231 3.323 3.344 3.379 3.386
load 3.346 3.354 3.358 3.368 3.379
load_norow 3.351 3.359 3.362 3.373 3.384

@senhuang42 senhuang42 force-pushed the row_hash2 branch 2 times, most recently from 42f78ce to 4d63d6e Compare April 7, 2021 17:26
@terrelln
Copy link
Contributor

terrelln commented Apr 7, 2021

I'm suspicious of those results, because I don't see a large difference between DDS & DMSAttach_norow. Where we know that DDS is significantly faster than DMSAttach_norow.

What is your dictionary? Is it possible that you get about the same compression ratio with or without a dictionary? So the match finder will basically not find matches within the dictionary, so all methods will appear to be about the same speed.

@terrelln
Copy link
Contributor

terrelln commented Apr 7, 2021

Silesia isn't a good corpus for dictionary benchmarking, because it contains several files, and each file is completely independent. I recommend running this on one of of the dictionary compression corpuses.

@senhuang42
Copy link
Contributor Author

I'm suspicious of those results, because I don't see a large difference between DDS & DMSAttach_norow. Where we know that DDS is significantly faster than DMSAttach_norow.

What is your dictionary? Is it possible that you get about the same compression ratio with or without a dictionary? So the match finder will basically not find matches within the dictionary, so all methods will appear to be about the same speed.

The dictionary I used is just a dictionary trained on silesia.tar split up into 128K blocks. Though that's a good point, I suspect that a dictionary of size 8K-500K for silesia.tar can't possibly be very good since the files are so different. A quick spot check on the dict vs. no dict sizes reveals that that's accurate, only a few percent better compression.

@senhuang42 senhuang42 merged commit 56421f3 into facebook:dev Apr 8, 2021
@aqrit
Copy link
Contributor

aqrit commented Apr 29, 2021

Clang has a codegen issue that costs ZSTD_vmovmaskq_u8 3 instructions.

If ZSTD_row_getMatchMask is passed rowEntries == 32 then ZSTD_vmovmaskq_u8 is called twice (once every 16 bytes). This is unnecessary. Combining the 128-bit halves together would saves several operations.

The following snippet avoids clang's vsra issue:
*It still uses a vsra but the optimizer is not smart enough to realize it could be lowered to a shift + or.

uint32_t NEON_i8x16_MatchMask (const uint8_t* ptr, uint8_t match_byte)
{
    uint8x16_t src = vld1q_u8(ptr);
    uint8x16_t dup = vdupq_n_u8(match_byte);
    uint16x8_t cmp = vreinterpretq_u16_u8(vceqq_u8(src, dup)); // #1

    uint16x8_t t0 = vshlq_n_u16(cmp, 7); // #2
    uint32x4_t t1 = vreinterpretq_u32_u16(vsriq_n_u16(t0, t0, 14)); // #3
    uint64x2_t t2 = vreinterpretq_u64_u32(vshrq_n_u32(t1, 14)); // #4
    uint8x16_t t3 = vreinterpretq_u8_u64(vsraq_n_u64(t2, t2, 28));
    return vgetq_lane_u8(t3, 0) | (vgetq_lane_u8(t3, 8) << 8);

/*
1) ...|77777777|66666666|55555555|44444444|33333333|22222222|11111111|00000000
2) ...|76666666|6xxxxxxx|54444444|4xxxxxxx|32222222|2xxxxxxx|10000000|0xxxxxxx
3) ...|76666666|6xxxxx76|54444444|4xxxxx54|32222222|2xxxxx32|10000000|0xxxxx10
4) ...|xxxxxxxx|xxxxxx76|6666666x|xxxx7654|xxxxxxxx|xxxxxx32|2222222x|xxxx3210
*/
}

A 256-bit version could like look:

uint32_t NEON_i8x32_MatchMask (const unsigned char* ptr, unsigned char val)
{
    uint16x8x2_t src = vld2q_u16((unsigned short*)ptr);
    uint8x16_t dup_val = vdupq_n_u8(val);
    uint8x16_t cmp0 = vceqq_u8(vreinterpretq_u8_u16(src.val[0]), dup_val);
    uint8x16_t cmp1 = vceqq_u8(vreinterpretq_u8_u16(src.val[1]), dup_val);

    uint8x8_t t0 = vreinterpret_u8_s8(vqmovn_s16(vreinterpretq_s16_u8(cmp0)));
    uint8x8_t t1 = vreinterpret_u8_s8(vqmovn_s16(vreinterpretq_s16_u8(cmp1)));
    uint8x8_t t2 = vsri_n_u8(t1, t0, 2); // ...|FEDCxxxx|BA98xxxx|7654xxxx|3210xxxx
    uint8x8x2_t t3 = vuzp_u8(t2, t0); // **see note
    uint8x8_t t4 = vsri_n_u8(t3.val[1], t3.val[0], 4);
    return vget_lane_u32(vreinterpret_u32_u8(t4), 0);

/* 	vuzp_u8 splits the input into odd and even lanes
	we don't care about the hi 32-bits; just pass old garbage as the hi arg.
	(note: don't pass `t2` twice as the cause an extra mov)
*/

/*
pack with signed saturation (vqmovn_s16):
0x0000 -> 0x00
0x00FF -> 0x7F
0xFF00 -> 0x80
0xFFFF -> 0xFF
*/
}

@terrelln
Copy link
Contributor

Thanks for the feedback @aqrit!

Definitely most of our time was spent optimizing the performance on x86. We'd love to take a PR that improves NEON performance!

@aqrit
Copy link
Contributor

aqrit commented May 20, 2021

The scalar version of ZSTD_Vec128_cmpMask8 may not be compatible with big-endian memory byte order?
Is big-endian support desired? If not, can the other big-endian paths be ripped out?

I'm considering creating a PR. I believe a SWAR approach for scalar ZSTD_Vec128_cmpMask8 would use one-fourth as many instructions.

size_t SWAR_MatchMask(const unsigned char* s, unsigned char c, size_t n) {
//	assert((n == 16) || (n == 32));
//	assert((sizeof(size_t) == 4) || (sizeof(size_t) == 8));

	size_t shift = (sizeof(size_t) * 8) - sizeof(size_t);
	size_t mul = 0x0204081;
	size_t cmp = c * 0x01010101;
	size_t x7F = 0x7F7F7F7F;
	size_t matches = 0;
	
 	if (8 == sizeof(size_t)) {
		mul |= mul << 28;
		cmp |= cmp << 32;
		x7F |= x7F << 32;
	}
	
	for (size_t i = 0; i < n; i += sizeof(size_t)) {
		size_t v;
		memcpy(&v, &s[i], sizeof(size_t));
		v ^= cmp;
		v = (~(((v & x7F) + x7F) | v | x7F) * mul) >> shift;
		matches |= v << i;
	}
	return matches;
}

Also, maybe set ZSTD_ROW_HASH_TAG_OFFSET to 16 and use aligned reads?
Each row in the tagTable has (at least) 15 bytes of padding and is always aligned to (at least) a 32 byte boundary..?

@terrelln
Copy link
Contributor

Is big-endian support desired? If not, can the other big-endian paths be ripped out?

Yes, it is desired. We do have big-endian tests, but I suspect this bug wasn't caught because it would only cause a ratio regression, not a crash / bad output. And we only run our ratio regression tests on little endian platforms.

@senhuang42 can you add a test that would expose this? We'd have to add it to one of the tests we run on big-endian platforms, and make sure we run it with vectorization disabled.

I'm considering creating a PR. I believe a SWAR approach for scalar ZSTD_Vec128_cmpMask8 would use one-fourth as many instructions.

Yeah, we'd welcome a PR that improves the scalar version. We've spent most of our time looking at optimizing the vectorized code, so there are probably wins here.

Each row in the tagTable has (at least) 15 bytes of padding and is always aligned to (at least) a 32 byte boundary..?

Yeah, we align the tagTable so the whole row fits in 1 cache line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants