-
Notifications
You must be signed in to change notification settings - Fork 2.3k
SIMD Row Based Matchfinder 🚀 #2494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a2b0ec6
to
4604216
Compare
If we just need SSE, then we can probably do a compile-time check. It's only for AVX that we need runtime checks. |
75bcbe4
to
233f4de
Compare
lib/compress/zstd_lazy.c
Outdated
for (; (matches > 0) && (nbAttempts > 0); --nbAttempts, matches &= (matches - 1)) { | ||
U32 const matchPos = (head + ZSTD_VecMask_next(matches)) & rowMask; | ||
U32 const matchIndex = row[matchPos]; | ||
if (matchIndex < lowLimit) | ||
break; | ||
if (shouldPrefetch) { | ||
if ((dictMode != ZSTD_extDict) || matchIndex >= dictLimit) { | ||
PREFETCH_L1(base + matchIndex); | ||
} else { | ||
PREFETCH_L1(dictBase + matchIndex); | ||
} | ||
} | ||
matchBuffer[numMatches++] = matchIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried inserting the scalar fallback here instead?
In the SSE version we iterate over the matches
mask.
In the scalar version, we could iterate over the tags.
for (U32 pos = 0; pos < 16; ++pos) {
U32 const matchPos = (head + pos) & rowMask;
U32 const matchIndex = row[matchPos];
if (tagRow[matchPos] != tag) continue;
if (matchIndex < lowLimit) break;
if (nbAttempts == 0) break;
--nbAttempts;
// Rest is the same
You could also try a cmov-style version to try to avoid branches. Though, who knows if the architecture supports cmov or not, since this code won't be executing on x86.
for (U32 pos = 0; pos < 16; ++pos) {
U32 const matchPos = (head + pos) & rowMask;
U32 const matchIndex = row[matchPos];
int const tagMatches = (tagRow[matchPos] == tag) ? 1 : 0;
if (matchIndex < lowLimit) break;
if (nbAttempts == 0) break;
nbAttempts -= tagMatches;
// Rest is the same except don't ++ numMatches & don't prefetch b/c it is probably disabled anyways because not x86
numMatches += tagMatches;
Its probably not right to be measuring performance on x86-64, since we know for a fact that the scalar code won't be executing on x86-64.
You could try executing in 32-bit mode (-m32
). That would give a slightly more interesting comparison. And/or you could measure on ARM, though that is harder to set up. Either of those would be a more interesting comparison than x86-64 speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely an interesting idea, and a more natural way to go about it - I'll give it a shot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more measurements, it actually doesn't seem like either of these offers a speed improvement compared to 32-bit no SIMD. On very small files it seems to provides some benefit, but we don't use the row-based matchfinder for those small files at that point anyways.
f5a867f
to
1852ef5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at the integration portion, but not zstd_lazy.c
this pass.
tests/fuzzer.c
Outdated
size_t const target_nodict_cSize[22+1] = { 3840, 3770, 3870, 3830, 3770+1, | ||
3770+5, 3770+5, 3770+5, 3750+5, 3750+5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to change these, but I don't know if they were a tight bound before, or fairly loose. So make sure there isn't a large regression.
The appveyor + travis tests all seem like legitimate failures |
Advanced param gatingNow this is gated behind an advanced param
These ongoing changes have introduced about a 1-2% decrease in speed compared to the initial PR. Memory usage comparisonParameters have been manually tuned: we're roughly neutral on memory. We can optimize further by making silesia.tar: entire file
enwik7: first 250K
enwik7: first 64K
32-bit vs. 64-bit comparisonsA comparison of the row-hash The regressions don't look too too bad. Also, 32-bit + SIMD does perform noticeably better than without. Worst case, 32-bit users can just not enable this advanced parameter. But when we get rid of the advanced parameter, we'll at least want to pay some attention to the impact here.
Still to-do:
|
What do you mean by that? |
So for 17-entry tagTable rows, we actually use 32 rows (wasting 15). When But for some reason, the code seems to work as-is. Changing the logic so that we allocate a bigger tagTable and multiplying the tagTable row we end up on by 2 (or just changing tagTable to be a table of U32 instead of U16) doesn't do anything. |
d1e5be0
to
e6a5cda
Compare
3df4848
to
d20f313
Compare
1533090
to
19b6704
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a review of everything execpt zstd_lazy.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of zstd_lazy.c
9a53feb
to
a79f52d
Compare
Some benchmarks regarding NEON performance (samsung galaxy s20):
Mid to high single digits speed improvement. I'd be fine either keeping or removing NEON support, though I don't see much harm in leaving it there since it's well encapsulated behind a single Regarding dictionaries (on devserver, github-users corpus):
Performance with dictionaries seems reasonable. DMS is slower than DDS. And since these files are small, row hash isn't very advantageous anyways (and for this dataset, would not be activated by default). |
21c6908
to
8c10476
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good! Just a few mostly minor comments. We should be ready to land this once they are fixed and #2546 is merged.
Can you also add force-enabled and force-disabled row-based hash tables to the regression test, with both 16 & 32 entry rows, and with/without a dictionary in DMS mode?
if (dictMode == ZSTD_dedicatedDictSearch) { | ||
ml = ZSTD_dedicatedDictSearch_lazy_search(offsetPtr, ml, nbAttempts, dms, | ||
ip, iLimit, prefixStart, curr, dictLimit, ddsIdx); | ||
} else if (dictMode == ZSTD_dictMatchState) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't include prefetching at all since it seemed like it was not good on smaller files/smaller hashtables, and the DMS typically aren't huge - but in either case, yeah, we should measure that, and I'll add it todo.
Prefetching in the DMS is very different then prefetching for a the file's hash table. We're inserting into the hash table as we compress, so those writes mean it is hot and likely already in L1.
However, the dictionary is read-only, and likely starts out in RAM (cold). So we'll see lots of cache misses. @felixhandte found that prefetching was very? beneficial for DDS.
f08e097
to
69a6516
Compare
I've collected a bunch of benchmarks regarding DDS vs. DMS+attach vs. DMS+copy vs. forceLoad - in addition to row hash vs. non row hash methods, and am currently just trying to parse through the data (as well as the updated Still trying to figure out how to best present this info, but the raw data is here if you're curious (some of the levels aren't using row hash for smaller inputs though it's reported anyway): DDS no row: https://pastebin.com/wm6d5m92 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Please just look into the with dict load
regression test before landing. Its always giving exactly the same numbers, no matter what level, or if row is enabled/disabled. That seems very suspicious.
If it is broken, you can fix it in a separate PR, as long as it isn't hiding a bug (unlikely).
tests/regression/results.csv
Outdated
github, level 5 row 2 with dict dms, advanced one pass, 38758 | ||
github, level 5 row 2 with dict dds, advanced one pass, 38737 | ||
github, level 5 row 2 with dict copy, advanced one pass, 38759 | ||
github, level 5 row 2 with dict load, advanced one pass, 42252 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspicious of this result. Why do we get exactly the same result as with row = 1
and row = 2
with dict load
? The rest are all different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was actually just looking into this since I noticed this in my separate benchmarking as well initially. It appears to be an existing bug here:
zstd/lib/compress/zstd_compress.c
Lines 5170 to 5171 in 980f3bb
if (cctx->cdict) | |
params.compressionLevel = cctx->cdict->compressionLevel; /* let cdict take priority in terms of compression level */ |
If the cdict exists but a compression level isn't set, but we reload the dictionary, we end up taking the level of the cdict which is fixed, rather than the cParams
. I'll put up a separate PR for this since it affects non-row matchfinder results too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can replicate this behavior if you hardcode the attach dict pref to forceLoad and run benchmarks like -b1e19
. Everything is just compressed with default compression level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I figured it was something like that, nice find!
I'm fine to fix it in a separate PR, we'll just need to double check the row-based match finder results after that PR, to make sure they make sense.
tests/regression/results.csv
Outdated
github, level 7 row 1 with dict dms, advanced one pass, 38771 | ||
github, level 7 row 1 with dict dds, advanced one pass, 38771 | ||
github, level 7 row 1 with dict copy, advanced one pass, 38745 | ||
github, level 7 row 1 with dict load, advanced one pass, 42252 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same number too.
tests/regression/results.csv
Outdated
github, level 9 with dict, advanced one pass, 39437 | ||
github, level 9 with dict dms, advanced one pass, 39437 | ||
github, level 9 with dict dds, advanced one pass, 39338 | ||
github, level 9 with dict copy, advanced one pass, 39398 | ||
github, level 9 with dict load, advanced one pass, 42252 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is the same number.
tests/regression/results.csv
Outdated
github.tar, level 5 row 2 with dict dms, advanced one pass, 39024 | ||
github.tar, level 5 row 2 with dict dds, advanced one pass, 39023 | ||
github.tar, level 5 row 2 with dict copy, advanced one pass, 39040 | ||
github.tar, level 5 row 2 with dict load, advanced one pass, 37956 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too.
Some comparisons of DDS vs. DMS attach vs. DMS copy to help inform a decision on what to do about the three methods. Results for force dict reloading are included too, though those are separate from the main topic here. It actually seems like one of DMS+attach or DMS+copy is typically faster than DDS, except on the smaller src Sizes (where non-row has the advantage). So if the auto-selection configuration as it is right now is somewhat accurate, these results actually seem to indicate that we don't need DDS. For the case that we typically care about for dictionary compression (smaller srcSize), DMS+attach actually seems to be faster than DDS, and DMS+copy is still pretty good for the larger srcSizes. For reference, the All of the results below are for level 7, with Ratio:
10K:
100K:
200K:
1M:
|
42f78ce
to
4d63d6e
Compare
I'm suspicious of those results, because I don't see a large difference between DDS & DMSAttach_norow. Where we know that DDS is significantly faster than What is your dictionary? Is it possible that you get about the same compression ratio with or without a dictionary? So the match finder will basically not find matches within the dictionary, so all methods will appear to be about the same speed. |
Silesia isn't a good corpus for dictionary benchmarking, because it contains several files, and each file is completely independent. I recommend running this on one of of the dictionary compression corpuses. |
The dictionary I used is just a dictionary trained on silesia.tar split up into 128K blocks. Though that's a good point, I suspect that a dictionary of size 8K-500K for silesia.tar can't possibly be very good since the files are so different. A quick spot check on the dict vs. no dict sizes reveals that that's accurate, only a few percent better compression. |
Clang has a codegen issue that costs ZSTD_vmovmaskq_u8 3 instructions. If The following snippet avoids clang's
A 256-bit version could like look:
|
Thanks for the feedback @aqrit! Definitely most of our time was spent optimizing the performance on x86. We'd love to take a PR that improves NEON performance! |
The scalar version of I'm considering creating a PR. I believe a SWAR approach for scalar
Also, maybe set |
Yes, it is desired. We do have big-endian tests, but I suspect this bug wasn't caught because it would only cause a ratio regression, not a crash / bad output. And we only run our ratio regression tests on little endian platforms. @senhuang42 can you add a test that would expose this? We'd have to add it to one of the tests we run on big-endian platforms, and make sure we run it with vectorization disabled.
Yeah, we'd welcome a PR that improves the scalar version. We've spent most of our time looking at optimizing the vectorized code, so there are probably wins here.
Yeah, we align the |
Followups after this PR
This PR introduces the row-based lazy matchfinder, iterated on the prototype from @terrelln, and presents some initial results.
The comment above
ZSTD_RowFindBestMatch_generic()
explains at a high-level how the matchfinder works.Preliminary benchmarks: parameters/compression levels are just approximations, so there will be some compression ratio differences. We see some pretty huge gains though.
gcc: silesia.tar, -b5e12:
gcc: enwik7, -b5e12:
On small sources we are roughly neutral compared to
zstd 1.4.8
if we remove prefetching. Currently the templating on the prefetching is probably not done properly and slows down everything in general. The following benchmarks with "no prefetch" compare the speed if the prefetching code is just removed.gcc: enwik7, -b4e8 -B1KB