Skip to content

Conversation

@regadas
Copy link
Collaborator

@regadas regadas commented Jul 11, 2020

Hi @johnynek long overdue but here it is. I think this bloom filter is inline with what you mentioned previously and it's backed by a slightly modified version of the cats-collections BitSet. Since this is very much experimental it lives under the experimental package.

To be honest this has been sitting in a branch for awhile now and I've been meaning to look more into it to try to squeeze more pref (it should be possible). Putting out here to fish for extra opinions.

Let me know what you think.

[info] BloomFilterCreateBenchmark.createBloomFilter                                      0.001            10000  avgt    3     3.810 ±    0.607  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilter                                      0.001            50000  avgt    3    19.622 ±    1.380  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilter                                       0.01            10000  avgt    3     2.647 ±    0.096  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilter                                       0.01            50000  avgt    3    12.683 ±    1.980  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterExperimental                          0.001            10000  avgt    3     4.717 ±    0.216  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterExperimental                          0.001            50000  avgt    3    21.364 ±    0.259  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterExperimental                           0.01            10000  avgt    3     3.341 ±    0.453  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterExperimental                           0.01            50000  avgt    3    16.450 ±    0.345  ms/op
[info] BloomFilterQueryBenchmark.queryBloomFilter                            0.001              100  avgt    3  134.870 ± 12.429  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilter                            0.001             1000  avgt    3  135.747 ± 13.614  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilter                            0.001            10000  avgt    3  136.107 ±  3.723  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilter                             0.01              100  avgt    3   96.572 ±  5.242  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilter                             0.01             1000  avgt    3   97.131 ±  8.785  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilter                             0.01            10000  avgt    3   98.331 ±  6.771  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilterExperimental                0.001              100  avgt    3  142.514 ±  6.100  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilterExperimental                0.001             1000  avgt    3  143.342 ±  4.156  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilterExperimental                0.001            10000  avgt    3  146.318 ±  6.112  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilterExperimental                 0.01              100  avgt    3  101.955 ±  3.377  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilterExperimental                 0.01             1000  avgt    3  111.415 ±  8.235  ns/op
[info] BloomFilterQueryBenchmark.queryBloomFilterExperimental                 0.01            10000  avgt    3  106.131 ±  3.394  ns/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                            0.001            10000  avgt    3   163.437 ±   20.909  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                            0.001            50000  avgt    3  4822.596 ± 1001.072  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                             0.01            10000  avgt    3    91.614 ±   16.930  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                             0.01            50000  avgt    3  2423.146 ±  354.286  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregatorExperimental                0.001            10000  avgt    3    32.485 ±    1.291  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregatorExperimental                0.001            50000  avgt    3   587.019 ±  796.350  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregatorExperimental                 0.01            10000  avgt    3    22.232 ±    1.424  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterAggregatorExperimental                 0.01            50000  avgt    3   370.341 ±   10.683  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                             0.001            10000  avgt    3   161.405 ±    5.454  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                             0.001            50000  avgt    3  4746.247 ±  333.122  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                              0.01            10000  avgt    3    94.377 ±    8.656  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                              0.01            50000  avgt    3  2265.540 ±   86.962  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                 0.001            10000  avgt    3    32.108 ±    0.386  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                 0.001            50000  avgt    3   542.427 ±   97.306  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                  0.01            10000  avgt    3    22.257 ±    0.485  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                  0.01            50000  avgt    3   400.538 ±  989.902  ms/op

@regadas regadas requested a review from johnynek July 13, 2020 09:53
Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

So, is it right that the TL;DR is that it may be a bit slower but very close for all operations except folding/aggregating where it is considerably faster?

Would you agree this is the right take?

@regadas
Copy link
Collaborator Author

regadas commented Jul 16, 2020

@johnynek yup that's the right take. If I'm not missing anything obvious we need dig a little bit more into this BitSet.

@regadas
Copy link
Collaborator Author

regadas commented Jul 17, 2020

New numbers:

[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                0.001            10000  avgt    3   14.570 ±  2.050  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                0.001            50000  avgt    3  100.728 ± 10.937  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                 0.01            10000  avgt    3   11.131 ±  1.336  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                 0.01            50000  avgt    3   71.354 ±  1.632  ms/op

previous:

[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                 0.001            10000  avgt    3    32.108 ±    0.386  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                 0.001            50000  avgt    3   542.427 ±   97.306  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                  0.01            10000  avgt    3    22.257 ±    0.485  ms/op
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFoldExperimental                  0.01            50000  avgt    3   400.538 ±  989.902  ms/op

@codecov-commenter
Copy link

codecov-commenter commented Jul 18, 2020

Codecov Report

Merging #840 into develop will increase coverage by 0.30%.
The diff coverage is 96.25%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #840      +/-   ##
===========================================
+ Coverage    89.43%   89.74%   +0.30%     
===========================================
  Files          116      119       +3     
  Lines         9164     9913     +749     
  Branches       383      501     +118     
===========================================
+ Hits          8196     8896     +700     
- Misses         968     1017      +49     
Impacted Files Coverage Δ
...om/twitter/algebird/experimental/BloomFilter.scala 95.00% <95.00%> (ø)
.../scala/com/twitter/algebird/immutable/BitSet.scala 96.57% <96.57%> (ø)
...rc/main/scala/com/twitter/algebird/Operators.scala 37.50% <0.00%> (-31.74%) ⬇️
...in/scala/com/twitter/algebird/scalacheck/Gen.scala 95.34% <0.00%> (-4.66%) ⬇️
.../main/scala/com/twitter/algebird/Applicative.scala 54.83% <0.00%> (-3.23%) ⬇️
.../main/scala/com/twitter/algebird/HyperLogLog.scala 90.76% <0.00%> (-1.61%) ⬇️
...ore/src/main/scala/com/twitter/algebird/Ring.scala 66.39% <0.00%> (-0.60%) ⬇️
...rc/main/scala/com/twitter/algebird/MonadLaws.scala 100.00% <0.00%> (ø)
.../main/scala/com/twitter/algebird/Successible.scala 87.50% <0.00%> (ø)
...ain/scala/com/twitter/algebird/AveragedValue.scala 100.00% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1bafbfc...8a6fadb. Read the comment docs.

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry this slipped again.

A couple of minor points.

Also, I think we should just move this into algebird.immutable (not experimental). We have pretty good evidence this is going to be as good or better for virtually all users.

After we merge this, I can see making a separate algebird-legacy package that has things we want to remove, but at the same class name. So users who need it will need to update their build, but not their source code.

I'd like to remove the mutable bitset we are using from the build so we could have a scalajs build of algebird.

* Create a bloom filter with multiple items from an iterator
*/
def create(data: Iterator[A]): Hash = monoid.sum(data.map(Item))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a def fromBitSet(bs: BitSet): Try[Hash] and comment that this is how you do serialization: convert to BitSet, serialize that, and then on the other end, use this method?

The Try here verifies that the bitset maximum entry does not go beyond the size we would expect for this particular bloomfilter. Also, if it is empty, we just replace it with Empty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @johnynek Thanks for getting back to me! Here wouldn't we need to validate the numHashes as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We statically know the numHashes, don't we? So, the only dynamic thing we have when we are converting from a BitSet is the maximum set value in the BitSet. So, as long as that isn't invalid, we assume all is good, I guess...

@regadas regadas force-pushed the feature/pd-bloom branch 3 times, most recently from 782f913 to cfadc20 Compare November 2, 2020 22:50
Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes.

@codecov-io
Copy link

codecov-io commented Nov 3, 2020

Codecov Report

Merging #840 into develop will increase coverage by 0.61%.
The diff coverage is 96.60%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #840      +/-   ##
===========================================
+ Coverage    89.04%   89.66%   +0.61%     
===========================================
  Files          117      119       +2     
  Lines         9354     9943     +589     
  Branches       356      489     +133     
===========================================
+ Hits          8329     8915     +586     
- Misses        1025     1028       +3     
Impacted Files Coverage Δ
...a/com/twitter/algebird/immutable/BloomFilter.scala 95.90% <95.90%> (ø)
.../scala/com/twitter/algebird/immutable/BitSet.scala 96.78% <96.78%> (ø)
.../main/scala/com/twitter/algebird/RightFolded.scala 75.00% <0.00%> (-25.00%) ⬇️
.../main/scala/com/twitter/algebird/BloomFilter.scala 94.24% <0.00%> (-0.89%) ⬇️
.../main/scala/com/twitter/algebird/HyperLogLog.scala 90.76% <0.00%> (+0.40%) ⬆️
.../main/scala/com/twitter/algebird/JavaMonoids.scala 80.32% <0.00%> (+1.63%) ⬆️
.../main/scala/com/twitter/algebird/Successible.scala 95.83% <0.00%> (+4.16%) ⬆️
...c/main/scala/com/twitter/algebird/SpaceSaver.scala 93.51% <0.00%> (+4.62%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb5b48d...3ce45c8. Read the comment docs.

@regadas
Copy link
Collaborator Author

regadas commented Nov 5, 2020

@johnynek Addressed the issue before. After this I'll go ahead and move the other BF impl to a algebird-legacy as you proposed.

@johnynek johnynek merged commit 66944f9 into twitter:develop Nov 5, 2020
@johnynek
Copy link
Collaborator

johnynek commented Nov 5, 2020

Thanks for pushing this through!

@regadas regadas deleted the feature/pd-bloom branch November 9, 2021 09:56
@codecov-commenter
Copy link

codecov-commenter commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 96.60441% with 20 lines in your changes missing coverage. Please review.

Project coverage is 89.66%. Comparing base (fb5b48d) to head (3ce45c8).
Report is 228 commits behind head on develop.

Files with missing lines Patch % Lines
.../scala/com/twitter/algebird/immutable/BitSet.scala 96.78% 15 Missing ⚠️
...a/com/twitter/algebird/immutable/BloomFilter.scala 95.90% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #840      +/-   ##
===========================================
+ Coverage    89.04%   89.66%   +0.61%     
===========================================
  Files          117      119       +2     
  Lines         9354     9943     +589     
  Branches       356      489     +133     
===========================================
+ Hits          8329     8915     +586     
- Misses        1025     1028       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants