-
Notifications
You must be signed in to change notification settings - Fork 347
Experimental path dependent bloom filter #840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
So, is it right that the TL;DR is that it may be a bit slower but very close for all operations except folding/aggregating where it is considerably faster?
Would you agree this is the right take?
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/immutable/BitSet.scala
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
|
@johnynek yup that's the right take. If I'm not missing anything obvious we need dig a little bit more into this BitSet. |
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
|
New numbers: previous: |
Codecov Report
@@ Coverage Diff @@
## develop #840 +/- ##
===========================================
+ Coverage 89.43% 89.74% +0.30%
===========================================
Files 116 119 +3
Lines 9164 9913 +749
Branches 383 501 +118
===========================================
+ Hits 8196 8896 +700
- Misses 968 1017 +49
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry this slipped again.
A couple of minor points.
Also, I think we should just move this into algebird.immutable (not experimental). We have pretty good evidence this is going to be as good or better for virtually all users.
After we merge this, I can see making a separate algebird-legacy package that has things we want to remove, but at the same class name. So users who need it will need to update their build, but not their source code.
I'd like to remove the mutable bitset we are using from the build so we could have a scalajs build of algebird.
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Show resolved
Hide resolved
| * Create a bloom filter with multiple items from an iterator | ||
| */ | ||
| def create(data: Iterator[A]): Hash = monoid.sum(data.map(Item)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a def fromBitSet(bs: BitSet): Try[Hash] and comment that this is how you do serialization: convert to BitSet, serialize that, and then on the other end, use this method?
The Try here verifies that the bitset maximum entry does not go beyond the size we would expect for this particular bloomfilter. Also, if it is empty, we just replace it with Empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @johnynek Thanks for getting back to me! Here wouldn't we need to validate the numHashes as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We statically know the numHashes, don't we? So, the only dynamic thing we have when we are converting from a BitSet is the maximum set value in the BitSet. So, as long as that isn't invalid, we assume all is good, I guess...
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Outdated
Show resolved
Hide resolved
algebird-core/src/main/scala/com/twitter/algebird/experimental/BloomFilter.scala
Show resolved
Hide resolved
782f913 to
cfadc20
Compare
cfadc20 to
06f9125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes.
Codecov Report
@@ Coverage Diff @@
## develop #840 +/- ##
===========================================
+ Coverage 89.04% 89.66% +0.61%
===========================================
Files 117 119 +2
Lines 9354 9943 +589
Branches 356 489 +133
===========================================
+ Hits 8329 8915 +586
- Misses 1025 1028 +3
Continue to review full report at Codecov.
|
algebird-core/src/main/scala/com/twitter/algebird/immutable/BloomFilter.scala
Outdated
Show resolved
Hide resolved
|
@johnynek Addressed the issue before. After this I'll go ahead and move the other BF impl to a |
|
Thanks for pushing this through! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #840 +/- ##
===========================================
+ Coverage 89.04% 89.66% +0.61%
===========================================
Files 117 119 +2
Lines 9354 9943 +589
Branches 356 489 +133
===========================================
+ Hits 8329 8915 +586
- Misses 1025 1028 +3 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Hi @johnynek long overdue but here it is. I think this bloom filter is inline with what you mentioned previously and it's backed by a slightly modified version of the
cats-collectionsBitSet. Since this is very much experimental it lives under theexperimentalpackage.To be honest this has been sitting in a branch for awhile now and I've been meaning to look more into it to try to squeeze more pref (it should be possible). Putting out here to fish for extra opinions.
Let me know what you think.