-
Notifications
You must be signed in to change notification settings - Fork 347
CMS: add contramap to convert CMS[K] to CMS[L], add support for String and Bytes, remove Ordering context bound for K #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
My biggest open question is how to test the contramap functionality in an elegant and effective way. |
|
So, looking at this indeed your concern about ordering on That said, I'm not sure why we need an Ordering on Keys at all. I was looking back through the code, and it seems to me we are only using the Ordering when we do heavy-hitters, and even then it is just to break ties if there are items with the same count. Seems to me we could have used something like the hash code to do that since it is hard to imagine that ordering is truly meaningful given it will only matter if one is evicted from the heavy hitters and the other is not, but that only happens for the TopK version, which is not associative anyway. I wonder if we are making a mistake by requiring |
|
You raise a good point. I will reply in further detail tomorrow. For starters though I have revisited the core CMS code (counting-only) and successfully removed the |
To add to what you said, technically the requirement on case class HeavyHitters[K: Ordering](hhs: SortedSet[HeavyHitter[K]]) extends java.io.Serializable { ... }However, the API contract of CMS only guarantees
Hmm. Given that we're allowing the user to throw any type I admit that this use case might be a bit contrived. Why not tracking paying and non-paying customers in separate CMS? That being said, it is quite hard to anticipate or predict how clients will use an API. But:
How important such a contrived use case is in practice is debatable if it's applicable to the TopK version only anyways (which I agree is much less attractive than its associative Top-% sibling). |
|
Could we easily move the demand for Ordering closer to where it is needed? For instance, only in the heavy hitter variants? |
|
I already did that in fact -- see the last three commits in this PR, specifically https://github.com/miguno/algebird/commit/e6158973cd043d96d3f09c8c1e30d15aca42d358. So the core CMS is free of Ordering now. Still, maybe we want to think a bit more about whether we truly need Ordering for the heavy hitters variants. What's your opinion? Personally I am still debating pro and con. |
|
See also my comment about the Ordering removal at #399 (comment). |
|
Again, the build failed only because of a single unstable test that is not related to the CMS code change. |
|
Based @johnynek's comments and concerns about "Do we really need Note: This cleanup would not be required but very helpful to achieve the original goal of this pull request, which is a) providing a Proposal
Why we should accept this proposal
Why we should NOT accept this proposal
Current statusI have a version of CMS and TopPctCMS that works without Depending on your feedback "Top-N CMS yes/no" I would take another stab at the Top-N variant but here things will get messy because we must break ties in some way, and this will mean either a) to continue to require What do you think? |
|
I'd prefer not to cut TopK, if for no other reason than to maintain parity with the SketchMap implementation. Instead, my strong preference would be to simply remove the key from the HH ordering and leave it at that. I would go as far as to say that we don't need to worry about consistently sorting two HH with the same count at all; I can't imagine it happening in practice enough to be meaningful. |
|
@jnievelt why is parity with SketchMap important? Can't we just say "if you want top-k, use SketchMap"? I'd also be curious to hear how SketchMap gets used in practice. Is it mostly (entirely?) for HLL values? Because if so I think it's not the best way of achieving that (but that's a discussion for another issue/PR). |
…ate time If we track heavy hitters in a sorted data structure we lower the time complexity of update operations (such as `count()`).
|
Yeah, I think it would help to understand the reasoning behind keeping feature parity between CMS and SketchMap (not saying I disagree, I simply don't know). That being said I pushed two new commits.
object HeavyHitter {
// Note that we can't define the ordering on `count` only. We use a set to track heavy hitters, where equality means
// two elements are identical. If we were to use `count` only, we would label different items in heavy hitters as
// identical simply because they have the same count (and such false duplicates may happen easily in practice). For
// this reason we need to add another dimension to the ordering to prevent such false duplicates. For this second
// dimension we use an item's default Java hash code because we do not want to introduce an `Ordering` constraint.
def ordering[K]: Ordering[HeavyHitter[K]] =
Ordering.by[HeavyHitter[K], (Long, Int)] { hh => (hh.count, hh.item.hashCode()) }
} |
|
@avibryant I would actually prefer to deprecate Of course, there would be significant work required to really make this happen (generalize |
|
👍 from me. Any objections or should we put this out in the next version? |
|
I'd like to make a few improvements to the current PR (e.g. re-enabling the default hasher for Strings, adding tests, run benchmarks, Joe's ordering variant). When to you want to create the release? I will be able to finish this until tomorrow. |
|
@miguno were hoping for today, but maybe not until tomorrow. I'd really like to have this in. |
|
Same here. I’d also like to have this patch included. As I said I’ll have an updated PR for review by tomorrow your time. Today won’t work unfortunately. :-) |
|
@miguno missed the train for this one, but if we can make it binary compatible, we can ship it as minor update. |
|
No problem. Since you released already I wasn't in a rush today. My plan is to submit an updated PR by Friday. |
|
Most of the code is done, but I have some weird issues with the tests for |
|
Further progress has been made, including filing a ScalaTest issue regarding equality/matcher issues when using e.g. There are still two failing test cases for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
damn scala. Equiv and scala.util.hashing.Hashing should be what we want here, but they are so broken and impoverished I guess we should think about it.
I really don't like using ClassTag in general, and I especially don't like it here when it is in fact a round-about way to get at what we really want, which is a hashcode.
So, I guess the least ugly solution is to keep the Ordering[K] bound on cases with the TopK heavy hitters. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be more inclined to say "Damn Java Array!" but otherwise I totally agree. ;-)
Equiv and scala.util.hashing.Hashing should be what we want here, but they are so broken and impoverished I guess we should think about it.
I haven't used either of those yet, but I understand that you would not recommend them. (Did you mean "I guess we should /not/ think about it"?)
I really don't like using ClassTag in general, and I especially don't like it here when it is in fact a round-about way to get at what we really want, which is a hashcode.
So, I guess the least ugly solution is to keep the Ordering[K] bound on cases with the TopK heavy hitters. What do you think?
I'm torn. While I don't like ClassTag shenanigans that much either -- particularly when they are just a workaround for a different issue (read: Array's messed up equality) -- there's one reason I'd prefer ClassTag over Ordering: Not having the Ordering[K] bound clarifies the contract. There is no reason the user should have to provide an Ordering -- we only use an Ordering internally to benefit from better update performance of a SortedSet, and as you pointed out this is mostly because of the TopK use case. We do not guarantee any order(ing) at all in the API, which only guarantees a Set when retrieving the heavy hitters from a CMS instance. In other words, requiring an Ordering kinda leaks internal, implementation-specific information to the user (well, at least in the current scenario where the API does not guarantee a SortedSet).
Another reason would be to keep the code and the look'n'feel as identical as possible between TopPct and TopK.
On the other hand, keeping the Ordering bound for TopK only might be a sneaky transition path to eventually get rid of TopK altogether, though @jnievelt will not like to hear that. ;-)
So, I guess the least ugly solution is to keep the Ordering[K] bound on cases with the TopK heavy hitters.
There's also a tiny performance benefit of using an Ordering internally for TopPct, as we don't have to traverse the full list of heavy hitters every time we perform an update because dropWhile can terminate earlier than filter:
// TopPct without Ordering
override def purgeHeavyHitters(cms: CMS[K])(hitters: HeavyHitters[K]): HeavyHitters[K] = {
val minCount = heavyHittersPct * cms.totalCount
// with Ordering/SortedSet: HeavyHitters[K](hitters.hhs.dropWhile { _.count < minCount })
HeavyHitters[K](hitters.hhs.filter { _.count >= minCount })
}Whether that's an important optimization to warrant keeping Ordering I can't tell. Typically, the list of heavy hitters shouldn't be that huge that terminating early is that big of a benefit, for example.
Edit: The first comment in the Scala snippet above should have read "TopPct withOUT Ordering".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Equiv bad news?
The reason is that in the object Equiv there is an implicit Equiv for every type that just defaults to a.equals(b). So, with that implicit hanging out, taking an Equiv[T] is not a guarantee of any kind of type-safe equivalence, since if the user does nothing, the default equals gets called. So, in the case of array Equiv[Array[Int]].equiv(Array(1), Array(1)) == false. The correct approach, is my view, would be define Equiv for all the primitive types, String and Array in the object Equiv and then auto generate them in case class companions (along with Ordering and Hashing).
The reason why Hashing is lame is that it is unknown to most (I just learned of it somewhat recently) and it also has almost no good default instances, so it does not buy users much: they still have to write a lot of code, they just write it a certain way.
I really wish the typeclass story in scala were better.
About pushing Ordering onto only TopK: big fan of that. Only those using the variant that needs it should pay the price. The benefit you note with dropWhile is even more of a win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Then I don't have to feel to bad about not having heard of Hashing before. :-)
The benefit you note with dropWhile is even more of a win.
Just to be clear: If we push Ordering to only TopK, then we lose dropWhile for TopPct (so that benefit, whether it's tiny or actually noticeable, would be gone) and we'd have to use filter instead. (I made a mistake in the first comment of the Scala snippet above.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess the conclusion is: Revert the ClassTag and bring back Ordering (for TopK only)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the most correct approach (where correct means that the typeclasses in play are the ones closest to what the code actually requires).
@jnievelt do you have any concerns?
Thank you, sir.
|
Since we haven't heard back from @jnievelt I think I'll count this as silent agreement. :-) I will revamp the code according to what we discussed above. |
|
The equality misbehavior of Why is that? We can't even track heavy hitters for Example: $ ./sbt algebird-core/console
scala> val a1 = HeavyHitter[Array[Byte]](Array(1.toByte), 123)
scala> val a2 = HeavyHitter[Array[Byte]](Array(1.toByte), 123)
// The two objects are NOT equal, "thanks" to the equality semantics of Java arrays.
scala> a1 == a2
res0: Boolean = false
// We must first convert/wrap `Array[Byte]` to achieve the correct equality behavior.
scala> val s1 = HeavyHitter[Seq[Byte]](Array(1.toByte).toSeq, 123)
scala> val s2 = HeavyHitter[Seq[Byte]](Array(1.toByte).toSeq, 123)
scala> s1 == s2
res1: Boolean = true
// This also means that HeavyHitters[K] (plural), i.e. the container
// we use to track heavy hitters, will not work correctly either.Unfortunately these Array-related issues are affecting many places in Scala/CMS, it's not just about /**
* Used for holding a single element, to avoid repeatedly adding elements from sparse counts tables.
*/
case class CMSItem[K](item: K, override val params: CMSParams[K]) extends CMS[K](params) {
// ...other code removed...
override def frequency(x: K): Approximate[Long] = if (item == x) Approximate.exact(1L) else Approximate.exact(0L)
}This also has the implication that you can't use a "safe" source In general I am not aware of an easy and/or clean way to address these Array equality issues, and from what I understand the attempts to address this at the Scala language level failed in Scala 2.8. Also: I would rather not want to track heavy hitters by a hash code of the tracked elements (think: using To recap what we have been trying to achieve in this PR and the related discussions:
So what I am proposing is to skip (4) and "only" do (1), (2), and (3). (We'd still keep the What do you think? Of course I'd also appreciate any ideas how to address the Array equality misbehavior. |
|
PS: I forgot to add: My comment above refers to us trying to achieve the various goals without being able to leverage |
|
@miguno what about adding something like: object BytesWrapper {
implicit val ordering: Ordering[BytesWrapper] = new Ordering[BytesWrapper] {
def compare(a: BytesWrapper, b: BytesWrapper) = {
ByteBuffer.wrap(a.get).compareTo(ByteBuffer.wrap(b.get))
}
}
}
case class BytesWrapper(get: Array[Byte]) extends AnyVal {
override def hashCode = /* murmur here */
override def equals(t: Any) = t match {
case BytesWrapper(that) => java.util.Arrays.equals(get, that)
}
}Then we can side step the issue and add a comment as to why. What do you think of that? |
|
So you mean we'd tell CMS users* to not try to work on a If that's the case, would it make sense to add such an Either way (Bijection or not) we could proceed with the current PR modifications: we'd exclude direct *If I understand correctly we can't automagically use such a wrapper behind the scenes, given the previous findings. |
|
Sounds right to me. I don't know of a way to automatically wrap. But we could provide the right wrapper so that it is easier. Note that MinHashSignature is basically the same thing, but it does not get the hashCode and equality right. If we put the hashCode/equals/toString methods in an object we could call them there too. |
|
Ok. I'll take a stab at this when I'm back early next week. Thanks for the quick feedback, Oscar!
|
|
PS: Note that the wrapper cannot be a value class because we must override |
|
To play devil's advocate: Apart from a Murmur3-based Also, in CMS we use a separate hash code setup that is decoupled from the |
…y[Byte] support This commit includes: - We remove ClassTag usage and track heavy hitters with a Set() instead of a SortedSet() - We drop direct support for Arrays. For the convenience of Algebird users we do provide a wrapper class called Bytes to safely wrap Array[Byte] for use with CMS.
|
Please take a look at the latest changes.
I also did some benchmarking/profiling to compare the performance (and memory impacts etc.) of |
|
Also, we use a The reason for this is that it isn't easy to run with two different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some tests to exercise the case that they are not all equal. Right now, equals(t: Any) == (t != null) and compare(a, b) == 0 would pass your tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch regarding the equals test. In the latest commit I enhanced the test spec of Bytes in general.
compare(a, b) == 0 would pass your tests
As part of the changes I also enhanced the compare tests, although compare(a, b) == 0 was already failing the tests prior to the enhancements. Hence I'm not sure but hope that the enhancements will still cover any gap you found yesterday for compare.
|
Looks good. I'm going to wait on merging this until we publish 0.10 just in case we find some bugs or something on 0.9. Thanks for doing this. |
|
Ok, sounds like a plan. Thanks from my side for your patience on finalizing this pull request, I do appreciate it. |
|
I guess 0.10 is coming down the pipe, so I'm going to merge this. |
CMS: add contramap to convert CMS[K] to CMS[L], add support for String and Bytes, remove Ordering context bound for K
Here's a first draft of
contramapfor the CMS monoid as per our discussion in #398.Three things stand out:
CMSStringTest.Ordering[Array[Byte]]. I put this into a separateCMSOrderingImplicitsobject.Array[Byte],BigInt, andStringas per my snippet above. However, I disabled theStringhashing function because I lacked the creativity to come up with a non-BigInt, non-String example to showcase how to translateCMS[K]intoCMS[L](hence I didCMS[Array[Byte]]intoCMS[String]).Let me know what you think. (I do not consider this code to be production-ready yet.)