Skip to content

Conversation

@miguno
Copy link

@miguno miguno commented Jan 15, 2015

Here's a first draft of contramap for the CMS monoid as per our discussion in #398.

Three things stand out:

  • It seems to work. :-) See CMSStringTest.
  • We need another implicit value in scope, which is an instance of Ordering[Array[Byte]]. I put this into a separate CMSOrderingImplicits object.
  • I added the default hashing functions for Array[Byte], BigInt, and String as per my snippet above. However, I disabled the String hashing function because I lacked the creativity to come up with a non-BigInt, non-String example to showcase how to translate CMS[K] into CMS[L] (hence I did CMS[Array[Byte]] into CMS[String]).

Let me know what you think. (I do not consider this code to be production-ready yet.)

@miguno
Copy link
Author

miguno commented Jan 15, 2015

My biggest open question is how to test the contramap functionality in an elegant and effective way.

@johnynek
Copy link
Collaborator

So, looking at this indeed your concern about ordering on Array[Byte] is legit. For UTF-8 strings, they are ordered on unsigned bytes.

That said, I'm not sure why we need an Ordering on Keys at all. I was looking back through the code, and it seems to me we are only using the Ordering when we do heavy-hitters, and even then it is just to break ties if there are items with the same count. Seems to me we could have used something like the hash code to do that since it is hard to imagine that ordering is truly meaningful given it will only matter if one is evicted from the heavy hitters and the other is not, but that only happens for the TopK version, which is not associative anyway.

I wonder if we are making a mistake by requiring Ordering[K] at all.

@miguno
Copy link
Author

miguno commented Jan 15, 2015

You raise a good point. I will reply in further detail tomorrow. For starters though I have revisited the core CMS code (counting-only) and successfully removed the Ordering[K] requirement. That is, we only require K: CMSHasher for CMS (I haven't fiddled with the Top-CMS variants yet).

@miguno
Copy link
Author

miguno commented Jan 15, 2015

I was looking back through the code, and it seems to me we are only using the Ordering when we do heavy-hitters, and even then it is just to break ties if there are items with the same count. Seems to me we could have used something like the hash code to do that since it is hard to imagine that ordering is truly meaningful given it will only matter if one is evicted from the heavy hitters and the other is not, but that only happens for the TopK version, which is not associative anyway.

To add to what you said, technically the requirement on Ordering is caused only by CMS' usage of SortedSet for heavy hitters:

case class HeavyHitters[K: Ordering](hhs: SortedSet[HeavyHitter[K]]) extends java.io.Serializable { ... }

However, the API contract of CMS only guarantees Set for heavy hitters to the user. There has been a related discussion about SortedSet vs. Set for SketchMap in #277, and I also had a question about the very same thing for CMS some time back at #353 (comment).

Seems to me we could have used something like the hash code to do that since it is hard to imagine that ordering is truly meaningful given it will only matter if one is evicted from the heavy hitters and the other is not [...]

Hmm. Given that we're allowing the user to throw any type K at us, Ordering allows the user to control how such tie-breaking should be performed. One use case I can think of is similar to a priority queue (the analogy works if you consider heavy hitters to be a ranked/sorted list): Let's say you have paying customers and non-paying customers, represented via a type C, and you use CMS to track misbehavior such as bandwidth usage in order to enforce throttling (users in heavy hitters will be throttled). Here, if the CMS must break ties between a paying customer and a non-paying customer F, then the client can ensure via Ordering[C] that the non-paying customer will be moved/kept in heavy hitters and thus throttled so that the service will not degrade for the paying customer.

I admit that this use case might be a bit contrived. Why not tracking paying and non-paying customers in separate CMS? That being said, it is quite hard to anticipate or predict how clients will use an API.

But:

[...] but that only happens for the TopK version, which is not associative anyway.

How important such a contrived use case is in practice is debatable if it's applicable to the TopK version only anyways (which I agree is much less attractive than its associative Top-% sibling).

@johnynek
Copy link
Collaborator

Could we easily move the demand for Ordering closer to where it is needed? For instance, only in the heavy hitter variants?

@miguno
Copy link
Author

miguno commented Jan 16, 2015

I already did that in fact -- see the last three commits in this PR, specifically https://github.com/miguno/algebird/commit/e6158973cd043d96d3f09c8c1e30d15aca42d358.

So the core CMS is free of Ordering now. Still, maybe we want to think a bit more about whether we truly need Ordering for the heavy hitters variants. What's your opinion? Personally I am still debating pro and con.

@miguno
Copy link
Author

miguno commented Jan 16, 2015

See also my comment about the Ordering removal at #399 (comment).

@miguno
Copy link
Author

miguno commented Jan 19, 2015

Again, the build failed only because of a single unstable test that is not related to the CMS code change.

@miguno
Copy link
Author

miguno commented Jan 19, 2015

Based @johnynek's comments and concerns about "Do we really need Ordering[K]?" I'd propose the following cleanup to CMS. I think this would adequately address these concerns while improving the code base at the same time.

Note: This cleanup would not be required but very helpful to achieve the original goal of this pull request, which is a) providing a contramap functionality for CMS[K] to CMS[L] auto-translation, which leads up to b) supporting String-based CMS out of the box. Both (a) and (b) are discussed in #398. The goal of #398 is to improve the UX of Algebird for users interested in String-based use cases.

Proposal

  1. Remove the Ordering context bound for K, and
  2. To keep the code simple, remove the Top-N CMS variant.

Why we should accept this proposal

  • Ordering is not really required given the API contract of CMS. Getting rid of Ordering simplifies the API (good for devs) as well as using the API (good for users).
    • We have always guaranteed "only" a Set[K] for heavy hitters, even though we used SortedSet[K] behind the scenes. As Oscar mentioned we'd need Ordering only to break ties when comparing/sorted iff two items have the same count AND we're running a Top-N CMS (= we don't care about this type of tie breaking for the Top-% CMS).
  • The Top-N CMS variant is not associative, and one could argue that "Top-N CMS is considered harmful". Also, this variant is the only real reason we can't get rid of Ordering in a clean way.

Why we should NOT accept this proposal

  • It is a backwards incompatible change for anyone who is already using the Top-N CMS variant in the wild. Note that this would have to be a user running a non-official Algebird verison because the latest public version (0.8.2) does not include the new CMS[K] code yet (i.e. 0.8.2 has neither TopPctCMS nor TopNCMS).
  • If Top-N use cases are important enough to be supported out of the box, then of course we should keep the Top-N CMS in the code base.
    • In our case, for instance, we are not using the Top-N variant at all because, well, it is not associative, which means we can never trust the outcome of a merge operation of two or more CMS instances.

Current status

I have a version of CMS and TopPctCMS that works without Ordering. Behind the scenes we stop using SortedSet (as well as dropWhile during the heavy hitters purge step; note that we require a sorted list of heavy hitters for the dropWhile semantics) and replace it with Set (and a straight filter).

Depending on your feedback "Top-N CMS yes/no" I would take another stab at the Top-N variant but here things will get messy because we must break ties in some way, and this will mean either a) to continue to require Ordering (at least for Top-N) or b) to come up with a way to break ties without ordering, although that would require some thinking (e.g. if we used a "pick an item at random" strategy -- which is a reasonable strategy IMHO -- to break ties then we'd have a hard time to write deterministic tests; Oscar suggested hashing to break up ties, which would be better than random tie-breaking but still we'd end up having to kinda "special-craft" test data for the code).

What do you think?

@jnievelt
Copy link
Contributor

I'd prefer not to cut TopK, if for no other reason than to maintain parity with the SketchMap implementation.

Instead, my strong preference would be to simply remove the key from the HH ordering and leave it at that. I would go as far as to say that we don't need to worry about consistently sorting two HH with the same count at all; I can't imagine it happening in practice enough to be meaningful.

@avibryant
Copy link
Contributor

@jnievelt why is parity with SketchMap important? Can't we just say "if you want top-k, use SketchMap"?

I'd also be curious to hear how SketchMap gets used in practice. Is it mostly (entirely?) for HLL values? Because if so I think it's not the best way of achieving that (but that's a discussion for another issue/PR).

Michael G. Noll added 2 commits January 20, 2015 13:39
…ate time

If we track heavy hitters in a sorted data structure we lower the time
complexity of update operations (such as `count()`).
@miguno
Copy link
Author

miguno commented Jan 20, 2015

Yeah, I think it would help to understand the reasoning behind keeping feature parity between CMS and SketchMap (not saying I disagree, I simply don't know).

That being said I pushed two new commits.

  • The first commit (https://github.com/miguno/algebird/commit/876786085e4bbe7d3794504ae7474cb1af9c57d2) gets rid of the Ordering requirement for K and turns the SortedSet we use behind the scenes into a plain Set. Functionality wise we do not need a SortedSet to do what we want. However:
  • The second commit (6b29c42) brings back SortedSet (but not Ordering!) to minimize the time complexity of write operations (e.g. count()). It's beneficial to track heavy hitters in a sorted data structure. In order for this to work, I needed to create a "generic" Ordering for heavy hitter elements so that SortedSet works. It is quite important that this ordering works correctly, hence I include it here:
object HeavyHitter {

  // Note that we can't define the ordering on `count` only.  We use a set to track heavy hitters, where equality means
  // two elements are identical.  If we were to use `count` only, we would label different items in heavy hitters as
  // identical simply because they have the same count (and such false duplicates may happen easily in practice).  For
  // this reason we need to add another dimension to the ordering to prevent such false duplicates.  For this second
  // dimension we use an item's default Java hash code because we do not want to introduce an `Ordering` constraint.
  def ordering[K]: Ordering[HeavyHitter[K]] =
    Ordering.by[HeavyHitter[K], (Long, Int)] { hh => (hh.count, hh.item.hashCode()) }

}

@jnievelt
Copy link
Contributor

@avibryant I would actually prefer to deprecate SketchMap and use CMS everywhere. To answer your usage question, I believe SketchMap[String, Long] is very common (including with my team), and these could use CMS today.

Of course, there would be significant work required to really make this happen (generalize CMS[K,V], double check performance, etc.). In the meantime, I like to at least move in that direction. At least as I see it, we need compelling reason not to do so.

@johnynek
Copy link
Collaborator

👍 from me. Any objections or should we put this out in the next version?

@miguno
Copy link
Author

miguno commented Jan 21, 2015

I'd like to make a few improvements to the current PR (e.g. re-enabling the default hasher for Strings, adding tests, run benchmarks, Joe's ordering variant).

When to you want to create the release? I will be able to finish this until tomorrow.

@johnynek
Copy link
Collaborator

@miguno were hoping for today, but maybe not until tomorrow. I'd really like to have this in.

@miguno
Copy link
Author

miguno commented Jan 21, 2015

Same here. I’d also like to have this patch included. As I said I’ll have an updated PR for review by tomorrow your time. Today won’t work unfortunately. :-)

@johnynek
Copy link
Collaborator

@miguno missed the train for this one, but if we can make it binary compatible, we can ship it as minor update.

@miguno
Copy link
Author

miguno commented Jan 22, 2015

No problem. Since you released already I wasn't in a rush today. My plan is to submit an updated PR by Friday.

@miguno
Copy link
Author

miguno commented Jan 23, 2015

Most of the code is done, but I have some weird issues with the tests for Array[Byte]. I'll have to do some further work on that. Stay tuned!

@miguno
Copy link
Author

miguno commented Jan 27, 2015

Further progress has been made, including filing a ScalaTest issue regarding equality/matcher issues when using e.g. Set[Array[Byte]]. I solved this with a custom Equality[Set[S]] definition in algebird-test; once ScalaTest 3.x is released we can get rid again of this custom equality definition.

There are still two failing test cases for Array[Byte] that I need to work on.

[info] CMSArrayByteTest:
[info] A Count-Min sketch implementing CMSCounting
[info] - should count total number of elements in a stream
[info] - should estimate frequencies *** FAILED ***
[info]   7 was greater than or equal to 0, but 7 was not less than or equal to 5 (CountMinSketchTest.scala:255)
[info] - should exactly compute frequencies in a small stream
[info] - should estimate inner products *** FAILED ***
[info]   TestFailedException was thrown during property evaluation.
[info]     Message: 6 was greater than or equal to 0, but 6 was not less than or equal to 0
[info]     Location: (CountMinSketchTest.scala:302)
[info]     Occurred when passed generated values (
[info]       totalCount = 25, // 10 shrinks
[info]       range = 100 // 6 shrinks
[info]     )
[info] - should exactly compute inner product of small streams
[info] - should work as an Aggregator when created from a single, small stream

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn scala. Equiv and scala.util.hashing.Hashing should be what we want here, but they are so broken and impoverished I guess we should think about it.

I really don't like using ClassTag in general, and I especially don't like it here when it is in fact a round-about way to get at what we really want, which is a hashcode.

So, I guess the least ugly solution is to keep the Ordering[K] bound on cases with the TopK heavy hitters. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more inclined to say "Damn Java Array!" but otherwise I totally agree. ;-)

Equiv and scala.util.hashing.Hashing should be what we want here, but they are so broken and impoverished I guess we should think about it.

I haven't used either of those yet, but I understand that you would not recommend them. (Did you mean "I guess we should /not/ think about it"?)

I really don't like using ClassTag in general, and I especially don't like it here when it is in fact a round-about way to get at what we really want, which is a hashcode.

So, I guess the least ugly solution is to keep the Ordering[K] bound on cases with the TopK heavy hitters. What do you think?

I'm torn. While I don't like ClassTag shenanigans that much either -- particularly when they are just a workaround for a different issue (read: Array's messed up equality) -- there's one reason I'd prefer ClassTag over Ordering: Not having the Ordering[K] bound clarifies the contract. There is no reason the user should have to provide an Ordering -- we only use an Ordering internally to benefit from better update performance of a SortedSet, and as you pointed out this is mostly because of the TopK use case. We do not guarantee any order(ing) at all in the API, which only guarantees a Set when retrieving the heavy hitters from a CMS instance. In other words, requiring an Ordering kinda leaks internal, implementation-specific information to the user (well, at least in the current scenario where the API does not guarantee a SortedSet).
Another reason would be to keep the code and the look'n'feel as identical as possible between TopPct and TopK.

On the other hand, keeping the Ordering bound for TopK only might be a sneaky transition path to eventually get rid of TopK altogether, though @jnievelt will not like to hear that. ;-)

So, I guess the least ugly solution is to keep the Ordering[K] bound on cases with the TopK heavy hitters.

There's also a tiny performance benefit of using an Ordering internally for TopPct, as we don't have to traverse the full list of heavy hitters every time we perform an update because dropWhile can terminate earlier than filter:

// TopPct without Ordering
override def purgeHeavyHitters(cms: CMS[K])(hitters: HeavyHitters[K]): HeavyHitters[K] = {
  val minCount = heavyHittersPct * cms.totalCount
  // with Ordering/SortedSet: HeavyHitters[K](hitters.hhs.dropWhile { _.count < minCount })
  HeavyHitters[K](hitters.hhs.filter { _.count >= minCount })
}

Whether that's an important optimization to warrant keeping Ordering I can't tell. Typically, the list of heavy hitters shouldn't be that huge that terminating early is that big of a benefit, for example.

Edit: The first comment in the Scala snippet above should have read "TopPct withOUT Ordering".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Equiv bad news?

The reason is that in the object Equiv there is an implicit Equiv for every type that just defaults to a.equals(b). So, with that implicit hanging out, taking an Equiv[T] is not a guarantee of any kind of type-safe equivalence, since if the user does nothing, the default equals gets called. So, in the case of array Equiv[Array[Int]].equiv(Array(1), Array(1)) == false. The correct approach, is my view, would be define Equiv for all the primitive types, String and Array in the object Equiv and then auto generate them in case class companions (along with Ordering and Hashing).

The reason why Hashing is lame is that it is unknown to most (I just learned of it somewhat recently) and it also has almost no good default instances, so it does not buy users much: they still have to write a lot of code, they just write it a certain way.

I really wish the typeclass story in scala were better.

About pushing Ordering onto only TopK: big fan of that. Only those using the variant that needs it should pay the price. The benefit you note with dropWhile is even more of a win.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Then I don't have to feel to bad about not having heard of Hashing before. :-)

The benefit you note with dropWhile is even more of a win.

Just to be clear: If we push Ordering to only TopK, then we lose dropWhile for TopPct (so that benefit, whether it's tiny or actually noticeable, would be gone) and we'd have to use filter instead. (I made a mistake in the first comment of the Scala snippet above.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess the conclusion is: Revert the ClassTag and bring back Ordering (for TopK only)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the most correct approach (where correct means that the typeclasses in play are the ones closest to what the code actually requires).

@jnievelt do you have any concerns?

Thank you, sir.

@miguno
Copy link
Author

miguno commented Feb 9, 2015

Since we haven't heard back from @jnievelt I think I'll count this as silent agreement. :-) I will revamp the code according to what we discussed above.

@miguno
Copy link
Author

miguno commented Feb 10, 2015

The equality misbehavior of Array[Byte] is really becoming a nuisance to deal with -- to the point where I am considering to drop Array[Byte] support from CMS.

Why is that? We can't even track heavy hitters for Array[Byte] easily because both Set and SortedSet -- i.e. regardless of whether we use or do not use Ordering -- break for Array[Byte] due to its equality semantics.

Example:

$ ./sbt algebird-core/console

scala> val a1 = HeavyHitter[Array[Byte]](Array(1.toByte), 123)
scala> val a2 = HeavyHitter[Array[Byte]](Array(1.toByte), 123)

// The two objects are NOT equal, "thanks" to the equality semantics of Java arrays.
scala> a1 == a2
res0: Boolean = false

// We must first convert/wrap `Array[Byte]` to achieve the correct equality behavior.
scala> val s1 = HeavyHitter[Seq[Byte]](Array(1.toByte).toSeq, 123)
scala> val s2 = HeavyHitter[Seq[Byte]](Array(1.toByte).toSeq, 123)
scala> s1 == s2
res1: Boolean = true

// This also means that HeavyHitters[K] (plural), i.e. the container
// we use to track heavy hitters, will not work correctly either.

Unfortunately these Array-related issues are affecting many places in Scala/CMS, it's not just about Set and SortedSet. For instance, even the simple function CMSItem#frequency() will break for Arrays because the comparison if (item == x) will not work correctly for Arrays:

/**
 * Used for holding a single element, to avoid repeatedly adding elements from sparse counts tables.
 */
case class CMSItem[K](item: K, override val params: CMSParams[K]) extends CMS[K](params) {

  // ...other code removed...

  override def frequency(x: K): Approximate[Long] = if (item == x) Approximate.exact(1L) else Approximate.exact(0L)

}

This also has the implication that you can't use a "safe" source CMS[K] such as CMS[String], and then use contramap to create a equality-safe wrapper for CMS[Array[Byte]].

In general I am not aware of an easy and/or clean way to address these Array equality issues, and from what I understand the attempts to address this at the Scala language level failed in Scala 2.8.

Also: I would rather not want to track heavy hitters by a hash code of the tracked elements (think: using hashCode instead of equals; the idea would be we do have good hashing in place, so why not use a hash to determine equality?). We should really stick to the "true" equality semantics of the various types K, otherwise Set/SortedSet will incorrectly track heavy hitters due to hash collisions for instances of K. Since our hashing is based on 32-bit integers such collisions are pretty likely, particularly for large domains such as variable-length String instances.

To recap what we have been trying to achieve in this PR and the related discussions:

  1. Add contramap functionality to CMS.
  2. Also, add built-in String support.
  3. Consolidate/re-use code to support String, BigInt, and Array[Byte] via a base hasher (Murmur3) for Array[Byte] that is re-used by the hashers for String and BigInt.
  4. At some point we also began adding Array[Byte] support, kinda because we already had been working on (2) and (3).

So what I am proposing is to skip (4) and "only" do (1), (2), and (3). (We'd still keep the Array[Byte] base hashing in the code for use with String and BigInt hashing, but we wouldn't expose it as an importable implicit to the user.)

What do you think? Of course I'd also appreciate any ideas how to address the Array equality misbehavior.

@miguno
Copy link
Author

miguno commented Feb 10, 2015

PS: I forgot to add: My comment above refers to us trying to achieve the various goals without being able to leverage ClassTag to special-case Arrays.

@johnynek
Copy link
Collaborator

@miguno what about adding something like:

object BytesWrapper {
  implicit val ordering: Ordering[BytesWrapper] = new Ordering[BytesWrapper] {
    def compare(a: BytesWrapper, b: BytesWrapper) = {
      ByteBuffer.wrap(a.get).compareTo(ByteBuffer.wrap(b.get))
    }
  }
}
case class BytesWrapper(get: Array[Byte]) extends AnyVal {
  override def hashCode = /* murmur here */
  override def equals(t: Any) = t match {
    case BytesWrapper(that) => java.util.Arrays.equals(get, that)
  }
}

Then we can side step the issue and add a comment as to why. What do you think of that?

@miguno
Copy link
Author

miguno commented Feb 11, 2015

So you mean we'd tell CMS users* to not try to work on a CMS[Array[Byte]] directly, but rather convert their original Array[Byte] instances to a BytesWrapper, and use a CMS[BytesWrapper] instead?

If that's the case, would it make sense to add such an Array[Byte] wrapper to Bijection?

Either way (Bijection or not) we could proceed with the current PR modifications: we'd exclude direct Array[Byte] support as described above, and tell CMS users to rely on the wrapper to work on Array[Byte] if that's what they need.

*If I understand correctly we can't automagically use such a wrapper behind the scenes, given the previous findings.

@johnynek
Copy link
Collaborator

Sounds right to me. I don't know of a way to automatically wrap. But we could provide the right wrapper so that it is easier.

Note that MinHashSignature is basically the same thing, but it does not get the hashCode and equality right. If we put the hashCode/equals/toString methods in an object we could call them there too.

@miguno
Copy link
Author

miguno commented Feb 11, 2015

Ok. I'll take a stab at this when I'm back early next week. Thanks for the quick feedback, Oscar!

On 11.02.2015, at 19:25, P. Oscar Boykin [email protected] wrote:

Sounds right to me. I don't know of a way to automatically wrap. But we could provide the right wrapper so that it is easier.

Note that MinHashSignature is basically the same thing, but it does not get the hashCode and equality right. If we put the hashCode/equals/toString methods in an object we could call them there too.


Reply to this email directly or view it on GitHub.

@miguno
Copy link
Author

miguno commented Feb 11, 2015

PS: Note that the wrapper cannot be a value class because we must override hashCode and equals. Doh!

@miguno
Copy link
Author

miguno commented Feb 11, 2015

To play devil's advocate: Apart from a Murmur3-based hashCode the user would achieve the same effect by converting Array[Byte] into a Seq[Byte] (backed by a WrappedArray) via array.toSeq and then use a CMS[Seq[Byte]].

Also, in CMS we use a separate hash code setup that is decoupled from the hashCode of type K, so whether or not the Array[Byte] wrapper would have a sane, Murmur3-based hashCode (compared to normal Array[Byte]) would not matter for the CMS use case.

Michael G. Noll added 2 commits February 17, 2015 16:59
…y[Byte] support

This commit includes:

- We remove ClassTag usage and track heavy hitters with a Set() instead of a SortedSet()
- We drop direct support for Arrays.  For the convenience of Algebird users we do provide
  a wrapper class called Bytes to safely wrap Array[Byte] for use with CMS.
@miguno
Copy link
Author

miguno commented Feb 17, 2015

Please take a look at the latest changes.

  • We do not use ClassTag anymore.
  • We drop direct Array (and notably Array[Byte]) support.
  • For the convenience of Algebird users we include a Bytes wrapper class that provides sane implementations of equals, hashCode, and toString as well as an Ordering typeclass. CMS ships with an implicit CMSHasher for Bytes so that you do not have to use contramap to get a CMS[Bytes]. (using contramap results in performance penalties, which I could confirm during benchmarking)
  • I enhanced the docs to explain the use of contramap and how to workaround the Array-related issues (e.g. via the new Bytes wrapper).

I also did some benchmarking/profiling to compare the performance (and memory impacts etc.) of Bytes vs. Seq[Byte] to work on Array[Byte]. In a nutshell, both are in the same ballpark. Bytes was slightly faster than Seq[Int], hence I decided to keep it in the code (instead of adding docs that would tell users to manually convert Array[Byte] into Seq[Byte] via toSeq.)

@miguno
Copy link
Author

miguno commented Feb 17, 2015

Also, we use a Set instead of a SortedSet now. The TopPctCMS does not care either way, but the TopNCMS currently needs to perform a sort of the heavy hitters whenever an instance is being updated (read: a count is performed).

The reason for this is that it isn't easy to run with two different Set implementations (one for TopPctCMS, one for TopNCMS) given the current code structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some tests to exercise the case that they are not all equal. Right now, equals(t: Any) == (t != null) and compare(a, b) == 0 would pass your tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch regarding the equals test. In the latest commit I enhanced the test spec of Bytes in general.

compare(a, b) == 0 would pass your tests

As part of the changes I also enhanced the compare tests, although compare(a, b) == 0 was already failing the tests prior to the enhancements. Hence I'm not sure but hope that the enhancements will still cover any gap you found yesterday for compare.

@johnynek
Copy link
Collaborator

Looks good. I'm going to wait on merging this until we publish 0.10 just in case we find some bugs or something on 0.9.

Thanks for doing this.

@johnynek johnynek added this to the 0.10.0 milestone Feb 18, 2015
@miguno
Copy link
Author

miguno commented Feb 18, 2015

Ok, sounds like a plan. Thanks from my side for your patience on finalizing this pull request, I do appreciate it.

@miguno miguno changed the title CMS: add contramap to convert CMS[K] to CMS[L] CMS: add contramap to convert CMS[K] to CMS[L], add support for String and Bytes Feb 18, 2015
@miguno miguno changed the title CMS: add contramap to convert CMS[K] to CMS[L], add support for String and Bytes CMS: add contramap to convert CMS[K] to CMS[L], add support for String and Bytes, remove Ordering context bound for K Feb 18, 2015
@johnynek
Copy link
Collaborator

I guess 0.10 is coming down the pipe, so I'm going to merge this.

johnynek added a commit that referenced this pull request Mar 31, 2015
CMS: add contramap to convert CMS[K] to CMS[L], add support for String and Bytes, remove Ordering context bound for K
@johnynek johnynek merged commit 3b2c292 into twitter:develop Mar 31, 2015
jnievelt pushed a commit that referenced this pull request May 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants