-
Notifications
You must be signed in to change notification settings - Fork 347
An implementation of t-digest numeric distribution sketching #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
For reference, my blog post that describes my reasoning behind binary tree algorithms backing the t-digest: |
|
CI is showing a failure, but the log output makes it seem like tests succeeded. I'm not replicating any failures on my local machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type on public methods, please.
|
Comment on t-digest monoids: I'm pretty sure I can address the issue of non-deterministic behavior of monoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this?
|
@erikerlandson agreed on "statistically monoidal" - it seems fine, and is consistent with |
|
@johnynek looks like some unrelated HLL unit test failed, but I can't re-run it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-deterministic, but I wonder if we track a seed which we update on merges if we can make it deterministic and yet keep the property that in expectation it is randomized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. I was thinking of just using a Random object seeded with some variation of hashCode on the data, something along the lines of:
def deterministicShuffle[T](data: Seq[T]) =
if (data.isEmpty) data else (new Random(data.head.hashCode)).shuffle(data)|
This is exciting, and I really appreciate the work on this. I want to get this merged, but also I hope you agree with us that careful code review is important. One thing I'm wondering: could we break this into two PRs? One to add the Red/black trees with tests, and the second to leverage the red/black trees to implement T-digest (that builds on the first clearly)? What I'm not seeing yet in the review is why scala.collection.immutable.SortedMap: http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.SortedMap won't work here? Could you comment in the code? Sorry if I missed it. Also, Red/black trees are O(log_2 N), but the hash-trie based standard maps are O(log_32 N). Can you comment what features we need of the red/black that the faster hash-trie can't do for us? |
|
@johnynek, I agree, it's a big PR with a lot going on, and I think that reviews may as well be rigorous otherwise they aren't as useful |
|
@johnynek regarding inheriting from Regarding the particular use of binary trees, one nontrivial reason for that choice was that binary trees support the log-time maintenance and queries for prefix sums, and my "cover" constructs, and a couple other basic algorithms. So there is more going on than just mapping objects. Although that might be possible in a non-binary-tree data structure, I think the logic would be a lot harder at best, and maybe not possible at all. Using red-black trees was just a way to guarantee balanced trees with a well-understood data structure that I had a hope of getting to work along with all the other functionality I was layering on top :) |
|
Maybe an even better way to explain it is that a lot of the logic I need to support requires that clusters are maintained in numeric / location order. So any hash-based container will not get me what I need. A tree data structure maintains the ordering by (numeric) key that I need, along with the desirable log-time operations. |
|
@johnynek regarding separate Red/Black node classes, in an earlier iteration I was doing that, but the code wasn't as dry. To make a long story short, it was resulting in two parallel copies of internal node logic (which is where most of the logic is). So I felt like designing around a single internal node class, with color as one field, was the cleaner solution. |
|
@johnynek splitting into a "tree PR" and a "t-digest PR" is OK with me, would you like me to pull the trigger on that? |
|
@erikerlandson Yes. Please. |
8f10e56 to
4356629
Compare
|
The supporting tree/map library has been factored out to #496, and I will now be keeping this branch rebased off of topic branch |
4356629 to
05f387b
Compare
…ith specific classes that inherit from the hierarchy which interact badly with type-widening
3bdc3a4 to
503db35
Compare
|
@johnynek @ianoc @avibryant is there still interest in #495 and #496 ? |
|
Would very much like to see this PR merged, I am currently using the reference implementation but would benefit from an Algebird based implementation. |
|
@kainoa21 coincidentally I started re-visiting this last weekend. There was a request to un-factor the various tree functions to reduce the code bulk on the back end, which is still on my to-do list. |
| // if we have already distributed all the mass, remaining clusters unchanged | ||
| cmNew = cmNew :+ ((clust.centroid, clust.mass)) | ||
| } else if (xn == clust.centroid) { | ||
| // if xn lies exactly on the centroid, add all mass in regardless of bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the justification for this? It doesn't seem to come from the paper.
In fact, there is explicit acknowledgment that two clusters could have the same centroid:
For centroids with identical means, order of creation is used as a tie-breaker to allow an unambiguous ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allows centroid to serve as unique key, which simplifies things. Multiple clusters with the same centroid adds nothing useful to the model.
|
Sorry, we did not communicate clearly on this. This is indeed an exciting algorithm, however, if we are going to become the maintainers of it, here it needs to be significantly smaller in size. An implementation that could reuse existing scala collection classes may be small enough to accept. Honestly, the best approach is probably for @erikerlandson to provide a low dependency subproject that hosts the t-digest, and then we could have |
|
I'd like to add, these cases are a challenge. We want to welcome contributions, but then inclusion can seem like an endorsement which has burned us a few times in the past when implementations bit-rot or are poorly performing (not saying that will happen here, I just mean that we need to really have the bandwidth to understand and maintain the code to add it). |
|
@johnynek What about an |
|
This is now available as a package here: So I'm going to close this out |
As described in the following paper:
Computing Extremely Accurate Quantiles Using t-Digests
Ted Dunning and Otmar Ertl
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf