-
Notifications
You must be signed in to change notification settings - Fork 347
Improvements to Aggregator #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This basically makes good on the idea that Scalding is a library to run Algebird on Hadoop. This puts all the functions we have on reducers in scalding into the Aggregator object and composes the Aggregators using the optimized composition from TupleSemigroupN. We could add code to apply Aggregators in parallel to large IndexedSeqs: partition them into N where there are N processors on the machine, each thread reduces part of the IndexedSeq, then have one thread finish combining the N results. |
|
A huge number of the new Aggregators you added have identity present functions. Worth just having a trait for that? |
|
Is it worth adding a bunch of convenience aggregators for common T values of |
|
... in fact I think most of them could be captured by object Aggregator {
def fromPrepare[A,B:Monoid](fn: A => B) = new MonoidAggregator[A,B] {
type B = B
def prepare(a: A) = fn(a)
def present(b: B) = b
}
} |
|
@avibryant good call. Will do. Also, we talked a lot (internally with @ianoc and @Gabriel439 ) about hiding the B type. It occurred to me that in summingbird, the B type will be relevant because you will need a store of that type. Also, in a system with typesafe serialization (unlike scalding), you will need to serialize items of types B across the mappers to reducers. This change (in addition to breaking old code) also makes this impossible. Our rough consensus here is to back out the abstract type B and put it back into the type so that we do not hit these problems. Notice, If you have an /**
* S[T] could be Bufferable[T], Pickler[T] or Store[K, T]
* The result can hide the B because S and Aggregator are bound together
*/
def with[S[_] : Applicative[S], A, B, C](Aggregator[A, B, C], S[B]): LiftedAggregator[S, A, C]That said, all of this starts looking pretty complex (and scala's type inference on higher kinded typeclasses, like Applicative is not great, so it can get ugly). So, any comments on backing out the B type? Does the justification about using it with Store/etc... sound legit? |
|
Ok, that's a reasonably compelling argument. |
|
I'm in favor of:
measure(injection, a join b) = measure(injection, a) join measure(injection, b)
measure(injection, Aggregator.apply(a)) = Metric.apply(a)In other words, The latter two points don't need to be part of this pull request. For the purpose of this pull request I'm fine with just restoring |
|
I hope we decide to keep pre defined aggregators for count, min, max, unique etc as making easier for users to |
|
@Gabriel439 +1 to your suggestions. Let's follow on to this. One small issue: right now, bijection is not a dependency of algebird but there is a algebird-bijection package, we could put it in there or reconsider the dependency heirarchy. |
|
@MansurAshraf yes. I kept the named aggregators (with names from scalding's KeyedListLike type). Note: composing aggregators (with the GeneratedTupleAggregators which are also called via join) uses the composed Semigroups so that sumOption should be called for the semigroups that have optimized that. |
|
@johnynek I'd be fine with this in |
Talked about this in issue #358
/cc @ejconlon @avibryant @Gabriel439