Improvements to Aggregator #359

johnynek · 2014-11-18T22:29:43Z

Talked about this in issue #358

johnynek · 2014-11-19T21:20:06Z

This basically makes good on the idea that Scalding is a library to run Algebird on Hadoop. This puts all the functions we have on reducers in scalding into the Aggregator object and composes the Aggregators using the optimized composition from TupleSemigroupN.

We could add code to apply Aggregators in parallel to large IndexedSeqs: partition them into N where there are N processors on the machine, each thread reduces part of the IndexedSeq, then have one thread finish combining the N results.

avibryant · 2014-11-20T05:58:17Z

A huge number of the new Aggregators you added have identity present functions. Worth just having a trait for that?

avibryant · 2014-11-20T06:04:42Z

Is it worth adding a bunch of convenience aggregators for common T values of Aggregator.fromMonoid[T]? I'm thinking of eg Aggregator.long instead of Aggregator.fromMonoid[Long].

avibryant · 2014-11-20T06:08:36Z

... in fact I think most of them could be captured by

object Aggregator {
   def fromPrepare[A,B:Monoid](fn: A => B) = new MonoidAggregator[A,B] {
     type B = B
     def prepare(a: A) = fn(a)
     def present(b: B) = b
   }
}

johnynek · 2014-11-20T19:10:25Z

@avibryant good call. Will do.

Also, we talked a lot (internally with @ianoc and @Gabriel439 ) about hiding the B type. It occurred to me that in summingbird, the B type will be relevant because you will need a store of that type. Also, in a system with typesafe serialization (unlike scalding), you will need to serialize items of types B across the mappers to reducers. This change (in addition to breaking old code) also makes this impossible.

Our rough consensus here is to back out the abstract type B and put it back into the type so that we do not hit these problems.

Notice, If you have an

/**
 * S[T] could be Bufferable[T], Pickler[T] or Store[K, T]
 * The result can hide the B because S and Aggregator are bound together
 */
def with[S[_] : Applicative[S], A, B, C](Aggregator[A, B, C],  S[B]): LiftedAggregator[S, A, C]

That said, all of this starts looking pretty complex (and scala's type inference on higher kinded typeclasses, like Applicative is not great, so it can get ugly).

So, any comments on backing out the B type? Does the justification about using it with Store/etc... sound legit?

avibryant · 2014-11-20T19:12:52Z

Ok, that's a reasonably compelling argument.

Gabriella439 · 2014-11-20T21:40:22Z

I'm in favor of:

keeping Aggregator the way it was (with B in the type)
defining a separate type named Metric (this is what tsar calls it) that hides B and bundles an injection between B to Array[Byte] for serialization purposes
defining a conversion function named measure from Aggregator to Metric where you supply the injection. The laws for measure are that it is an applicative homomorphism:

measure(injection, a join b) = measure(injection, a) join measure(injection, b)
measure(injection, Aggregator.apply(a)) = Metric.apply(a)

In other words, Metric handles the common, happy path and Aggregator handles more advanced use cases.

The latter two points don't need to be part of this pull request. For the purpose of this pull request I'm fine with just restoring Aggregator to the original type.

MansurAshraf · 2014-11-20T22:24:44Z

I hope we decide to keep pre defined aggregators for count, min, max, unique etc as making easier for users to join multiple aggregators and use them after a GroupBy was the original intent of this refactoring

johnynek · 2014-11-24T18:07:05Z

@Gabriel439 +1 to your suggestions. Let's follow on to this. One small issue: right now, bijection is not a dependency of algebird but there is a algebird-bijection package, we could put it in there or reconsider the dependency heirarchy.

johnynek · 2014-11-24T18:09:08Z

@MansurAshraf yes. I kept the named aggregators (with names from scalding's KeyedListLike type).

Note: composing aggregators (with the GeneratedTupleAggregators which are also called via join) uses the composed Semigroups so that sumOption should be called for the semigroups that have optimized that.

Gabriella439 · 2014-11-24T18:10:55Z

@johnynek I'd be fine with this in algebird-bijection. I'm all in favor of good dependency hygiene.

Improvements to Aggregator

johnynek added 3 commits November 18, 2014 14:27

Improvements to Aggregator

062e702

use Semigroup in Aggregator

71e988c

Fix some comments

1a8ef90

johnynek added 3 commits November 19, 2014 13:43

Added Aggregator.exists

25cec20

Merge in CMS changing

42d0879

Fix a merge issue

7a93cd7

Revert the middle type of Aggregator as an abstract type

6e5b27b

Keep the type in uniqueCount

4d23dc8

ianoc added a commit that referenced this pull request Nov 26, 2014

Merge pull request #359 from twitter/aggregator-type

c0bb4e8

Improvements to Aggregator

ianoc merged commit c0bb4e8 into develop Nov 26, 2014

ianoc deleted the aggregator-type branch November 26, 2014 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to Aggregator #359

Improvements to Aggregator #359

Uh oh!

johnynek commented Nov 18, 2014

Uh oh!

johnynek commented Nov 19, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

johnynek commented Nov 20, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

Gabriella439 commented Nov 20, 2014

Uh oh!

MansurAshraf commented Nov 20, 2014

Uh oh!

johnynek commented Nov 24, 2014

Uh oh!

johnynek commented Nov 24, 2014

Uh oh!

Gabriella439 commented Nov 24, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Improvements to Aggregator #359

Improvements to Aggregator #359

Uh oh!

Conversation

johnynek commented Nov 18, 2014

Uh oh!

johnynek commented Nov 19, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

johnynek commented Nov 20, 2014

Uh oh!

avibryant commented Nov 20, 2014

Uh oh!

Gabriella439 commented Nov 20, 2014

Uh oh!

MansurAshraf commented Nov 20, 2014

Uh oh!

johnynek commented Nov 24, 2014

Uh oh!

johnynek commented Nov 24, 2014

Uh oh!

Gabriella439 commented Nov 24, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants