Skip to content

Conversation

@johnynek
Copy link
Collaborator

Talked about this in issue #358

/cc @ejconlon @avibryant @Gabriel439

@johnynek
Copy link
Collaborator Author

This basically makes good on the idea that Scalding is a library to run Algebird on Hadoop. This puts all the functions we have on reducers in scalding into the Aggregator object and composes the Aggregators using the optimized composition from TupleSemigroupN.

We could add code to apply Aggregators in parallel to large IndexedSeqs: partition them into N where there are N processors on the machine, each thread reduces part of the IndexedSeq, then have one thread finish combining the N results.

@avibryant
Copy link
Contributor

A huge number of the new Aggregators you added have identity present functions. Worth just having a trait for that?

@avibryant
Copy link
Contributor

Is it worth adding a bunch of convenience aggregators for common T values of Aggregator.fromMonoid[T]? I'm thinking of eg Aggregator.long instead of Aggregator.fromMonoid[Long].

@avibryant
Copy link
Contributor

... in fact I think most of them could be captured by

object Aggregator {
   def fromPrepare[A,B:Monoid](fn: A => B) = new MonoidAggregator[A,B] {
     type B = B
     def prepare(a: A) = fn(a)
     def present(b: B) = b
   }
}

@johnynek
Copy link
Collaborator Author

@avibryant good call. Will do.

Also, we talked a lot (internally with @ianoc and @Gabriel439 ) about hiding the B type. It occurred to me that in summingbird, the B type will be relevant because you will need a store of that type. Also, in a system with typesafe serialization (unlike scalding), you will need to serialize items of types B across the mappers to reducers. This change (in addition to breaking old code) also makes this impossible.

Our rough consensus here is to back out the abstract type B and put it back into the type so that we do not hit these problems.

Notice, If you have an

/**
 * S[T] could be Bufferable[T], Pickler[T] or Store[K, T]
 * The result can hide the B because S and Aggregator are bound together
 */
def with[S[_] : Applicative[S], A, B, C](Aggregator[A, B, C],  S[B]): LiftedAggregator[S, A, C]

That said, all of this starts looking pretty complex (and scala's type inference on higher kinded typeclasses, like Applicative is not great, so it can get ugly).

So, any comments on backing out the B type? Does the justification about using it with Store/etc... sound legit?

@avibryant
Copy link
Contributor

Ok, that's a reasonably compelling argument.

@Gabriella439
Copy link
Contributor

I'm in favor of:

  • keeping Aggregator the way it was (with B in the type)
  • defining a separate type named Metric (this is what tsar calls it) that hides B and bundles an injection between B to Array[Byte] for serialization purposes
  • defining a conversion function named measure from Aggregator to Metric where you supply the injection. The laws for measure are that it is an applicative homomorphism:
measure(injection, a join b) = measure(injection, a) join measure(injection, b)
measure(injection, Aggregator.apply(a)) = Metric.apply(a)

In other words, Metric handles the common, happy path and Aggregator handles more advanced use cases.

The latter two points don't need to be part of this pull request. For the purpose of this pull request I'm fine with just restoring Aggregator to the original type.

@MansurAshraf
Copy link
Contributor

I hope we decide to keep pre defined aggregators for count, min, max, unique etc as making easier for users to join multiple aggregators and use them after a GroupBy was the original intent of this refactoring

@johnynek
Copy link
Collaborator Author

@Gabriel439 +1 to your suggestions. Let's follow on to this. One small issue: right now, bijection is not a dependency of algebird but there is a algebird-bijection package, we could put it in there or reconsider the dependency heirarchy.

@johnynek
Copy link
Collaborator Author

@MansurAshraf yes. I kept the named aggregators (with names from scalding's KeyedListLike type).

Note: composing aggregators (with the GeneratedTupleAggregators which are also called via join) uses the composed Semigroups so that sumOption should be called for the semigroups that have optimized that.

@Gabriella439
Copy link
Contributor

@johnynek I'd be fine with this in algebird-bijection. I'm all in favor of good dependency hygiene.

ianoc added a commit that referenced this pull request Nov 26, 2014
@ianoc ianoc merged commit c0bb4e8 into develop Nov 26, 2014
@ianoc ianoc deleted the aggregator-type branch November 26, 2014 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants