Skip to content

Conversation

@johnynek
Copy link
Collaborator

This needs tests to make sure the implicits are wired up correctly, but it seems like it is time to make Algebird-spark to make it easy to use Algebird with spark.

@johnynek
Copy link
Collaborator Author

@ianoc and @tsdeng care to look at this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aggregatorOnAll and aggregatorByKey rather than aggregateAll and aggregateByKey? The latter seem a closer fit to spark's reduceByKey

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already aggregateByKey. I will add tests and see if the implicit enrichment can differentiate based on the arguments. It may, or it may give the user a very confusing message. Let's see.

@ianoc
Copy link
Collaborator

ianoc commented Jan 13, 2015

any tests? looks fine modulo Avi's comment about naming would be nice to be closer to sparks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Edit: Doh, should have read the code more closely before asking the first question. Never mind. :-)

  • Any particular reason to give a default of null to ordK and/or to use an implicit parameter instead of K: ClassTag: Ordering?

    implicit ordK: Ordering[K] = null

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like spark follows this style, which I assume that it means it can try to work even if there is no Ordering. Scalding requires the ordering, and uses the approach you suggest.

@adelbertc
Copy link

Interested in this effort, especially with the algebra stuff going on. I've talked with a few people briefly about a more open-source effort to share common Spark things that are related to Algebird, e.g. having functions that takes a Monoid instead of A => (A => A => A), etc.

@ianoc
Copy link
Collaborator

ianoc commented Mar 26, 2015

@johnynek This looks good to me, if we can fix the merge conflict looks good to merge

@johnynek
Copy link
Collaborator Author

@ianoc we need tests too, right?

@virtualirfan
Copy link

An example would be nice too.

@ianoc
Copy link
Collaborator

ianoc commented Mar 27, 2015

Tests would be great but as some tiny implicits id happily ship if manually verified and follow up with those later.

@adelbertc
Copy link

Latest version of Spark is now 1.3.0

@reconditesea
Copy link
Contributor

+1 for a small tutorial job of how to make spark work with algebird.

@johnynek
Copy link
Collaborator Author

johnynek commented Jul 1, 2015

@adelbertc can you review?

@ianoc this has tests now, and I think reasonably efficient implementations, especially for sum, sumOption, aggregate, aggregateOption.

I don't see a way to use sumOption in sumByKey or aggregateByKey with reimplementing or skipping map-side combining. I wish spark had something like scalding's sumByLocalKeys.

@johnynek johnynek changed the title Initial sketch of using Algebird with spark Algebird support for spark Jul 1, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest Spark version is 1.4.0 now, not sure how you want to handle versioning

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its provided so the user brings their own. Once the API's ABI compatible in spark this uses it should work fine for both

@adelbertc
Copy link

👍 LGTM, only thing I'm concerned about is whether or not commutativity is needed in the sumOption case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToAlgebird?

@johnynek
Copy link
Collaborator Author

johnynek commented Jul 2, 2015

I think this addresses all review concerns. I also added Partitioner support for the cases where you want to control partitioning.

@ianoc
Copy link
Collaborator

ianoc commented Jul 2, 2015

lgtm, merge when green

@ianoc
Copy link
Collaborator

ianoc commented Jul 2, 2015

@johnynek incase you haven't seen, failing with
[info] Compiling 1 Scala source to /home/travis/build/twitter/algebird/algebird-bijection/target/scala-2.11/classes...
[error] /home/travis/build/twitter/algebird/algebird-spark/src/main/scala/com/twitter/algebird/spark/AlgebirdRDD.scala:14: in class AlgebirdRDD, multiple overloaded alternatives of method sumByKey define default arguments.
[error] class AlgebirdRDD[T](val rdd: RDD[T]) extends AnyVal {
[error] ^
[error] one error found

@johnynek
Copy link
Collaborator Author

johnynek commented Jul 2, 2015

@ianoc yeah. I fixed it by adding a class that @non , @avibryant and I (I think it was) discussed and @non put in algebra.

still need to get algebird to depend on algebra.

@ianoc
Copy link
Collaborator

ianoc commented Jul 2, 2015

Cool, lgtm anyway with this in place

johnynek added a commit that referenced this pull request Jul 2, 2015
@johnynek johnynek merged commit 917e6bf into develop Jul 2, 2015
@johnynek johnynek deleted the algebird-spark branch July 2, 2015 20:38
@johnynek
Copy link
Collaborator Author

johnynek commented Jul 2, 2015

coveralls is keeping this yellow. Need to get this sorted out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants