Stars
An open source ML system for the end-to-end data science lifecycle
Apache Spark - A unified analytics engine for large-scale data processing
An efficient updatable key-value store for Apache Spark
Code to accompany Advanced Analytics with Spark from O'Reilly Media
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Interactive and Reactive Data Science using Scala and Spark.
Example code from Learning Spark book
A collection of MapReduce tasks translated (from Pig, Hive, MapReduce streaming, Cascalog, etc.) into Scalding.
Orignal unmaintained version of the Lightbeam extension. See lightbeam-we for the new one which works in modern versions of Firefox.
A private messenger for Windows, macOS, and Linux.
The FourthParty web measurement platform.
Magpie contains a number of scripts for running Big Data software in HPC environments, including Hadoop and Spark. There is support for Lustre, Slurm, Moab, Torque. LSF, Flux, and more.
The winning solution to the The Higgs Boson Machine Learning Challenge.
Solution to the Higgs Boson Machine Learning Challenge on Kaggle
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Storm-yarn enables Storm clusters to be deployed into machines managed by Hadoop YARN.
A simple demonstration of sub-sequence sampling as used for anomaly detection with EKG signals
Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability
A repository of information, examples and good practices around the Lambda Architecture
The official home of the Presto distributed SQL query engine for big data
Lightning-fast cluster computing in Java, Scala and Python.