Stars
Public collaboration of Scalable Single Cell Analytics
Examples of using CloudML with genomic data.
A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Example spark integrations.
Apache Spark jobs such as Principal Coordinate Analysis.
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means
Apache Spark - A unified analytics engine for large-scale data processing
network-based vaccination game
A simple demonstration of sub-sequence sampling as used for anomaly detection with EKG signals
Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library.
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
scikit-learn: machine learning in Python
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
Breeze is/was a numerical processing library for Scala.
CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
OpenRefine is a free, open source power tool for working with messy data and improving it
Streaming MapReduce with Scalding and Storm
BlinkDB: Sub-Second Approximate Queries on Very Large Data.
A python script for summarizing articles using nltk
Twitter common libraries for python and the JVM (deprecated)
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
An accelerated framework for manipulating and interpreting high-throughput sequencing data
Lightning-fast cluster computing in Java, Scala and Python.
Machine Learning / Natural Language Processing / Information Retrieval