Starred repositories
Code repo for "An Empirical Evaluation of Columnar Storage Formats" VLDB Vol 17
An extensible, state of the art columnar file format. Formerly at @spiraldb, now an Incubation Stage project at LFAI&Data, part of the Linux Foundation.
Build reliable AI and agentic applications with DataFrames
Protocol and libraries for sending and receiving OpenTelemetry data using Apache Arrow
DuckLake is an integrated data lake and catalog format
Native Rust TPCH support for Datafusion using tpchgen
Spark integrations for working with Lance datasets
Lance Namespace is an open specification on top of the storage-based Lance table and file format to standardize access to a collection of Lance tables
TPC-H benchmark data generation in pure Rust
Olympia is a storage-only open catalog format for big data analytics, ML & AI.
Code used to create text embeddings of all Magic: The Gathering cards.
DataFusion TableProviders for reading data from other systems
Analytical database for data-driven Web applications 🪶
The Amazon S3 Tables catalog is a client library that bridges control plane operations provided by S3 Tables to engines like Apache Spark, Flink and others, when used with the Iceberg Table format
Batteries included CLI, TUI, and server implementations for DataFusion.
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
Monitoring and insights on your data lakehouse tables
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
The simplest, highest-throughput Python interface to S3, GCS & Azure Storage, powered by Rust.
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, P…