π Hadoop
winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows
Example source code accompanying O'Reilly's "Hadoop: The Definitive Guide" by Tom White
The official home of the Presto distributed SQL query engine for big data
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Upserts, Deletes And Incremental Processing on Big Data.
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Pentaho Data Integration ( ETL ) a.k.a Kettle
Apache Atlas - Open Metadata Management and Governance capabilities across the Hadoop platform and beyond
Apache Ranger - To enable, monitor and manage comprehensive data security across the Hadoop platform and beyond
A composable and fully extensible C++ execution engine library for data management systems.
Apache DataFusion Comet Spark Accelerator
Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
A cross platform way to express data transformation, relational algebra, standardized record expression and plans.