LakeMLB (Data Lake / Lakehouse Machine Learning Benchmark)
Status: Work in Progress
LakeMLB is an evolving benchmark suite designed to evaluate the performance and scalability of machine learning models in data lake and lakehouse environments. It aims to provide a standardized framework for assessing how well ML algorithms handle large-scale, heterogeneous datasets while integrating with modern data architectures.
- Performance Evaluation: Benchmark training and inference performance across diverse data lake and lakehouse setups.
- Scalability Analysis: Assess how models scale with increasing data volumes and complexity.
- Data Integration: Test integration of ML models with various data storage architectures—from traditional data lakes to modern lakehouses.
- Reproducibility: Establish standardized tasks and metrics for fair comparisons between different ML approaches.
- Standardized Benchmarks: A set of tasks that simulate real-world data lake scenarios.
- Comparative Metrics: Tools to measure throughput, accuracy, latency, and resource efficiency.
- Extensibility: Open framework allowing the community to add new benchmarks, models, and datasets.
- Transparency: Detailed guidelines and documentation to reproduce and validate experimental results.