Starred repositories
Reference PyTorch implementation and models for DINOv3
MTEB: Massive Text Embedding Benchmark
Awesome Unified Multimodal Models
[BMVC2023] Official code for TEMI: Exploring the Limits of Deep Image Clustering using Pretrained Models
Scalable data pre processing and curation toolkit for LLMs
[NeurIPS 2025 Spotlight] Towards Understanding Camera Motions in Any Video
PKU-DAIR / DataFlow
Forked from OpenDCAI/DataFlowEasy Data Preparation with latest LLMs-based Operators and Pipelines.
This is a repo with links to everything you'd ever want to learn about data engineering
Automatically crawl arXiv papers daily and summarize them using AI. Illustrating them using GitHub Pages.
Official Repository of "LLM × DATA" Survey Paper
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
🔥[VLDB'26] Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning".
Fantastic Data Engineering for Large Language Models
Best Papers of Top Venues like CVPR, NeurIPS, ICLR, ICML, ICCV, ECCV, ...
A Library for Advanced Deep Time Series Models for General Time Series Analysis.
🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT)
Simplistic mobile RSS client built with Flutter
[ICML 2024] Selecting High-Quality Data for Training Language Models
The official repository for the NLP-KG web application [ACL 2024 Demo].
[ICCV 2023 Oral] IOMatch: Simplifying Open-Set Semi-Supervised Learning with Joint Inliers and Outliers Utilization
[ICML'24] Mitigating Privacy Risk in Membership Inference by Convex-Concave Loss
The Open-Source Data Annotation Platform
A Survey on Data Selection for Language Models
Summarize existing representative LLMs text datasets.
[ICML'24] Open-Vocabulary Calibration for Fine-tuned CLIP
Train transformer language models with reinforcement learning.