Stars
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型)
MTEB: Massive Text Embedding Benchmark
RayGen: Multi-Modal Dataset Reinforcement for MobileCLIP and MobileCLIP2
This repository contains the official implementation of the research paper, "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization" ICCV 2023
An open source implementation of CLIP.
Enriching MS-COCO with Chinese sentences and tags for cross-lingual multimedia tasks
Reproducible scaling laws for contrastive language-image learning (https://arxiv.org/abs/2212.07143)
New generation of CLIP with fine grained discrimination capability, ICML2025
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (V…
Example pybind11 module built with a CMake-based build system
Retrieval and Retrieval-augmented LLMs
[ICLR 2025] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation
[ECCV 2024 Oral] PetFace: A Large-Scale Dataset and Benchmark for Animal Identification https://arxiv.org/abs/2407.13555
Refine high-quality datasets and visual AI models
Official implementation for the paper: "Multi-label Classification with Partial Annotations using Class-aware Selective Loss"
YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
Effortless data labeling with AI support from Segment Anything and other awesome models.
Precision Search through Multi-Style Inputs
Official implementation of paper "Query2Label: A Simple Transformer Way to Multi-Label Classification".
[CVPR 2022] Official code for "Unified Contrastive Learning in Image-Text-Label Space"