RecIS: A unified deep learning framework specifically designed for ultra-large-scale sparse and dense computing. Built on the PyTorch open-source ecosystem, it provides a complete solution for recommendation model training and recommendation combined with multimodal/large model training. Jointly launched by Alibaba's AiCheng Technology and Taobao & Tmall's Advertising Technology and Algorithm Technology teams. Currently widely applied in Alibaba's advertising, recommendation, and search scenarios.
Unified Framework
- Based on PyTorch open-source ecosystem, unifying sparse-dense framework requirements
- Meeting industrial-grade recommendation model training needs combined with multimodal and large model scenarios
Performance Optimization
- Optimizing memory access performance for sparse-related operators
- Providing sparse operator fusion optimization capabilities to fully utilize GPU
- Achieving or even exceeding TensorFlow-based performance
Ease of Use
- Flexible feature and embedding configuration
- Automated feature processing and optimization workflows
- Simple sparse model definition
RecIS adopts a modular design with the following core components:
-
ColumnIO: Data Reading
- Supports distributed sharded data reading
- Supports feature pre-computation during the reading phase
- Assembles samples into Torch Tensors and provides data prefetching
-
Feature Engine: Feature Processing
- Provides feature engineering and feature transformation processing capabilities, including Hash / Mod / Bucketize, etc.
- Supports automatic operator fusion optimization strategies
-
Embedding Engine: Embedding Management and Computation
- Provides conflict-free, scalable KV storage embedding tables
- Offers multi-table fusion optimization capabilities for better memory access performance
- Supports feature admission and filtering strategies
-
Saver: Parameter Saving and Loading
- Provides sparse parameter storage and delivery capabilities in SafeTensors standard format
-
Pipelines: Training Process Orchestration
- Connects the above components and encapsulates training workflows
- Supports complex training processes including multi-stage (training/testing alternation) and multi-objective computation
The RecIS framework implements efficient dynamic embeddings (HashTable) through a two-level storage architecture:
- IDMap: Serves as first-level storage, using feature IDs as keys and Offsets as values
- EmbeddingBlocks:
- Serves as second-level storage, continuous sharded memory blocks for storing embedding parameters and optimizer states
- Supports dynamic sharding with flexible scalability
- Flexible Hardware Adaptation Strategy: Supports both GPU and CPU placement for IDMap and EmbeddingBlocks
- Parameter Aggregation and Sharding:
- During model creation, merges parameter tables with identical properties (dimensions, initializers, etc.) into a single logical table
- Parameters are evenly distributed across compute nodes
- Request Merging and Splitting:
- During forward computation, merges requests for parameter tables with identical properties and deduplicates to compute sharding information
- Obtains embedding vectors from various compute nodes through All-to-All collective communication
-
GPU Concurrency Optimization:
- Supports feature processing operator fusion optimization, significantly reducing operator count and launch overhead
-
Parameter Table Fusion Optimization:
- Supports merging parameter tables with identical properties, reducing feature lookup frequency, significantly decreasing operator count, and improving memory space utilization efficiency
-
Operator Implementation Optimization:
- Implements vectorized memory access in operators to improve memory access utilization
- Optimizes reduction operators through warp-level merging, reducing atomic operations and improving memory access utilization
If you encounter issues, you can:
- Check project Issues
- Join our WeChat discussion group
This project is open-sourced under the Apache 2.0 license.