Skip to content

alibaba/RecIS

Repository files navigation

LOGO

RecIS (Recommendation Intelligence System)

中文版

RecIS: A unified deep learning framework specifically designed for ultra-large-scale sparse and dense computing. Built on the PyTorch open-source ecosystem, it provides a complete solution for recommendation model training and recommendation combined with multimodal/large model training. Jointly launched by Alibaba's AiCheng Technology and Taobao & Tmall's Advertising Technology and Algorithm Technology teams. Currently widely applied in Alibaba's advertising, recommendation, and search scenarios.

Static Badge Static Badge Static Badge stars GitHub issues WeChat QR

🎯 Design Goals

Unified Framework

  • Based on PyTorch open-source ecosystem, unifying sparse-dense framework requirements
  • Meeting industrial-grade recommendation model training needs combined with multimodal and large model scenarios

Performance Optimization

  • Optimizing memory access performance for sparse-related operators
  • Providing sparse operator fusion optimization capabilities to fully utilize GPU
  • Achieving or even exceeding TensorFlow-based performance

Ease of Use

  • Flexible feature and embedding configuration
  • Automated feature processing and optimization workflows
  • Simple sparse model definition

🏗️ Core Architecture

RecIS adopts a modular design with the following core components:

System Architecture
  • ColumnIO: Data Reading

    • Supports distributed sharded data reading
    • Supports feature pre-computation during the reading phase
    • Assembles samples into Torch Tensors and provides data prefetching
  • Feature Engine: Feature Processing

    • Provides feature engineering and feature transformation processing capabilities, including Hash / Mod / Bucketize, etc.
    • Supports automatic operator fusion optimization strategies
  • Embedding Engine: Embedding Management and Computation

    • Provides conflict-free, scalable KV storage embedding tables
    • Offers multi-table fusion optimization capabilities for better memory access performance
    • Supports feature admission and filtering strategies
  • Saver: Parameter Saving and Loading

    • Provides sparse parameter storage and delivery capabilities in SafeTensors standard format
  • Pipelines: Training Process Orchestration

    • Connects the above components and encapsulates training workflows
    • Supports complex training processes including multi-stage (training/testing alternation) and multi-objective computation

🚀 Key Optimizations

Efficient Dynamic Embedding

The RecIS framework implements efficient dynamic embeddings (HashTable) through a two-level storage architecture:

  • IDMap: Serves as first-level storage, using feature IDs as keys and Offsets as values
  • EmbeddingBlocks:
    • Serves as second-level storage, continuous sharded memory blocks for storing embedding parameters and optimizer states
    • Supports dynamic sharding with flexible scalability
  • Flexible Hardware Adaptation Strategy: Supports both GPU and CPU placement for IDMap and EmbeddingBlocks

Distributed Optimization

  • Parameter Aggregation and Sharding:
    • During model creation, merges parameter tables with identical properties (dimensions, initializers, etc.) into a single logical table
    • Parameters are evenly distributed across compute nodes
  • Request Merging and Splitting:
    • During forward computation, merges requests for parameter tables with identical properties and deduplicates to compute sharding information
    • Obtains embedding vectors from various compute nodes through All-to-All collective communication

Efficient Hardware Resource Utilization

  • GPU Concurrency Optimization:

    • Supports feature processing operator fusion optimization, significantly reducing operator count and launch overhead
  • Parameter Table Fusion Optimization:

    • Supports merging parameter tables with identical properties, reducing feature lookup frequency, significantly decreasing operator count, and improving memory space utilization efficiency
  • Operator Implementation Optimization:

    • Implements vectorized memory access in operators to improve memory access utilization
    • Optimizes reduction operators through warp-level merging, reducing atomic operations and improving memory access utilization

📚 Documentation

🤝 Support and Feedback

If you encounter issues, you can:

  • Check project Issues
  • Join our WeChat discussion group

Wechat

📄 License

This project is open-sourced under the Apache 2.0 license.

About

A unified architecture deep learning framework designed specifically for ultra-large-scale sparse models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 9