-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Overview
Add a mechanism to pre-pull OCI artifacts (especially AI/ML models) before they are needed by workloads, enabling faster pod startup for latency-sensitive inference services.
Problem Statement
Current behavior: When a pod requests an OCI artifact volume, CRI-O downloads it on-demand. For large AI models (10GB+), this causes significant cold-start delays during autoscaling events.
User impact:
- LLM/GenAI workloads are latency-sensitive with spiky traffic patterns
- Autoscalers (KEDA, HPA) cannot respond quickly due to model download delays
- Pod startup times of 3-5 minutes are unacceptable for inference services
- Time between scaling decisions and ability to serve requests is critical for maintaining SLAs
Use Cases
1. KEDA-based AI Inference Autoscaling
Scenario: LLM inference services autoscale based on queue depth. When traffic spikes, new pods must come online quickly.
Problem: New pods wait 3-5 minutes downloading 10GB models. By the time pods are ready, the spike may have passed or users have experienced degraded latency.
Desired outcome: Pods start in <30 seconds with pre-cached models, enabling system to respond quickly to load changes.
2. Spot/Preemptible Instance Preparation
Scenario: Cloud clusters use spot instances for cost savings. New nodes need to be ready immediately to handle workloads.
Problem: New spot instance joins, pod is scheduled, downloads model for 5 minutes. If instance is preempted during download, work is lost.
Desired outcome: Node initialization pre-pulls critical artifacts before marking node ready. Pods start with predictable latency.
3. Multi-Model Serving Platforms
Scenario: Model serving platforms support hundreds of models, but only a subset are "hot" at any time.
Desired outcome: Pre-pull top 20 most-requested models on all nodes. 80% of requests served with zero download latency, graceful degradation for rare models.
4. Air-gapped/Edge Deployments
Scenario: Edge locations or air-gapped environments cannot pull artifacts on-demand due to network constraints.
Desired outcome: Pre-pull all required artifacts during node provisioning or via offline bundle/USB drive. Nodes operate without registry access.
Related
- Default artifact location #9570: Configurable artifact storage locations (complementary feature)
- Combined use case: Pre-pull models into shared storage, configure as read-only artifact store on all nodes