简体中文 | English
Arks is an end-to-end framework for managing LLM-based applications within Kubernetes clusters. It provides a robust and extensible infrastructure tailored for deploying, orchestrating, and scaling LLM inference workloads in cloud-native environments.
- Multi-node scheduling: Run inference across multiple compute nodes.
- Heterogeneous computing support: Works across different hardware types (CPU, GPU, etc.).
- Multi-engine compatibility: Supports vLLM, SGLang, and Dynamo.
- Auto service discovery & load balancing: Dynamically register and balance application traffic.
- Automatic weight adjustment: Adapt to traffic and resource demands in real-time.
- Horizontal Pod Autoscaling (HPA): Autoscale applications based on workload.
- Model caching & optimization: Efficiently download and cache models to reduce cold-start latency.
- Model sharing: Share models across inference nodes to save bandwidth and memory.
- Accelerated loading: Leverage local cache and preloaded strategies for fast startup.
- Fine-grained API Token control: Issue and manage tokens with scoped permissions.
- Flexible quota strategies: Enforce usage limits by total token count or pricing-based policies.
- Request throttling: Support rate limiting by TPM (tokens per minute) and RPM (requests per minute) and more rate limiting stategies.
Arks consists of the following major components:
- Gateway Layer: Acts as the unified entry point for all external traffic. It handles request routing and enforces access policies.
- ArksToken: Provides fine-grained multi-tenant access control with support for:
- API token-based authentication
- Quota enforcement (based on token usage or pricing)
- Rate limiting (TPM, RPM)
 
- ArksEndpoint: Dynamically manages routing rules and traffic distribution across different ArksApplication instances.
- Supports dynamic weight-based routing
- Enables automatic application discovery
- Adjusts traffic flow in real-time based on load or policies
 
 
- ArksToken: Provides fine-grained multi-tenant access control with support for:
- Workload Layer: Each ArksApplication contains one or more runtime instances. Supported runtimes include vLLM, SGLang, Dynamo.
Each runtime is deployed as a Kubernetes workload and benefits from:
- Distributed inference across multiple nodes
- Support for heterogeneous computing environments
- Autoscaling via Kubernetes HPA, based on predefined SLOs
 
- Storage Layer: Using ArksModel to manage model storage.
- Supports auto caching of models to reduce cold start time
- Enables model sharing across applications and nodes
- Designed for high-throughput model loading and reuse
 
More docs:
- Kubernetes cluster (v1.20+)
- kubectl configured to access your cluster
git clone https://github.com/scitix/arks.git
cd arks
# Install envoy gateway, lws dependencies
kubectl create -f dist/dependency.yaml
# Install arks operator
kubectl create -f dist/operator.yaml
# Install arks gateway plugins
kubectl create -f dist/gateway.yamlverification:
# Check all component status, should be ready
kubectl get deployment -n arks-operator-system
---
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
arks-gateway-plugins               1/1     1            1           22h
arks-operator-controller-manager   1/1     1            1           22h
arks-redis-master                  1/1     1            1           22h
# Check Envoy Gateway status
kubectl get deployment -n   envoy-gateway-system
--- 
NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
envoy-arks-operator-system-arks-eg-abcedefg   1/1     1            1           22h
envoy-gateway                                 1/1     1            1           22h
Install with:
kubectl create -f examples/quickstart/quickstart.yamlCheck resources ready:
# Check all ARKS custom resources
kubectl get arksapplication,arksendpoint,arksmodel,arksquota,arkstoken,httproute -owide
---
# REPLICAS should equals to READY, PHASE should be Running
NAME                                      PHASE     REPLICAS   READY   AGE   MODEL     RUNTIME   DRIVER
arksapplication.arks.ai/app-qwen   Running   1          1       21m   qwen-7b   sglang
NAME                                  AGE   DEFAULT WEIGHT
arksendpoint.arks.ai/qwen-7b   21m   5
# PHASE should be Ready
NAME                               AGE   MODEL                         PHASE
arksmodel.arks.ai/qwen-7b   21m   Qwen/Qwen2.5-7B-Instruct-1M   Ready
NAME                                   AGE
arksquota.arks.ai/basic-quota   21m
NAME                                     AGE
arkstoken.arks.ai/example-token   21m
NAME                                          HOSTNAMES   AGE
httproute.gateway.networking.k8s.io/qwen-7b               21m
Get the gateway IP:
# Option 1: Kubernetes cluster with LoadBalancer support
LB_IP=$(kubectl get svc -n envoy-gateway-system --selector=gateway.envoyproxy.io/owning-gateway-name=arks-eg -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
# Option 2: Dev environment without LoadBalancer support. Use port forwarding way instead
ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system --selector=gateway.envoyproxy.io/owning-gateway-name=arks-eg -o jsonpath='{.items[0].metadata.name}')
kubectl -n envoy-gateway-system port-forward service/${ENVOY_SERVICE} 8888:80 &
ENDPOINT="localhost:8888"
Curl the example app through Envoy proxy:
curl http://${ENDPOINT}/v1/chat/completions -k \
  -H "Authorization: Bearer sk-test123456" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello, who are you?"}]}'Expected response
{
  "id":"xxxxxxxxx",
  "object":"chat.completion",
  "created": 12332454,
  "model":"qwen-7b",
  "choices":[{
    "index":0,
    "message":{
      "role":"assistant",
      "content":"I'm a large language model created by Alibaba Cloud. I go by the name Qwen.",
      "reasoning_content":null,
      "tool_calls":null
    },
    "logprobs":null,
    "finish_reason":"stop",
    "matched_stop":151645
  }],
  "usage":{
    "prompt_tokens":25,
    "total_tokens":45,
    "completion_tokens":20,
    "prompt_tokens_details":null
  }
}kubectl delete -f examples/quickstart/quickstart.yaml --ignore-not-found=true
kubectl delete -f dist/gateway.yaml
kubectl delete -f dist/operator.yaml
kubectl delete -f dist/dependency.yamlIt is recommended to compile ARKS using Docker. Here are the relevant commands:
make docker-build-operator
make docker-build-gateway
make docker-build-scripts
Arks is licensed under the Apache 2.0 License.
For feedback, questions, or contributions, feel free to:
- Open an issue on GitHub
- Submit a pull request