-
Notifications
You must be signed in to change notification settings - Fork 385
Open
Labels
kind/bugSomething isn't workingSomething isn't working
Description
What happened: When submitting a job with 40+ parallelism ( 40+ pods need to be scheduled) using hami's resources (nvidia). The Scheduler bind
endpoint started to throttle, and it took ~15mins-30mins to schedule all the pods for this parallel job, even though the node had sufficient resources. Below are error messgae from kubectl describe pod
:
Warning FailedScheduling 6m54s hami-scheduler Post "https://127.0.0.1:443/bind": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
However, if the job does not use hami's resource, the scheduling is very fast, so it has to be the issue within vgpu-scheduler-extender
.
What you expected to happen: The pods with hami resources should be scheduled faster without the above error message when the node has sufficient resources.
How to reproduce it (as minimally and precisely as possible):
- Install HAMi on an Nvidia cluster
- Create a job that requests hami resource with parallelism 40+ and ensure the node has sufficient resources: see example below:
apiVersion: batch/v1
kind: Job
metadata:
name: mnist-job
namespace: default
spec:
parallelism: 80
completions: 80
template:
spec:
...
- Describe the pod with pending status, the above behavior can be observed
Environment:
- K8s Version:
Client Version: v1.30.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.8+k3s1
- HAMi version: v2.6.1
- nvidia driver version: 550.144.03
- Kernel version from
uname -a
:
Linux ip-172-31-35-36 6.8.0-1021-aws #23~22.04.1-Ubuntu SMP Tue Dec 10 16:50:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
...
With 32GiB or RAM
Metadata
Metadata
Assignees
Labels
kind/bugSomething isn't workingSomething isn't working