Skip to content

HAMi Scheduler Throttling With High Number of Pod Submission #1367

@CcccYxx

Description

@CcccYxx

What happened: When submitting a job with 40+ parallelism ( 40+ pods need to be scheduled) using hami's resources (nvidia). The Scheduler bind endpoint started to throttle, and it took ~15mins-30mins to schedule all the pods for this parallel job, even though the node had sufficient resources. Below are error messgae from kubectl describe pod:

Warning  FailedScheduling  6m54s                  hami-scheduler  Post "https://127.0.0.1:443/bind": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

However, if the job does not use hami's resource, the scheduling is very fast, so it has to be the issue within vgpu-scheduler-extender.

What you expected to happen: The pods with hami resources should be scheduled faster without the above error message when the node has sufficient resources.

How to reproduce it (as minimally and precisely as possible):

  1. Install HAMi on an Nvidia cluster
  2. Create a job that requests hami resource with parallelism 40+ and ensure the node has sufficient resources: see example below:
apiVersion: batch/v1                                                                                                                                       
kind: Job                                                                                                                                                  
metadata:                                                                                                                                                  
  name: mnist-job                                                                                                                                          
  namespace: default                                                                                                                                       
spec:                                                                                                                                                      
  parallelism: 80                                                                                                                                          
  completions: 80                                                                                                                                           
  template:                                                                                                                                                
    spec:                                                                                                                                                  
...
  1. Describe the pod with pending status, the above behavior can be observed

Environment:

  • K8s Version:
Client Version: v1.30.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.8+k3s1
  • HAMi version: v2.6.1
  • nvidia driver version: 550.144.03
  • Kernel version from uname -a:
Linux ip-172-31-35-36 6.8.0-1021-aws #23~22.04.1-Ubuntu SMP Tue Dec 10 16:50:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • lscpu:
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
...

With 32GiB or RAM

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions