HAMi Scheduler Throttling With High Number of Pod Submission

**What happened**:  When submitting a job with 40+ parallelism ( 40+ pods need to be scheduled) using hami's resources (nvidia). The Scheduler `bind` endpoint started to throttle, and it took ~15mins-30mins to schedule all the pods for this parallel job, even though the node had sufficient resources. Below are error messgae from `kubectl describe pod`: 
```
Warning  FailedScheduling  6m54s                  hami-scheduler  Post "https://127.0.0.1:443/bind": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
```
However, if the job does not use hami's resource, the scheduling is very fast, so it has to be the issue within `vgpu-scheduler-extender`.

**What you expected to happen**: The pods with hami resources should be scheduled faster without the above error message when the node has sufficient resources.

**How to reproduce it (as minimally and precisely as possible)**:
1. Install HAMi on an Nvidia cluster
2. Create a job that requests hami resource with parallelism 40+ and ensure the node has sufficient resources: see example below:
```yaml
apiVersion: batch/v1                                                                                                                                       
kind: Job                                                                                                                                                  
metadata:                                                                                                                                                  
  name: mnist-job                                                                                                                                          
  namespace: default                                                                                                                                       
spec:                                                                                                                                                      
  parallelism: 80                                                                                                                                          
  completions: 80                                                                                                                                           
  template:                                                                                                                                                
    spec:                                                                                                                                                  
...
```
3. Describe the pod with pending status, the above behavior can be observed

**Environment**:
- K8s Version:
```
Client Version: v1.30.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.8+k3s1
```
- HAMi version: v2.6.1
- nvidia driver version: 550.144.03 
- Kernel version from `uname -a`: 
```
Linux ip-172-31-35-36 6.8.0-1021-aws #23~22.04.1-Ubuntu SMP Tue Dec 10 16:50:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
```
- lscpu: 
```
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
...
```
With 32GiB or RAM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HAMi Scheduler Throttling With High Number of Pod Submission #1367

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HAMi Scheduler Throttling With High Number of Pod Submission #1367

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions