Skip to content

Conversation

haitwang-cloud
Copy link
Contributor

@haitwang-cloud haitwang-cloud commented Jan 16, 2025

What type of PR is this?

/kind design
What this PR does / why we need it:
This pull request includes several changes aimed at improving logging, refactoring code for better readability, and enhancing test coverage. The most important changes include updating logging statements to use structured logging, refactoring code for clarity, and adding new test cases.

Logging Improvements:

Code Refactoring:

Test Enhancements:

These changes collectively enhance the robustness, readability, and maintainability of the codebase while providing more detailed and structured logging for easier debugging and monitoring.
Which issue(s) this PR fixes:
Fixes # Avoid the client init error by adding ut & singleton pattern

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

@haitwang-cloud
Copy link
Contributor Author

This pull request includes several changes aimed at improving logging, refactoring code for better readability, and enhancing test coverage. The most important changes include updating logging statements to use structured logging, refactoring code for clarity, and adding new test cases.

Logging Improvements:

Code Refactoring:

Test Enhancements:

These changes collectively enhance the robustness, readability, and maintainability of the codebase while providing more detailed and structured logging for easier debugging and monitoring.

@haitwang-cloud
Copy link
Contributor Author

Appending the scheduler logs after this PR's fix

0121 02:17:54.850403       1 util.go:303] checklist is [map[DCU:hami.io/dcu-devices-allocated Iluvatar:hami.io/iluvatar-vgpu-devices-allocated MLU:hami.io/cambricon-mlu-devices-allocated Metax:hami.io/metax-gpu-devices-allocated Mthreads:hami.io/mthreads-vgpu-devices-allocated NVIDIA:hami.io/vgpu-devices-allocated]], annos is [map[cni.projectcalico.org/containerID:9a6f90a87396138de769ffbe49360d904261236356fccf19c79726a65888d16b cni.projectcalico.org/podIP:100.96.0.157/32 cni.projectcalico.org/podIPs:100.96.0.157/32 hami.io/bind-phase:success hami.io/bind-time:1737425866 hami.io/vgpu-devices-allocated:GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:; hami.io/vgpu-devices-to-allocate:; hami.io/vgpu-node:shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp hami.io/vgpu-time:1737425866 istio.io/rev:default kubectl.kubernetes.io/default-container:nanogpt kubectl.kubernetes.io/default-logs-container:nanogpt prometheus.io/path:/stats/prometheus prometheus.io/port:15020 prometheus.io/scrape:true sidecar.istio.io/status:{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null,"revision":"default"}]]
I0121 02:17:54.850531       1 util.go:278] Start to decode container device GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:
I0121 02:17:54.850550       1 util.go:298] Finished decoding container devices. Total devices: 1
I0121 02:17:54.850585       1 util.go:325] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:54.850635       1 pods.go:78] "Pod devices updated" pod="kbf-i742968/nanogpt-0" namespace="kbf-i742968" name="nanogpt-0" devices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:55.420883       1 scheduler.go:99] "Pod added" pod="nanogpt-0" namespace="kbf-i742968"
I0121 02:17:55.420920       1 util.go:303] checklist is [map[DCU:hami.io/dcu-devices-allocated Iluvatar:hami.io/iluvatar-vgpu-devices-allocated MLU:hami.io/cambricon-mlu-devices-allocated Metax:hami.io/metax-gpu-devices-allocated Mthreads:hami.io/mthreads-vgpu-devices-allocated NVIDIA:hami.io/vgpu-devices-allocated]], annos is [map[cni.projectcalico.org/containerID:9a6f90a87396138de769ffbe49360d904261236356fccf19c79726a65888d16b cni.projectcalico.org/podIP:100.96.0.157/32 cni.projectcalico.org/podIPs:100.96.0.157/32 hami.io/bind-phase:success hami.io/bind-time:1737425866 hami.io/vgpu-devices-allocated:GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:; hami.io/vgpu-devices-to-allocate:; hami.io/vgpu-node:shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp hami.io/vgpu-time:1737425866 istio.io/rev:default kubectl.kubernetes.io/default-container:nanogpt kubectl.kubernetes.io/default-logs-container:nanogpt prometheus.io/path:/stats/prometheus prometheus.io/port:15020 prometheus.io/scrape:true sidecar.istio.io/status:{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null,"revision":"default"}]]
I0121 02:17:55.420978       1 util.go:278] Start to decode container device GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:
I0121 02:17:55.420993       1 util.go:298] Finished decoding container devices. Total devices: 1
I0121 02:17:55.421023       1 util.go:325] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:55.421051       1 pods.go:78] "Pod devices updated" pod="kbf-i742968/nanogpt-0" namespace="kbf-i742968" name="nanogpt-0" devices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:55.613563       1 scheduler.go:99] "Pod added" pod="language-detection-postgresql-0" namespace="language-detection"
I0121 02:17:55.819818       1 metrics.go:65] Starting to collect metrics for scheduler
I0121 02:17:55.820109       1 pods.go:153] "Retrieved scheduled pods" podCount=6
I0121 02:17:55.820145       1 metrics.go:192] "Collecting metrics" namespace="kbf-i744932" podName="pyspark-img-v12-0" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820184       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820228       1 metrics.go:192] "Collecting metrics" namespace="sir-service" podName="sir-core-service-66b8c9f8c8-zzzbn" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820259       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820304       1 metrics.go:192] "Collecting metrics" namespace="dask" podName="worker-777959db8f-gfls9" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" usedCores=25 usedMem=8000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820338       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820376       1 metrics.go:192] "Collecting metrics" namespace="ism" podName="deploy-ism-embedding-8b49866d-hcfz4" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820406       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820448       1 metrics.go:192] "Collecting metrics" namespace="ism" podName="dev-deploy-ism-embedding-849978f877-dvzf8" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820473       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820507       1 metrics.go:192] "Collecting metrics" namespace="kbf-i742968" podName="nanogpt-0" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820534       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"

@archlitchi
Copy link
Member

Thanks:) /lgtm

@archlitchi archlitchi merged commit 7fe6183 into Project-HAMi:master Jan 21, 2025
11 checks passed
@haitwang-cloud haitwang-cloud deleted the refine-metrics-logs branch January 21, 2025 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants